The information that most investors are familiar with for stocks is price and volume data. If we go one step further, we can easily obtain company fundamental data such as net income, return on equity, free cashflow and lots more data. This type of financial company data can be used to inform investors decisions about which stock to invest in.
The question is then: how do we use all this information to figure out why we should invest in stock A, compared to stock B, compared to stock C, ..., compared to stock Z ?
This can be easily achieved if we can combine all these stock-specific datasets together to obtain a final stock score or ranking. This final stock score can then be used to rank all of our potential investable stocks in descending order. Then we can choose the N stocks with the highest scores to form our investable portfolio, where N is whatever size of portfolio the end investor wants.
Now we have all these different datasets to combine, the nuance is how to both clean that data and also standardize and normalize the datasets so stocks in each dataset can be compared against each other efficiently. This may involve taking price, volume and fundamental data and then fitting some linear model to the data. The final stage would be then to transform that linear model into a final ranking or distribution for each stock.
For anyone who has worked with data, one of the biggest issue is to ensure the data is cleaned and scrubbed and outliers removed. This ensures that the data passed to our financial models is as robust and error-free as we can make it, so our models can make more accurate predictions. The data cleaning and outlier detection then naturally leads to the question
How do I standardize my data or normalize my data to then detect outliers and remove them to clean my data ?
The standardized or normalized data is a form of data transformation. We are taking data in raw form and modifying the representation of that data, as we will see below with some examples. Once the data has been modified, it can be interpreted more easily but the overall properties of the data should remain intact. This means that data which was an outlier in its raw form, will still be an outlier in the new transformed data but the hope is that it will be easier for us to identify those outliers.
The terms standardization and normalization are used interchangeably by some people in the data science or investing industries. This can lead to some confusion (or misinterpretation) at best or misusing one method in place of the other at worst. It is best shown via a simple example.
The easiest and probably the most commonly used standardization methods is to squash the data inside the minimum-maximum range. In formula this is given by
This essentially transforms your data into a [0,1] range, which we can think of as a form of percentage so 0% to 100%. This is a standardization technique for raw data where all raw data (of different types) is transformed into a defined standard [0,1] interval. This technique can also be referred to as min-max scaling. However, this is not really normalization in the sense of data normalization or a normal distribution, as we try to explain in the next section.
Normalization in the statistics, data science or mathematical professions refers to the normal distribution, which means the distribution of the data can be explained by two parameters - the mean and standard deviation. The standard normal random variable is one of the most common distributions that quantitative researchers use and is a special case of the normal distribution - with a mean of zero and standard deviation of one.
One method where normalization is frequently used is to convert data into a z-score, also referred to as a standard score, which is given by the following formula where mu is the mean or average of the data and sigma the volatility
The z-score subtracts the average value from each data point and then normalizes the data by dividing by the standard deviation. If the distribution of the raw data is a normal distribution then calculating a z-score turns the data into a standard normal random distribution. The z-score tells the user how many standard deviations away from the average value your data point is. We can illustrate a comparison between the standard normal distribution and other normal distributions with different parameter to visually show their impact
It is hopefully clear the impact of the larger standard deviation in how it creates fatter tails with more change of data coming from those extreme events. The last example with a positive mean of two instead of zero is used to illustrate the shifting of the distribution but leaving the shape unchanged.
If anyone wants to test and play with some data, here is some example python code to illustrate normal distributions by varying parameters mu and sigma, the mean and standard deviation.
1# Create a normal distribution with mean as 5 and standard deviation as 10
2from scipy.stats import norm
3from scipy import stats
4import seaborn as sns
6mu = 0
7std = 1
8snd = stats.norm(0, 1)
9snd2 = stats.norm(0, 2)
10snd3 = stats.norm(0, 3)
11snd4 = stats.norm(2, 1)
13# Generate 1000 random values between -100, 100
14x = np.linspace(-10, 10, 1000)
16# normal distribution plots for random variable values from range -10, 10
22plt.legend(['mean = 0, stdev = 1','mean = 0, stdev = 2','mean = 0, stdev = 3','mean = 2, stdev = 1'], loc='upper left')
24plt.title('Normal Distributions: Different Means + Stdev', fontsize='15')
25plt.xlabel('Values of Random Variable X', fontsize='15')