We won't cover all datasets but we can at least cover the 3 main types of datasets out there that are used by quantitative investment firms. Each of these as we shall see need varying degrees of outlier control.
Price and volume data by its very nature is very robust and prone to the fewest errors. It is not error free though as no system ever is. There are also most likely additional 3rd party data vendors involved in delivery this data to end users. We cover this is more detail in a separate post, however a quick example of a count of a strange data error that would be need to be scrubbed is given below.
Datasets from fundamental data providers (balance sheet, income statement and detailed analyst estimate data) contain orders of magnitude more data than simple daily price and volume data. Fundamental balance sheet style data tends to contain upwards of 500 items per company per reporting period. Errors can easily be introduced when that data is ingested by the data provider prior to deliver to quantitative investment firms.
For detailed analyst estimates data, all analyst updates to all forecasted items for every company with analyst coverage are all stored by the data provider and then distributed to clients. As a result, these datasets quickly grow into enormous datasets with more outliers introduced. Consequently, the data cleaning and outlier control is more involved and a lot more quality assurance (QA) of the data needs to be undertaken by the quantitative team. Errors can occur easily in such circumstances, a few examples of which are:
A simple example of this can be found if we look at dividend yield. Let us assume you want to build a strategy for a yield factor and you either want to use overall dividend yield metric in some form or a growth metric on the dividend yield. Clearly, we better investigate our raw dividend yield data from our data provider.
What we would like to see is data that aligns with our economic rationale: so a good prior may be that our data should contain maximum dividend yields of maybe 50-60% and mostly these would be poor illiquid companies where their price is driving that metric higher.
What is clear from the graph is the raw dividend data contains some extremely large and unrealistic dividend yields. A much safer thing to do, than to allow this type of data to feed through to your quantitative model, is to truncate all the outliers to instead equal the 99.9th percentile as we show in the following graph. This gives a much more reasonable and realistic dividend yield to be used in our model.
Some users may decide to not even truncate their outliers to the 99.9th percentile and instead remove the outliers altogether from their model. This is also a valid method and during the research phase of the model build, both methods can be tested to see if that strategy you built is susceptible to these data outliers.
Combining outlier control with histograms is probably the single best way to get a visual representation of a data metric and should be the first step in analysing your data. After that, users can then also calculate descriptive statistics from their data to explain its distribution in further detail. An example of this is shown now for price-to-earnings (P/E) where we first outlier control extremely large unrealistic values and then plot the histogram.
We implement our usual truncating technique here (also termed clipping or winsorizing) and ensure that our P/E data matrix has the appropriate outlier control. This allows us to then plot the histogram of P/E's for our US stock universe in a controlled and stable manner.
1# Python code to truncate a matrix using a daily-calculated 2# lower and upper percentile 3data_in = data_in.clip( 4 lower=np.nanpercentile(data_in, 0.5, axis=1), 5 upper=np.nanpercentile(data_in, 99.5, axis=1), 6 axis=0 7)
It turns out that the P/E of this universe of stocks has a median of 19 and a mean of 39, indicating its naturally right-skew and prone to outliers.
Alternative data (sometimes referred to as alt data) has been a data source increasingly utilised by quantitative trading firms. By it's very definition, it represents data from non-standard sources and as such the quality of that data, while it is high quality by some providers, can be of low quality by other providers. Alternative datasets have improved over the years, however they can be less robust than standard datasets. Some of the potential issues for these datasets are:
Overall, alternative data will grow and will be an increasing source of alpha for quantitative strategies going forward. Naturally, a lot of the issues raised above will filter out and be fixed purely by time (historic data is naturally growing) and expertise (they are aware of what quantitative firms criteria are) in the alternative data field.