About UsCONTACT US
Outlier control of your data
Back to topic: Quant data enthusiasts

STRATxAI

April 2024 · 10 min read

Controlling for outliers in your data

Research

Which data sets need outlier control ?

We won't cover all datasets but we can at least cover the 3 main types of datasets out there that are used by quantitative investment firms. Each of these as we shall see need varying degrees of outlier control.

  1. Price and volume data
  2. Fundamental data
  3. Alternative data
Datasets and their quality

Price and volume data by its very nature is very robust and prone to the fewest errors. It is not error free though as no system ever is. There are also most likely additional 3rd party data vendors involved in delivery this data to end users. We cover this is more detail in a separate post, however a quick example of a count of a strange data error that would be need to be scrubbed is given below.

An example of price QA and outlier control. A plot of the count of the number of stocks per day where open price is greater than the high price, which can not occur.

Datasets from fundamental data providers (balance sheet, income statement and detailed analyst estimate data) contain orders of magnitude more data than simple daily price and volume data. Fundamental balance sheet style data tends to contain upwards of 500 items per company per reporting period. Errors can easily be introduced when that data is ingested by the data provider prior to deliver to quantitative investment firms.

For detailed analyst estimates data, all analyst updates to all forecasted items for every company with analyst coverage are all stored by the data provider and then distributed to clients. As a result, these datasets quickly grow into enormous datasets with more outliers introduced. Consequently, the data cleaning and outlier control is more involved and a lot more quality assurance (QA) of the data needs to be undertaken by the quantitative team. Errors can occur easily in such circumstances, a few examples of which are:

  • Fundamental data provider misread a record
  • Company accounting data entry errors
  • Analyst estimate erroneously entered or format was incorrect (entered as pounds not pence for example)
  • Corporate actions were not applied correctly
An example using fundamental data

A simple example of this can be found if we look at dividend yield. Let us assume you want to build a strategy for a yield factor and you either want to use overall dividend yield metric in some form or a growth metric on the dividend yield. Clearly, we better investigate our raw dividend yield data from our data provider.

What we would like to see is data that aligns with our economic rationale: so a good prior may be that our data should contain maximum dividend yields of maybe 50-60% and mostly these would be poor illiquid companies where their price is driving that metric higher.

A example plot of the raw maximum dividend yield from a data provider.
A example plot of the raw maximum dividend yield from a data provider.

What is clear from the graph is the raw dividend data contains some extremely large and unrealistic dividend yields. A much safer thing to do, than to allow this type of data to feed through to your quantitative model, is to truncate all the outliers to instead equal the 99.9th percentile as we show in the following graph. This gives a much more reasonable and realistic dividend yield to be used in our model.

A plot of the same dataset but we truncate the outliers to instead equal the 99.9th percentile which is a lot more stable.
A plot of the same dataset but we truncate the outliers to equal the 99.9th percentile which is a lot more stable.

Some users may decide to not even truncate their outliers to the 99.9th percentile and instead remove the outliers altogether from their model. This is also a valid method and during the research phase of the model build, both methods can be tested to see if that strategy you built is susceptible to these data outliers.

Helpful visual distribution checks

Combining outlier control with histograms is probably the single best way to get a visual representation of a data metric and should be the first step in analysing your data. After that, users can then also calculate descriptive statistics from their data to explain its distribution in further detail. An example of this is shown now for price-to-earnings (P/E) where we first outlier control extremely large unrealistic values and then plot the histogram.

Maximum P/E versus 99.5th percentile P/E for US stocks through time.
Maximum P/E versus 99.5th percentile P/E for US stocks through time.

We implement our usual truncating technique here (also termed clipping or winsorizing) and ensure that our P/E data matrix has the appropriate outlier control. This allows us to then plot the histogram of P/E's for our US stock universe in a controlled and stable manner.

1# Python code to truncate a matrix using a daily-calculated 
2# lower and upper percentile 
3data_in = data_in.clip(
4    lower=np.nanpercentile(data_in, 0.5, axis=1),
5    upper=np.nanpercentile(data_in, 99.5, axis=1),
6    axis=0
7)
Histogram of the P/E for US stocks through time after truncating outliers at the 99.5th percentile..
Histogram of the P/E for US stocks through time after truncating outliers at the 99.5th percentile.

It turns out that the P/E of this universe of stocks has a median of 19 and a mean of 39, indicating its naturally right-skew and prone to outliers.

Alternative datasets

Alternative data (sometimes referred to as alt data) has been a data source increasingly utilised by quantitative trading firms. By it's very definition, it represents data from non-standard sources and as such the quality of that data, while it is high quality by some providers, can be of low quality by other providers. Alternative datasets have improved over the years, however they can be less robust than standard datasets. Some of the potential issues for these datasets are:

  • Lack of global coverage
  • Focused on a particular sector only
  • Point-in-time issues where, especially in the early days, the rigidity needed to ensure fully accurate point-in-time data was not fully understood or could just not be guaranteed by the data provider
  • Short history so difficult to use for long-term backtests

Overall, alternative data will grow and will be an increasing source of alpha for quantitative strategies going forward. Naturally, a lot of the issues raised above will filter out and be fixed purely by time (historic data is naturally growing) and expertise (they are aware of what quantitative firms criteria are) in the alternative data field.

PreviousNext

More in

Quant data enthusiasts

Efficient frontier for stock portfolios
Efficient Frontier
October 11, 2022 · 5 min read