In this post we discuss percentiles. Quick reminder, they are not the same as percentages!
Imagine you have 700 stocks and want to choose the top 25% based on certain factors. However, the data from your chosen factors all have various ways of being measured, some are measured by percentage such as dividend yield, others are more hard coded like price-to-earnings such as 9.5.
Choosing percentile simplifies this for you, and precludes you from having detailed knowledge of the individual metrics and their actual amounts. Instead using percentiles you can just get the top 20% of all stocks or the bottom 20% of all stocks based on the factor of your choice. You let the data do the work for you rather than you entering lots of very nuanced hardcoded values for percentages or multiples.
Compared to hardcoded values, percentiles are a robust metric to compare all stocks against each other today and through history. For example, let us consider a simple investing strategy for the last 12 years that may have invested in companies with a price-to-book less than 0.65 (which was the 10th percentile) back in 2010.
In the figure below, if we use a hardcoded value, the universe of stocks we can choose from varies wildly from a high of 700 to a low of 100 companies. However using a percentile-based method ensures that we are consistently using the bottom 10% price-to-book companies where the overall number stays very constant. If you noticed some drift upwards from 2020 onwards, this is because the overall number increases even for the percentiles as our universe size has also increased with time. This is for US stocks where our universe has grown by 1000 companies (from 4000 to 5000 stocks) in the last 2 years and thus the number of companies in the percentile metric increased from 400 to 500.
Percentiles are simple percentage calculations that tell us the numeric value at which "x" percent of our data is below.
Or with a slightly more mathematic language,
One common use of percentiles is to generate graphs that use the median and the first and third quartile percentiles to visually represent the data. This simply means that the data in question is separated via the 25th, 50th and 75th percentile and is a very quick way to visualise the distribution in percentage terms through time as per the below graph. A user can quickly see the four areas of the graph and how the three quartiles lines help separate the areas.
One of the most important aspects from a practical, under-the-hood, use case for percentiles is in the area of outlier control. When onboarding either a critical dataset like price and fundamental data, or onboarding a new or alternative dataset, the cleaning, scrubbing and outlier control of that data is crucial. We wrote a detailed post on outlier control but a quick illustration of it in action is found in the next graph.
If we apply our outlier control at the 99.5th percentile for example we get a much more stable and controllable piece of fundamental data to use for our quantitative models. As much as there is some predictive power (alpha) in the tails, that is also where data errors lie so it is prudent to apply this type of data pruning at the data ingestion stage of the pipeline.
The natural question is then how can we use all these percentile metrics for a simple stock screening strategy ? Most stock screeners out there allow users to hardcode fixed values for various metrics such as P/E or P/B but as we have seen above these are not the most robust methods for cross-sectional analysis through time. Let's start with a toy example of a user choosing 5 metrics that they wish to create a stock screen with. Instead of hardcoding values, a more robust method would be to use percentiles such as
In the above, you can see we have also augmented the screens with a fixed dividend yield and also used percentile band for the payout ratio.
A quick note of caution for users when using percentiles in their own analysis and work. When using percentiles as part of a quantitative investing strategy and judging their efficacy, we must ensure that the percentile is a point-in-time metric and not calculated on the full-sample and then applied with lookahead historically (you would be surprised how many times this has been seen). A quick code snippet for calculating a daily percentile value across the cross-section of stocks is as follows:
1import numpy as np
3# data is a pandas dataframe
5# wrong way
6data_prc = data.stack().quantile(0.75)
8# right way
9data_prc = data.quantile(0.75, axis=1)
The wrong way method returns a single value that has lookahead if it was applied historically as it uses all historical data in its calculation. The right way method calculates a percentile on a per day basis and so only uses information that is available at that point-in-time.