PORTFOLIO SELECTION USING STOCK SCREENING
Why Use Stock Screens
Using stock screens to sort huge databases of company metrics is
attracting more retail investors. It does not require any knowledge of
accounting, or time to read reports and research the business, and it has
convincing back-testing to 'prove' it works. But any simple understanding amounts to a wrong understanding. This
page presents the strategy in more detail so you can see the problems and
avoid the traps.
The common sense question to ask yourself is "Since all these listed metrics are promised to produce excess returns, and most everyone uses these metrics when evaluating stocks, why isn't everyone earning excess returns? The list is so long that it should be hard to NOT earn excess returns.". By 2012 there were 82 or (33) documented anomalies that all seem to generate excess returns.
The negativity on this page is along the same lines as the old joke. Two economists are walking down the street. One says: "Hey, there's a dollar bill on the floor." The other says: "Impossible. If it were real, someone would have picked it up by now."
Data Snooping or Data Mining ?
- The biggest problem with this strategy lies in its basic premise, that back-testing proves something. Academics have given this issue the name 'data-snooping'. Given enough time, enough attempts, and enough imagination, almost any pattern can be teased out of any dataset. There have been decades and decades of dredging done on the historical stock exchange database. At the start of the internet, university mainframe computers would be reserved for a full night to crunch the numbers. But do the relationships exist outside the specific dataset analyzed (e.g. in the future)? Is the relationship due only to chance or will it disappear once it becomes known? There is a relatively easy to understand discussion of this by Michael Edesess.
Zhijian Huang (2009) measured excess returns before and after different screening parameters were 'discovered'. For example, the size premium discovered in 1981 shrinks from 11.5% to 0.7%. The high dividend premium fell from 2.7% to 0.9%. All retained at least some benefit after discovery, and some premiums remained large (e.g. low-trading-volumes-relative-to-float, last-year's-winners, January-effect). McLean, Pontiff (2012) also found that 50% of the orginally measured outperformance continued after publication.
On the other hand Neuhierl and Schlusche (2009) found that using statistical tests for data-snooping disproved the continuing existance of excess returns for a wide variety of metrics. Sometimes they worked. Other times they didn't. Sometimes combining metrics did better, sometimes single metrics did better. Chordia, et al (2012) found that most pricing anomalies have materially diminished in strength when comparing the 1976 - 1992 timeframe to the later 1993 - 2009 period.
- Proponents of the efficient markets hypothesis argue on ideological grounds that many of the predictable patterns that have been identified in financial markets may be due to simple chance - that market movements are random.
- Fama and French may be the most famous for screening cheapness and size. Their results (that both metrics are correlated with higher returns - with cheapness premiums more evident in small cap stocks) are now considered gospel. But all they ever presented was regression analysis. The shortcomings of regression analysis are well stated by Michael Edesses.
"Regression formulas alone, however – even if they fit the data well – are not a theory. They are merely patterns perceived in data, barren of explanation – and possibly accidents of randomness. ... Without theory, regressions have no lasting or practical implications. Without an explanation of those results, there is no reason to apply results mined from historical data to predict the future, or even to evaluate historical performance that relies on making accurate predictions. ... Linear regression – a relatively unsophisticated methodology, in spite of all the babble about sophistication in the financial industry."
Armstrong points out that "The more complex the regression the more skeptical you should become. Increased complexity typically reduces forecast accuracy. Do not try to estimate relationships for more than three variables in a regression. Adding variables does not mean controlling for variables when variables co-vary with other predictor variables. The problem becomes worse as variables are added to the regression."
"Correlation does not imply causation. Regression measures correlation. A coefficient can 'get credit' for important excluded variables that happen to be correlated with the predictor variables. Adding variables to the regression cannot solve this problem. Goodness of fit metrics like R-squared bear little relationship to ex-ante forecast accuracy. Typically, fit improves as complexity increases, while forecasting accuracy decreases."
- Most back-testing uses datasets that go back far less than 100 years. In that time there have been very few full economic cycles. But you need multiple instances of similar conditions before you can jump to conclusions about 'what happens when ...'.
- Most readers of academic papers would presume that the database of 'facts' used was both correct and complete. Fangjian Fu (Dissecting the Asset Growth Anomaly, 2014) found that the CRSP dataset does not include the monthly returns of stocks delisted in the month, 90% of the time. Since the returns of delisted stocks tend to be very negative (-38% in his sample), omitting them introduces significant survivorship bias.
- Look-ahead bias may exist where information is assumed to have been available at the time of the stock selection, when in reality it was not. E.g. the published year end P/E of an index may use the calendar year earnings even though Q4 would not have been reported for another six weeks at least.
Screening for one parameter may generate a portfolio of stocks which can also be generated by some other parameter. This is the classic "correlation vs. causation" problem. It leads you to conclude the wrong parameter 'causes' excess returns. In the analysis of dividend yield as a predictor of excess returns below, you can see below how the waters are muddied when dividend payout ratios are included. Is the correlation between high returns and high dividend yields driven by the payout ratio?
Or is the correlation between high returns and high dividend yields driven by the P/E ratio? P/E = payout ratio divided by dividend yield. The falling diagonal across the boxes reflects decreasing P/E ratios (and greater returns). The rising diagonal relects neutral P/E ratios. In that subgroup higher returns show up in the low yielders.
The waters are muddied again when you include market capitalization in the analysis of dividend-growing-stocks. Gwilym, et al. have shown in their paper that when test portfolios are weighted by market cap, the dividend growers UNDERperform the benchmarks (Table 1). The OUTperformance only shows up in the equally-weighted portfolios. They show in Table 3 that higher returns from dividend-growers correlate exactly with smaller sized firms. So maybe the whole 'dividends-are-great' issue is really just a manifestation of the 'small-cap' and 'value' premiums long known to exist (Fama,French).
- All backtesting procedures involve rebalancing on some schedule. Most rebalance yearly on Dec 31. If the magical results being 'proved' are real you would expect the timing of the rebalancing to have no effect in the medium term. But testing shows otherwise. An MSCI paper compared the results of yearly rebalancing their Value-Weighted World Index at month-ends other than Dec. The returns for May and August are barely better than the normal index.
|14 yrs to Nov'10 ||Return |
|MSCI World Index||5.69|
|Value Weighted MSCI||7.04|
O'Shaughnessy tested the cheapest P/E Decile #1 from 1963-2008. Rebalancing at the November or December month ends showed the best returns. The worst return for the February month end, though, was in line with the returns for Decile #3 and still better than the basic index. When only years 2000-2008 were included in the sample, November and December's ranking fell toward the middle of the pack. So 'cheapness' exists to different degrees in the market at different times of the year, but what that time is changes. No doubt the same applies to other screening metrics.
The problem with all statistics gathered from screened portfolios extends to the Day chosen for monthly measurements. Quite different rates of return, market betas, and F-F factors result when different measurement days are chosen. Research papers on this subject include Dimitrov, Govindaraj, Acker, Duck (2006), and Stein (2012)
- Screening for momentum has produced super-large excess returns. It is essentially band-wagon-jumping. You buy the best performing stocks of the last 6-months or year. But trends last until they don't. And when the trend reverses momentum portfolios get crushed. Hugely crushed. Especially when markets snap back after big drops. Here are charts showing the cumulative returns in two periods of time with market crashes. Can you handle this volatility?
- Many back-tested results are generated by averaging monthly returns. Investors don't really give a hoot about monthly returns. Averaging monthly returns has the effect of making volatile portfolios seem to be earning higher returns than stable portfolios. See the discussion on the Rates of Return page. Miller and MacKillop found that measuring yearly returns shows that Fama and French's claim that "small cap stocks beat large cap stocks on a risk-adjusted basis" does not hold true.
- Almost all the academic works present the results of portfolios, and generalizes from the portfolio to attributes of individual stocks. Monthly measurements include monthly rebalancing. The rebalancing creates an excess return for the portfolio, over the returns of the average stock, when stocks are volatile with negative correlation.
Greene and Rakowski (2011) show that this 'rebalancing bonus' is what explains the excess returns of equal-weighted indexes, momentum portfolio and contrarian portfolios. It explains the excess returns of small-cap portfolios (even while small cap stocks have lower average returns). Trying to duplicating these result with a buy-and-hold portfolio won't work.
- The point of this long list of problems is to exemplify the saying "There are lies, damned lies, and then there are statistics."
Can retail investors benefit from screening - assuming back-testing is valid?
- The markets today are vastly different to those existing 100 years ago. Investors react differently because information flows differently. The economies of countries have changed drastically. Capital moves around the globe easily now. The profitability of companies has changed drastically with less reliance on fixed assets, etc. The fact that some metric predicted stockmarket gains in the past is really of little relevance in today's world.
- Free datasets for the public include only the securities STILL existing. There will be survivorship bias. For true back-testing, the dataset must include all the securities that were available at that past point in time.
- Academics often remove from their universe of possibilities all the financial and utility stocks because of differences in accounting rules and regulations. If you want to apply their same rules, you also must ignore these types of stocks.
- A common methodology for determining the excess return for some variable involves first sorting the dataset into deciles by that variable. Then the returns earned by the top and bottom decile are measured. In an effort to zero-out extraneous 'stuff' happening to all the dataset, a zero-market (leveraged) portfolio is created by combining a long position in the top decile with a short position in the opposite bottom decile. The calculated excess return therefore comes from both the long and the short position.
In many cases it is the short position that generates the meaningful excess returns, NOT the long position. Unless you intend to also match your long position with a short, you should not expect the same excess return. In practice most investors considering screening are looking for marginally better returns than a passive large-cap index fund. They have no intention to short-sell anything.
For example, take the idea that low-volatility stocks outperform. The following table presents the data from Riley, 2012. Yes, the leveraged portfolio shows a 13.6% excess return. But what does that prove?
| ||Portfolio Returns Sorted by |
Standard Deviation of Past Returns
|Decile ||Return |
|1 (low) ||1.58% |
|2 ||2.32% |
|3 ||1.24% |
|4 ||-0.71% |
|5 ||5.00% |
|6 ||1.75% |
|7 ||0.93% |
|8 ||2.26% |
|9 ||-6.31% |
|10 (high) ||-12.02% |
|1-10 ||13.6% |
There is no progression of returns along the continuum of sorted portfolios. The highest 5% return shows in the middle (5th) decile, not the low volatility decile. The leveraged excess return is all atributable to the very poor performance of the high volatility decile.
- Rarely do you hear anyone claim that a particular screening method will generate excess returns consistently. That leaves the investor with sub-normal results after the first few years asking himself "Does the screening no longer work, or do I just have to hang in there longer?" There is no rational way to answer that question. Consider how the famous Fama-French excess-returns for small-size (SMB) and cheapness (HML) have varied over time - often in negative territory.
- When academics report excess returns for small capitalization stocks they really do mean SMALL. The smallest decile in their dataset is so small and so illiquid that translating theory into practice is impossible. Also, the excess returns attributed to a number of other metrics only show up in the small cap space. In a large cap sample the metrics are much less powerful.
- You must be aware of the possibility of added risk coming with the added returns. Risk can be measured many ways. E.g. O'Shaughnessy's metrics for portfolios sorted by dividend yield, found the highest yielding stocks has superior risk metrics for standard deviation, downside deviation, Beta, Sharpe ratio and Sortino ratio. But he also found that they suffered DEEPER declines measured over the various time spans of 1,3, 5, 7, and 10 years.
- Although books and advice are aimed at retail investors, none address the size of the portfolio and the risks of holding fewer stocks. E.g. O'Shaughnessy uses 50 stocks and Seigel uses 100 stocks. Academics use even larger portfolios. Many metrics were studied by sorting into quintiles. The retail investor is not going to buy 1/5th of the market. He won't even buy 50 - 100 stocks. He certainly won't be rebalancing monthly. The cost of implementing this strategy cannot be ignored. The retail investor's risk would be hugely greater.
The investor with a small selection of cherry-picked stocks will not generate those same excess returns. This is because MOST of the companies within a (e.g.) 'cheap' portfolio will have been correctly priced to reflect real problems that do not reverse in time. The portfolio's great returns come from the smaller number of companies that successfully turn around. These stocks increase in value multiple times, more than compensating for the poor returns from the majority. The MEAN returns of the portfolio underperform the market. It is only the AVERAGE of the large basket that outperforms. The method only works 'as a basket'. (see Bird-Berlach paper)
Aretz and Aretz (2011) show that when 1% of the stocks having the most influence on the regression analysis are identified (bold dots), the remaining 99% are seen to NOT be influenced by the screening parameter. Of course that leaves the question "What are the attributes of those stocks that drive the regression results?" but the point here is that cherry-picking individual stocks by some metric will most likely leave you with the normal 99%.
- Small cap portfolios with a low Price-to-Book are shown in theory to outperform high Price-to-Book portfolios. But the real-life experience of mutual funds is that they fail to capture the value premium. Why?
The transaction costs for trading illiquid small-caps stocks with thin floats is predictable. Scislaw and McMillan (2012) found that fund managers realize this and exclude the smallest, most illiquid stocks from both their value and growth portfolios.
This exclusion has different effects. It adds 3.3% per year to value portfolios, but 10.4% per year to growth (high price-to-book) portfolios. Scislaw and McMillan deconstructed the mutual fund returns to show that the value portfolio's composition of stocks did indeed give rise to excess returns. But that gain was more than offset by the lesser benefit from excluding the most illiquid stocks.
What are the results AFTER the fact?
If beating the market by screening were so easy, with at least half a dozen metrics 'proven' to generate excess returns, why are not mutual funds firing their analysts and screening? Replicating the methodology is almost impossible for retail investors, but institutional money should have no problem beating the market. O'Shaughnessy started a few mutual funds using his screening rules that are a perfect test for the value of screening. The funds have some periods of better returns, and some periods of worse returns. He earned his fees with the Canadian fund, but the US funds show no long-run out-performance.
Understanding the Metrics
The metrics used for screening may be essentially the same as those used to understand and evaluate a company heuristically. But there are differences in interpretation. Here, the investor has no need to agree with the calculation of any metric. Nor should he care if there appears to be no explanation for why a metric works.
This is because they are not meant to show that one company is 'better' than another. No CEO should attempt to run his company using these metrics as operating objectives. E.g. '1-yr sales %growth' is negatively correlated with stock appreciation, but we would never want management to aim for lower sales. The metrics are used as predictors of near-term stock price movements ONLY. And not of an individual stock's price, but of the basket portfolio created by the screen.
Classification of Metrics (examples only)
- Market Attention metrics:
- Stock price hitting new highs
- 3 mo, 6 mo, 1 yr stock price % appreciation indicating momentum
- Trading volume (relative to market capitalization) is a negative indicator
- 5 year stock price % appreciation (a negative indicator because of reversion to the mean)
- Growth metrics:
- 1 year EPS % growth
- 5 year EPS % growth (negative indicator: reversion to the mean)
- ROE (= theoretically possible growth)
- 1 year sales % growth (a negative indicator per O'Shaugnnessy)
- Risk metrics:
These use Balance Sheet accounts to measure liquidity and leverage.
- Valuation metrics:
There are five main valuation metrics. The modifying factor can be thought of as the link between the metric and Net Income.
- Price/Sales is modified by the factor 'net margin'.
- Price/Book Value is modified by the factor 'ROE'
- Price/Earnings is modified by the factor 'growth'
- Price/Cash Earnings is modified by 'capex per sales'. This is not the same measure as cash flow used for absolute valuations.
- Dividend Yield is modified by the 'payout ratio'.
Different Ways to Define the Metric Parameters
- The metric must be greater/less than a predefined value.
- The metric must be better than the average for the universe (or in the top decile, etc).
- Only the most extreme values are accepted until all the spaces in the portfolio are filled.
more sophisticated method incorporates a larger number of metrics. For
each metric all the stocks are given a ranking (or decile, etc) for
that metric. A stock's overall score is the sum across the values for
all the metrics.
This way no stock is discarded just because it does not have an extreme
value on one metric. A company with moderate values on a lot of metrics
outranks one with extreme scores on a limited number of metrics.
Where To Find The Data
Sites don't always say what universe of stocks is being searched, so
make your own presumptions from the context. While there are several
sites that can screen for US stocks, the sites for Canadian companies
are very rudimentary.
When there are no screening tools available, you can use screen
displays. Copy and paste them into a blank Excel file. Sort by the
ticker and arrange beside each other, ensuring all the metrics on a
line pertain to one company. Then use the sort button to sort by each
column's metric. Color shade the values that comply with your screen
parameters. Once all the metrics have been analyzed, you add across the
line giving different values to the different color codings.
Google Finance - US
General Issues You Should Understand
is just as important as increasing returns. For all portfolios, your annual returns averaged will be higher than the compounding return you realize over time. (E.g. the S&P500's annual returns from 1927-2006 average at 12.2%. But the compounding return over time is only 10.3%). It is the infrequent really bad years that
drag you down. Screening for single metrics widens the difference
between these two measures of return.
Almost all screened portfolios in O'Shaughnessy's book had greater risk than the
underlying universe of stocks. This is predictable. When screening for
growth, you end up exposed to high priced stocks when their growth
ends. When screening for value, you end up with broken businesses.
A large number of portfolios back-tested in the book UNDER performed
their benchmark one third of all 5 year periods. Even if there is near
certainty of regaining the shortfall in the subsequent 5 years, do YOU
want to wait 10 years just to pull even?
* Why does it work?
The extra return generated by screening for a specific metric may not
be attributable to the metric itself. E.g. screening LargeCaps for
'dividend yield' will spit out a large number of utilities. The high
returns are a function of the sector not the metric. Screening AllCaps
for 'dividend yield' added little value for O'Shaughnessy.
* Multiple metrics. Screening for more
metrics is not necessarily better than screening for only one. When the
metric belong to the same category you effectively make your selections
more and more risky, with less and less redeeming qualities.
Some metrics may have little value alone, but add value when
combined with others. E.g. using '1-year price appreciation' to sort
SmallCaps lowers your return from 12.6% to 10.2%. But when combined
with the metric 'price/sales<1', the returns increase to 18.6%.
The risk created by single metrics can be reduced by teaming up
a value metric with a market attention metric. This way you look for
companies who are still cheap but have turned the corner. Now the
markets are paying attention.
* Capitalization size. The value of a metric can be
different when applied to portfolios of different capitalization (small cap, large cap, equal-weighted, cap-weighted). E.g. '1-yr price
appreciation' creates about 3% of value in the LargeCap universe. But
lowers returns by more than 2% in the SmallCap universe.
* Size of Portfolio.
Decreasing the number of stocks in the resulting portfolio can increase
your returns because the ones chosen are more extreme. But almost
always the increase in risk is so sever that no investor could continue
the strategy through the dark days. The advantages of screening are
greater in the SmallCap universe, but when adjusted for risk the
benefits are greater in the LargeCaps.
* Rebalance Frequency. The optimum length of time
between rebalancing with the results of a new screen vary. Generally, 1
year is optimal for LargCaps and value metrics. SmallCaps and growth
metrics are better rebalanced more frequently. Transaction costs and
taxes will effect your decision on this.