Substack

Wednesday, November 17, 2021

Genaralizability problem with investment models

A common topic in this blog has been about the problems with drawing inferences with field experiments in development. Economists refer to the generalisability or external validity of the findings of field experiments. This post will examine the problem in Finance.

This has a more wider resonance in the form of what has been called a "replication crisis" even in the world of hard sciences. The seminal work has been that of John Ioannidis who has shown that the results of many medical research papers could not be replicated by other researchers, a trend exposed by researchers in other fields too. I have blogged here about the work Eva Vivalt whose meta analysis of nearly 600 impact evaluation studies of 20 development interventions shows a remarkable level of non-replicability and inconclusiveness. 

This has an even wider relevance in the context of the debate around the narrative of superior wisdom of experts. The Covid 19 has been only the latest demonstration of the perils of relying on expert wisdom. Philip Tetlock has done extensive research to show how experts perform even worse than outsiders in predicting future events in their respective fields. 

Robin Wigglesworth has a good article in FT which points to the problem in Finance, citing the work of renowned finance guru and Duke University Professor Campbell Harvey. Before getting to Harvey's work,  let's read Wigglesworth's excellent description of the problem, 

The heart of the issue is a phenomenon that researchers call “p-hacking”... P-hacking is when researchers overtly or subconsciously twist the data to find a superficially compelling but ultimately spurious relationship between variables. It can be done by cherry-picking what metrics to measure, or subtly changing the time period used. Just because something is narrowly statistically significant, does not mean it is actually meaningful. A trading strategy that looks golden on paper might turn up nothing but lumps of coal when actually implemented. Harvey attributes the scourge of p-hacking to incentives in academia. Getting a paper with a sensational finding published in a prestigious journal can earn an ambitious young professor the ultimate prize — tenure. Wasting months of work on a theory that does not hold up to scrutiny would frustrate anyone. It is therefore tempting to torture the data until it yields something interesting, even if other researchers are later unable to duplicate the results.

Campbell Harvey claims a replication crisis with market beating investment strategies identified in top financial journals. He feels at least half of them are bogus and that his fellow academics are in denial on this. The paper is a short 6 page read. It starts with the provocative statement,

About 90% of the articles published in academic journals in the field of finance provide evidence in “support” of the hypothesis being tested. Indeed, my research shows that over 400 factors (strategies that are supposed to beat the market) have been published in top journals. How is that possible? Finding alpha is very difficult.

He then outlines the incentive structure facing researchers and journals,

Academic journals compete with impact factors, which measure the number of times an article in a particular journal is cited by others. Research with a “positive” result (evidence supportive of the hypothesis being tested) garners far more citations than a paper with non-results. Authors need to publish to be promoted (and tenured) and to be paid more. They realize they need to deliver positive results.

To obtain positive outcomes, researchers often resort to extensive data mining. While in principle nothing is wrong with data mining if done in a highly disciplined way, often it is not. Researchers frequently achieve statistical significance (or a low p-value) by making choices. For example, many variables might be considered and the best ones are cherry picked for reporting. Different sample starting dates might be considered to generate the highest level of significance. Certain influential episodes in the data, such as the global financial crisis or COVID- 19, might be censored because they diminish the strength of the results. More generally, a wide range of choices for excluding outliers is possible as well as different winsorization rules. Variables might be transformed—for example, log levels, volatility scaling, and so forth—to get the best possible fit. The estimation method used is also a choice. For example, a researcher might find that a weighted least squares model produces a “better” outcome than a regular regression.

These are just a sample of the possible choices researchers can make that all fall under the rubric of “p-hacking.” Many of these research practices qualify as research misconduct, but are hard for editors, peer reviewers, readers, and investors to detect. For example, if a researcher tries 100 variables and only reports the one that works, that is research misconduct. If a reader knew 100 variables were tried, they would also know that about five would appear to be “significant” purely by chance. Showing that a single variable works would not be viewed as a credible finding. 

His conclusion is what matters, 

The incentive problem, along with the misapplication of statistical methods, leads to the unfortunate conclusion that likely half of the empirical research findings in finance are likely false.
However, he feels that backtest-overfitted strategies (p-hacking) is less of a problem in asset management industry than in academia due to the need for replication and protect reputations in the former. 

No comments: