Understanding limitations of machine learning models and alternative data

Tuesday 15 December 2020

Author: Alexander Denev and Saeed Amen

One of the limitations of Machine Learning, or statistical models, in general, is that they are commonly trained on historical datasets and on a narrow set of variables.

This means that sometimes they lack the breadth (number of variables) to draw robust generalizations that are going to hold in the future, especially in periods of regime shifts, where variables that we might have omitted in our dataset could explain the behaviour of the system in the new regime. According to Pearl (Pearl, J., 2009. Causality. Cambridge university press), causal relationships are more robust (as opposed to spurious correlations) and permit to respond to changes in time. This is certainly true with the caveat that causality is difficult to find, especially in financial markets, where, additionally, causal driving forces can change continuously e.g. the recent Covid-19 pandemic – a new phenomenon. In such periods of time, historical data is going to be less relevant than an appropriate extension to the dataset e.g. including alternative data sources, especially if we deem they contain the “true” causal drivers, and not just a proxy.

Understanding causality

If we can understand causality in markets, we might also have a better idea of understanding the market regime. Often the market regime can impact the performance of a trading strategy. Trend following tends to be best when there are large outsized market moves, as we saw in oil markets during the earlier part of 2020. Indeed, trend following outperformed many other strategies during the initial sell-off over March (see FT: Quant pioneer Winton suffers in coronavirus driven sell-off 6 Apr 2020) because of the large outsized moves. In general, during more range bound market regimes, trend following tends to underperform, as it might generate signals to buy at the high and sell at the low before a reversal. By contrast, quant strategies such as statistical arbitrage[1] came unstuck during this same period (see FT: DE Shaw quant fund takes hits from markets gone haywire 24 March 2020).

If we think about discretionary trading, and in particular for successful macro trading the key is to not only make the right calls, but identifying the market theme (in other words the causal driving force) and the likely market regime preferably early on. In a sense, it is like a quantitative trader having an overlay which reweights models according the current market environment or regime[2]. Indeed, we note that some macro funds performed well during the March selloff (see Bloomberg: Chris Rokos’s Macro Hedge Fund Surges 14% in Best Month Ever 14 Apr 2020).

Drivers in 2020

We can think of many examples of how drivers impact markets and how they might change. In recent months, whilst the second wave of Covid-19 has unfortunately become a reality, it is also the case that positive drivers are coming to fore, and indeed, we are currently seeing news about coronavirus vaccines positively impacting markets. Thinking of other examples of how themes can change, looking ahead we might expect certain news-based market drivers to fade, such as the impact Trump’s tweets after Biden is inaugurated.

In this paper, we will investigate the impact of news associated with Covid-19, Oil and also Brexit as causal drivers of markets through the framework of Probabilistic Graphical Models (PGM).

We show an example in Figure 1 of three news data sources on Oil, Covid-19 and Brexit from Bloomberg in the period 1st March 2020 – 1st May 2020 and their impact on FTSE 100 and its implied volatility.

Figure 1

Machine learning - chart 1

Figure 1 - A Probabilistic Graphical Model trained on news count %changes and market data (01.03.2020-01.05.2020). The model has been trained on data - both the structure and the parameters with the PC algorithm - yielding the equations FTSE Implied Vol Changes =5.2626*Oil News stream + Normal(-0.499299,4.45033), FTSE Returns =-0.00482465*FTSE Implied Vol Changes -0.0636198*Covid-19 News stream + Normal(-0.00197189,0.0231985). NB we omit the marginal distributions of all the news stream nodes.

The model in Figure 1 is a Probabilistic Graphical Model trained automatically on the data for that period. We see that daily changes in news counts on Oil and Covid-19 impact on FTSE returns and FTSE Implied Vol while the Brexit news stream has no impact. This makes intuitively sense. This period has been turbulent both because of the pandemic and the oil supply debacle.

We also see a link between FTSE Implied Vol and FTSE returns which captures the unexplained by the news stream data covariance between these two variables. Interestingly enough, the training algorithm has captured correctly the exogeneity of the news stream data. It is also saying that Oil does not have direct impact on FTSE returns as it is screened by FTSE Implied Vol. The correlation between FTSE Implied Vol and FTSE returns is -72% for that period. After conditioning on the Covid-19 variable, it goes to -62% which means that this driver “explains it away” to a certain extent. Also, for example, we can see that the variance of the FTSE Implied Vol is “explained” ~20% by the variance of the daily changes in oil news data.

Comparing the drivers

Rewind slightly back in time and we would see a different picture. We show an example in Figure 2 of three news data sources on Oil, Covid-19 and Brexit in the period 1st January 2020 – 1st March 2020 and their impact on FTSE 100 and its implied volatility.

Figure 2

Machine Learning - Chart 2

Figure 2 - A Probabilistic Graphical Model trained on news count %changes and market data (01.01.2020-01.03.2020). The model has been trained on the data - both the structure and the parameters with the PC algorithm - and yields the equation: FTSE Returns=-0.00375547*FTSE Implied Vol Changes +Normal(-0.000476263,0.0055734). NB we omit the marginal distributions of all the newstream nodes.

There wasn’t as much Covid-19 news during that period, compared to the large amount of Brexit news. Oil markets were also relatively calm. Hence, the three news streams have no impact on the market variables of interest. The correlation between FTSE Implied Vol and FTSE returns was 87% for the period, as captured by the arrow and the underlying equation. It seems that we have not included the right drivers for that period and we should have looked into other exogenous data streams (however, this has a cost as we will shortly explain).

Rewind again to 1st August 2017 – 1st October and you see a different reality represented in Figure 3. As expected, Brexit is the only driver for that period.

Figure 3

Machine learning - chart 3

Figure 3 - A Probabilistic Graphical Model trained on news and market data (01.09.2017-01.10.2017). FTSE Implied Vol Changes=-0.543462*Brexit Newstream+Normal(0.0898095,0.890205). FTSE Returns=-0.00468426*). FTSE Implied Vol Changes +Normal(-3.48278e-05,0.00379985).

Inputs and outputs

We must say that sometimes, there is an economic rationale to not extending our datasets. Having breadth and history comes at cost which sometimes outweighs the benefits of obtaining a better understanding or prediction. Increasing the number of variables might require additional data sources which means paying license fees; obtaining longer history also has a cost. Moreover, the technological challenges of treating and storing diverse data sources increases exponentially with their number. Last but not least, you need data scouts and economists to quickly identify drivers and relevant datasets containing them. We treat these issues and how to resolve some of them in “The Book of Alternative Data” (John Wiley & Sons, July 2020).

We often hear statements that models “break” in time which means that relationships between input and outputs unexpectedly change. We would expect it to happen less frequently in, say, language translation and image recognition tasks, but in non-stationary systems like financial markets this could happen very often. The way to understand at least what drives markets is to think what the true causal drivers are – sometimes manually, sometimes through automatic learning when drivers are too many – and be reactive enough to, say, re-balance your portfolio on time. In other words, we need to be able to identify the market theme, and preferably early on.

One of us argued in the book “Portfolio Management under Stress – A Bayesian Net Approach to Coherent Asset Allocation” (Rebonato, R. and Denev, A., 2014. Portfolio management under stress: a Bayesian-net approach to coherent asset allocation. Cambridge University Press.) written in the wake of the great financial crisis that in “normal times” markets have regularities and can be modelled statistically, while in turbulent times, especially under exogenous shocks, a causal understanding of the drivers of the regime shift is needed. In this situation the observations can be very few, so statistical models as the one in this article might even not apply but a more Bayesian approach is warranted. Maybe we are living in those times now again? Maybe many of the models trained in the last years of financial data are not relevant anymore? Soon we might be pondering again what happened with our models as we did during and after the great financial crisis. Hopefully, we will have learned our lessons this time.

--------------------------------------------

[1] At its very simplest level, statistical arbitrage relies on trading pairs of similar stocks. For example, we might sell BP and buy Shell, if they have diverged based on some historical relationship.

[2] We must say that, if a trader makes an informed call about a theme, but the market ends up ignoring the theme totally, it is unlikely he or she will be able to monetise that view. Of course, we can have the situation where a theme gradually becomes more relevant over time or fades into the background.

Alexander Denev, Head of AI - Financial Services, Risk Advisory, Deloitte LLP

Alexander Denev leads AI and Data Science team for Financial Services - Risk Advisory for Deloitte. His responsibilities include development of AI products and advisory around risks and ethical implementation of AI. He also focusses on Alternative Data, AI for Data Quality and AI for Retail Banking.

Alexander is a former Head of Quantitative Research & Advanced Analytics at IHS Markit. Prior to that, Alexander has worked in Risk Dynamics (McKinsey & Company), The Royal Bank of Scotland, European Investment Bank (EIB) and European Investment Fund (EIF), National Bank of Greece and Societe Generale. He holds a degree in Mathematical Finance from University of Oxford where he is a Visiting Lecturer on Bayesian Risk Management and Alternative Data.

He wrote several papers and books on quantitative topics, ranging from stress testing and scenario analysis to asset allocation through Machine Learning techniques and alternative data. Alexander often has thought leadership engagements in conferences, journals and global fora.

Saeed Amen - Founder and CEO - Cuemacro Saeed Amen is the founder of Cuemacro and co-founder of Thalesians. Over the past fifteen years, Saeed Amen has developed systematic trading strategies at major investment banks including Lehman Brothers and Nomura. He is also the author of Trading Thalesians: What the ancient world can teach us about trading today (Palgrave Macmillan) and is the coauthor of The Book of Alternative Data (Wiley), due in 2020.

Through Cuemacro, he now consults and publishes research for clients in the area of systematic trading. He has developed many Python libraries including finmarketpy and tcapy for transaction cost analysis. His clients have included major quant funds and data companies such as Bloomberg. He has presented his work at many conferences and institutions which include the ECB, IMF, Bank of England, and Federal Reserve Board.

Understanding limitations of machine learning models and alternative data

One of the limitations of Machine Learning, or statistical models, in general, is that they are commonly trained on historical datasets and on a narrow set of variables.

Related Articles

The rise and rise of alternative data

The second order impacts of artificial intelligence

Machine learning in financial markets

How to apply machine learning in financial markets

INVESTMENT TALENT CONFERENCE 2025