How to apply machine learning in financial markets

Monday 17 February 2020

AI and Machine learning

Author: Chris Vryonides

In the last of this three-part series, Chris Vryonides, Director at AI consultancy Cognosian, provides some guidelines for those wanting to use machine learning in finance 

The field of machine learning (ML) is constantly evolving. Before attempting any analysis, it is important to be familiar with the fundamentals, particularly around model complexity, overfitting and how to control it. In short: understand how to increase the chances that the model will not fail out-of-sample.

Spend time exploring and visualising the data

This is fundamental to any ML project. Initially, we can observe the distribution of data, identify issues such as missing data, outliers and then determine how to clean the dataset (a topic in itself, but mandatory for most algorithms to have a chance of working).  

The characteristics of the dataset will also inform the choice of algorithms to be explored.

Target application areas with known successes 

Ambitious blue-sky R&D requires large budgets and a stomach for a low and slow hit-rate. Absent this luxury, it is important to focus on application areas that have been proven successful. These include:

  • Ultra-short time scale price prediction high frequency trading (HFT) using time series neural nets 

  • Meta models where ML techniques are used to dynamically allocate across a number of simple models and signals

  • Hybrid approaches to support discretionary trader decisions

Furthermore, it can be beneficial to begin modelling, with a gentle goal then refine the model in the direction required. For example, suppose we are a liquidity taker taking intraday positions, we might initially train a model to predict very short-term moves. We can then push the prediction horizon out, where accuracy is more of a challenge, but price moves are higher in magnitude and so more likely to overcome our transaction costs. 


Understand how available ML models relate to the prediction task


It is critical to understand at least the basics of what lies under the hood of the algorithms in your arsenal. While industrial-strength open source libraries have liberated us from having to code algorithms from scratch, we still need to know enough to be non-dangerous.

In practice, for any given algorithm, this requires:

Appreciating what types of data it can handle:

  • Continuous / Discrete / Categorical etc.

  • Missing data (at random / not at random)

  • For time series: uniform / non-uniform sampling

We can often work around these issues if a particular model must be used, but at some point we must address:


What underlying dynamics is the model capable of capturing?

  • Snapshot or time series/memory

  • Does it requires stationarity or can it handle regime shifts?

  • For example, we shouldn’t naively apply Kaggle (online community of data scientists and machine learners) bellwether XGBoost 1 (open-source software library) unless we can set up suitable
    features that capture temporal aspects of our data

What underlying objective function is being optimised? How does the training procedure work?

  • Is it continuous and differentiable? TensorFlow (an open-source software library for dataflow) or equivalent automatic gradient frameworks should work well. If not, we may need to use global search optimisation methods such as evolutionary algorithms

What is the capacity of the model (How complex a training set can it fit)? With higher capacity comes a greater need to guard against overfitting via

  • Regularisation and parameter restriction

  • Bayesian methods

  • More heuristic methods such as dropout for neural networks (NNs)

It is crucial that we are familiar with established, traditional techniques for time series analysis, to illuminate ML approaches. Armed with knowledge of e.g. vector autoregressions and state space models from the conventional side, we can see the underlying connections to e.g. recurrent neural networks and appreciate why unconstrained model parameters might be a very bad idea. This will also help demystify variants like long short-term memory’s (LSTMs), when we realise they are merely non-linear state space models. 

Understand the pitfalls of backtests

Less a can of worms, more a barrel of monkeys armed with keyboards.

It is impossible to imagine deploying a strategy without a back test. The problem is this, if we hurl enough (random) models at a back test, some will inevitably stick and almost certainly betray us later. 

This aspect is by far the biggest component of the “art” of quantitative investing. 

Differing approaches to mitigate this risk include:

  • Constraining the models to conform to some underlying hypothesis

  • Applying severe parameter restrictions; aim for back test performance robust to variations of parameters

  • Use Bayesian methods if possible to capture impact of parameter uncertainty on forecasts

  • Ideally a model trained on one market should produce good results out-of-sample on another market from the same asset class
  • Averaging a large number of simple models to spread model/parameter risk (can be extremely effective at the cost of higher computational demands)

Alpha decay

Let’s imagine we didn’t overfit our training data and have found a genuinely profitable signal. High fives and back slaps all round! Now let us prepare for said alpha to dwindle as others find similar signals.

Do recent public competitions from highly successful shops (2-sigma, XTX) represent targeted recruitment or are they suggestive of an uphill battle against alpha decay? 
We should always keep an eye on performance and be prepared to retire strategies when returns drop too much relative to back test expectations.

Alternative Data

Given sufficient resources (in particular, subscription fees), we might investigate alternative datasets and hybridise with conventional signals. Bear in mind that the more accessible (cheaper) datasets may have less alpha, or at least a shorter half-life.

As such, alternative data may make more sense for larger firms, or as a standalone niche offering.

Automation

It is self-evident that in such a competitive environment, we need to automate as much of the process as possible, whilst bearing in mind perils of throwing models at a back test.

Finally

ML is evolving at a remarkable pace and cutting-edge frameworks and research are being published constantly. It pays to keep abreast of developments; papers from other domains will often contain broadly applicable tips and tricks for coaxing better performance from familiar models.

Recommended books

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, by Aurélien Géron (Perfect blend of theory and practice for those wishing to dive in and apply the latest and industry standard toolkits) 

The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani and Jerome Friedman  (Rigorous coverage of the fundamentals. Online version available at: https://web.stanford.edu/~hastie/ElemStatLearn/


Bayesian Reasoning and Machine Learning,
by David Barber (Broad overview of classical machine learning methods with an exceptionally clear exposition on ML approaches to time series modeling. Online version available at: http://www.cs.ucl.ac.uk/staff/d.barber/brml/

Deep Learning, by Ian Goodfellow, Yoshua Bengio and Aaron Courville (The standard book on neural networks and deep learning, available online at: http://www.deeplearningbook.org/front_matter.pdf

Advances in Financial Machine Learning, by Marcos Lopez de Prado (Whilst by no-means a comprehensive book on ML in finance, there is much interesting food for thought, particularly around backtesting)

 
Blogs 

 

----------------------------------------------------------------------------------------

Kaggle is one of a number of online machine learning communities that host competitions to crowd-source ML solutions and expertise. XGBoost is an open-source ML algorithm that has been used in many winning submissons

 

----------------------------------------------------------------------------------------

 

Read the first article of the series                             Read the second article of the series 

 

----------------------------------------------------------------------------------------

 

Chris VryonidesChris Vryonides, Director at AI consultancy Cognosian, a consultancy providing bespoke AI solutions.
 
He has over 20 years prior experience of quantitative finance in both bulge bracket banking and investment management and has been applying machine learning and statistical methods since 2008.
 

 

 

 



This article was produced on behalf of the CFA UK Fintech AI & Machine Learning working group. 

 

 

Related Articles

Jul 2024 » Investments

Exploring diversity in the Venture Capital (VC) industry with an LGBTQ+ lens

Sep 2023 » Investments

The ascent of primary research

Sep 2023 » Investments

Investment Opportunities in Biotechnology

Sep 2023 » Climate Change

Are hedge funds stepping up on climate transition?