Author: Chris Vryonides
In the last of this three-part series, Chris Vryonides, Director at AI consultancy Cognosian, provides some guidelines for those wanting to use machine learning in finance
The field of machine learning (ML) is constantly evolving. Before attempting any analysis, it is important to be familiar with the fundamentals, particularly around model complexity, overfitting and how to control it. In short: understand how to increase the chances that the model will not fail out-of-sample.
Spend time exploring and visualising the data
This is fundamental to any ML project. Initially, we can observe the distribution of data, identify issues such as missing data, outliers and then determine how to clean the dataset (a topic in itself, but mandatory for most algorithms to have a chance of working).
The characteristics of the dataset will also inform the choice of algorithms to be explored.
Target application areas with known successes
Ambitious blue-sky R&D requires large budgets and a stomach for a low and slow hit-rate. Absent this luxury, it is important to focus on application areas that have been proven successful. These include:
- Ultra-short time scale price prediction high frequency trading (HFT) using time series neural nets
- Meta models where ML techniques are used to dynamically allocate across a number of simple models and signals
- Hybrid approaches to support discretionary trader decisions
Furthermore, it can be beneficial to begin modelling, with a gentle goal then refine the model in the direction required. For example, suppose we are a liquidity taker taking intraday positions, we might initially train a model to predict very short-term moves. We can then push the prediction horizon out, where accuracy is more of a challenge, but price moves are higher in magnitude and so more likely to overcome our transaction costs.
Understand how available ML models relate to the prediction task
It is critical to understand at least the basics of what lies under the hood of the algorithms in your arsenal. While industrial-strength open source libraries have liberated us from having to code algorithms from scratch, we still need to know enough to be non-dangerous.
In practice, for any given algorithm, this requires:
Appreciating what types of data it can handle:
- Continuous / Discrete / Categorical etc.
- Missing data (at random / not at random)
- For time series: uniform / non-uniform sampling
We can often work around these issues if a particular model must be used, but at some point we must address:
What underlying dynamics is the model capable of capturing?
- Snapshot or time series/memory
- Does it requires stationarity or can it handle regime shifts?
- For example, we shouldn’t naively apply Kaggle (online community of data scientists and machine learners) bellwether XGBoost 1 (open-source software library) unless we can set up suitable
features that capture temporal aspects of our data
What underlying objective function is being optimised? How does the training procedure work?
- Is it continuous and differentiable? TensorFlow (an open-source software library for dataflow) or equivalent automatic gradient frameworks should work well. If not, we may need to use global search optimisation methods such as evolutionary algorithms
What is the capacity of the model (How complex a training set can it fit)? With higher capacity comes a greater need to guard against overfitting via
- Regularisation and parameter restriction
- Bayesian methods
- More heuristic methods such as dropout for neural networks (NNs)
It is crucial that we are familiar with established, traditional techniques for time series analysis, to illuminate ML approaches. Armed with knowledge of e.g. vector autoregressions and state space models from the conventional side, we can see the underlying connections to e.g. recurrent neural networks and appreciate why unconstrained model parameters might be a very bad idea. This will also help demystify variants like long short-term memory’s (LSTMs), when we realise they are merely non-linear state space models.
Understand the pitfalls of backtests
Less a can of worms, more a barrel of monkeys armed with keyboards.
It is impossible to imagine deploying a strategy without a back test. The problem is this, if we hurl enough (random) models at a back test, some will inevitably stick and almost certainly betray us later.
This aspect is by far the biggest component of the “art” of quantitative investing.
Differing approaches to mitigate this risk include:
- Constraining the models to conform to some underlying hypothesis
- Applying severe parameter restrictions; aim for back test performance robust to variations of parameters
- Use Bayesian methods if possible to capture impact of parameter uncertainty on forecasts
- Ideally a model trained on one market should produce good results out-of-sample on another market from the same asset class
- Averaging a large number of simple models to spread model/parameter risk (can be extremely effective at the cost of higher computational demands)
Alpha decay
Let’s imagine we didn’t overfit our training data and have found a genuinely profitable signal. High fives and back slaps all round! Now let us prepare for said alpha to dwindle as others find similar signals.
Do recent public competitions from highly successful shops (2-sigma, XTX) represent targeted recruitment or are they suggestive of an uphill battle against alpha decay?
We should always keep an eye on performance and be prepared to retire strategies when returns drop too much relative to back test expectations.
Alternative Data
Given sufficient resources (in particular, subscription fees), we might investigate alternative datasets and hybridise with conventional signals. Bear in mind that the more accessible (cheaper) datasets may have less alpha, or at least a shorter half-life.
As such, alternative data may make more sense for larger firms, or as a standalone niche offering.
Automation
It is self-evident that in such a competitive environment, we need to automate as much of the process as possible, whilst bearing in mind perils of throwing models at a back test.
Finally
ML is evolving at a remarkable pace and cutting-edge frameworks and research are being published constantly. It pays to keep abreast of developments; papers from other domains will often contain broadly applicable tips and tricks for coaxing better performance from familiar models.
Recommended books
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, by Aurélien Géron (Perfect blend of theory and practice for those wishing to dive in and apply the latest and industry standard toolkits)
The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani and Jerome Friedman (Rigorous coverage of the fundamentals. Online version available at: https://web.stanford.edu/~hastie/ElemStatLearn/
Bayesian Reasoning and Machine Learning, by David Barber (Broad overview of classical machine learning methods with an exceptionally clear exposition on ML approaches to time series modeling. Online version available at: http://www.cs.ucl.ac.uk/staff/d.barber/brml/
Deep Learning, by Ian Goodfellow, Yoshua Bengio and Aaron Courville (The standard book on neural networks and deep learning, available online at: http://www.deeplearningbook.org/front_matter.pdf
Advances in Financial Machine Learning, by Marcos Lopez de Prado (Whilst by no-means a comprehensive book on ML in finance, there is much interesting food for thought, particularly around backtesting)
----------------------------------------------------------------------------------------
1 Kaggle is one of a number of online machine learning communities that host competitions to crowd-source ML solutions and expertise. XGBoost is an open-source ML algorithm that has been used in many winning submissons
----------------------------------------------------------------------------------------
Read the first article of the series Read the second article of the series
----------------------------------------------------------------------------------------
Chris Vryonides, Director at AI consultancy Cognosian, a consultancy providing bespoke AI solutions.
He has over 20 years prior experience of quantitative finance in both bulge bracket banking and investment management and has been applying machine learning and statistical methods since 2008.
This article was produced on behalf of the CFA UK Fintech AI & Machine Learning working group.