The last few years have seen increased interest in Machine Learning – Neural Networks (NN’s) in finance. I will focus here specifically on the application of NN’s to function approximation, so basically derivatives pricing and consequently risk calculations. The goal is to either “turbo-boost” models / calculations that are already in production systems, or allow new models to be used that were previously not practical due to the high computational cost. I’ve been hearing claims of millions of times faster calculations and that the Universal Approximation Theorem guarantees that all functions can be approximated accurately by the simplest kind of net, provided enough training samples. But as usual the devil is in the details; like exactly how many training samples are we talking about for a required level of accuracy? I was wondering, what if to guarantee acceptable accuracy one would need impractically large training sets (and hence data generation and training times)? And never mind millions of times faster (probably more realistic for quantum computers when they arrive), I would be satisfied with 100 or even 10 times faster, provided it was easy enough to implement and deploy.
There have been waves of renewed interest in NN's for decades, but their recent resurgence is in large part due to them been made more accessible via Python packages like TensorFlow and PyTorch that hide away the nitty gritty that would otherwise dishearten most of the recent users. So having been spurred on by a couple of people to check out this seemingly all-conquering method, I decided to see for myself. Moreover, this seems more like an engineering exercise than a mathematical one; one needs to be willing to try a lot of different combinations of “hyperparameters”, have an eye on how all of this affects convergence and a knack for finding little tricks to improve it. Not to mention "free" time and relatively cheap, sustainable electricity. That’s me then this year I thought.
I am not sure what's the state of adoption of NN's in finance right now. Looking at relevant papers, they all seem to be fairly recent and more proof-of-concept type. What puzzles me in particular is the narrow input parameter range used to train the models on. I am thinking this is not that useful; surely one would need more coverage than that in practice? And then there is talk that such (NN) models would need to be retrained from time to time when market parameters get out of the “pre-trained” range. Now I may be missing something here. First, I wonder why people use the term “NN models” in this case. I think it would be less confusing to call them "NN approximations" since they simply seek to reproduce the output of existing models. The application of NN’s in this context really produces a kind of look-up table to interpolate from. As for the limited range of parameters that is usually chosen for demonstrations, I imagine it has to do with the fact that the resulting NN’s accuracy is of course higher like that. But this does not give me any assurance that the method would be suitable for real-world use, where one would need quite a bit higher accuracy than what I have seen so far and also wider input ranges.
So I decided I want hard answers: can I approximate a model (function) in the whole (well almost) of its practically useful parameter space so that there’s no need to retrain it, like never (unless the original is changed of course)? To this aim I chose a volatility model calibration exercise as my first case study. Which is convenient because I had worked on this before. Note that a benefit of the NN approach (if it turns out they do the job well) is that one could make the calibration of any model super-fast, and thus enable such models as alternatives to the likes of Heston and SABR. The latter are popular exactly because they have fast analytical solutions or approximations that make calibration to market vanillas possible in practical timescales.
To demonstrate said benefit, the logical thing to do here would be to come up with a "steroidal" version of the non-affine model demo calibrator. The PDE method used in there to solve for vanilla prices under those models is highly optimized and could be used to generate the required millions of training samples for the NN in a reasonable amount of time. To keep things familiar though and save some time, I will just try this on the Heston model whose training samples can be generated a bit faster still (I used QLib’s implementation for this). If anyone wants me to produce a high accuracy, production-level calibrator for any other volatility model, by all means get in touch.
Like I said above, the trained parameter range is much wider than anything I’ve seen published so far but kind of arbitrary. I chose it to cover the wide moneyness (S/K) range represented by the 246 SPX option chain A from . Time to maturity ranges from a few days up to 3 years. The model parameters should cover most markets. The network has about 750,000 trainable parameters, which may be too many, but one does not know beforehand what accuracy can be achieved with what architecture. It is possible that a smaller (and thus even faster) network can give acceptable results. 32 million optimally placed training samples were used. I favored accuracy over speed here, if anything to see just how accurate the NN approximation can practically get. But also because higher accuracy minimizes the chance of the optimizer converging to different locales (apparent local minima) depending on the starting parameter vector (see  for more on this).
Overall this is a vanilla calibration set-up, where a standard optimizer (Levenberg-Marquardt) is used to minimize the root mean square error between the market and model IV's, with the latter provided by the (Deep) NN approximation. There are other more specialized approaches involving NN's designed specifically for the calibration exercise, see for example the nice paper by Horvath et al. . But I tried to keep things simple here. So how does it perform then?
The mean absolute error is about 1.e−3 implied volatility percentage points (i.e. 0.1 implied volatility basis points).
Obviously for individual use cases the NN specification would be customized for a particular market, thus yielding even better accuracy and/or higher speed. Either way I think this is a pretty satisfactory result and it can be improved upon further if one allows for more resources.
In terms of calibration then, how does that IV accuracy translate to predicted model parameter accuracy? I will use real-world market data that I had used before to test my PDE-based calibrator; the two SPX option chains from  and the DAX chain from . As can be seen in Table 1, the calibration is very accurate and takes a fraction of a second. In practice one would use the last result as the starting point which should help the optimizer converge faster still.
For those who need hard evidence that this actually works as advertised, there's the self-contained console app demo below to download. The options data files for the 3 test chains are included. The calibrator always starts from the same parameter vector (see Table 1) and uses only the CPU to keep things simple and facilitate comparisons with traditional methods. And I have also included a smaller, only slightly less accurate on average NN approximation that is more than double as fast still.
So to conclude, it is fair to say that the NN (or should I say Machine Learning) approach passed the first real test I threw at it with relative ease. Yes, one needs to invest time to get a feeling of what works and come up with ways to optimize since this is basically an engineering problem. But the results show that at least in this case one can get all the accuracy practically needed in a reasonable amount of (offline) time and resources.
Finally let's briefly mention here the two main perceived issues with this approach: How does one guarantee accuracy everywhere in the input parameter hyperspace (i.e. looking at the tails of the error distribution)? I agree this is an issue but there are ways to increase confidence, especially in relatively simple cases like the one here. The other is lack of transparency and/or interpretability.
 Kangro, R., Parna, K., and Sepp, A., (2004), “Pricing European Style Options under Jump Diffusion Processes with Stochastic Volatility: Applications of Fourier Transform,” Acta et Commentationes Universitatis Tartuensis de Mathematica 8, p. 123-133.
 Cui, Y., del Bano Rollin, S., Germano, G. (2017), "Full and fast calibration of the Heston stochastic volatility model", European Journal of Operational Research, 263(2), p. 625–638
 Horvath, B., Muguruza, A., Tomas, M. (2019). "Deep learning volatility.", arXiv:1901.09647v2 [q-fin.MF]