The easy way to price these is of course via Monte Carlo simulation; it is simple to implement and straightforward to accommodate any added contract bells and whistles. But MC is slow and it's tough to get good Greeks out of it. There is of course another way.

The TARF contract consists of a series of periodic payments (e.g. weekly or monthly) up to some maturity, but the contract can end prematurely when/if on a particular fixing date the specified target is reached. A bullish TARF for instance defines a strike *K* (usually set below the spot *S* at the time of issue) and a leverage factor *g*. At every fixing date the holder makes a profit of (*S - K*) if *S* is above *K* and a loss of *g (K - S)* is *S* fixes below *K* (per unit of notional)*; ***profits accrue towards the target (but losses do not)**.

So the structure is knocked out when/if the accrued amount*A* reaches the target. This is a strong path dependency feature; but unlike accumulators where the knock out depends on the path of the spot itself (which is already part of the numerical solution), here we need to somehow involve *A* in the solution. We do that by adding *A* as an extra pseudo-dimension, discretized by a number of individual 1-D PDE solvers (for Black Scholes, local volatility (LV) model, or local regime switching). Each of these solvers corresponds to an *A*-level within the range of possible values (from zero to the target).

Solving backwards from maturity, we march the option value in time separately for each solver until we reach a fixing/settlement date (for simplicity here I assume that fixing and settlement dates coincide). There we have a jump condition that must be applied in order to continue solving. More specifically, for each*S*-grid point on each solver we determine the payment to be made based on the *S* value and the solver's *A* level. Adding positive payments to the solver's *A*, gives us the implied *A* level that corresponds to each *S* grid point of the solver just after (in normal time direction) the payment has been made. Given the implied *A* level we can find the option value there by interpolating from the solutions of the solvers whose *A* levels are nearest (which we have all marched up to that time point). We do that with say cubic interpolation. The (initial, or rather final) values with which we start solving again towards the next fixing date are then found by adding the payout to the interpolated continuation value.

That is the outline (by the way, this "auxiliary state variable" method can be seen in action when pricing Asian options within the old option pricer in this site). I will stick with the basics here (no volatility model) in order to show that even in this simple setting the implementation leaves enough room for one to come up with a great, or not so great pricing engine.

So each individual solver is solving the Black Scholes pricing PDE using the Crank Nicolson method with Rannacher time-stepping on a non-uniform*S*-grid that places more points near the strike. This standard building block is well-optimized and one can use it as per above to put together a basic implementation of the TARF pricing engine. I did this and the results while still better than MC, left a lot to be desired in my view. Let's just say that a basic implementation results in a solution that does not behave optimally (and a lot worse than it does for Asian options by the way). So I tried to find ways to improve things. And here I will just give an indication of how much one can improve. The details that make this possible will not be the subject of this short post, but if one really wants to know you can contact me (using the contact form or the email in this paper).

So the structure is knocked out when/if the accrued amount

Solving backwards from maturity, we march the option value in time separately for each solver until we reach a fixing/settlement date (for simplicity here I assume that fixing and settlement dates coincide). There we have a jump condition that must be applied in order to continue solving. More specifically, for each

That is the outline (by the way, this "auxiliary state variable" method can be seen in action when pricing Asian options within the old option pricer in this site). I will stick with the basics here (no volatility model) in order to show that even in this simple setting the implementation leaves enough room for one to come up with a great, or not so great pricing engine.

So each individual solver is solving the Black Scholes pricing PDE using the Crank Nicolson method with Rannacher time-stepping on a non-uniform

Here I am showing a couple of the many tests I did that showcase the typical performance differential between an optimized and a "naive", basic implementation. The individual solvers used were identical (*S*-grid, boundary conditions, etc) in both engine implementations. I was aiming for an accuracy of about $ 10^{-4} $ and the time discretization error was smaller than that when *NT* = 500 time steps were used. I played a bit with the *S* resolution (*NS)* to this end. The spacing between the the *A* (*S-t*) planes was roughly the same as the average *S*-grid spacing in both implementations (though uniform for the basic and non-uniform for the optimized). Note that there are generally three ways of handling the settlement of the fixing that results in the target being reached, i.e. three knockout types. a) *No pay*: payment is not made, b) *Capped pay***:** there is partial payment so as to "fill" the target exactly, c) *Full pay*: payment is made in full. All three are of course trivially accommodated by the method.

The plots below show the numerical (spatial discretization) error for the part of the solution that is of interest (the actual solver S-grid extends further) and assuming the accrued amount today is zero, i.e.*A* = 0. This may mean we are pricing at the time of issue, or at some time during the lifetime of the contract (in which case target stands for "remaining target"). The error profile is obtained by comparing the solution to that from an optimized engine using *NS* = 6000 and *NT* = 500 (this has really, really low spatial discretization error, < $10^{-8}$ to be precise). One CPU (Ryzen 5900X) core was used for the timings shown, though if need be the method is easily parallelized. Code is written in C++ and compiled with gcc on Linux. Interest rates are zero.

First we look at a one year contract with weekly fixings/payments and volatility typical of many FX rates. The knockout type is*n**o* pay and the leverage factor is two. The optimized version achieves the desired accuracy in 0.09 seconds, while the basic version needs a lot higher spatial resolution and 5 seconds to come close to that.

The plots below show the numerical (spatial discretization) error for the part of the solution that is of interest (the actual solver S-grid extends further) and assuming the accrued amount today is zero, i.e.

First we look at a one year contract with weekly fixings/payments and volatility typical of many FX rates. The knockout type is

And now a two year contract and high underlying volatility (more common to precious metals). The knockout type in the case is f*ull pay* and the leverage factor is one. In this setting the optimized engine's relative advantage is more pronounced, achieving the desired level of accuracy in 0.06 seconds while the basic engine really struggles here and fails to come close even with a spatial resolution ten times higher. The keen observer will notice the basic problem here immediately.

And the same in log scale:

In case it is not clear to all readers, the plots above show that with a single split-second PDE solution we get very low error valuations for the whole range of spot values. This is of course under Black Scholes, but the same level of efficiency should carry over with a volatility model. Thus if the model and/or volatility dynamic assumed allows it, a whole spot ladder calculation should be possible in less than a second.

By the way, if the aim is a single spot valuation, it is often (but not always) preferable to cluster the grid points around the current spot (typical for example when one solves the PDE in log space). In this case one can get higher efficiency still, i.e. same discretization error in even less CPU time.

By the way, if the aim is a single spot valuation, it is often (but not always) preferable to cluster the grid points around the current spot (typical for example when one solves the PDE in log space). In this case one can get higher efficiency still, i.e. same discretization error in even less CPU time.

While I am at it, why not post some highly accurate benchmark valuations ("benchmark" being a theme in this site). It may be difficult to believe (and indeed as pointless as say luxury watches waterproof to depths no human can survive!), but all digits provided should be correct. Perhaps if someone wants to perform basic sanity checks on their TARF pricing engine these can be of some help. So to be clear, assuming fixing and settlement dates coincide and are all *dt = 1/52* yrs apart (say forcing an ACT/364 day count with no holidays affecting the weekly fixings dates), flattening the volatility surface and zeroing rates curves, should reproduce the values below.

Case | T (yrs) | num. fixings | target | lev. factor | KO type | spot | BS vol | TARF value |

1 | 1 | 52 | 0.3 | 2 | No Pay | 1.06 | 10% | -0.02096094 |

2 | 1 | 52 | 0.5 | 2 | No Pay | 1.06 | 10% | -0.017514630 |

3 | 1 | 52 | 0.7 | 2 | No Pay | 1.06 | 10% | 0.035902523 |

4 | 1 | 52 | 0.3 | 2 | No Pay | 1.1 | 10% | 0.235120763 |

5 | 1 | 52 | 0.3 | 2 | Capped pay | 1.1 | 10% | 0.286424762 |

6 | 1 | 52 | 0.3 | 2 | Full pay | 1.1 | 10% | 0.335294058 |

7 | 1 | 52 | 0.3 | 1 | No Pay | 1.03 | 10% | -0.29368756 |

8 | 1 | 52 | 0.3 | 1 | No pay | 1.1 | 20% | -0.01224772 |

9 | 1 | 52 | 0.7 | 1 | No pay | 1.1 | 20% | 0.037345388 |

10 | 1 | 52 | 0.7 | 1 | Full pay | 1.1 | 20% | 0.152792822 |

11 | 2 | 104 | 0.3 | 1 | No pay | 1.35 | 40% | 0.059232579 |

12 | 2 | 104 | 0.5 | 1 | No pay | 1.35 | 40% | 0.315294774 |

13 | 2 | 104 | 0.5 | 1 | Full pay | 1.1 | 40% | -3.81704655 |

I will try to follow this up with an update using some volatility model.

]]>Since I do not currently have access to a proper production model (that could for example employ some Local Stochastic Volatility model for the underlying processes), I will use simple GBM. The number of free input parameters (dimensionality of the approximated function) is 28 (see below), which combined with the pricing discontinuities at the autocall dates still make for a very challenging problem. My experience so far (on training DNN's for simpler exotics) is that replacing GBM with a volatility model does not pose unsurmountable difficulties for the DNN. I am confident that similar (or higher) accuracy to the one showcased here can be achieved for the same MRBCA, at the expense of more (offline) effort in generating data and training the DNN.

The product class we are looking at (autocallable multi barrier reverse convertibles) is popular in the Swiss market. It is a structured product paying a guaranteed coupon throughout its lifetime with a complex barrier option embedded. The latter is contingent on the worst performing out of a basket of underlying assets. There is a down-and-in type continuously monitored barrier and an up-and-out discretely monitored one (at the autocall dates). The assets' performance (level) is measured as a percentage of their initial fixing values. Accordngly, the strike level *K* for the down-and-in payoff, the down-and-in barrier *B* and the early redemption (autocall) level *A* (the "up-and-out barrier") are quoted as percentages.

In short: The product pays the holder a guaranteed coupon throughout its lifetime (up to maturity or early redemption). If on any of the observation (autocall) dates the worst-performing asset level is above the early redemption level, the product expires immediately and the amount redeemed is 100% of the nominal value. If no early redemption event happens then at maturity :

In short: The product pays the holder a guaranteed coupon throughout its lifetime (up to maturity or early redemption). If on any of the observation (autocall) dates the worst-performing asset level is above the early redemption level, the product expires immediately and the amount redeemed is 100% of the nominal value. If no early redemption event happens then at maturity :

- If during the lifetime of the product the worst-performing asset level did not at any moment touch or cross the barrier level
*B*, the amount redeemed is 100% of the nominal value. - If the worst-performing asset level did touch or cross the barrier level
*B*at some point and its final fixing level is above the strike level*K*, the amount redeemed is again 100% of the nominal value. - If the worst-performing asset did touch or cross the barrier level
*B*at some point and its final fixing level is below the strike level*K*, the amount redeemed is the percentage of the nominal equal to the worst-performing asset performance (ratio of its final to initial fixing level).

I am not going to attempt an all-encompassing DNN representation of any possible MRBCA structure, but rather focus the effort on particular subcategory. So what I am looking for is a pricing approximation for MRBCA's on the worst of 4 assets, with original maturity of 2Y and semi-annual coupon payment and autocall dates. Also assuming the product is struck at-the-money (i.e. the strike level *K* is 100%, the most usual case in practice) with an early redemption (autocall) level of also 100%, again typical in the market. The latter two could of course also be included as variable inputs in the approximation. This may well be possible while maintaining the same accuracy but I haven't tried it yet.

So the DNN approximation will be for the clean price of any such product (given the inputs described next) at any time after its inception, up to its maturity. Indeed in what follows, T denotes the time left to maturity.

So the DNN approximation will be for the clean price of any such product (given the inputs described next) at any time after its inception, up to its maturity. Indeed in what follows, T denotes the time left to maturity.

- The asset level
*S*(% of initial fixing), volatility*vol*and dividend yield*d*for each of the 4 underlying GBM processes. - Seven-point discount factor curve (1D, 1W, 1M, 3M, 6M, 1Y, 2Y).
- Time left to maturity
*T*(in years). - Barrier level
*B*(% of initial fixings). - Coupon level
*Cpn*(% p.a.). - Correlation matrix (six distinct entries).

The DNN is trained for wide ranges of its inputs to allow it to be used for a long time without the need for retraining. The approximation is only guaranteed to be good within the input ranges that it has been trained for. Those are shown below.

Operational parameter ranges | ||||

Min | Max | |||

S_{i} | 20% | 500% | ||

vol_{i} | 10% | 40% | ||

d_{i} | 0 | 10% | ||

T | 0.001 | 2 | ||

r (disc. rates) | -2% | 2.50% | ||

B | 40% | 80% | ||

Cpn | 2% p.a. | 20% p.a. | ||

ρ | -55% | 99% |

The original pricing model / function we aim to "mimic" is of course based on Monte Carlo simulation and was written in C++. I omitted things like date conventions and calendars for ease of implementation. The continuously monitored (American) down-and-in barrier feature is taken care of via the use of a probabilistic correction (Brownian Bridge). Given the assumed GBM processes, this not only perfectly eliminates any simulation bias, but also enables the use of large time-steps thus allowing for significant speedup in the generation of the training samples. The discount curve interpolation is based on cubic splines and auxiliary points. The simulation can be driven by either pseudo random numbers or quasi random sequences (Sobol). I chose the former for the generation of the training samples as it proved to be more beneficial for the learning ability of the DNN under my current setup.

Note that in contrast with the use case in the previous post, here the training output data (the MC prices) are noisy and of limited accuracy. Does this represent a big problem for the DNN's ability to learn from them? It turns out that the answer is not really.

Note that in contrast with the use case in the previous post, here the training output data (the MC prices) are noisy and of limited accuracy. Does this represent a big problem for the DNN's ability to learn from them? It turns out that the answer is not really.

The DNN is trained by feeding it millions of *[vector(inputs), price]* pairs. This process in practice has to be repeated many times since there is no general formula for what works best in each case. The pricing accuracy of the training set samples does not have to be as high as the target accuracy. As it turns out the DNN has the ability to effectively smooth out the pricing data noise and come up with an accuracy that is higher than that on the individual prices it was trained on. Also the input space coverage does not have to be uniform as we may want for example to place more points where the solution changes rapidly in an effort to end up with a balanced error distribution.

When it comes to testing the resulting DNN approximation though we create a separate (out of sample) test set of highly accurate prices uniformly filling the input space. This is to say we don't weigh some areas of the solution (say near the barrier) more than others when we calculate the error metrics. We say this is the operational (inputs) range of the DNN and we provide (or at least aim to) similar accuracy everywhere within that range. So the test set is created by drawing random inputs from uniform distributions within their respective ranges. The one exception being the correlation matrices whose coefficients follow the distribution below. We then discard those matrices that include coefficients outside our target range of (-55% to 99%).

When it comes to testing the resulting DNN approximation though we create a separate (out of sample) test set of highly accurate prices uniformly filling the input space. This is to say we don't weigh some areas of the solution (say near the barrier) more than others when we calculate the error metrics. We say this is the operational (inputs) range of the DNN and we provide (or at least aim to) similar accuracy everywhere within that range. So the test set is created by drawing random inputs from uniform distributions within their respective ranges. The one exception being the correlation matrices whose coefficients follow the distribution below. We then discard those matrices that include coefficients outside our target range of (-55% to 99%).

The overall accuracy achieved by the DNN is measured by the usual Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) metrics. We can also look at the error distribution to get an idea of how good the approximation is. What we cannot easily do is say what lies far in the tails of that distribution, or in other words provide some sort of limit for the maximum possible error. In contrast to the traditional MC model, there is no theoretical confidence interval for the DNN error.

The MAE and RMSE are calculated against a reference test set of 65K MC prices, each generated using 32 million Sobol-driven paths (with Brownian Bridge construction). Such prices are found (when re-calculating a subset using 268 million Sobol paths) to have an accuracy of 4.e-6, which is well below the target accuracy (about 1.e-4, or 1 cent in a nominal of 100$). The inputs were generated again using (22-dimensional, correlations excluded) Sobol points, in an effort to best represent the space. The average model price for this test set is 0.874 (87.4%).

In order to try and get an idea for the worst-case errors I tested the DNN against a much bigger (but less accurate) test set of 16.7 million Sobol points.

For the results presented here the DNN was trained on 80 million *[vector(inputs), price]* samples carefully chosen so as to ensure that error levels are as uniform as possible across the input space. At this training set size the convergence rate (error decrease with each doubling of training set size) was showing some signs of slowing down, but there was still room for improvement. Using a few hundred million samples would still be straightforward and would yield even better accuracy.

Still the overall quality of the approximation is excellent. The mean error is less than a cent and generally does not exceed 3 cents. The speed is as expected many orders of magnitude higher than an MC simulation with similar standard error (see below). The timings are for a single CPU core. Of course if GPU's are used instead the speed can still be improved significantly.

Still the overall quality of the approximation is excellent. The mean error is less than a cent and generally does not exceed 3 cents. The speed is as expected many orders of magnitude higher than an MC simulation with similar standard error (see below). The timings are for a single CPU core. Of course if GPU's are used instead the speed can still be improved significantly.

Deep Neural Network Pricing Performance | ||||

MAE | 6×10^{-5} | |||

RMSE | 9×10^{-5} | |||

Maximum absolute error in 16.7M test samples | 1.5×10^{-2} | |||

CPU time per price (1 core) | 6×10^{-6} secs |

in order to get similar accuracy from the traditional MC model one needs about 400K antithetic paths. With the present implementation this takes about 0.35 secs on 1 CPU core, which is about 60000 times slower than the DNN. If the MC pricing employed some volatility model needing fine time steps, the speedup factor could easily be in the order of millions (the DNN speed would remain the same).

The MC model prices a lot of the samples with almost zero error. This is because for many of the random input parameter vectors the solution is basically deterministic and the product behaves either like a bond or is certain to be redeemed early.

By far the most challenging dimension in the 28-dimentional function we are approximating here is the time to expiry T. The (clean) product price can be discontinuous at the autocall dates, posing a torture test for any numerical method. This is illustrated below where I am plotting a few sample solutions across T (keeping all other input parameters constant).**These "pathological" cases correspond to the random input parameter vectors that resulted in the worst DNN approximation errors among the 16.7 million reference set cases (top 5 worst errors)**. The MC price plots are based on 40000 valuation points using 132K Sobol-driven paths per valuation. It took about 10 mins to create each plot utilizing all 12 cores of a CPU . The corresponding 40000 DNN approximations took < 0.2sec on a single core.

By far the most challenging dimension in the 28-dimentional function we are approximating here is the time to expiry T. The (clean) product price can be discontinuous at the autocall dates, posing a torture test for any numerical method. This is illustrated below where I am plotting a few sample solutions across T (keeping all other input parameters constant).

Looking at these plots it comes as no great surprise that the DNN struggles here. Considering the vast variety of shapes the solution can take, it is nonetheless seriously impressive that the DNN can cope as well as it does overall. That said, the maximum errors above are about 1.5% (not quite visible, located within those ultra narrow dips a few hours from the auto-call dates), which is more than I would have been happy with. Still, for use in XVA type calculations and intraday portfolio valuation monitoring, the performance is more than adequate as is. For use in a production environment one would need to be even more stringent with ensuring the maximum errors do not exceed a certain threshold. When testing the DNN against the much smaller 65K reference set, the maximum error was an order of magnitude smaller (about 0.2%, or 20 cents). Looking at 100M cases may reveal an even worse case than the 1.5% error found in the 16.7M set. Nonetheless there are ways to identify and target the problematic areas of the input parameter space. I am thus confident the maximum errors can be brought down further together with the mean error metrics by increasing and further refining the synthetic training set.

In conclusion, we can say that the DNN has passed this second much more difficult test as well. There was never a doubt that the approximation accuracy increases with increasing training data. The question in my mind was rather "is the sufficient amount of training (for the DNN to produce a worthy replacement of the traditional MC and PDE-based pricing) practical in terms of time and cost"? Given the experience gathered so far I would say the answer is yes. The present results were achieved mainly on a top spec desktop with only limited use of cloud resources. Approximating fully fledged models incorporating local and/or stochastic volatility will require more computational power, but the offline effort would still correspond to reasonable time and cost. To this end, a third post in this series would look at the case of FX TARF pricing under an LV or LSV model.

P.S. The results summarily presented in these last two posts are the culmination of a lot of work and experimentation that took the better part of a year and thousands of CPU/GPU hours.

]]>The last few years have seen increased interest in Machine Learning – Neural Networks (NN’s) in finance. Here I will focus specifically on the application of NN’s to function approximation, so basically derivatives pricing and consequently risk calculations. The goal is to either “turbo-boost” models/calculations that are already in production systems, or enable the use of alternative/better models that were previously not practical due to the high computational cost. I’ve been hearing claims of millions of times faster calculations and that the Universal Approximation Theorem guarantees that all functions can be approximated accurately by the simplest kind of net, provided enough training samples. But as usual the devil is in the details; like exactly how many training samples are we talking about to achieve a typical production system level of accuracy? I was wondering, what if to guarantee acceptable accuracy one would need impractically large training sets (and hence data generation and training times)? And never mind millions of times faster (perhaps more realistic for quantum computers when they arrive?), I would be satisfied with 100 or even 10 times faster, provided it was easy enough to implement and deploy.

There have been waves of renewed interest in NN's for decades, but their recent resurgence is in large part due to them been made more accessible via Python packages like TensorFlow and PyTorch that hide away the nitty gritty that would otherwise dishearten most of the recent users. So given the low barrier to entry and having been spurred on by a couple of people, I decided to check out this seemingly all-conquering method. Moreover, this seems like an engineering exercise; one needs to be willing to try a lot of different combinations of “hyperparameters”, use suitable training sample generation techniques and bring in experience/tricks from traditional methods, all with the aim to improve the end result. Not to mention "free" time and relatively cheap, sustainable electricity. That’s me then this year I thought.

I am not sure what is the state of adoption of such NN applications in finance right now. Looking at relevant papers, they all seem to be fairly recent and more the proof-of-concept type. What puzzles me in particular is the narrow input parameter range used to train the NN's on. Surely one would need more coverage than that in practice? Consequently there is talk that such (NN) models would need to be retrained from time to time when market parameters get out of the “pre-trained” range. Now I may be missing something here. First, I think it would be less confusing to call them "NN approximations" instead of "NN models" since they simply seek to reproduce the output of existing

So I decided I want hard answers: can I approximate a model (function) in the whole (well almost) of its practically useful parameter space so that there’s no need to retrain it, like never (unless the original is changed of course)? To this aim I chose a volatility model calibration exercise as my first case study. Which is convenient because I had worked on this before. Note that a benefit of the NN approach (if it turns out they do the job well) is that one could make the calibration of any model super-fast, and thus enable such models as alternatives to the likes of Heston and SABR. The latter are popular exactly because they have fast analytical solutions or approximations that make calibration to market vanillas possible in practical timescales.

To demonstrate said benefit, the logical thing to do here would be to come up with a turbo-charged version of the non-affine model demo calibrator. The PDE method used in there to solve for vanilla prices under those models is highly optimized and could be used to generate the required millions of training samples for the NN in a reasonable amount of time. To keep things familiar though and save some time, I will just try this on the Heston model whose training samples can be generated a bit faster still (I used QLib’s implementation for this). If someone is interested in a high accuracy, production-quality robust calibration routine of any other volatility model, feel free to get in touch to discuss.

Like I said above, the trained parameter range is much wider than anything I’ve seen published so far but kind of arbitrary. I chose it to cover the wide moneyness (S/K) range represented by the 246 SPX option chain A from [1]. Time to maturity ranges from a few days up to 3 years. The model parameters should cover most markets. The network has about 750,000 trainable parameters, which may be too many, but one does not know beforehand what accuracy can be achieved with what architecture. It is possible that a smaller (and thus even faster) network can give acceptable results. 32 million optimally placed training samples were used. I favored accuracy over speed here, if anything to see just how accurate the NN approximation can practically get. But also because higher accuracy minimizes the chance of the optimizer converging to different locales (apparent local minima) depending on the starting parameter vector (see [3] for more on this).

Overall this is a vanilla calibration set-up, where a standard optimizer (Levenberg-Marquardt) is used to minimize the root mean square error between the market and model IV's, with the latter provided by the (Deep) NN approximation. There are other more specialized approaches involving NN's designed specifically for the calibration exercise, see for example the nice paper by Horvath et al. [4]. But I tried to keep things simple here. So how does it perform then?

Neural Network Approximation specification | ||||

Operational parameter ranges | ||||

min | max | |||

S/K | 0.5 | 5 | ||

T | 0.015 | 3 | ||

r | -0.02 | 0.1 | ||

d | 0 | 0.1 | ||

v_{0} | 0.002 | 0.5 | ||

v̅ | 0.005 | 0.25 | ||

κ | 1 | 20 | ||

ξ | 0.1 | 10 | ||

ρ | -0.95 | 0.1 | ||

Performance | ||||

Mean Absolute IV Error | 9.3×10^{-6} |

The mean absolute error is about 1.e−3 implied volatility percentage points (i.e. 0.1 implied volatility basis points).

The maximum error over a test set of 2 million (out of sample) random points was 5 IV basis points but that is in an area of the parameter hypercube of no interest in practice. For example on the moneyness - expiration plane, the error distribution looks like this:

Obviously for individual use cases the NN specification would be customized for a particular market, thus yielding even better accuracy and/or higher speed. Either way I think this is a pretty satisfactory result and it can be improved upon further if one allows for more resources.

In terms of calibration then, how does that IV accuracy translate to predicted model parameter accuracy? I will use real-world market data that I had used before to test my PDE-based calibrator; the two SPX option chains from [1] and the DAX chain from [2]. As can be seen in Table 1, the calibration is very accurate and takes a fraction of a second. In practice one would use the last result as the starting point which should help the optimizer converge faster still.

For those who need hard evidence that this actually works as advertised, there's the self-contained console app demo below to download. The options data files for the 3 test chains are included. The calibrator always starts from the same parameter vector (see Table 1) and uses only the CPU to keep things simple and facilitate comparisons with traditional methods. And I have also included a smaller, only slightly less accurate on average NN approximation that is more than double as fast still.

SPX Chain A from [1] (246 options) | SPX Chain B from [1] (68 options) | DAX Chain from [2] (102 options) | |||||||||

Exact | NN | error | Exact | NN | error | Exact | NN | error | |||

v0 | 0.007316 | 0.007315 | (0.01%) | 0.04575 | 0.04576 | (0.02%) | 0.1964 | 0.1964 | (0.00%) | ||

v̅ | 0.03608 | 0.03608 | (0.00%) | 0.06862 | 0.06862 | (0.00%) | 0.07441 | 0.07440 | (0.01%) | ||

k | 6.794 | 6.794 | (0.00%) | 4.906 | 4.903 | (0.06%) | 15.78 | 15.80 | (0.13%) | ||

x | 2.044 | 2.044 | (0.00%) | 1.526 | 1.525 | (0.07%) | 3.354 | 3.356 | (0.06%) | ||

r | -0.7184 | -0.7184 | (0.00%) | -0.7128 | -0.7129 | -(0.01%) | -0.5118 | -0.5118 | (0.00%) | ||

IV RMSE (b.p.) | 128.24 | 128.23 | 0.01 | 101.35 | 101.37 | 0.02 | 131.72 | 131.72 | 0.00 | ||

CPU time | 0.55 s | 0.16 s | 0.26 s |

So to conclude, it is fair to say that the NN (or should I say Machine Learning) approach passed the first real test I threw at it with relative ease. Yes, one needs to invest time to get a feeling of what works and come up with ways to optimize since this is basically an engineering problem. But the results show that at least in this case one can get all the accuracy practically needed in a reasonable amount of (offline) time and resources.

Finally let's briefly mention here the two main perceived issues with this approach: How does one guarantee accuracy everywhere in the input parameter hyperspace (i.e. looking at the tails of the error distribution)? I agree this is an issue but there are ways to increase confidence, especially in relatively simple cases like the one here. The other is lack of transparency and/or interpretability.

[1] Y. Papadopoulos, A. Lewis (2018), “A First Option Calibration of the GARCH Diffusion Model by a PDE Method.”, arXiv:1801.06141v1 [q-fin.CP].

[2] Kangro, R., Parna, K., and Sepp, A., (2004), “Pricing European Style Options under Jump Diffusion Processes with Stochastic Volatility: Applications of Fourier Transform,” Acta et Commentationes Universitatis Tartuensis de Mathematica 8, p. 123-133.

[3] Cui, Y., del Bano Rollin, S., Germano, G. (2017), "Full and fast calibration of the Heston stochastic volatility model", European Journal of Operational Research, 263(2), p. 625–638

[4] Horvath, B., Muguruza, A., Tomas, M. (2019). "Deep learning volatility.", arXiv:1901.09647v2 [q-fin.MF]

]]>[2] Kangro, R., Parna, K., and Sepp, A., (2004), “Pricing European Style Options under Jump Diffusion Processes with Stochastic Volatility: Applications of Fourier Transform,” Acta et Commentationes Universitatis Tartuensis de Mathematica 8, p. 123-133.

[3] Cui, Y., del Bano Rollin, S., Germano, G. (2017), "Full and fast calibration of the Heston stochastic volatility model", European Journal of Operational Research, 263(2), p. 625–638

[4] Horvath, B., Muguruza, A., Tomas, M. (2019). "Deep learning volatility.", arXiv:1901.09647v2 [q-fin.MF]

In my last couple of posts I spoke about how a fairly simple PDE / finite differences approach can actually enable fast and robust option calibrations of non-affine SV models. I also posted a little console app that demonstrates the approach for the GARCH diffusion model. I have since played around with that app a little more, so here I'm giving a second version that can calibrate the following 5 non-affine SV models (plus Heston for comparison).

For the GARCH diffusion or power-law model (and the PDE pricing engine used for all the models implemented in this demo) see [1]. For the Inverse Gamma (aka "Bloomberg") model see for example [2]. The XGBM model was suggested to me by Alan Lewis, who has been working on an exact solution for it (to be published soon). Here pricing is done via the PDE engine. The ACE-X1 model is one of the models I've tried calibrating, which seems to be doing a very good job in various market conditions. All the above models calibrate to a variance (or volatility) process that is arguably more realistic than that of the Heston model (which when calibrated, very often has zero as its most probable value in the long-run).

Please bear in mind that the above demo is just that, i.e. not production-level. In [1] the PDE engine was developed just for the GARCH diffusion model and Excel was used for the optimization. I then quickly coupled the code with a Levenberg-Marquardt implementation and adjusted the PDE engine for the other models without much testing (or specific optimizations). That said, it works pretty well in general, with calibration times ranging from a few seconds up to a minute. It offers three speed / accuracy settings, but even with the fastest setting calibrations should be more accurate (and much faster) than any Monte-Carlo based implementation. A production version for a chosen model would be many times faster still. Note that you will also need to download the VC++ Redistributable for VS2013. The 64-bit version (which is a little faster) also requires the installation of Intel's MKL library. The demo is free but please acknowledge if you use it and do share your findings.

EDIT April 2020: After downloading and running this on my new Windows 10 laptop I saw that the console was not displaying the inputs as intended (it was empty). To get around this please right click on the top of the console window, then click on Properties and there check "Use legacy console". Then close the console and re-launch.

EDIT April 2020: After downloading and running this on my new Windows 10 laptop I saw that the console was not displaying the inputs as intended (it was empty). To get around this please right click on the top of the console window, then click on Properties and there check "Use legacy console". Then close the console and re-launch.

As a small teaser, here's the performance of these models (in terms of IV-RMSE, recalibrated every day) for SPX options during two months at the height of the 2008 crisis. Please note that I used option data (from www.math.ku.dk/~rolf/Svend/) which include options +- 30% in moneyness only. Expirations range from 1M to 3Y. Including further out-of-the-money options does change the relative performance of the models. As does picking a different market period. That being said, my tests so far show that the considered models produce better IV fits than Heston in most cases, as well as representing arguably more realistic dynamics. Therefore they seem to be better candidates to be combined with local volatility and/or jumps than Heston.

Heston | GARCH | Power_law_0.8 | ACE-X1 | I-Ga | XGBM | |

average IV-RMS | 0.91% | 0.81% | 0.82% | 0.66% | 0.73% | 0.45% |

[1] Y. Papadopoulos, A. Lewis, “A First Option Calibration of the GARCH Diffusion Model by a PDE Method.,” arXiv:1801.06141v1 [q-fin.CP], 2018.

[2] N. Langrené, G. Lee and Z. Zili, “Switching to non-affine stochastic volatility: A closed-form expansion for the Inverse Gamma model,” arXiv:1507.02847v2 [q-fin.CP], 2016.

]]>[2] N. Langrené, G. Lee and Z. Zili, “Switching to non-affine stochastic volatility: A closed-form expansion for the Inverse Gamma model,” arXiv:1507.02847v2 [q-fin.CP], 2016.

For those interested there's now a detailed report of this joint collaboration with Alan Lewis on the arXiv: A First Option Calibration of the GARCH Diffusion Model by a PDE Method. Alan's blog can be found here.

EDIT: For the results reported in the paper, Excel's solver was used for the optimization. I've now quickly plugged the PDE engine into M. Lourakis' Levenberg-Marquardt implementation (levmar) and built a basic demo so that perhaps people can try to calibrate the model to their data. It doesn't offer many options, just a fast / accurate switch. The fast option is typically plenty accurate as well. So, if there's some dataset you may have used for calibrating say the Heston model, it would be interesting to see how will the corresponding GARCH diffusion fit compare. Irrespective of the fit, GARCH diffusion is arguably a preferable model, not least because it typically implies more plausible dynamics than Heston. Does this also translate to more stable parameters on recalibrations? Are the fitted (Q-measure) parameters closer to those obtained from the real (P-measure) world? If the answers to the above questions are mostly positive, then coupling the model with local volatility and/or jumps could give a better practical solution than what the industry uses today. (Of course one could just as easily do that with the optimal-*p* model instead of GARCH diffusion, as demonstrated in my previous post.) Something for the near future.

If you do download the calibrator and use it on your data, please do share your findings (or any problems you may encounter), either by leaving a comment below, or by sending me an email. As a bonus, I've also included the option to calibrate another (never previously calibrated) non-affine model, the general power-law model with p = 0.8 (sitting between Heston and GARCH diffusion, see [1]).

If you do download the calibrator and use it on your data, please do share your findings (or any problems you may encounter), either by leaving a comment below, or by sending me an email. As a bonus, I've also included the option to calibrate another (never previously calibrated) non-affine model, the general power-law model with p = 0.8 (sitting between Heston and GARCH diffusion, see [1]).

Note that (unless you have Visual Studio 2013 installed) you will also need to download the VC++ Redistributable for VS2013. The 64-bit version (which is a little faster) also requires the installation of Intel's MKL library.

EDIT April 2020: After downloading and running this on my new Windows 10 laptop I saw that the console was not displaying the inputs as intended (it was empty). To get around this please right click on the top of the console window, then click on Properties and there check "Use legacy console". Then close the console and re-launch.

I am also including in the download a sample dataset I found in [2] (DAX index IV surface from 2002), so that you can readily test the calibrator. I used it to calibrate both Heston (also calibrated in [2], together with many other affine models) and GARCH diffusion. In contrast to the two datasets we used in the paper, in this case GARCH diffusion (RMSE = 1.14%) "beats" Heston (RMSE = 1.32%). This calibration takes about 5 secs. This is faster than the times we report on the paper and the reason is that the data we considered there include some very far out-of-the-money options that slow things down as they require higher resolution. The Levenberg-Marquardt algo is also typically faster than Excel's solver for this problem. It is also "customizable", in the sense that one can adjust the grid resolution during the calibration based on the changing (converging) parameter vector. Still, this version is missing a further optimization that I haven't implemented yet, that I expect to reduce the time further by a factor of 2-3.

EDIT April 2020: After downloading and running this on my new Windows 10 laptop I saw that the console was not displaying the inputs as intended (it was empty). To get around this please right click on the top of the console window, then click on Properties and there check "Use legacy console". Then close the console and re-launch.

I am also including in the download a sample dataset I found in [2] (DAX index IV surface from 2002), so that you can readily test the calibrator. I used it to calibrate both Heston (also calibrated in [2], together with many other affine models) and GARCH diffusion. In contrast to the two datasets we used in the paper, in this case GARCH diffusion (RMSE = 1.14%) "beats" Heston (RMSE = 1.32%). This calibration takes about 5 secs. This is faster than the times we report on the paper and the reason is that the data we considered there include some very far out-of-the-money options that slow things down as they require higher resolution. The Levenberg-Marquardt algo is also typically faster than Excel's solver for this problem. It is also "customizable", in the sense that one can adjust the grid resolution during the calibration based on the changing (converging) parameter vector. Still, this version is missing a further optimization that I haven't implemented yet, that I expect to reduce the time further by a factor of 2-3.

The fitted parameters are:

GARCH: v0 = 0.1724, vBar = 0.0933, kappa = 7.644, xi = 7.096, rho = -0.5224.

Heston: v0 = 0.1964, vBar = 0.0744, kappa = 15.78, xi = 3.354, rho = -0.5118.

Note that both models capture the short-term smile/skew pretty well (aided by the large fitted xi's, aka vol-of-vols), but then result in a skew that decays (flattens) too fast for the longer expirations.

[1] Y. Papadopoulos, A. Lewis, “A First Option Calibration of the GARCH Diffusion Model by a PDE Method.,” arXiv:1801.06141v1 [q-fin.CP], 2018.

[2] Kangro, R., Parna, K., and Sepp, A., (2004), “Pricing European Style Options under Jump Diffusion Processes with Stochastic Volatility: Applications of Fourier Transform,” Acta et Commentationes Universitatis Tartuensis de Mathematica 8, 123-133.

]]>[2] Kangro, R., Parna, K., and Sepp, A., (2004), “Pricing European Style Options under Jump Diffusion Processes with Stochastic Volatility: Applications of Fourier Transform,” Acta et Commentationes Universitatis Tartuensis de Mathematica 8, 123-133.

However, other voices suggest that this isn't necessarily true and that models of the Heston type have simply not been used to their full potential. They say why use a deterministic starting point $v_0$ for the variance when the process is really hidden and stochastic? Instead, they propose to give such traditional SV models a "hot start", that is assume that the variance today is given by some distribution and not a fixed value. Mechkov [1] shows that when the Heston model is used like that it is indeed capable of "exploding" smiles as expiries tend to zero. Jacquier & Shi [2] present a study of the effect of the assumed initial distribution type.

The idea seems elegant and it's the kind of "trick" I like, because it's simple to apply to an existing solver so it doesn't hurt trying it out. And it gets particularly straightforward when the option price is found through a PDE solution. Then the solution is automatically returned for the whole range of possible initial variance values (corresponding to the finite difference grid in the

So here I'm going to try this out on a calibration to a chain of SPX options to see what it does. But why limit ourselves to Heston? Because it has a fast semi-analytical solution for vanillas you say. I say I can use a fast and accurate PDE solver instead, like the one I briefly tested in my previous post. Furthermore, is there any reason to believe that a square root diffusion specification for the variance should fit the market, or describe its dynamics better? Maybe linear diffusion could work best, or something in between. The PDE solver allows us to use any variance power

Sounds complicated? It actually worked on the first try. Here's what I got for the end of Q1 2017. Just using the PDE engine and Excel's solver for the optimization. The calibration involved 262 options of 9 different expiries ranging from 1W to 3Y and took a few minutes to complete. If one restricts the calibration to mainly short-term expiries then better results are obtained, but I wanted to see the overall fit when very short and long expiries are fitted simultaneously. I am also showing how the plain Heston model fares. Which on the face of it is not bad, apart for the very short (1W) smile. Visually what it does is try to "wiggle its way" into matching the short-term smiles. The wiggle seems perfectly set up for the 3W smile, but then it turns out excessive for the other expiries. The ROD model on the other hand avoids excessive "wiggling" and still manages to capture the steep smiles of the short expiries pretty well. The optimal model power

But the RMSE only tells half the story. The Feller ratio corresponding to Heston's fitted parameters is 0.09, which basically means that the by far most probable (risk-neutral) long-run volatility value is zero. In other words, the assumed volatility distribution is not plausible. The randomization idea is neither without any issues, despite the impressive improvement in fit. The optimizer calibrated to a correlation coefficient of -1 for this experiment, which seems extreme and not quite realistic.

By the way, this experiment is part of some research I've been doing in collaboration with Alan Lewis, the results of which will be available/published soon.

This year though I've worked a lot more on such solvers and I now realise that those run times are clearly not what one should be expecting, nor aim for with (fine-tuned) implementations. So just how fast should your numerical solver of the Heston (or similar) PDE be? What kind of run time should be expected for high-accuracy solutions? The answer is:

κ | η | σ | ρ | rd | rf | T | K | νo | |

Case 0 | 5 | 0.16 | 0.9 | 0.1 | 0.1 | 0 | 0.25 | 10 | 0.0625 |

Case 1 | 1.5 | 0.04 | 0.3 | -0.9 | 0.025 | 0 | 1 | 100 | 0.0625 |

Case 2 | 3 | 0.12 | 0.04 | 0.6 | 0.01 | 0.04 | 1 | 100 | 0.09 |

Case 3 | 0.6067 | 0.0707 | 0.2928 | -0.7571 | 0.03 | 0 | 3 | 100 | 0.0625 |

Case 4 | 2.5 | 0.06 | 0.5 | -0.1 | 0.0507 | 0.0469 | 0.25 | 100 | 0.0625 |

Case 5 | 3 | 0.04 | 0.01 | -0.7 | 0.05 | 0 | 0.25 | 100 | 0.09 |

My current "state-of-the-art" solver uses second-order spatial discretization and the Hundsdorfer-Verwer ADI scheme. For the results below I used the solver "as is", so no fine-tuning for each case, everything (grid construction) decided automatically by the solver. In order to give an idea of the accuracy achieved overall for each case I plot the solution error in the asset direction (so across the moneyness spectrum). The plots are cut-off where the option value becomes too small, the actual grids used by the solver extend further to the right. I am showing results for two different grid resolutions (NS x NV x NT), grid A (60 x 40 x 30) and grid B (100 x 60 x 50). The timings were taken on an i7-920 PC from 2009 in single-threaded mode. Obviously one can expect at least double single-thread speed from a modern high-spec machine. (ADI can be parallelized as well with almost no extra effort, with a parallel efficiency of about 80% achieved using basic OpenMP directives). The errors in each case are calculated by comparing with the exact (semi-analytic) values obtained using QuantLib at the highest precision. Note that I am using the relative errors which are harder to bring down as the option value tends to zero. But when one wants to use the PDE solver as the pricing engine for a calibration, then it is important to price far out-of-the-money options with low relative errors in order to get accurate implied volatilities and properly capture any smile behavior. The present solver can indeed be used to calibrate the Heston model (or any other model in this family) accurately in less than a minute and in many cases in just a few seconds.

So, if you have a Heston PDE solver then this is an efficiency reference you can use to benchmark against. Price the 6 options above and compare error plots and timings with those below. Let me know what you get. If you're getting larger errors than I do with the above resolutions, I'll tell you why that is!

There is of course another important quality for a PDE solver, robustness. The scheme I used here (H-V) is fairly robust, but can still produce some spurious oscillations when one uses too low NT/NS. I may expand this post testing other schemes in the near future.

So, if you have a Heston PDE solver then this is an efficiency reference you can use to benchmark against. Price the 6 options above and compare error plots and timings with those below. Let me know what you get. If you're getting larger errors than I do with the above resolutions, I'll tell you why that is!

There is of course another important quality for a PDE solver, robustness. The scheme I used here (H-V) is fairly robust, but can still produce some spurious oscillations when one uses too low NT/NS. I may expand this post testing other schemes in the near future.

]]>

As an example let's try to price a daily monitored up and out put option. I'll use simple Black Scholes to demonstrate, but the qualitative behaviour would be very similar under other models. In order to show the effect clearly I'll start with a uniform grid. The discetization uses central differences and is thus second order in space (asset S) and Crank-Nicolson is used in time. That will sound alarming if you're aware of C-N's inherent inability to damp spurious oscillations caused by discontinuities, but a bit of Rannacher treatment will take care of that (see here). In either case in order to take time discretization out of the picture here, I used 50000 time steps for the results below so there's no time-error (no oscillation issues either) and thus the plotted error is purely due to the S-dicretization. The placement of grid points relative to a discontinuity has a significant effect on the result. Having a grid point falling exactly on the barrier will produce different behaviour as opposed to having the barrier falling mid-way between two grid points. So Figure 1 has the story. The exact value (4.53888216 in case someone's interested) was calculated on a very fine grid. It can be seen that the worst we can do is place a grid point on the barrier and solve with no smoothing. The error using the coarsest grid (which still has 55 points up to the strike and it's close to what we would maybe use in practice) is clearly unacceptable (15.5%). The best we can do without smoothing (averaging), is to make sure the barrier falls in the middle between two grid points. This can be seen to significantly reduce the error, but only once we've sufficiently refined the grid (curve (b)). We then see what can be achieved by placing the barrier on a grid point and smooth by averaging as described above. Curve (c) shows linear smoothing already greatly improves things again compared to the previous effort in curve (b). Finally curve (d) shows that quadratic smoothing can add some extra accuracy yet.

S = 98, K = 110, B = 100, T = 0.25, vol = 0.16, r = 0.03, 63 equi-spaced monitoring dates (last one at T)

This case is also one which greatly benefits from the use of a non-uniform grid which concentrates more points near the discontinuity. The slides below show exactly that. Quadratic smoothing with the barrier on a grid point is used. First is the solution error plot for a uniform grid with of dS = 2, which corresponds to the first point from the left of curve (d) in figure 1. The second slide shows the error we get when a non-uniform grid of the same size (and hence same execution cost) is used. The error curve has been pretty much flattened even on this coarsest grid. The error now has gone from 15.5% on a unifom grid with no smoothing, down to 0.01% on a graded grid with smoothing applied, for the same computational effort. Job done. Here's the function I used for it: SmoothOperator.cpp.

]]>

Now there are generally two ways of reducing, or even getting rid of this simulation bias in special cases. The first is to shift the barrier closer to the asset spot in order to compensate for the times when we "blink". The question is by how much should the barrier be shifted? This was first treated by Broadie, Glasserman & Kou [1], for the Black-Scholes model, where they introduced a continuity correction based on the "magical" number 0.5826. Check it out if you haven't. This trick works pretty well when the spot is not very near the barrier and/or when we have many monitoring dates. But in the opposite case, it can be quite bad and produce errors of 5% or more (Gobet, [2]). This gets worse when there is a steep drop of the value near the barrier, as in a down and out put, or an up and out call with low volatility.

The second way is to use an analytic expression for the probability of hitting the barrier between two realized points in a simulated asset path. This is usually referred to as the probabilistic method, or the Brownian bridge technique, since it uses the probability of Brownian motion hitting a point conditional on two fixed end points. As already implied this probability is known analytically (see [2]) for Brownian motion (and consequently for GBM as well). So for the usual GBM asset model (Black-Scholes), this technique can remove the simulation bias completely. One can just use the probability directly, multiplying each path's payoff with the total path survival probability (based on the survival probability of each path segment

As can be seen the probabilistic method (enabled by checking "continuous monitoring") enables exact Monte Carlo pricing of this continuously monitored up and out call (for which of course the exact analytic solution is available and is shown as well and marked by the blue line). By contrast, the barrier-shifting method misprices this option by quite some margin (3.4%), as can be seen in the second slide where the probabilistic correction is unchecked and instead the barrier has been shifted downwards according to BGK's (not so magic here) formula.

Now the analytical (Brownian Bridge) hitting probability is only available for a BM/GMB process. But what if we have a not so simple process? Well, as long as the discretization scheme we use for the asset makes it locally BM/GBM (always the case when we use Euler to discretize

κ | η | σ | ρ | rd | rf | T | K | U | So | vo | |

Case 1 | 5 | 0.16 | 0.9 | 0.1 | 0.1 | 0 | 0.25 | 100 | 120 | 100 | 0.0625 |

Case 2 | 5 | 0.16 | 0.9 | 0.1 | 0.1 | 0 | 0.25 | 100 | 135 | 130 | 0.0625 |

Case 3 | 1.5 | 0.04 | 0.3 | -0.9 | 0.025 | 0 | 0.25 | 100 | 115 | 100 | 0.0625 |

Case 4 | 1.5 | 0.04 | 0.3 | -0.9 | 0.025 | 0 | 0.25 | 100 | 135 | 130 | 0.0625 |

Case 5 | 3 | 0.12 | 0.04 | 0.6 | 0.01 | 0.04 | 0.25 | 100 | 120 | 100 | 0.09 |

Case 6 | 2.5 | 0.06 | 0.5 | -0.1 | 0.0507 | 0.0469 | 0.5 | 100 | 120 | 100 | 0.0625 |

Case 7 | 6.21 | 0.019 | 0.61 | -0.7 | 0.0319 | 0 | 0.5 | 100 | 110 | 100 | 0.010201 |

Solver | Price | Difference | |

Case 1 | PDE/FDM cont. monitoring (exact) | 1.8651 | - |

MC with 63 timesteps & PCC | 1.8652 | 0.01% | |

Plain MC with 63 timesteps | 2.1670 | 16% | |

Case 2 | PDE/FDM cont. monitoring (exact) | 2.5021 | - |

MC with 63 timesteps & PCC | 2.5032 | 0.04% | |

Plain MC with 63 timesteps | 3.4159 | 37% | |

Case 3 | PDE/FDM cont. monitoring (exact) | 2.1312 | - |

MC with 63 timesteps & PCC | 2.1277 | -0.16% | |

Plain MC with 63 timesteps | 2.3369 | 10% | |

Case 4 | PDE/FDM cont. monitoring (exact) | 3.6519 | - |

MC with 63 timesteps & PCC | 3.6394 | -0.34% | |

Plain MC with 63 timesteps | 4.6731 | 28% | |

Case 5 | PDE/FDM cont. monitoring (exact) | 1.6247 | - |

MC with 63 timesteps & PCC | 1.6249 | 0.01% | |

Plain MC with 63 timesteps | 1.8890 | 16% | |

Case 6 | PDE/FDM cont. monitoring (exact) | 1.7444 | - |

MC with 125 timesteps & PCC | 1.7438 | -0.03% | |

Plain MC with 125 timesteps | 1.9209 | 10.1% | |

Case 7 | PDE/FDM cont. monitoring (exact) | 1.9856 | - |

MC with 125 timesteps & PCC | 1.9790 | -0.33% | |

Plain MC with 125 timesteps | 2.0839 | 4.95% |

Finally regarding the variance value used in the probability formulas, my brief testing showed that the best results were obtained when using the variance corresponding to the one of the two nodes *S*[i] and *S*[i+1] which is closer to the barrier. This was the choice for the results of Table 2.

Once I set up a more accurate long stepping discretization scheme, I may test to see how well this approximation holds when one uses fewer/longer time steps.

Once I set up a more accurate long stepping discretization scheme, I may test to see how well this approximation holds when one uses fewer/longer time steps.

[1] Broadie M, Glasserman P, & Kou S. A continuity correction for discrete barrier options, *Mathematical Finance**,* Vol. 7 (4) (1997), pp. 325-348.

[2] Emmanuel Gobet. Advanced Monte Carlo methods for barrier and related exotic options. Bensoussan A., Zhang Q. et Ciarlet P.*Mathematical Modeling and Numerical Methods in Finance*, Elsevier, pp.497-528, 2009, *Handbook of Numerical Analysis*

]]>[2] Emmanuel Gobet. Advanced Monte Carlo methods for barrier and related exotic options. Bensoussan A., Zhang Q. et Ciarlet P.

Well I decided to give it another try and see how it compares really after a few optimizing tweaks.

We are trying to solve the following PDE by discretizing it with the finite difference method:

$$\frac{\partial V}{\partial t} = \frac{1}{2} S^2 \nu\frac{\partial^2 V}{\partial S^2}+\rho \sigma S \nu \frac{\partial^2 V}{\partial S \partial \nu} + \frac{1}{2}\sigma^2\nu\frac{\partial^2 V}{\partial \nu^2} + (r_d-r_f)S \frac{\partial V}{\partial S}+ \kappa (\eta-\nu) \frac{\partial V}{\partial \nu}-r_d V $$

where $\nu$ is the variance of the underlying asset $S$ returns, $\sigma$ is the volatility of the $\nu$ process, $\rho$ the correlation between the $S$ and $ \nu$ processes and $\eta$ is the long-term variance to which $\nu$ is mean-reverting with rate $\kappa$. Here $S$ follows geometrical Brownian motion and $\nu$ the CIR (Cox, Ingersoll and Ross) mean-reverting process.

After we discretize the PDE we are left with an algebraic system of equations relating the values at each grid point (

The 3LFI

This scheme is strongly A-stable (?), so unlike something like Crank-Nicolson (CN) it has built-in oscillation-damping on top of being unconditionally stable. So I tested this first and while it worked pretty well, I found that it produced slightly larger errors than CN. That is while they're both second-order accurate, CN's error-constant was lower. Which is not surprising, central discretizations often produce higher accuracy than their one-sided counterparts of the same order. Moreover it was notably less time-converged compared to the ADI schemes using the same number of time-steps.

So I decided to go for higher-spec. Especially since it comes really cheap in terms of implementation. Adding an extra time-point gives us the

$$ BDF3 : \quad V_{i,j}^{(n+1)} = \left (B \cdot dt+3 \cdot V_{i,j}^{(n)}-\frac 3 2 \cdot V_{i,j}^{(n-1)}+\frac 1 3 \cdot V_{i,j}^{(n-2)}\right)/\left(\frac {11} 6 - a_{i,j} \cdot dt\right) \qquad (1)$$

While such a scheme is no longer strictly unconditionally stable (it's almost A-stable), it should still be almost as good as and should preserve the damping qualities of 3LFI/BDF2. As it happens, the results it produces are significantly more accurate than both the 3LFI/BDF2 and the ADI

Now since this is a multi-step scheme (it needs the values not just from the previous time level but also from two levels before that), we need to use something different for the first couple of time steps, a starting procedure. For the very first step we need a two-level (single step) scheme. We cannot use CN because this will allow spurious oscillations to occur given the non-smooth initial conditions of option payoff functions. Instead one can use the standard Implicit Euler scheme plus local extrapolation: we first use the IE scheme for 4 sub-steps of size

Central finite differences are used for everything, except again for the

I used successive over-relaxation (SOR) to solve the resulting system because it's so easy to set up and accommodate early exercise features. On the other hand it is in general quite slow and moreover its speed may depend significantly on the chosen relaxation factor $\omega$. This is where the hand-made part comes in. I am using an SOR variant called

As I mentioned above, I wanted to compare this general purpose, old-school method to what people use nowadays for Heston (and other parabolic PDE's with mixed derivatives), which will usually be an ADI variant like Graig-Sneyd, Modified Graig-Sneyd, or Hundsdorfer-Verwer. To this end I did not implement said methods myself, but I downloaded and used QuantLib which does implement them. It's the first time that I use QuantLib and was pleasantly surprised by the available functionality (but less so by the lack of quantitative-level documentation). So I will be comparing with their implementation.

I will use the following 6 sets of parameters found in the relevant literature in order to make a few quick comparisons:

κ | η | σ | ρ | rd | rf | T | K | νo | Ref | |

Case 0 | 5 | 0.16 | 0.9 | 0.1 | 0.1 | 0 | 0.25 | 10 | 0.0625 | [2] |

Case 1 | 1.5 | 0.04 | 0.3 | -0.9 | 0.025 | 0 | 1 | 100 | 0.0625 | [1] |

Case 2 | 3 | 0.12 | 0.04 | 0.6 | 0.01 | 0.04 | 1 | 100 | 0.09 | [1] |

Case 3 | 0.6067 | 0.0707 | 0.2928 | -0.7571 | 0.03 | 0 | 3 | 100 | 0.0625 | [1] |

Case 4 | 2.5 | 0.06 | 0.5 | -0.1 | 0.0507 | 0.0469 | 0.25 | 100 | 0.0625 | [1] |

Case 5 | 3 | 0.04 | 0.01 | -0.7 | 0.05 | 0 | 0.25 | 100 | 0.09 | [3] |

The first case has been used by Ikonen & Toivanen [2] and other authors, cases 1-4 are from Hout & Foulon [1] and the last one is from Sullivan & Sullivan [3] who quote that such a set would be typical of a calibration of the model to equity market options. Note the very small volatility of variance $ \sigma$ which makes the PDE strongly convection-dominated in the

So let's start with a very unscientific comparison. Table 2 shows the prices, errors and timings for at-the-money puts at the current variance levels $\nu_o$ of table 1, as calculated by the present solver and the modified Craig-Sneyd ADI scheme in QLib. This is just a quick and dirty way to get a first idea since there are more differences than just the method. QLib discretizes in log(

Finally a fine grid was chosen deliberately since this is where SOR-based solvers are usually reported to suffer the most in terms of efficiency when compared with splitting schemes like ADI. The stopping criterion was set low enough so that the prices are converged to all six digits shown. This is definitely lower than the average discretization error and thus unnecessary, so in real-life situations the present solver could be set up to be a little faster still.

Solver | Price | |Error| | CPU secs | |

Case 0 | Present | 0.501468 | 0.000002 | 4.1 |

QLib MGS | 0.501469 | 0.000003 | 7.5 | |

Exact | 0.5014657 | - | - | |

Case 1 | Present | 7.46150 | 0.000234 | 6.2 |

QLib MGS | 7.46175 | 0.000015 | 7.5 | |

Exact | 7.461732 | - | - | |

Case 2 | Present | 14.4157 | 0.000008 | 7.4 |

QLib MGS | 14.4166 | 0.000861 | 7.5 | |

Exact | 14.41569 | - | - | |

Case 3 | Present | 12.0821 | 0.001076 | 6.6 |

QLib MGS | 12.0830 | 0.000122 | 7.5 | |

Exact | 12.08315 | - | - | |

Case 4 | Present | 4.71629 | 0.000007 | 3.8 |

QLib MGS | 4.71637 | 0.000072 | 7.5 | |

Exact | 4.716294 | - | - | |

Case 5 | Present | 4.83260 | 0.000002 | 4.3 |

QLib MGS | 4.83267 | 0.000069 | 7.5 | |

Exact | 4.832600 | - | - |

As I've said above this is really a dirty comparison, but basically I just wanted to see how much slower the present scheme is, is it 3 times, or 5 times, or 10. The next thing to check would be robustness, but I'll leave that for a different post. So then, to my surprise the present method seems to be overall on par speed-wise with the ADI method, faster even for the chosen number of time steps. This is in contrast to what one reads in the literature where SOR is always the "slow" method one uses to compare with the more efficient modern methods. Not the case here with the

QLib by default focuses the

A better picture of the quality of the solution is possible if we look at the error distribution across different values of

When we price an option by solving a PDE we get the entire solution surface for the grid range

So what do we see in this (used frequently in literature) case ? Two things: First that the non-uniform grid construction in log(

Another advantage of SOR when it comes to option pricing, apart from its simplicity, is the ease with which one can incorporate early exercise features. It's literally one extra line of code (OK, maybe a couple more counting the equally trivial changes in the boundary conditions). So let's have a look at the accuracy of the results produced by the present method. I will again use test case 0 from table 1 which has been used extensively in the literature for American option pricing under Heston (see for example [2]). We should note here that with the ADI schemes there are various ways with which one can handle early exercise. One of the best is probably the operator splitting method of Ikonen & Toivanen in [2], although in that paper they use it with different time-discretizations. I will compare with their results below but first I will again make a quick comparison with QLib's American Heston pricing functionality, using the same ADI scheme (MGS) like before. QLib does not use a "proper" method for handling early exercise, it simply applies the American constraint explicitly, after the prices have been obtained like European at each time step (explicit payoff method). Such an approach cannot give very accurate results. With SOR on the other hand (in this context called PSOR, i.e. Projected SOR) the American constraint is really "blended into" the solution. We apply it every time we update the value of each grid point and the next point we update can "feel" the effect of the previous points we have updated, i.e. it can feel that the values of the previous points have been floored by the payoff. And this is repeated many times (as many as the iterations needed for the solution to converge for that time step), which leads to all the points properly "sharing" what each needs to know about the early exercise effect. So in the end of each time step the PDE and the constraint are satisfied at each point simultaneously. Unlike with the explicit, a posteriori enforcement of the constraint in QLib, which unavoidably cancels all the good work the ADI had done before to exactly satisfy the discretized PDE at each point. The result of these different ways of incorporating early exercise can be seen below.

Present solver BDF3 - PTSOR | QLib MGS - EP | |||||

Nt | Price | |Error| | CPU secs | Price | |Error| | CPU secs |

50 | 0.519890 | 0.000004 | 0.33 | 0.519376 | 0.000557 | 0.28 |

100 | 0.519893 | 0.000001 | 0.44 | 0.519653 | 0.000280 | 0.51 |

200 | 0.519894 | 0.000000 | 0.43 | 0.519793 | 0.000140 | 0.97 |

400 | 0.519894 | 0.000000 | 0.73 | 0.519863 | 0.000070 | 1.9 |

800 | 0.519894 | 0.000000 | 1.1 | 0.519900 | 0.000033 | 3.7 |

1600 | 0.519894 | 0.000000 | 1.6 | 0.519918 | 0.000015 | 7.3 |

20000 | 0.519894 | 0 | - | 0.519934 | 0 | - |

Table 3 shows the pricing of the American put of case 0 from table 1 (also used for such tests in [2]) with a fixed spatial grid of 200 x 100 and increasing number of time steps. The time-converged value was calculated in both cases using 20000 time steps. Note that the converged values are not the same for the two solvers because they are using different spatial discretizations.

As we can see, apart from the first order convergence, the explicit payoff method of QLib also comes with a large error constant. Even with 1600 time steps the error is greater than that of the present solver with only 50 time steps. That said, the predictably first order convergence of the Qlib approach means that one can use Richardson extrapolation and get much improved accuracy.

The other thing to note is that the present solver becomes much faster than the ADI when the number of time steps is increased. Unfortunately this proves to be a rather useless advantage since given its high temporal accuracy one would not need to use a large number of time steps anyway.

Ikonen & Toivanen in [2] provide extensive results for the same test case which include the use of the standard (P)SOR with three second order time schemes (Crank-Nicolson, BDF2 and Runge Kutta), as well as their operator splitting (OS) method coupled with a multigrid (MG) solver. Their space discretization is very similar to the one used here (second-order central schemes for first and second derivative terms), although they do add some diffusion in areas of the solution where convection dominates, in order to avoid positive non-diagonal coefficients. To this end they also use a special scheme for the discretization of the mixed derivative term. This extra safety added should in general result in some loss of accuracy. Still I think it's probably not unreasonable to make direct accuracy comparisons with their results using the same (uniform) grid sizes. I also use the same computational domain

Present solver | Ikonen & Toivanen (2009) | ||||||||||||

(NS,Nν,Nt) | PTSOR -BDF3 | PSOR-CN | OS-CN w/ MG | ||||||||||

Error | Term. error | Avg. iter | Avg. ω | CPU | Error | Avg. iter | ω | CPU | Error | CPU | |||

(40,16,16) | 4.85E-03 | - | 9.3 | 1.11 | 0.01 | 1.55E-02 | 10.4 | 1.4 | 0 | 1.49E-02 | 0 | ||

(40,32,16) | 4.99E-03 | - | 10.4 | 1.27 | 0.01 | 1.61E-02 | 12.9 | 1.4 | 0 | 1.55E-02 | 0 | ||

(80,16,16) | 1.10E-03 | - | 14.7 | 1.21 | 0.02 | 4.57E-03 | 22.1 | 1.6 | 0.01 | 4.06E-03 | 0.01 | ||

(80,32,16) | 1.33E-03 | - | 13.9 | 1.36 | 0.03 | 4.44E-03 | 22.1 | 1.6 | 0.02 | 3.69E-03 | 0.02 | ||

(80,32,32) | 1.39E-03 | - | 8.3 | 1.26 | 0.03 | 4.19E-03 | 14.4 | 1.5 | 0.04 | 3.77E-03 | 0.04 | ||

(80,64,32) | 1.42E-03 | - | 10.6 | 1.36 | 0.05 | 1.72E-03 | 16.1 | 1.6 | 0.09 | 3.83E-03 | 0.09 | ||

(160,32,32) | 3.11E-04 | 7% | 11.6 | 1.4 | 0.06 | 1.36E-03 | 32.9 | 1.7 | 0.17 | 9.79E-04 | 0.08 | ||

(160,64,32) | 3.45E-04 | - | 16.9 | 1.59 | 0.14 | 1.30E-03 | 32 | 1.7 | 0.33 | 8.78E-04 | 0.17 | ||

(160,64,64) | 3.64E-04 | - | 8.4 | 1.44 | 0.14 | 1.15E-03 | 20.1 | 1.7 | 0.42 | 9.19E-04 | 0.33 | ||

(160,128,64) | 3.74E-04 | 2% | 10.7 | 1.58 | 0.34 | 1.16E-03 | 21.9 | 1.7 | 0.92 | 9.13E-04 | 0.75 | ||

(320,64,64) | 1.09E-04 | 10% | 11.7 | 1.6 | 0.36 | 4.22E-04 | 49.9 | 1.8 | 2.1 | 2.43E-04 | 0.74 | ||

(320,128,64) | 1.10E-04 | 9% | 13.21 | 1.66 | 0.84 | 4.05E-04 | 45.4 | 1.8 | 3.7 | 2.17E-04 | 1.6 | ||

(320,128,128) | 1.05E-04 | 4% | 8 | 1.57 | 1 | 3.28E-04 | 26.9 | 1.7 | 4.5 | 2.24E-04 | 2.7 | ||

(320,256,128) | 1.04E-04 | 3% | 10.2 | 1.68 | 2.6 | 3.30E-04 | 29.8 | 1.8 | 9.8 | 2.46E-04 | 5.5 | ||

(640,128,128) | 4.19E-05 | 23% | 10.9 | 1.66 | 2.8 | 1.43E-04 | 65.3 | 1.8 | 21.3 | 6.88E-05 | 6.6 | ||

(640,256,128) | 4.38E-05 | 27% | 11.7 | 1.71 | 6.5 | 1.42E-04 | 69.4 | 1.8 | 45.2 | 6.71E-05 | 13.6 |

The error shown in table 4 is the

So then, the one thing we can directly compare is the number of SOR iterations needed. These can be seen to be lower across the board with the present set-up compared to the one in [2]. More interestingly, the big difference is with the finer grids where the standard SOR requires increasingly higher numbers of iterations to converge. This is not so in the case of the present solver, where the required number of iterations remains more or less constant. This is a result of both the use of the TSOR variant and the BDF3 scheme. Consequently, the present solver would seem to be significantly faster than the equivalent SOR solver in [2], about 6 to 7 times faster for the fine grids, based on numbers of iterations but also CPU times (note that the CPU time per iteration seems to be about the same in both cases). It could also be faster than the multi-grid solver of [2] but again, these are two completely different implementations at different times from different people, so such a comparison can only be seen as vaguely indicative. Note that the perceived speed up factor of 6-7 for the finer grids is quite sensitive to the exact termination criterion. If for example I use a stricter criterion that results in the SOR termination error being 5% or 10% instead, then the speed up factor looks more like 3-4. I don't know exactly how converged the PSOR solutions in [2] are, the authors noting that "the error due to the termination of iterations is smaller than the discretization error. On the other hand, the criterion is not unnecessary tight which would increase the CPU time without increasing accuracy".

In terms of accuracy achieved, table 4 shows that the present solver gives errors that are 2 to 4 times lower. This is most probably due to the extra diffusion added with the special discretization used in [2] as mentioned above. Indeed looking at some European option prices also reported in [2], even though second order convergence is observed, the error overall seems higher than what I get with my discretization using the same grids. In terms of temporal convergence, the present method seems close to first order for low numbers of time steps

Finally, table 5 shows benchmark values for the 10 points considered in table 4, for future reference. Many publications have presented their estimated benchmarks for those 10 points and since I like calculating benchmarks, here are mine (obviously the most accurate!). Please note though that these are not the reference values used to calculate the errors in table 4, since those needed to be for the same computational domain of

with NS = 5120, Nv = 1920, Nt = 4500.

S | |||||

ν | 8 | 9 | 10 | 11 | 12 |

0.0625 | 2.0000000 | 1.1076218 | 0.5200330 | 0.2136783 | 0.0820444 |

0.25 | 2.0783684 | 1.3336383 | 0.7959825 | 0.4482762 | 0.2428072 |

In this rather lengthy post I pitted an experimental, easy to implement solver for partial differential equations with mixed derivatives, against more advanced and complicated methods. There is really nothing new about it, except maybe the TSOR variant (which I chose not to explain for the time being). I also don't think I've seen the BDF3 scheme used for the solution of the Heston PDE before, so someone may find this interesting as well, maybe. The first results show the present combo to be very competitive against the modified Craig-Sneyd ADI implementation in QuantLib and significantly faster than other SOR-based solvers reported in literature such as the one in [2] (or the one in [3] which I've also compared against). I will not say that it really is as "good" as the ADI methods, since I have not yet thoroughly tested it in terms of robustness. In terms of stability/oscillation damping is should fare well, but that needs to be confirmed as well. It certainly has its quirks, for example it "likes" certain grid proportions better (in terms of CPU/accuracy ratio), speeds up with small time steps and slows down with larger ones and needs some heuristic logic for adjusting the relaxation factor. All to do with the iteration matrix of course, but it's not difficult to set it up for near-optimal operation. Also need to note that the implementation (QLib) I tested against may not be representative of how fast an ADI method could be, so the timings above may be misleading.

All this took me quite a bit more time to prepare than I thought it would, so I'll stop here. When I feel like more numerical games I may add another post testing stability and robustness, as well as non-uniform grids. Does the use of central spatial discretizations all around mean that we may get spurious oscillations in convection-dominated pricing problems? If that's the case, replacing the central finite differences with second order upwind ones where needed is really straightforward. It would also come with no extra performance cost, as would be the case with ADI since it would mean that fast tridiagonal solvers would have to be replaced by band-diagonal ones that are not as fast. Also my non-uniform grid implementation currently seems in many cases up to two times slower than the present (uniform) grid implementation. This is partly due to more arithmetic operations and the SOR iteration matrix being less "favourable" for the same grid size.

[1] Hout K. J. In ’t & Foulon S. ADI finite difference schemes for option pricing in the Heston model with correlation, *International Journal of Numerical Analysis and Modeling,* 7 (2) (2010), pp. 303-320.

[2] S. Ikonen & J. Toivanen, Operator splitting methods for pricing American

options under stochastic volatility,*Numerische Mathematik, *113 (2009), pp. 299–324.

[3] C. O'Sullivan & S. O'Sullivan, Pricing European and American Options in the Heston Model with Accelerated Explicit Finite Differencing Methods,*International Journal of Theoretical and Applied Finance*, Vol. 30, No. 4 (2013).

]]>[2] S. Ikonen & J. Toivanen, Operator splitting methods for pricing American

options under stochastic volatility,

[3] C. O'Sullivan & S. O'Sullivan, Pricing European and American Options in the Heston Model with Accelerated Explicit Finite Differencing Methods,