Original article was published by Maxime Bergeron on Deep Learning on Medium
Deep Learning to Jump
By Maxime Bergeron & Ivan Sergienko
In this short note, we describe a Jump Unit that can be used to fit a step function with a simple neural network. Our motivation comes from quantitative finance problems where discontinuities often appear.
Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.
Why the jump?
Discontinuous functions are a common occurrence in financial instruments. For instance, the graph below shows the price of a typical five-year fixed rate bond with a semi-annual coupon. We have set the coupon rate higher than the discount rate, so the value of the bond stays above its Par Value of $100. If you are not familiar with bond pricing, a good primer is available here.
The important things to notice for our purposes are the jumps that occur at each coupon payment date. This happens simply because money cannot be made out of thin air. The “wealth” of the security owner remains the same immediately before and after the coupon. As such, we have:
value before coupon = coupon cash + value after coupon.
Similar jumps occur in the values of more complicated path-dependent financial derivatives on exercise dates. A classic example here is that of a Bermudan swaption, a popular instrument used to manage mortgage prepayment risk. Bermudan-style options can be exercised on predetermined schedule of dates, on which dates the value may jump.
To keep things concrete, we will focus on the sub-problem of learning a piece-wise constant function with a single downward jump. We generate the training data as shown in the graph above. The reader is invited to follow along with our code in this Jupyter notebook.
So how do we machine learn a jump? A natural approach here is to use a sigmoid. However, in our problem, the steps are sharp. All the points to the left of a coupon payment date are strictly above the points to the right of it resulting in a true discontinuity. Fitting a simple 1-dimensional neural network
comprised of a sigmoidal activation function sandwiched between two linear layers requires an unbounded coefficient (weight) in the initial linear layer. Quantities growing without bound are always bad news for numerical methods, but it gets even worse with neural nets. Indeed, it leads to the infamous exploding gradients problem which makes optimal parameter values nearly impossible to learn. We would have to compromise by keeping the weight large yet finite, leading to a region of the inputs (just around the jump) where it is difficult to get rid of a significant error.
Can we do better?
Why not simply learn a jump with… a jump? It is tempting to replace the sigmoid function in our one-dimensional network above with a discontinuous activation. The simplest candidate here is the Heaviside step function H(x) which equals 0 for x < 0, and 1 otherwise:
This is an obvious idea, which immediately leads to a failure like the one shown in the graph below:
To understand the problem here, let’s look at the math. The loss function corresponding to our little neural network is:
Here, yᵢ are the training data values corresponding to times tᵢ , the sub-scripted w term is the weight of the last linear layer and the sub-scripted b terms are biases from the two linear layers. The astute reader will notice that without loss of generality we have set the weight of the first linear layer to 1 and that the remaining weights and biases determine the size and position of the step. The gradient descent methods on which deep learning relies for minimizing the error function require the use of first derivatives. In this context, one of them is rather problematic:
The problem here is the second factor, which blows up when the b and t terms cancel out. The following plot clearly illustrates the issue: the function is piece-wise constant!
While the actual minimum is where we expect it to be, gradient based methods never reach it. Instead the process gets stuck in one of the little plateaus.
In fact, this problem is one of the reasons why sigmoid functions became a staple of machine learning. Replacing Heaviside functions with sigmoidal functions allowed complex neural networks such as the multi-layer perceptron to be successfully trained via gradient descent.
At this point, it may start to feel like we are going in circles. Sigmoidal functions are of limited use to fit sharp jumps, yet they were introduced to fix the obvious problems with Heaviside functions.
The following table summarizes the previous two sections and shows what we like and dislike about the two activation functions when faced with the task of learning a sharp step function:
A natural question is whether we can combine the two functions, keeping the features that we like and discarding those we dislike. Enter the Jump Unit illustrated below:
It consists of three linear nodes along with a sigmoid and Heaviside activation functions arranged in parallel. To understand how this unit works, consider the equation that it encodes:
In order to simplify the equation, we omit the bias term in the linear layer following the sigmoid activation since it is accounted for by the bias term of the linear layer following the Heaviside activation. Let’s now take a look at the derivative which was problematic before:
Since the troublesome bias term now appears in the argument of both S(-) and H(-), its gradient no longer vanishes at most points and the network is able to learn. Notice that at the end of the process we also want the weight of the linear layer following the sigmoid activation to vanish, so that only the Heaviside contribution remains.
The plot above shows the MSE error of our Jump Unit as a function of its two key parameters. We see that the plateaus between the steps along the bias axis are now slopped, allowing gradient-based algorithms to learn the global minimum.
Finally, we plot the resulting function, voilà!
To conclude, we have shown how the benefits and drawbacks of sigmoidal and Heaviside activation functions can be combined to produce a Jump Unit capable of learning a discontinuous step function via gradient descent.
We encourage readers to try it out for themselves in this Jupyter notebook!