Original article was published on Deep Learning on Medium
Wavenet variations for financial time series prediction: the simple, the directional-Relu, and the probabilistic approach
One of the most useful features of deep neural network architectures is their capability to yield good results in very different domains. Image recognition, natural language processing, or regression tasks can be implemented with similar solutions, model parts, or models. To build a state of the art model in any domain it is important to know what are the new techniques of other fields.
The Wavenet model by DeepMind was originally created for generating audio but showed good results with language translation and time series prediction.
In this article, we will build models based on the Wavenet architecture to predict currency prices.
Forecasting the stock market or the foreign exchange market is very difficult. There are too many unknown effects. You will be never able to predict what the president of the United States will share on Twitter tomorrow, and only a single tweet has the power to alter the course of the market. But most of the time we don’t want to build a model which is capable to predict the price in all situations. That would be an unachievable goal. We can aim lower, we can try to build models which are capable to predict some aspects of the future, in some situations, for a not too long timeframe with acceptable skill. This skill can be very different depending on the exact problem.
What exactly want we to predict when we are trying to forecast the future price of some assets? Are we trying to predict the real future value? No. We are trying to forecast the response of a system which is made of millions or billions of individual human brains and algorithms. It doesn’t matter that the aggregated response is how rational according to our judgment. If the aggregated response is biased, then our subjectively rational decision (which is a belief) can be bad. We could sink in a never-ending philosophical debate about the existence of the ultimate rational decision, but in my world, we have to predict the reaction of a very complex system, and not some very vague real value.
The system we aim to predict can’t be totally random, it must have a pattern, but in the case of financial markets, these patterns are more and more difficult to exploit. To find these patterns we have to use all kinds of tools and data our idea can exploit, and we have to have all the new data asap. Deep learning models can help to find non-linear associations and very week signals if enough data is fueled into them, but the longer the forecast range, the more probable that some factor outside of our model-world will affect the state of the system.
Maybe you have other views. I am a meteorologist and for more than 10 years I made ultra short-range predictions, where we always had to use (almost) real-time information to predict the weather, and often we had to leave out of consideration the base forecast-model because it was obviously unable to deal with the situation on the actual spatial and temporal resolution. This experience inevitably shaped my view of prediction methods. The weather is, of course, a very different beast than the stock-market. The weather was guided by same rules yesterday and will be tomorrow. Financial markets have aspects of a zero-sum-game in the short term, where the rules can change from one day to the other. This doesn’t mean that it is possible to ever make perfect weather predictions, but that is another story.
Back to our problem:
We will use the Wavenet architecture which will be the core of our models with small variations and will attach to this core model different inputs and outputs. The Wavenet architecture is composed of great additions of the deep learning toolbox, like dilated convolutional layers, gated activations, or residual connections. You can read more about them in this article.
Finding a good architecture is necessary, but not sufficient. Without proper input data, no model will yield useful results. There are datasets, where inputting simple, one-dimensional data can be enough to get some future steps with low error, but financial datasets aren’t so easy to tame. Giving the model only the 1-minute or 1-day price history of the closing ask price of an asset will not give good results (at least not for me).
Not so long ago I read the book of Daniel Kahneman; Thinking, fast and slow. This book isn’t about Deep Learning, it is about human thinking, the effect of cognitive biases, where we can and cannot trust our intuitions or decisions, and other things. A fantastic book. I wasn’t able to read it without always comparing the properties and biases of the human braines to deep learning models. Don’t think that I am a believer that general artificial intelligence almost is in our grasp. I don’t even like the name “artificial intelligence”, this name promises too much compared to our available tools. But there are some similarities, and if you make machine learning there is a good chance that you make something for people or about people. Even if you aren’t interested in decision theory or psychology, you can be hit by great intuitions when reading this book to improve your product. One important cognitive bias we have is shorty described as “What you see is all there is.” This refers to the fact, that we make decisions based on inadequate information, or the information all we have (most of the time). We don’t really think about the information we don’t have, that would be too much effort. This is true for deep learning models as well. It doesn’t matter how many neurons and what kind of state of the art architecture we build. We can give the model petabytes of data, if that data doesn’t have enough information (different features being in some kind of association with our output) to map it to the label by our shiny function approximator, then our model will not perform well.
Here we shortly summarize what kind of data we will use, but for the data generating process, I devote a whole article which will be linked <HERE> later.
Our goal is to forecast forex bar features of the upcoming step. Forecasted pairs: EUR/USD, GBP/USD, and JPY/USD (not the more common USD/JPY). We will forecast simple price values, log-returns, and the directions of changes.
We will play a bit with the Relu activation as output. With activation functions as output, we can determine the codomain of our models. At the end, we will build the Wavenet model with distribution output. For that, we will use the Tensorflow probability library. An implementation of a similar model can be seen here: Time series prediction with multimodal distribution — Building Mixture Density Network with Keras and Tensorflow Probability.
We will forecast features calculated from the tick means of the bar ranges, and not OHCL. Tick means are more representative measurements of the prices during the bar period, and not so noisy as the closing price. It is easier for a model to find patterns if we use means.
The data inputs have two main components:
- features generated from tick data during a 5-minute range (source: Dukascopy)
- features generated from economic news calendar (source: FXStreet)
Generating hundreds of features required compromises and arbitrary choices. The data preparation article is coming, preparing the data took a longer time than building the models and training them. In the article, Financial bars at the age of deep learning, you can read about some of the ideas.
To feed the data to our model we will use the Tensorflow dataset API.
For training, we will use the 2016–2018 period, and 2019 is the validation period. For simplicity, I didn’t use separate test data.
The models were trained on Google Cloud AI Platform.
The notebook with the code is available on Github: https://github.com/sinusgamma/probabilistic_wavenet_fx/blob/master/wavenet_fx_final.ipynb
The Core Wavenet model
Here is the code of the core model which will be part of all models with slightly different parameters. Depending on our input data or output attachment the parameters of the wavenet_model_setup function will change from one model to the other.
The baseline prediction
First, we set baselines we want to overperform.
The most obvious and naive feature we can predict is the price itself. The most naive forecast, which sometimes can be very hard to surpass and called ‘Naive Forecast’. We predict, that the price at the next step is the same as the price at the last known step. The MAE of this forecast can be calculated from our label dataset. Here we average the MAE of the three currency pairs.
If we do the same naive prediction with log-returns, then we make a mistake. Price describes a state, while log-return describes the change of a state. So with the above naive math, we would forecast that the price change will be the same, as before. (In the final dataset log-returns are scaled in the input and output as well.) But let’s see how this prediction would perform:
When we forecast that the price in the next step will be the same as before, we assume that the log-return will be 0.0. Calculating the error of this forecast will give a better naive prediction with lower MAE than above.
The price forecasting models
I tested two models to forecast the price of the currencies. One model was trained only with the features derived from the tick data, the other model was trained with additional features derived from economic news data.
In the graph above we can see the architecture of our model with price and news data.
- input_nosparse: This represents the data derived from the economic news dataset. The ‘no-sparse’ in the name means that I dropped some features which were mostly zeros because of one-hot-encoding. Very sparse models make the training very ineffective, so I included and generated features that yield not so sparse tensors. This dataset includes the last surprise factor of a given news event, and a counter, which tells the model how far are we from the last event. (Another counter could count the timesteps until the next known occasion of that event, but here I didn’t use that.) The idea of the counter is similar to the CoordConv architecture, which “allows filters to know where they are in Cartesian space by adding extra, hard-coded input channels that contain coordinates of the data seen by the convolutional filter”. The difference is that this isn’t spatial, but temporal help. As the Wavenet model also uses convolution, I hope, this will have a similar positive effect on the model performance.
- input_eventcur: This dataset shows if in the given timestep there is any economic event that could affect our prediction and the currency of the event.
- input_curbars: Features generated from the tick data, like ‘log-returnized’ OHCL, standard deviation or Spearman’s rank correlation of the ticks, and others.
- model_news: News data is sparse and has lots of dimensions. We use a kind of embedding before inputting it to the Wavenet part of the model. Entity embedding can’t be used with this dataset, because multiple events can coexist in the same step. Instead of that, we use depthwise convolution to embed this data. Depthwise convolution is a convolution when your kernel size is only one. This way the kernel weights the input only along one axis. By increasing or decreasing the filter size and using different activations we can use it to represent our data with less or more dimensions. Why don’t we use kernels with more dimensions? Imagine an image with RGB layers. The relation of close pixels has meaning and the task of a typical kernel of an image recognition problem is to exploit these relations. But what if the order of the pixels along a dimension is arbitrary? For example, what if one axis of our tensor represents different currency pairs and the other axis represents some features of the price during a bar range. The order of the currency pairs is arbitrary, and the order of the features is arbitrary as well. In this case, our dataset isn’t similar to an image, there is no point to try 2D kernels. But the data isn’t totally unrelated. Along one axis, we have all the features of the same currency pair, and along the other axis, we have all the currency pair data of the particular feature. For some of our datasets, we could use depthwise convolution even from two directions, but I will calculate it only from one.
The model with the price data and the additional economic news data was better than the model with the price features alone, but I have to mention that because of the data size differences I used different number of filters in the Wavenet, so the comparison isn’t totally fair.
In the end, the better model had 0.00024 MAE, which was worse than the naive no-price-change forecast with its 0.00014 MAE. With some tunning, architecture optimizing, and input data preprocessing I think it isn’t too hard to overperform our base prediction, but we will pursue different goals. Forecasting the price is harder than some other equally useful features. Even if we normalize, or standardize our output, the variation of the price during a sequence can be far smaller than the variation in the whole dataset, which makes it harder for the model to capture the changes. It would be a better choice to forecast the change in the price compared to the start of the sequence or compared to the earlier step, or we can forecast the return or log-return instead (what we will do later).
I can’t go further without a nice chart, similar to the charts we can so often see when somebody showcases the performance of a one-step forecast model.
This always looks so good and it is so useless and deceptive. If we examine the chart a bit longer we can notice that the prediction lags behind the real price, but this lag is worse than the naive forecast. If this would be a multi-step forecast all along, that would be fantastic, and you know, I wouldn’t show it to you. But this is generated from one-step forecasts, and it is very hard for a human to evaluate the performance of the model from this image.
An alternative way to show the performance of our model on a chart is to display the real change of the price, the predicted change, end the error. This doesn’t seem so good.
The log-return forecasting models with directional Relu
This chapter of the article would have been shorter if I didn’t make a mistake in the code. But I made one and was happy about that.
Originally in the Wavenet part, I used Relu activation as output. Relu is zero where the input is negative and keeps the original number, where it is positive. For price prediction, this wasn’t a problem, every currency pair has positive value everywhere. But log-returns can be negative or positive, so my outputs were weird, there were too many zeros.
After realizing my mistake I tried to exploit this property of the Relu activation.
I developed three models with the same architecture and input data, only the last activation function of the models was different. The first model had a normal Relu-activation and was able only to predict positive log-returns and zeros. The second model had an inverted Relu-activation and was able to predict only negative log-returns and zeros. The third model didn’t have an output activation function and was able to predict any number.
I implemented the negative Relu in a very simple way, multiplied the Relu output by -1. Theoretically, we should implement negative Relu differently. We could keep the negative inputs, and zero out the positive ones, but as the weights are determined by the output, they would give us the same results only with opposite weights before the last lambda layer.
The model predicts log-returns, but we implement some metrics to evaluate the directional forecast performance of the model.
(The log-returns in my dataset are scaled, divided by a standard deviation, but aren’t shifted.)
These metrics need some explanation. The directional metrics measure how often the model can predict if the log-return will be positive or negative, regardless of the magnitude of the log-return. For example ‘direction_acc’ measures the percentage of the predictions on the correct side from all predictions. With this metric, our no-activation output is larger, than the positive or negative Relu output, as the Relu outputs are able to predict only one side. But the ‘direction_acc_pos’ metric counts for only the positive log-returns, and the ‘direction_acc_neg’ counts only for the negative log-returns, so we can use these metrics to compare the performance of our Relu models to the no-activation model. To be on the correct side is important, but it is also important to know how often are we on the wrong side. For that, we use the ‘_inacc_’ metrics, they measure how often we predict the wrong side. Our directional accuracy and directional inaccuracy functions use > or < and not ≥ or ≤ distinctions. This way we generate an uncertain zone. The model with no activation never predicted 0.0 log-return, it isn’t really capable of that, because there is a very small chance that the weights will output exactly zero. The sum of the directional accuracy and inaccuracy of this model is 1.0 with these metrics. But if we add together the directional accuracies and inaccuracies of both the positive and negative Relu models, we got a number less than 1.0, we got 0.86. We have another metric for counting the rate of the zero predictions ‘pred_zero’. The rate of zeros of the no activation model is 0.0, and maybe you would anticipate, that the rate of zeros in the Relu models is close to 0.5, and their sum is close to 1.0. No, in both models the rate of zeros is above 0.5. Or from another viewpoint, they both predict the direction available for them by the Relu in less than 50 percent of the steps. the Relu output pulled into play a new card: uncertainty. Training with Relu made the models more cautious because with this activation function as output they are allowed to be cautious. Of course, we could determine a threshold by optimizing for maximum F1-score on the training data, or other methods to find the best threshold for our task, and apply that threshold to all the models. This way we would give uncertain regions to the no activation model as well, but the Relu models made that without any postprocessing.
The Relu output models predicted lots of zeros. Because of that their directional accuracies weren’t so high than the positive or negative directional accuracy of the no-activation model, but their inaccuracy scores were close or lower.
We can go further with our Relu models to forecast uncertain zones. If we examine the prediction of the Relu models together we can have predictions where:
- The positive and negative Relu outputs are both zero: we forecast zero log-return.
- One log return is zero, the other isn’t: we forecast the non-zero log-return.
- Both predictions are non-zeros. The Relu output models predict opposite directions. This is an uncertain situation. We predict zero log-return.
After building the above joint-model we check the rate of accurate and inaccurate model directions. (_dbl is our joint model)
The joint model had fewer accurate predictions than the no-activation model but had fewer inaccurate predictions as well. The rate of accurate/inaccurate predictions was 2.42 with the no-activation model, and 3.03 with the joint-Relu model considering both directions. This is very good. (Seems too good to me, so if you discover an error in my logic or data preparation, please, tell!)
The no-activation mode always had to choose a side, but the joint-Relu model was able to give 0.0 log-returns in uncertain situations (about 20 percent of the steps), and this helped the model to improve the accurate/inaccurate prediction rate.
Oh, and I’ve almost forget to mention that with the log-return forecast we overperformed the base model. MAE of no-activation model: 0.2835, MAE of base: 0.3555.
The joint Relu model helped us to discover some uncertain situations, but a more sophisticated way to describe uncertainty is to predict distributions or probabilities instead of simple values.
What does a distribution represent in our problem? It represents a belief, the subjective belief of our model about the possible outcomes of the future.
Nassim Nicholas Taleb wrote in his book, Fooled by Randomness: “Probability is not a mere computation of odds on the dice or more complicated variants; it is the acceptance of the lack of certainty in our knowledge and the development of methods for dealing with our ignorance.”
This statement is closely related to Kahneman’s “What You See Is All There Is” problem I quoted earlier.
Probability has an objectivist or frequentist and a subjectivist or Bayesian view. You can read a bit more about them on Wikipedia. All probability problems can be enrobed by both views, but some problems are better to view trough objectivist, other problems through subjectivist glasses.
For example, tossing a fair coin, or playing most games in a casino has uncertainty, but that uncertainty is well modeled by math, and because of the law of large numbers the casino will never loose on average (or only if it is lead by idiots). From the view of the casino, it isn’t really an uncertain business (if we don’t count tax, competitors, COVID-19, and politics).
But in most real-world examples we have to count for very diverse factors, some of them are hard to represent with numbers (but we have to if we want to build models on that information), and there are situations that occur only once in history. Try to train a model on that.
Predicting forex prices or the stock market is better viewed trough our subjectivist glasses. There are lots of factors we don’t know. Different people or algorithms have different information about the system, but nobody knows all the factors and the system is continuously changing, so something that was true yesterday maybe not be true tomorrow. Or I could say that the distribution of our forecasted time-range will be different than the distribution of the historical data (most of the time). Of course, we don’t hope to build a perfect model, we just want to build one that is good enough for a task, and for a time.
Here we will build Wavenet models with bimodal normal distribution outputs. (Sorry Taleb.) It is possible to use lots of other distributions of the Tensorflow probability library to build mixture distributions, but my goal was to make predictions where the output had double peaks at some steps, so a bimodal normal seemed satisfactory. (You can see a mixture density network with multimodal forecasts here.)
For the first model I used the following architecture:
The three Mixture normal layers are for the three currency pairs, and we have to make the Wavenet output compatible with our MixtureNormal layers.