Journey through NLP and Time Series to Predict Gold Price

Source: Deep Learning on Medium

Data exploration

Every data science project needs data. What data have I collected?

  • 1900+ news articles from January 2014 to August 2019
  • 1400+ price points for each of the six asset classes chosen for the same period

To prepare for preprocessing, I transfer articles from JSON format to a Pandas data frame. Then for my target, I calculate the gold price fluctuation between one trading day. I merge the two data frames to align news with prices.

Data insights

As I process the two types of data, I have discovered a few interesting points.

From the word cloud of all the news, I make three observations.

  • It is all about who says what.
  • Central bank is the boss.
  • The ‘invisible hand’: market forces are just as important.

Moreover, there is some relationship between gold price distribution and news volume. When news volume is low, the gold price also fluctuates relatively less, meaning that there is a positive correlation between news volume and gold price volatility between February to November. The only exception is during December to January when news volume peaks, but gold price fluctuation is not as high as August.

The above are observations from me, a human. Then what can a machine tell us?

Data preprocessing

At this stage, I am dealing with three sets of data: news, asset prices, and my target.

  • News: Before lemmatization, I remove numbers, punctuation, white space at ends, and stop words, and put all texts to lower case. After taking a look at the length of all documents, I choose to pad all corpus with zeros to a uniform length of 24100 words. This cutoff is chosen as only five out of 1900 documents are out of this length.
  • Asset prices: As I merge the two data frames, news on non-trading days have no corresponding prices. Therefore, I fill the nan cells with price from the previous trading day.
  • Target: Based on one trading day fluctuation, I classify changes within absolute 1% to ‘stay’, above 1% to ‘up’, and below 1% to ‘down’. For future works, I may narrow the ‘stay’ bandwidth to absolute 0.5% to create more balanced classes.

Deep learning models

Before diving into the deep learning model that I use, I want to share a bit of my decision process to use a multi-input neural network model. My initial plan is to choose between two methods. One is to use a multi-input neural network model. The other one is to use a recurrent neural network for text and multi-variate time series for financial data and then to combine the result with specific rules. I attempt the second approach first. As the results of using the time series VAR model are rather unimpressive, I resort to the more sophisticated alternative.

A few more steps before training the models: standard scaling on financial data, one hot-encoding of targets, tokenizing texts, preparing embedding matrix using Stanford GloVe.6B.100d, and finally train, validate, test data split.

The multi-input neural network is structured, as shown below. I compare performance from the convolutional neural network (CNN) and sequence-to-sequence recurrent neural network (encoder-decoder RNN) for text input while maintaining the LSTM model structure for time series financial prices.

Model performances:

CNN model only takes 20 seconds per epoch to train, twenty times less than the RNN model. Moreover, the RNN model’s accuracy and F1 score plateau quickly after the first epoch, which leads to poor classification performance on test data as it classified everything to the majority class.

Below is the normalized confusion matrix of the CNN model on test data.

My journey

Given my initial result, there are many potential improvements. Nevertheless, given that this is my first attempt at a customized multi-input deep learning model, coupled with the fact that I have only three weeks, I find this journey to be both challenging and extremely rewarding. I plan to explore this subject further and will post an update once I do.