My experiments with Data Science techniques to beat the stock market

QuantQuest an online competition organized by Auquan where you solve trading problems using standard data science, math and statistics. I participated in QuantQuest II in Sep 2017 and this post is an attempt to document some of my methods and learnings.

Iam a recent Computer Science graduate from IIT Kanpur. I first came to know about QuantQuest through IITK placement cell during our final year placement season. During my mid-semester recess when I was chilling out at my home, I received an email from Ms. Chandini Jain (founder of Auquan) with the subject “Optiver Amsterdam is recruiting traders from IIT Kanpur”. Like other recruiter spams, I was first going to ignore it. But I knew that Optiver was recruiting in our placement season, so I opened the email out of curiosity and found a link to register in the QuantQuest. I opened the link in the incognito mode, as I am very security-conscious. There I looked at the prizes and read about previous QuantQuest winners (esp an IITK senior whose Rank was 1 in the last competition). That motivated me to participate in the competition.

After mustering enough motivation, I opened the first problem. At first, due to so many unfamiliar financial terms, the problem went over my head. Then I read it again carefully and realized it was just a regression problem.

And then just after that I read second problem which was a classification problem.

Getting Setup and Understanding the Data: Weird Features names and what to do with them?

As per the instructions provided, I downloaded the toolbox and the problem template where I was supposed to write my code. The toolbox was no different from the enigma box in the beginning, so I played with it by printing ‘features’ and the ‘target variable’ and found out that the toolbox processes the data of one timestamp at a time and one can only see up to certain number of data points back from the current timestamp. I realized the stock data is the time series data (although it was written in the problem definition, but I somehow missed it). It also downloaded the stock data in the first run which contained several columns with weird names along with the target variable.

Then I read the wiki page of the Auquan toolbox and the way to compute new features using the toolbox where those weird names were used. I googled those financial names too, but wasn’t able to put all the pieces together. However, I had done some ML courses and projects, so I knew how to create handcrafted features from the original features (and through wiki page I learned more ways). The toolbox came very handy in calculating new features as all the logic to compute feature are already implemented in it.

Initial Attempts: Started with the simplest model you can think of!

In order to first solve problem, I decided to begin with the simple linear regression model. Since I was not familiar with names of the original features, so I used all the original features and compute new features from each using the toolbox like moving average, standard deviation, minimum, maximum, etc. with some period. I used the linear regression model from the scikit-learn python package which is quite fast to fit model. To speed this up further, I started with the top two stocks only present in the stock list. As the toolbox uses one prediction at a time for trading and the features of the last certain number of timestamps are also available, I used this available data to fit the model. Basically, in my first attempt, I was simply fitting the linear regression model at each timestamp using the past available data and predicting the fair value of the current timestamp. This model gave me some competition score and pnl (profit & loss) but that was not good enough to put me in the leaderboard.

My next approach, then, was to first fit the linear regression model on the whole training dataset and then use it to make predictions at each timestamp. But the toolbox computes the features of one data point at a time.

And I wanted the features of all data points (timestamps) at once. At the time, I felt this was one of the shortcomings in the toolbox. Now I realize the toolbox is designed to predict the target variable from the features and trade and place orders in real time (or mimic real-time trading).

A simple workaround was to first run the toolbox with random or constant prediction, to compute and append features at each timestamp in an array along with the target variable. Then I could use all the data to fit the model, save it and then use it directly to predict the target variable in subsequent runs.

However, one can manually read those stock CSV files and compute features (which would be much faster and efficient than using the toolbox). But that would also require to write your own logic of computing features. And I didn’t want to spend this much time in the problem. So I chose the long way with much less hassle, which wasn’t really very long though. When I fit the linear regression model on the whole data and used it for prediction, I got pretty decent results that took me into the leaderboard. At this point, since the stocks were completely unbeknownst to me, so I simply selected the top 20–22 stocks from the stock list to make my submission valid.

Time to improve performance: Let’s get smarter

After few hours, I saw several new names above me on the leaderboard which compelled me to improve my model further. Fortunately, I was doing a probabilistic machine learning course during that semester, which teaches several non-linear regression methods. So I used the Gaussian Process (GP) with different kernels to introduce non-linearity in my regression model. Unexpectedly, the results (score and pnl) became worse. I didn’t waste much time on figuring out what was going wrong with the GP model, rather I tried using the linear regression model in online fashion which gave me almost similar results.

Turns out simpler the better… what’s next then?

I decided to use polynomial features along with those handcrafted features (moving average, standard deviation etc) to add non-linearity in my simple linear regression model. I experimented with the polynomial of power up to 4 and but even here only quadratic gave a better result in my case.

At this point, I also added the standard scaler to center the data with zero mean and unit variance. Since I was computing new features using toolbox on all the original features, this resulted in very large feature set. Such a large feature seat with many unimportant features may have adverse effect on the linear regression model. To select only those features which are important, I looked at the correlation score between the original features and the target variable (fair value). By manually looking at the data, I removed those features which were not changing much and have low correlation score. (There is a method in scikit-learn that can do this for you quite simply). This substantially reduced the size of final feature set. All these changes in my model improved the result and also my position in the leaderboard.

At this point I realized, I still didn’t know what those weird names meant. And it turned out that I didn’t need to know them at all to develop a decent solution to the problem!

One last attempt at improvement: LSTM!

While I was fitting my model on the training dataset and testing it on the QuantQuest server, I started reading articles on how to use LSTM, a type of recurrent neural network with gating mechanism, on the time series data. Through my courses, I knew that recurrent neural network with their ability to remember some past can be used in time series prediction.

By this time, QuantQuest was over, but I was hooked! I implemented a simple LSTM model in the Keras and trained it on the whole training dataset. The LSTM model met my expectations and gave a bit better result than the previous models. I also experimented several variants of LSTM model with stateful-ness and looking back into some timesteps.

Even upto this point, I had astonishingly put little thought into the stocks that I was modeling. I was sticking with my original choice of 20 stocks, and I was training one LSTM model for each stock. After a discussion with Chandini about my submission, I decided to cluster the stocks which behave somewhat similar and train a common LSTM model for all stocks in a cluster. I used the simple K-means algorithm on the original feature-set of stocks for clustering and trained one LSTM model for a cluster. I submitted this solution to them much after QuantQuest was over. On further analysis, we found that some stock clusters were performing much better than any other previous models.

Final thoughts:

This competition gave me the perfect opportunity for me to apply what I was learning in classroom to a real world problem. I really enjoyed experimenting with and applying all these different machine learning techniques which I was studying in my courses during that semester to the competition problems.

At the beginning of the competition, I had no idea I could solve a stock market problem with zero trading knowledge.

My main learnings would be:

  • There was little need to know what stocks mean, which stocks is the data provided for or what do the features mean. Knowing how to work with Time Series Data is sufficient.
  • Feature Selection is quite important…
  • As is selecting the right stocks. I could have improved my results by simply paying more attention to which stocks could my model predict well for?
  • Simple is better. The models that worked the best for me were usually quite simple

As a side note, I also ended up learning some new things about finance and how stock markets operate through this competition.

Useful Links:

Source: Deep Learning on Medium