Original article was published by Suraj Parmar on Artificial Intelligence on Medium
However, Numerai data is different. It is a problem of predicting the stock market but what makes it unique is that the data is obfuscated and is already cleaned! We don’t know which row corresponds to which stock. Moreover, each row is grouped into eras that represent different points in time but as long as it has a structure, we can certainly try to learn and map patterns from it.
Numerai gives this cleaned data to data scientists and asks them to provide better estimates for the data. These crowd-sourced predictions are used to build a meta-model and to invest in real stock markets around the world. The incentives are based on the quality of your predictions and the amount of your NMR staked. You earn a percentage of your stake if your predictions help to make a profit, otherwise, your stake gets burned. This earn/burn system keeps motivating for better and unique predictions. So, the more accurate and/or unique the predictions, the higher the returns. This is what makes it interesting and complex(hardest data science problem).
Let’s address this problem on Google Colab. An end-to-end walk-through using a simple yet very good technique— CatBoost. I’ll be explaining the colab snippets here. It would be really helpful if you open the notebook link in a new tab parallel to this.
- Load data set(and some operations that you’ll need)
- Define a model
- Train a model
4.1 Tweak something(back to step 1)
- Predict and submit
5.1 Observe the performance over 4 weeks
Setting up Colab
We’ll need to switch the runtime to use GPU by going to
Runtime -> Change runtime type -> GPU -> Save
Colab comes preinstalled with so many data science libraries. we’ll need to install
We’ll go through setting up your pipeline in colab and making it flexible enough to perform experiments there and submit the predictions using API keys. Thus, all you need to do is to press
Run all on colab once you set up the keys and finalize a model.
Again, make sure you have opened the notebook alongside this article.
Loading data 📊
The tournament data already contains validation sets (val1 and val2). We usually evaluate our model’s predictions on this subset with the goal of performing well on unseen data.
Defining and training a model 🤖⚙️
This is probably the part where most of your observations and tuning will happen. You should experiment with other types of modeling algorithms.
Making and evaluating predictions 📐
Don’t get overwhelmed by so much code here. This is mostly a boilerplate code that helps in evaluating predictions. You probably won’t need to change much. However, you might want to add more metrics for better evaluation once you feel comfortable with the tournament.
Once you think your predictions are satisfying your goals, you can save and upload them with the help of
numerapi using your secret keys.
Submitting the predictions📤
Although you can manually upload
predictions.csv , we’ll use API for hassle-free and easy submissions. Numerai lets you create keys for different purposes but we’ll create key for uploading predictions only.
To create new secret keys, go to
Settings -> Create API key -> select "Upload Predictions" -> Save
You’ll be prompted with your keys to save it somewhere safe.
Below is a sample key for submitting predictions.
You can have 10 models in one numerai account. So, feel free to experiment with new techniques while keeping your well performing models same. You can use
numerapi to submit predictions for different models. You can see a list of your models in options above settings. You just need to copy
model_id and paste here.
After uploading the predictions, you’ll see some metrics and information about your submission.
From my experience, it takes a couple of submissions to get up and running in the tournament. Once you have set up your workflow, all you need to do is to press
Run all in Google colab.
Your predictions will be tested on live data and given scores,CORR: Correlation between your predictions and live data
Meta Model Contribution(MMC): An advanced staking option which incentivizes models that are unique in addition to high performingYou can stake your NMR on either CORR or CORR+MMC.
What’s next? 💭
There are a couple of things you can do to improve your performance. You get paid for the uniqueness of your predictions too.