Making a Continual ML Pipeline to predict Apple Stock with Global News (Python)

Source: Artificial Intelligence on Medium

Making a Continual ML Pipeline to predict Apple Stock with Global News (Python)

Simplicity is key.

Goal

In this tutorial we’ll make a Machine Learning Pipeline which inputs Business News and generates predictions for Apple Stock Price re-training through time.

We’ll also measure how profitable it is in real life.

What we’ll do

  • Step 1: Set up technical prerequisites
  • Step 2: Get the data for daily Apple Stock since 2017
  • Step 3: Define and understand target for ML
  • Step 4: Blend business news to our data predict target
  • Step 5: Prepare our data and apply ML
  • Step 6: Measure and analyze results
  • Step 7: Break the data and train/test through time

Step 1. Prerequisites

  • Have Python 2.6+ or 3.1+ installed
  • Install pandas, sklearn and openblender (with pip)
$ pip install pandas OpenBlender scikit-learn

Step 2. Get the data

We’ll use this daily Apple Stock dataset

It has the daily high, low, open and close prices and the percentual change during that day.

So let’s pull the data through the OpenBlender API. Open a python script and run the following code:

# Import the librariesimport OpenBlender
import pandas as pd
import json
# Specify the actionaction = 'API_getObservationsFromDataset'# Specify your Token and 'Apple Inc. Price' id_datasetparameters = {
'token':'YOUR_TOKEN_HERE',
'id_dataset':'5d4c39d09516290b01c8307b',
'date_filter':{"start_date":"2017-01-01T06:00:00.000Z",
"end_date":"2020-02-09T06:00:00.000Z"}
}
# Pull the data into a Pandas Dataframedf = pd.read_json(json.dumps(OpenBlender.call(action, parameters)['sample']), convert_dates=False, convert_axes=False).sort_values('timestamp', ascending=False)
df.reset_index(drop=True, inplace=True)

Note: To get a token you need have to create an account on openblender.io (free ), you’ll find it in the ‘Account’ tab on your profile icon.

Now let’s look at the data:

df.head()

Check!

Step 3. Define and understand Target

Let’s plot the price and Change :

Now, what we want is to detect if the price is going to increase or decrease on the next day so we can buy or short.

The ‘change’ is the percentual increase or decrease that happened between the opening and closing price, so it works for us.

Let’s define two target variables:

Positive POC: Where ‘change’ increased more than 0.5%

Negative POC: Where ‘change’ decreased more than 0.5%

Distribution of ‘change’

*Note that they are two different variables and we’ll make a ML model to try and predict each one sepparately and make them work together.

Step 4. Vectorize and Blend Business News

This is very simple to execute, but let’s try and understand what’s happening in the background.

What we want:

  1. We need to gather usefull news data which conveniently (statistically speaking) relates to our target
  2. We want to blend it to our data in a way that the news align whith the next day’s price ‘change’ (so the model can learn to predict for the next day and we can actually use it)
  3. We want to transform it into numerical features so it can loop through an ML model.

So let’s look at this Wall Street Journal News dataset:

And the USA Today Twitter news.

*Note: I picked these because they made sense, but you can search for hundreds of other ones.

Now let’s create a text vectorizer, which is a model on OpenBlender from which you can then pull vectorized words or groups of words as features, as if it were another dataset:

action = 'API_createTextVectorizerPlus'parameters = {
'token' : 'YOUR_TOKEN_HERE',
'name' : 'Wall Street and USA Today Vectorizer',
'sources':[
{'id_dataset':"5e2ef74e9516294390e810a9",
'features' : ["text"]},
{'id_dataset' : "5e32fd289516291e346c1726",
'features' : ["text"]}
],
'ngram_range' : {'min' : 1, 'max' : 2},
'language' : 'en',
'remove_stop_words' : 'on',
'min_count_limit' : 2
}
response = OpenBlender.call(action, parameters)
response

From above, we specified the following:

  • name: We’ll name it ‘Wall Street and USA Today Vectorizer’
  • sources: The ids and source columns of the datasets to include as source (in this case both with only one and both named ‘text’)
  • ngram_range: The min and max length of the set of words which will be tokenized
  • language: English
  • remove_stop_words: So it eliminates stop-words from the source
  • min_count_limit: The minimum of repetitions to be considered a token (one time occurrences rarely help)

Now, if we go to our dashboard at OpenBlender, we can see the vectorizer:

It generated 4999 n-grams which are binary features of tokens of max 2 words which are “1” if the n-gram was mentioned and “0” if else.

Now we want the vectorized data compressed in 24 hour time lags and aligned with the Apple stock price from the next day.

*Note: To download all the vectorized data you need to pay about $5 by upgrading with the ‘Pay as you go’ option in OpenBlender. If you don’t upgrade you can still download a small part of the data and continue.

action = 'API_getObservationsFromDataset'interval = 60 * 60 * 24 # One dayparameters = { 
'token':'YOUR_TOKEN_HERE',
'id_dataset':'5d4c39d09516290b01c8307b',
'date_filter':{"start_date":"2017-01-01T06:00:00.000Z",
"end_date":"2020-02-09T06:00:00.000Z"},
'aggregate_in_time_interval' : {
'time_interval_size' : interval,
'output' : 'avg',
'empty_intervals' : 'impute'
},
'blends' :
[{"id_blend" : "YOUR_ID_TEXT_VECTORIZER_HERE",
"blend_class" : "closest_observation",
"restriction":"None",
"blend_type":"text_ts",
"specifications":{"time_interval_size" : interval}
}],
'lag_feature' : {'feature' : 'change', 'periods' : [1]}
}
df = pd.read_json(json.dumps(OpenBlender.call(action, parameters)['sample']), convert_dates=False, convert_axes=False).sort_values('timestamp', ascending=False)
df.reset_index(drop=True, inplace=True)

This is the same service call as before but with some new parameters:

  • aggregate_in_time_interval: To aggregate the data by average in intervals of 24 hours and impute if there are intervals with no observations
  • blends: Join the aggregated news 24 hour data by time
  • lag_feature: We want the ‘change’ feature to be aligned with news that happened in the previous 24 hours

Let’s take a look at the top of the data (ordered by most recent):

print(df.shape)
df.head()

We have 1068 observations and 4887 features. Most of them are n-grams from the vectorizer, and we also have our original Apple Stock dataset.

And now we have the ‘lag1_change’ feature which simply aligns the ‘change’ values with the “previous day data” which is exactly what we need.

Step 5. Prepare the data and apply ML

There isn’t much more wrangling or cleansing to do, we just need to create our target features (positive POC and negative POC) as defined earlier.

# Where ‘change’ decreased more than 0.5%
df['negative_poc'] = [1 if val < 0.5 else 0 for val in df['lag1_change']]
# Where ‘change’ increased more than 0.5%
df['positive_poc'] = [1 if val > 0.5 else 0 for val in df['lag1_change']]
df[['lag1_change', 'positive_poc', 'negative_poc']].head()

Now, let’s try some ML to learn and predict the positive poc.

# Import librariesfrom sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics
# Define target and remove unwanted featurestarget = 'positive_poc'df_positive = df.select_dtypes(['number'])
for rem in ['lag1_change', 'negative_poc']:
df_positive = df_positive.loc[:, df_positive.columns != rem]
# Create train/test setsX = df_positive.loc[:, df_positive.columns != target].values
y = df_positive.loc[:,[target]].values
# We take the first bit of the data as test and the last as train because the dataset is ordered by timestamp descending and we want to train with previous observations and test with subsequent ones.div = int(round(len(X) * 0.29))

X_test = X[:div]
y_test = y[:div]

X_train = X[div:]
y_train = y[div:]
print('Train:')
print(X_test.shape)
print(y_test.shape)
print('Test:')
print(X_train.shape)
print(y_train.shape)
# Let's now train the model and predict. I'm actually nervousrf = RandomForestRegressor(n_estimators = 1000, random_state = 1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

Step 6. Analyze the results

And now, for the results:

print("AUC score:")
print(roc_auc_score(y_test, y_pred))
print('---')
# Let's binarize and look at the confusion matrix
preds = [1 if val > 0.5 else 0 for val in y_pred]
print('Confusion Matrix:')
print(metrics.confusion_matrix(y_test, preds))
print('---')
# Lets look at the accuracy
print('Acurracy:')
print(accuracy_score(y_test, preds))
print('---')

This is pretty amazing if I do say so myself.

Let’s analyze the results.

This means that from all the times that this model predicted that the following day the price change would increase 0.5% or more, it was correct 72% of the times. I don’t know of any real life model that even compares to this.

*Side note: If any of you run this into production for auto ML trading at least send me a gift.

Step 7. Break the data through time

Now, we want a methodology in which the data is trained and tested throgh time to see if the results are consistent.

So let’s run it in a loop.

results = []for i in range(0, 90, 5): 
time_chunk = i/100
print(“time_chunk:” + str(time_chunk) + “ starts”)
df_ml = df_positive[:int(round(df_positive.shape[0] * (time_chunk + 0.4)))]
X = df_ml.loc[:, df_ml.columns != target].values
y = df_ml.loc[:,[target]].values
div = int(round(len(X) * 0.29))
X_test = X[:div]
y_test = y[:div]
X_train = X[div:]
y_train = y[div:]
rf = RandomForestRegressor(n_estimators = 1000, random_state = 1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
preds = [1 if val > threshold else 0 for val in y_pred]
try:
roc = roc_auc_score(y_test, y_pred)
except:
roc = 0
conf_mat = metrics.confusion_matrix(y_test, preds)
accuracy = accuracy_score(y_test, preds)
results.append({
‘roc’ : roc,
‘accuracy’ : accuracy,
‘conf_mat’ : conf_mat,
‘time_chunk’ : time_chunk
})

We can see the metrics increasing and stabilizing trough time!