Tuning LSTM to predict stock price in SET50 with lower than 5.5% error

Original article was published on Deep Learning on Medium

1. Build LSTM and optimize parameters for one stock in SET50 on 1,5 and 10 days prediction

1.1 Select the workspace and install yfinance library

First, we will need to get stock data from Yahoo Finance. yfinance library is the package that we need to install. It is packed with function to get the data from Yahoo Finance. The documentation can be found here — https://pypi.org/project/yfinance/.

Install yfinance using !pip install command.

!pip install yfinance
import yfinance as yf

As LSTM required tensorflow to build and train the model, I develop this notebook on Google’s colaboratory — https://colab.research.google.com which provide free Jupyter Notebook working space with ternsorflow GPU supported installed. It is also able to read/write file to Google Drive which is quite handy for me in this situation as my personal machine doesn’t has GPU.

Keras is Deep Learning library which has LSTM implemented that I’m going to use in this exploration — https://keras.io/.

Colaboratory’s screenshot from my notebook

1.2 Prepare data

Since we will use all of SET50 data in the next topic, I’ll download all of them. The stock that I select to explore is INTUCH.BK which is the one that I used to trade recently.

I use colaboratory’s Google Drive mounting features to store the downloaded data and also intermediate result while working on this notebook.

yfinance has handy command below which is able to download historical data within 2 lines. First, we have to initiate the instance of yfinance using the ticker name. After that, we can use history function to download the historical data. More detail in yfinance’s documentation : https://pypi.org/project/yfinance/

# Instantiate object from stock ticker
stock_data = yf.Ticker(stock)
# yfinance history function is able to define period to download historical data
pd.DataFrame(stock_data.history(period='max',auto_adjust=False,actions=False)).to_csv(file)

After I save the data to CSV, I explore them a bit to check for completeness of the data, null data and the expected features (Open, High, Low, Close, Adjusted Close, Volume).

INTUCH.BK’s example data from Yahoo finance
Check null and data type of the DataFrame

Base on quick check, the data is quite ready to use.

After we have got all of the data, we have to make it ready to train the model. Here is the list of things to do:

  1. Drop null rows (if any) as we can’t use it anyway.
  2. Drop ate as we can’t use it as features in model training.
  3. Normalize the data to have the value between 0–1 as it will help neural network has better performance. This is per this post : https://towardsdatascience.com/why-data-should-be-normalized-before-training-a-neural-network-c626b7f66c7d. To normalize and scale the data backup, we can use Python’s preprocessing.MinMaxScaler(). What we have to is to also keep the object that we use to scale the data down and use the same object to scale the data up.
  4. Transform the data format. We will predict the Adj. Close for the period of prediction day range (1,5 and 10). So, the dataset will consist of the set of Open, High, Low, Volume of each day for the number of history points day that we will use to do prediction.
  5. For example, if we say we want to use 30 history points. One row of our dataset will consists of the following features:
    [dayAopen, dayAclose, dayAvolume, dayAhigh, dayAlow,dayA-1open, dayA-1close, dayA-1volume, dayA-1high, dayA-1low....dayA-29low]

Here are the code that I use to perform all of the activities above.

# Construct the CSV filepath for INTUCH.BKstock = 'INTUCH.BK'
filename = gdrive_path+stock+'.csv'
# Read the file and drop null rowdf = pd.read_csv(filename)
df_na = df.dropna(axis=0)
# Drop Date as this is time series data, Date isn't used. Also drop Close as we will predict Adj Close.df_na = df_na.drop(['Date','Close'],axis=1)# As neural network has better performance with normalize data, we will normalize the data before train and predict# After we got the predict result, we will scale them back to normal value to measure error rate.# Normalise all data to the value range of 0-1 as neural network algorithm has better performance with this data rangedata_normaliser = preprocessing.MinMaxScaler()y_normaliser = preprocessing.MinMaxScaler()data_normalised = data_normaliser.fit_transform(df_na)# The length of dataset, number of day to predict and number of featureshistory_points = 30
predict_range = 1
# Prepare the data in the format of [day-1-open,day-1-max,day-1-min,...day-history_point ] as 1 row input for predict the 'predict_range' price for train and testohlcv_histories_normalised = np.array([data_normalised[i : i + history_points].copy() for i in range(len(data_normalised) - history_points - predict_range +1)])# Get the actual price [day1-adj close,day2-adj close....day-predict_range adj close] for train and testnext_day_adjclose_values_normalised = np.array([data_normalised[i + history_points:i + history_points + predict_range,3].copy() for i in range(len(data_normalised) - history_points - predict_range+1)])# Create the same array as the normalised adj close but with the actual value not the scaled down value. This is used to calculate the prediction accuracynext_day_adjclose_values = np.array([df_na.iloc[i + history_points:i + history_points+predict_range]['Adj Close'].values.copy() for i in range(len(df_na) - history_points - predict_range+1)])# Use the passed normaliser to fit the actual value so that we can scale the predicted result back to actual valuey_normaliser.fit(next_day_adjclose_values)

Now, the data is ready. As we are going to train the model, we will have to split the data to train and test.

The older data will be the training set and the newer data will be the test set.
I select 90% of the data as train data and 10% of the data to be test data.

So, we can use Python’s array slicing to split the data. The code below is the example from my function. ohlcv_histories is the data that we prepared earlier.

n = int(ohlcv_histories.shape[0] * 0.9)ohlcv_train = ohlcv_histories[:n]
y_train = next_day_adj_close[:n]
ohlcv_test = ohlcv_histories[n:]
y_test = next_day_adj_close[n:]

1.3 Build, Train and Validate the model

Then, it is ready to create LSTM model, train and validate the model by using mean squared error. LSTM that I will use is a simple one consist of hidden layer, dropout layer, and forecast layer.

I create as function so that I can change the parameters of the model. The parameters that we change when we build LSTM models are:

  • hidden layer number — The layer of LSTM
  • dropout probability — The probability to forget the information of previous node
  • history points — The range of data use in training the model for each iteration (E.g. 30 days for each iteration from all of the data in training set)
  • feature number — The number of features. If we add more features this number has to change.
  • optimizer (mostly we will use ‘adam’)

Here is the code inside the function.

# Initialize LSTM using Keras librarymodel = Sequential()# Defining hidden layer number and the shape of the input (number of data in the dataset and the number of feature)model.add(LSTM(layer_num, input_shape=(history_points, features_num)))# Add forget (dropout) layer with probability per argumentmodel.add(Dropout(dropout_prob))# End the network with hiddenlayer per the size of forecast day e.g. 1,5,10model.add(Dense(predict_range))# Build and return the model per the selected optimizermodel.compile(loss='mean_squared_error', optimizer=optimizer)

After we get the model as a result from compile(), we can fit it with the training data. Additional parameters that we can change when we fit the data are:

model.fit(x=ohlcv_train, y=y_train, batch_size=batch_size, epochs=epoch, shuffle=True, validation_split=0.1)

Once the model has completed the training, we can use test data to predict the result and compare the result with the actual result by calculating mean squared error (MSE). However, the actual result that we have is the scaled up value (the normal price one not the normalized 0–1 which we got from the model)

Before calculating MSE, we have to scale the predicted price back.

# The model is train. Test with the test datasety_test_predicted = model.predict(ohlcv_test)# Scale up the result to actual value with y_normaliser that we use earliery_test_predicted = y_normaliser.inverse_transform(y_test_predicted)# Calculate the error with MSEreal_mse = np.mean(np.square(unscaled_y_test - y_test_predicted))scaled_mse = real_mse / (np.max(unscaled_y_test) - np.min(unscaled_y_test)) * 100

Now, we have the completed code to prepare data, build, train and validate the model and also able to change parameters when we build and train the model to find the set that give the lowest MSE.

For the first attempt, I try with all lowest parameter for 1 day prediction. The history points that I use is 30 days on all historical data that was downloaded.

# Must be the same as history point that we use to prepare datahistory_points = 30# Must be the same number of features when we prepare datafeatures_num = 5# LSTM parameterslayer_num = 30predict_range = 1optimizer = 'adam'dropout_prob = 1.0# Create LSTM model objectmodel = get_LSTM_Model(layer_num, history_points, features_num,predict_range,optimizer,dropout_prob)# Parameter for model trainingbatch_size = 10epoch = 10# Train model with our train datamodel.fit(x=ohlcv_train, y=y_train, batch_size=batch_size, epochs=epoch, shuffle=True, validation_split=0.1)

After we got the result, we can plot the predict price and the actual price to see how it different.

real = plt.plot(unscaled_y_test, label='real')
pred = plt.plot(y_test_predicted, label='predicted')
plt.legend(['Real', 'Predicted'])
plt.show()
The first attempt’s MSE at ~6.18%
Example result from INTUCH prediction. The parameter is just the lowest one like all as 10.

We can say that the result is able to capture the trend quite well. It will constantly predict lower than the actual price when the price is uptrend while predict higher than the actual price.

1.4 Optimize parameters for 1,5 and 10 days prediction

Then, it’s time for finding the best parameters value. In summary, here are the list of parameters to optimize:

  • hidden layer number
  • dropout probability
  • history points
  • batch size
  • epoch

The way that I do is to create function that will loop thru the range of one parameter while all others parameter value will fix to see which value of the particular parameter give the lowest MSE. So, I’ll have 5 functions at total.

Here is the example of the function. Other functions share the same structure but just changing the parameter.

def get_best_history_points(predict_range, max_history_points, stock_list, hidden_layer=10, batch_size=10,epoch=10,dropout_probability=1.0,mode='file'):mse_list = []exception_list = []for history_points in range(30,max_history_points+1,round(max_history_points/10)):for stock in stock_list:try:model, scaled_mse = train_and_validate_stock_predictor(stock,history_points,predict_range,hidden_layer,batch_size,epoch,dropout_probability,mode)print("Predict {} days for {} with MSE = {}".format(str(predict_range),str(stock),str(scaled_mse)))mse_list.append([history_points,stock,scaled_mse])pd.DataFrame(mse_list).to_csv('/content/drive/My Drive/Colab Notebooks/stocklist_'+str(predict_range)+'_mse_history_'+mode+'.csv')except Exception as e:print("exception "+str(e)+"on "+stock)
exception_list.append([predict_range,stock,str(e)])
pd.DataFrame(exception_list).to_csv('/content/drive/My Drive/Colab Notebooks/exception_list.csv')
continue

Then, I start by run all of the function to see which parameter at which value give the lowest MSE.

From first round of tune we found that epoch = 90 has the lowest MSE at ~2.85%

We will run all functions except epoch again and also fix epoch value at 60 as input to all functions. This is to find other parameters that could decrease MSE further. I repeat these steps until MSE isn’t decreased anymore.

Finally, I got the result which give the lowest MSE at ~2.79% as below:

  • hidden layer number = 10
  • dropout probability = 1.0
  • batch size=10
  • epoch=90
  • history point=90

I tried to optimize the MSE further by adding some technical analysis indicator that is commonly used to trade the stock. I select MACD and EMA which is quite not complicated to calculate. The example code is as below. It will add MACD and EMA at 20 and 50 days to the stock data dataframe.

# Extract Close data to calculate MACDdf_close = df[['Close']]df_close.reset_index(level=0, inplace=True)df_close.columns=['ds','y']# Calculate MACD by using DataFrame's EWM https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ewm.htmlexp1 = df_close.y.ewm(span=12, adjust=False).mean()exp2 = df_close.y.ewm(span=26, adjust=False).mean()macd = exp1-exp2# Merge MACD back as new column to the input dfdf = pd.merge(df,macd,how='left',left_on=None, right_on=None, left_index=True, right_index=True)# Rename DataFrame columnsdf.columns = ['Date','Open','High','Low','Close','Adj Close','Volume','MACD']# Add new columns using EMA windwos size. EWM can use directly.df[ema1] = df['Close'].ewm(span=20, adjust=False).mean()df[ema2] = df['Close'].ewm(span=50, adjust=False).mean()return df

However, the MSE with additional data is increased to be around ~6.7% instead. So, adding them for 1 day prediction might not be the case.

The overall steps to find parameters for 1 day is as described earlier. So, I repeat all of the steps for 5 days and 10 days prediction and get the result as follow:

1 day prediction at 2.78% MSE

  • history points : 90
  • hidden layer : 10
  • batch size : 10
  • dropout probability : 1.0
  • epoch : 90
  • add MACD and EMA? : No

5 days prediction at 7.56% MSE

  • history points : 30
  • hidden layer : 70
  • batch size : 10
  • dropout probability : 1.0
  • epoch : 60
  • add MACD and EMA? : No

10 days prediction at 14.55% MSE

  • history points :50
  • hidden layer : 60
  • batch size : 10
  • dropout probability : 0.3
  • epoch : 80
  • add MACD and EMA? : No

It is quite a surprise for me that adding MACD and EMA doesn’t help for predicting INTUCH at any range. Howeve, I’ll still keep the function and try it with other stocks in SET50 instead.

Now, we have the parameters for each day range of prediction. We can try them with SET50 stocks to see how many stocks can be predicted with acceptable accuracy for 1,5 and 10 days prediction.