Original article was published on Deep Learning on Medium

# 1. Build LSTM and optimize parameters for one stock in SET50 on 1,5 and 10 days prediction

*1.1 Select the workspace and install yfinance library*

First, we will need to get stock data from Yahoo Finance. **yfinance** library is the package that we need to install. It is packed with function to get the data from Yahoo Finance. The documentation can be found here — https://pypi.org/project/yfinance/.

Install yfinance using !pip install command.

`!pip install yfinance`

import yfinance as yf

As LSTM required tensorflow to build and train the model, I develop this notebook on Google’s colaboratory — https://colab.research.google.com which provide free Jupyter Notebook working space with ternsorflow GPU supported installed. It is also able to read/write file to Google Drive which is quite handy for me in this situation as my personal machine doesn’t has GPU.

Keras is Deep Learning library which has LSTM implemented that I’m going to use in this exploration — https://keras.io/.

*1.2 Prepare data*

Since we will use all of SET50 data in the next topic, I’ll download all of them. The stock that I select to explore is INTUCH.BK which is the one that I used to trade recently.

I use colaboratory’s Google Drive mounting features to store the downloaded data and also intermediate result while working on this notebook.

yfinance has handy command below which is able to download historical data within 2 lines. First, we have to initiate the instance of yfinance using the ticker name. After that, we can use history function to download the historical data. More detail in yfinance’s documentation : https://pypi.org/project/yfinance/

# Instantiate object from stock ticker

stock_data = yf.Ticker(stock)# yfinance history function is able to define period to download historical data

pd.DataFrame(stock_data.history(period='max',auto_adjust=False,actions=False)).to_csv(file)

After I save the data to CSV, I explore them a bit to check for completeness of the data, null data and the expected features (Open, High, Low, Close, Adjusted Close, Volume).

Base on quick check, the data is quite ready to use.

After we have got all of the data, we have to make it ready to train the model. Here is the list of things to do:

- Drop null rows (if any) as we can’t use it anyway.
- Drop ate as we can’t use it as features in model training.
- Normalize the data to have the value between 0–1 as it will help neural network has better performance. This is per this post : https://towardsdatascience.com/why-data-should-be-normalized-before-training-a-neural-network-c626b7f66c7d. To normalize and scale the data backup, we can use Python’s preprocessing.MinMaxScaler(). What we have to is to also keep the object that we use to scale the data down and use the same object to scale the data up.
- Transform the data format. We will predict the Adj. Close for the period of prediction day range (1,5 and 10). So, the dataset will consist of the set of Open, High, Low, Volume of each day for the number of history points day that we will use to do prediction.
- For example, if we say we want to use 30 history points. One row of our dataset will consists of the following features:
`[dayAopen, dayAclose, dayAvolume, dayAhigh, dayAlow,dayA-1open, dayA-1close, dayA-1volume, dayA-1high, dayA-1low....dayA-29low]`

Here are the code that I use to perform all of the activities above.

# Construct the CSV filepath for INTUCH.BKstock = 'INTUCH.BK'

filename = gdrive_path+stock+'.csv'# Read the file and drop null rowdf = pd.read_csv(filename)

df_na = df.dropna(axis=0)# Drop Date as this is time series data, Date isn't used. Also drop Close as we will predict Adj Close.df_na = df_na.drop(['Date','Close'],axis=1)# As neural network has better performance with normalize data, we will normalize the data before train and predict# After we got the predict result, we will scale them back to normal value to measure error rate.# Normalise all data to the value range of 0-1 as neural network algorithm has better performance with this data rangedata_normaliser = preprocessing.MinMaxScaler()y_normaliser = preprocessing.MinMaxScaler()data_normalised = data_normaliser.fit_transform(df_na)# The length of dataset, number of day to predict and number of featureshistory_points = 30

predict_range = 1# Prepare the data in the format of [day-1-open,day-1-max,day-1-min,...day-history_point ] as 1 row input for predict the 'predict_range' price for train and testohlcv_histories_normalised = np.array([data_normalised[i : i + history_points].copy() for i in range(len(data_normalised) - history_points - predict_range +1)])# Get the actual price [day1-adj close,day2-adj close....day-predict_range adj close] for train and testnext_day_adjclose_values_normalised = np.array([data_normalised[i + history_points:i + history_points + predict_range,3].copy() for i in range(len(data_normalised) - history_points - predict_range+1)])# Create the same array as the normalised adj close but with the actual value not the scaled down value. This is used to calculate the prediction accuracynext_day_adjclose_values = np.array([df_na.iloc[i + history_points:i + history_points+predict_range]['Adj Close'].values.copy() for i in range(len(df_na) - history_points - predict_range+1)])# Use the passed normaliser to fit the actual value so that we can scale the predicted result back to actual valuey_normaliser.fit(next_day_adjclose_values)

Now, the data is ready. As we are going to train the model, we will have to split the data to train and test.

The older data will be the training set and the newer data will be the test set.

I select 90% of the data as train data and 10% of the data to be test data.

So, we can use Python’s array slicing to split the data. The code below is the example from my function. ohlcv_histories is the data that we prepared earlier.

n = int(ohlcv_histories.shape[0] * 0.9)ohlcv_train = ohlcv_histories[:n]

y_train = next_day_adj_close[:n]ohlcv_test = ohlcv_histories[n:]

y_test = next_day_adj_close[n:]

*1.3 Build, Train and Validate the model*

Then, it is ready to create LSTM model, train and validate the model by using mean squared error. LSTM that I will use is a simple one consist of hidden layer, dropout layer, and forecast layer.

I create as function so that I can change the parameters of the model. The parameters that we change when we build LSTM models are:

- hidden layer number — The layer of LSTM
- dropout probability — The probability to forget the information of previous node
- history points — The range of data use in training the model for each iteration (E.g. 30 days for each iteration from all of the data in training set)
- feature number — The number of features. If we add more features this number has to change.
- optimizer (mostly we will use ‘adam’)

Here is the code inside the function.

# Initialize LSTM using Keras librarymodel = Sequential()# Defining hidden layer number and the shape of the input (number of data in the dataset and the number of feature)model.add(LSTM(layer_num, input_shape=(history_points, features_num)))# Add forget (dropout) layer with probability per argumentmodel.add(Dropout(dropout_prob))# End the network with hiddenlayer per the size of forecast day e.g. 1,5,10model.add(Dense(predict_range))# Build and return the model per the selected optimizermodel.compile(loss='mean_squared_error', optimizer=optimizer)

After we get the model as a result from compile(), we can fit it with the training data. Additional parameters that we can change when we fit the data are:

`model.fit(x=ohlcv_train, y=y_train, batch_size=batch_size, epochs=epoch, shuffle=True, validation_split=0.1)`

Once the model has completed the training, we can use test data to predict the result and compare the result with the actual result by calculating mean squared error (MSE). However, the actual result that we have is the scaled up value (the normal price one not the normalized 0–1 which we got from the model)

Before calculating MSE, we have to scale the predicted price back.

# The model is train. Test with the test datasety_test_predicted = model.predict(ohlcv_test)# Scale up the result to actual value with y_normaliser that we use earliery_test_predicted = y_normaliser.inverse_transform(y_test_predicted)# Calculate the error with MSEreal_mse = np.mean(np.square(unscaled_y_test - y_test_predicted))scaled_mse = real_mse / (np.max(unscaled_y_test) - np.min(unscaled_y_test)) * 100

Now, we have the completed code to prepare data, build, train and validate the model and also able to change parameters when we build and train the model to find the set that give the lowest MSE.

For the first attempt, I try with all lowest parameter for 1 day prediction. The history points that I use is 30 days on all historical data that was downloaded.

# Must be the same as history point that we use to prepare datahistory_points = 30# Must be the same number of features when we prepare datafeatures_num = 5# LSTM parameterslayer_num = 30predict_range = 1optimizer = 'adam'dropout_prob = 1.0# Create LSTM model objectmodel = get_LSTM_Model(layer_num, history_points, features_num,predict_range,optimizer,dropout_prob)# Parameter for model trainingbatch_size = 10epoch = 10# Train model with our train datamodel.fit(x=ohlcv_train, y=y_train, batch_size=batch_size, epochs=epoch, shuffle=True, validation_split=0.1)

After we got the result, we can plot the predict price and the actual price to see how it different.

real = plt.plot(unscaled_y_test, label='real')

pred = plt.plot(y_test_predicted, label='predicted')plt.legend(['Real', 'Predicted'])

plt.show()

We can say that the result is able to capture the trend quite well. It will constantly predict lower than the actual price when the price is uptrend while predict higher than the actual price.

*1.4 Optimize parameters for 1,5 and 10 days prediction*

Then, it’s time for finding the best parameters value. In summary, here are the list of parameters to optimize:

- hidden layer number
- dropout probability
- history points
- batch size
- epoch

The way that I do is to create function that will loop thru the range of one parameter while all others parameter value will fix to see which value of the particular parameter give the lowest MSE. So, I’ll have 5 functions at total.

Here is the example of the function. Other functions share the same structure but just changing the parameter.

def get_best_history_points(predict_range, max_history_points, stock_list, hidden_layer=10, batch_size=10,epoch=10,dropout_probability=1.0,mode='file'):mse_list = []exception_list = []for history_points in range(30,max_history_points+1,round(max_history_points/10)):for stock in stock_list:try:model, scaled_mse = train_and_validate_stock_predictor(stock,history_points,predict_range,hidden_layer,batch_size,epoch,dropout_probability,mode)print("Predict {} days for {} with MSE = {}".format(str(predict_range),str(stock),str(scaled_mse)))mse_list.append([history_points,stock,scaled_mse])pd.DataFrame(mse_list).to_csv('/content/drive/My Drive/Colab Notebooks/stocklist_'+str(predict_range)+'_mse_history_'+mode+'.csv')except Exception as e:print("exception "+str(e)+"on "+stock)

exception_list.append([predict_range,stock,str(e)])

pd.DataFrame(exception_list).to_csv('/content/drive/My Drive/Colab Notebooks/exception_list.csv')continue

Then, I start by run all of the function to see which parameter at which value give the lowest MSE.