Generating Simulated Stock Price Data using a Variational Autoencoder

Source: Deep Learning on Medium

Generating Simulated Stock Price Data using a Variational Autoencoder

Standard autoencoders, which are useful for tasks like data compression and denoising, learn from training data in order to generate a compact representation of the original input. Variational autoencoders, on the other hand, are powerful generative models that generate new data that looks similar to the training data.

A very good and intuitive explanation of the principles behind variational autoencoders can be found here.

A common requirement in software development is to generate simulated data for the development work, so that potentially sensitive information is not exposed to the development teams. This is especially true when working with financial data: if real-life data is not available to the development teams, they will at least need simulated datasets whose behavior looks like real data.

In this example, I will use a simple variational autoencoder to simulate the stock price of three technological companies: Microsoft (msft), Apple (aapl) and Amazon (amzn). I used data publicly available in Quandl to train the model, reserving the last 365 days for testing. The following is the Python code to retrieve this data:

!pip install — upgrade tensorflow
!pip install quandl
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import quandl
quandl_data_apple = quandl.get(‘WIKI/AAPL’)
quandl_data_amazon = quandl.get(‘WIKI/AMZN’)
quandl_data_msft = quandl.get(‘WIKI/MSFT’)
Plot of High Stock Price data available in Quandl for Microsoft, Apple and Amazon

The training data (which is normalized before going into the model) is constituted by chunks of 365 consecutive data points corresponding to the highest stock price of the day (I am not using the adjusted price in this model, as the goal of this example is not prediction but data generation). The last 365 days are reserved for testing:

msft_high_raw = (quandl_data_msft['High'].values).reshape(1,-1)
msft_high = preprocessing.normalize(msft_high_raw, norm='max', axis=1)
msft_high = msft_high.reshape(msft_high.shape[1],)
msft_max_high = np.max(msft_high_raw)

apple_high_raw = (quandl_data_apple['High'].values).reshape(1,-1)
apple_high = preprocessing.normalize(apple_high_raw, norm='max', axis=1)
apple_high = apple_high.reshape(apple_high.shape[1],)
apple_max_high = np.max(apple_high_raw)

amazon_high_raw = (quandl_data_amazon['High'].values).reshape(1,-1)
amazon_high = preprocessing.normalize(amazon_high_raw, norm='max', axis=1)
amazon_high = amazon_high.reshape(amazon_high.shape[1],)
amazon_max_high = np.max(amazon_high_raw)
def generate_samples(data, sample_size):
n_samples = data.shape[0]//sample_size
result = np.empty((n_samples, sample_size))
for i in range(n_samples):
result[i] = data[i*sample_size : i*sample_size + sample_size]
return(result)
test_data_points = 365
sample_size = 365

msft_high_train = msft_high[0:msft_high.shape[0]-test_data_points]
msft_high_test = msft_high[-test_data_points:]
X_msft_train = generate_samples(msft_high_train, sample_size)
X_msft_test = generate_samples(msft_high_test, sample_size)

apple_high_train = apple_high[0:apple_high.shape[0]-test_data_points]
apple_high_test = apple_high[-test_data_points:]
X_apple_train = generate_samples(apple_high_train, sample_size)
X_apple_test = generate_samples(apple_high_test, sample_size)

amazon_high_train = amazon_high[0:amazon_high.shape[0]-test_data_points]
amazon_high_test = amazon_high[-test_data_points:]
X_amazon_train = generate_samples(amazon_high_train, sample_size)
X_amazon_test = generate_samples(amazon_high_test, sample_size)

The autoencoder has the following architecture:

  • Encoder network: 4 dense layers of size 256, 128, 32 and 16, with 0.1 dropout after the first two layers. The hidden layers use relu activation.
  • Decoder network: 4 dense layers of size 16, 32, 128 and 256. The hidden layers use relu activation and the output layer uses tanh activation.
  • 1-Dimensional latent space.

The following is the Keras code used to build the autoencoder:

class Sampling(keras.layers.Layer):
def call(self, inputs):
mean, log_var = inputs
return K.random_normal(tf.shape(log_var)) * K.exp(log_var / 2)
+ mean
np.random.seed = 50
tf.random.set_seed(50)
K=keras.backend
keras.backend.clear_session()
latent_dim = 1
inputs = keras.layers.Input(shape=[sample_size])
z = keras.layers.Dense(256, activation='relu')(inputs)
z = keras.layers.Dropout(0.1)(z)
z = keras.layers.Dense(128, activation='relu')(z)
z = keras.layers.Dropout(0.1)(z)
z = keras.layers.Dense(32, activation='relu')(z)
z = keras.layers.Dense(16, activation='relu')(z)
latent_mean = keras.layers.Dense(latent_dim)(z)
latent_log_var = keras.layers.Dense(latent_dim)(z)
sample = Sampling()([latent_mean, latent_log_var])
variational_encoder = keras.models.Model(
inputs=[inputs], outputs=[latent_mean, latent_log_var, sample])
decoder_inputs = keras.layers.Input(shape=[latent_dim])
x = keras.layers.Dense(16, activation='relu')(decoder_inputs)
x = keras.layers.Dense(32, activation='relu')(x)
x = keras.layers.Dense(128, activation='relu')(x)
x = keras.layers.Dense(256, activation='relu')(x)
outputs = keras.layers.Dense(sample_size, activation='tanh')(x)
variational_decoder = keras.models.Model(inputs=[decoder_inputs], outputs=[outputs])
_, _, sample = variational_encoder(inputs)
reconstructions = variational_decoder(sample)
variational_ae = keras.models.Model(inputs=[inputs], outputs=[reconstructions])
latent_loss = -0.5 * K.sum(
1 + latent_log_var - K.exp(latent_log_var) - K.square(latent_mean),
axis=-1)
variational_ae.add_loss(K.mean(latent_loss)/(sample_size*1.))
variational_ae.compile(loss='mse',
optimizer=keras.optimizers.Adam(learning_rate=0.01),
metrics=['mse'])
variational_ae.save('vae.tf', save_format='tf')

Results of the simulation on Microsoft data:

The following code trains the model and generates the output for the last 365 days of Microsoft stock data (the testing set):

vae_mft = keras.models.load_model('vae.tf')
history = vae_mft.fit(X_msft_train, X_msft_train, epochs=300, batch_size=32)

msft_simulated_high = np.empty((X_msft_test.shape[0],sample_size))

for i in range(X_msft_test.shape[0]):
msft_simulated_high[i] = vae_mft.predict(X_msft_test[[i]])

plt.figure(figsize=(20,5))
plt.plot(msft_max_high * X_msft_test.flatten(), label='msft real data')
plt.plot(msft_max_high * msft_simulated_high.flatten(), label='msft simulated data')
plt.legend(loc='upper left')
plt.show()
Microsoft real data (blue) vs. simulated data (orange) for the testing set

The following code generates the output of the model when applied to the entire Microsoft dataset:

plt.figure(figsize=(21,5))
plt.plot(msft_max_high*
np.append(X_msft_train,X_msft_test).flatten(),
label='msft real data')
plt.plot(msft_max_high *
np.append(msft_train_high, msft_simulated_high).flatten(),
label='msft simulated data')
plt.legend(loc='upper left')
plt.show()
Microsoft real data (blue) vs. simulated data (orange) for the entire dataset

Results of the simulation on Apple and Amazon data:

The following plot shows the outputs for the last 365 days of Apple data (the testing set). Notice that the model tries to reproduce in the testing set an event belonging to the training set: the sharp decline in the stock price caused by a stock split operation back in 2014:

Apple real data (blue) vs. simulated data (orange) for the testing set. The sharp decline in the simulated price at the beginning of the plot suggests that the model remembers and tries to reproduce a similar event that happened in the training set.

This is what the simulated data looks like when we apply the model to the entire Apple data set:

Apple real data (blue) vs. simulated data (orange) for the entire dataset

The following are the results of applying the model to the last 365 days of Amazon data and to the entire Amazon dataset:

Amazon real data (blue) vs. simulated data (orange) for the testing set
Amazon real data (blue) vs. simulated data (orange) for the entire dataset

Further Reading: