Reinforcement Learning for Options Trading

Original article can be found here (source): Artificial Intelligence on Medium

Artificial intelligence is making its impact on many areas of finance, particularly trading. A diverse range of artificial intelligence subfields such as deep learning, reinforcement learning, and natural language processing are currently being utilized to predict stock movements. A reinforcement learning trading agent attempts to learn stock prices through trial and error. By combining Q learning, a type of reinforcement learning algorithm, with the Black-Scholes model, a traditional model for option pricing, we can create a Q Learning Black Scholes (QLBS) model to determine optimal option prices.

In this article, I’ll go over options, the Black-Scholes model, and Q learning before showing the implementation of a Q Learning Black-Scholes model for an European put option.

Note: click here if you want to go straight to the implementation of the QLBS model (link doesn’t work in mobile app)

Options Explained

Options are a type of derivative security, meaning their value depends on the price of some other asset such as stock or commodities. For example, an option contract for a stock usually represents 100 shares of the underlying stock. The price of the option contract is known as the premium. Essentially, the contract allows the bearer to either buy or sell an amount of the underlying asset at a pre-determined price (referred to as the strike price) at or before the contract expires. Additionally, the bearer does not have an obligation to buy or sell, so they can also let the contract expire.

The two major types of options are call options and put options; the former allows the bearer to buy the asset at a stated price and the latter allows the bearer to sell the asset at a stated price. Buyers can use call options for speculation and put options for hedging purposes.

Image credit: Gatsby

To understand this better, we’ll look at a real-world example. Suppose that Apple shares are trading at $200 per share. You think that shares may rise above $210 over the next month and buy a $210 call trading at $0.67 per contract. If the price rises over $210 before or on the expiration date of the contract, you buy the shares at $210 and if the shares fall or don’t rise over $210, you only lose money from the $67 premium ($0.67 x 100 shares). Conversely, let’s say you already own Apple shares and think that the price may fall, leading you to buy a $190 put trading at $0.63 per contract. If the price falls below $190 before or on the expiration date of contract, you profit by selling the shares at $190 and if the shares don’t fall, you only lose money from the $63 premium ($0.63 x 100 shares). Now that we understand options, let’s move on to the Black-Scholes Model.

Black-Scholes Model

The Black-Scholes equation, which is probably the most famous equation in finance, provided the first widely used model for option pricing. The current stock prices, the option’s strike price, expected interest rates, time to expiration and volatility (measure of fluctuations in price) are all used as inputs to calculate the theoretical value of options. Introduced in 1973 by economists Fischer Black, Myron Scholes and Robert Merton, the equation was so influential that it won Scholes and Merton the 1997 Nobel Prize in Economics (Black unfortunately died before he could receive the honor).

Image Credit: KhanAcademy

In the image above, C is the call option price, N(d1) is the normal distribution corresponding to the call option’s delta (ratio comparing change in the price of an asset to the corresponding change in the price of its derivative), and N(d2) is the normal distribution corresponding to the probability that the call option will be exercised at expiration.

Black-Scholes for put option

For the sake of brevity, I’ll focus on the assumptions the Black-Scholes equation makes as well as its limitations rather than the actual math behind it. If you want a deeper understanding of Black-Scholes, watch this.


  • The option is European (can only be exercised at expiration, not before)
  • No dividends are paid out during the life of the option.
  • Markets are efficient (market movements cannot be predicted).
  • No transaction costs in buying the option.
  • Known and constant risk-free interest rate and volatility of underlying asset.
  • Normally distributed returns on the underlying asset.


  • Doesn’t work for US options (can be exercised before expiration)
  • Volatility fluctuates in real life
  • Transaction costs exist
  • Risk free interest rate is not always constant in real life

Great, now that we have an overview of Black-Scholes, we’ll go over Q learning before jumping into the implementation of the QLBS model.


Image Credit: Reinforcement Learning:An Introduction

In reinforcement learning, the goal is to maximize rewards. The agent performs an action to transition from one state to the next and the action taken in each state gives the agent an reward. For example, think of a scenario with a dog and its human master. In the home (environment), the dog (agent) runs around (action) when the master commands it to sit down (state) and receives no treats (reward). In the next state if the master commands the dog to sit down, it sits down (because running did not give treats) and receives treats. Essentially, the dog learns through trial and error.

Q-learning is a reinforcement learning algorithm where the goal is to learn the optimal policy (the policy tells an agent what action to take under what circumstances). A Q-Table of dimensions states x actions has values initialized to zero. Then, the agent chooses an action, observes a reward, and enters a new state, updating Q, the “quality” of the action taken in a state at each time t. Here’s the algorithm below.

Image Credit: Wikipedia

The learning rate, which is usually constant for all time t, determines to what extent from 0 (agent learns nothing new) to 1 (agent only considers recent information) new information overrides old information. Furthermore, the discount factor, which determines the importance of future rewards, ranges from 0 (only current rewards matter) to 1 (long term reward prioritized).

The agent can interact with the environment in two ways. One way is exploitation, where the agent uses the Q-table as reference and chooses the action that has the highest value. However, the Q-table begins with all zeros so actions sometimes have to chosen randomly. This is exploring, when the agent selects an action at random instead of choosing based on max future reward. The epsilon value sets the percent of time you want your agent to explore instead of exploit.

Finally, its time to move on to the intersection of Q Learning and Black -Scholes!

Q-Learning + Black-Scholes

When Q-learning and Black-Scholes are combined, our QLBS model uses trading data to autonomously learn both the optimal option price and optimal hedge. For our implementation of the model, we’ll be working with a European put option. Before implementing the QLBS model, we’ll also implement the classic Black-Scholes formula to compare the results of the two. I’ll be leaving out the code for graphs and show the graphs directly instead to avoid making this article unnecessarily long; you can still click here to view the full code.


First, we make the necessary imports.

import numpy as np
import pandas as pd
from scipy.stats import norm
import random
import time
import matplotlib.pyplot as plt
import sys

Monte Carlo Simulation

After making imports, we’ll set the parameters for a Monte Carlo simulation of prices. A Monte Carlo simulation is used to model the probability of different outcomes in a process (such as stock price movement) which is unpredictable due to the presence of random variables.

S0 = 100 # initial stock price
mu = 0.05 # drift
sigma = 0.15 # volatility
r = 0.03 # risk-free rate
M = 1 # maturity
T = 24 # number of time steps
N_MC = 10000 # number of paths
delta_t = M / T # time interval
gamma = np.exp(- r * delta_t) # discount factor

Black-Scholes Simulation

Images from Coursera: Reinforcement Learning in Finance
np.random.seed(42)# stock price
S = pd.DataFrame([], index=range(1, N_MC+1), columns=range(T+1))
S.loc[:,0] = S0
# standard normal random numbers
RN = pd.DataFrame(np.random.randn(N_MC,T), index=range(1, N_MC+1), columns=range(1, T+1))
for t in range(1, T+1):
S.loc[:,t] = S.loc[:,t-1] * np.exp((mu - 1/2 * sigma**2) * delta_t + sigma * np.sqrt(delta_t) * RN.loc[:,t])
delta_S = S.loc[:,1:T].values - np.exp(r * delta_t) * S.loc[:,0:T-1]
delta_S_hat = delta_S.apply(lambda x: x - np.mean(x), axis=0)
# state variable
X = - (mu - 1/2 * sigma**2) * np.arange(T+1) * delta_t + np.log(S) # delta_t here is due to their conventions

Here’s what some stock price and state variable paths look like.