Python Multithreading vs Multiprocessing? Web Scrape Stock Price History Faster

Original article was published by Debra Ray on Artificial Intelligence on Medium


Multithreading is faster than multiprocessing at Python web scraping stock price history from Yahoo Finance. To understand why, you must know the difference between multithreading and multiprocessing.

  • Multithreading: One central processing unit or core in a multicore processor concurrently executes multiple threads of tasks together.
  • Multiprocessing: Two or more central processing units allocate mutliple tasks across the units.

Think about how many cores are in your computer processor. It is likely that you might have four cores. This means that you can run four tasks at the same time across each core. By contrast, you can create 30 threads to work tasks at the same time within your computer system through multithreading.

In this article, I will compare the speed at which Python can scrape Apple’s stock price history for the last 10 years using 13 and 27 threads on one CPU. For comparison, I will also demonstrate that multiprocessing is also more expensive in terms of time to perform the same task.

For projects involving multiple stocks and more than 100 URLs, you will want to consider using Python’s asyncio module.

Import project dependencies

For this project, you will need to use the following modules to:

  1. datetime: Supply input dates for the Yahoo Finance URLs.
  2. time: Format the dates in time since epoch.
  3. multiprocessing: Run multiple tasks across CPU cores.
  4. numpy: Count trading days within a time block.
  5. pandas: Create and manipulate dataframes.
  6. math: Round up to the nearest whole number.
  7. requests: Send GET requests to a web server.
  8. html: parse the html content and save it into a tree.
  9. lxml: Serialize an XML tree to a string.
  10. concurrent.futures: asynchronously execute functions.
from datetime import datetime, timedelta
import time
from multiprocessing import Pool
import numpy as np, pandas as pd
import math, requests, html
import lxml, lxml.html
import concurrent.futures

Python class representing the stock’s market price history

First, you will want to create a class to hold the individual functions to scrape stock price history from Yahoo Finance. The primary arguments for input are the stock symbol, start date, and end date for the timeframe in which you want to acquire data.

class price_history:    def __init__(self, symbol, start, end):
'''
:param symbol: example "AAPL, or TRI.TO"
:param start: Start date as datetime object
:param end: End date as datetime object
'''
self.symbol = symbol.upper()
self.start = start
self.end = end
self.hdrs = {"authority": "finance.yahoo.com",
"method": "GET",
"scheme": "https",
"accept": "text/html,application/xhtml+xml",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cache-control": "no-cache",
"dnt": "1",
"pragma": "no-cache",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64)"
self.urls = self.__urls__()

Headers are important in order to avoid the website flagging you as a robot and blocking you from accessing the data. The last class attribute is self.urls which will hold a list of urls that we create in a private function later on in the article.

Yahoo Finance will only allow you to scrape up to 100 trading days worth of price data at a time. Thus, if you are interested in a larger timeframe, we will need to divide the timeframe into 100 business day intervals and integrate multithreading into the program to handle the larger number of URLs efficiently.

Creating the primary web scraping function

If you are familiar with web scraping, then you will recognize the following function to request data from a URL and convert it into a manageable form.

def __table__(self, url, hdrs):
page = requests.get(url, headers=hdrs)
tree = lxml.html.fromstring(page.content)
table = tree.xpath('//table')
string = lxml.etree.tostring(table[0], method='xml')
data = pd.read_html(string)[0]
return data

Next, you will need to clean the data by removing unnecessary rows from the data frame output. The last row of the data frame is usually unnecessary notation about adjusted close prices. You don’t need this to conduct your data analysis project, and it will only get in the way.

def __clean_history__(self, price_history):
history = price_history.drop(len(price_history) - 1)
history = history.set_index('Date')
return history

After dropping the last row and setting the index to the date, we can combine these two functions into another public method. We will call this method later within the multithreading code to retrieve data from our list of multiple URLs.

def scrape_history(self, url):
'''
:param url: URL location of stock price history
:return: price history
'''
symbol = self.symbol
hdrs = self.hdrs
price_history = self.__table__(url, hdrs)
price_history = self.__clean_history__(price_history)
return price_history

How to obtain more than 100 trading days of price history

As mentioned previously, Yahoo Finance will only allow you to call 100 trading days worth of data per URL. Thus, we will need to create a few functions to manage this impediment in our program.

The first step is to check if there are more than 100 business days within the provided time frame. To do this, we need to convert our datetime objects into numpy’s datetime64[D] format using the np.datetime64 function.

Next, we can use numpy’sbusday_count method to check to see if there are greater than 100 business days within a particular time frame.

def __check__(self, s, e):start = np.datetime64(s, "D")
end = np.datetime64(e, "D")
if np.busday_count(start,end) > 100:
response = True
else:
response = False
return response

After verifying that our code has greater than 100 business days, we need to split the timeframe into the number of URLs we are going to scrape. The true or false response from the prior function acts as input in the calculate pages function.

def __calc_pages__(self, response):
s, e = [self.start, self.end]
if response:
pages = math.ceil(np.busday_count(np.datetime64(s, "D"), np.datetime64(e, "D")) / 100)
else:
pages = 1
return pages

We use the number of pages into the calculate start function to get all the dates at which we are going to begin each new section of our original timeframe. These will be our new start dates. A generator can do this quickly.

def __calc_start__(self, pages, s, e):
calendar_days = (e - s) / pages
while pages > 0:
s = s + calendar_days
yield s
pages -= 1

I integrated the calculate start function into a separate function to combine the output into a list. I did this in order to keep each function as simple as possible.

def __starts__(self, pages, s, e):
starts = []
for s in self.__calc_start__(pages, s, e):
if pages == 0:
break
starts.append(s)
starts.append(e)
return starts

A fifth function calls the check, calculate pages, and starts function in the appropriate order. We leave out the calculate starts function because that will be called when the program initiates the starts function

def __getStarts__(self):
response = self.__check__(self.start, self.end)
pages = self.__calc_pages__(response)
starts = self.__starts__(pages, self.start, self.end)
return starts

Build list of URLs to web scrape

The URLs will not accept a datetime object as input. The URL won’t exist and your code will break. Thus, you need to format the dates into strings in time since epoch using the time.mktime function.

def __format_date__(self, date_datetime):
date_timetuple = date_datetime.timetuple()
date_mktime = time.mktime(date_timetuple)
date_int = int(date_mktime)
date_str = str(date_int)
return date_str

The URLs function uses the format datefunction to convert your datetime objects into the correct format for the URLs. This function will take the start dates and format multiple URLs from those date segments.

def __urls__(self):
'''
Returns
-------
urls : a list of urls complete with start and end dates for each 100 trading day block
'''
starts = self.__getStarts__()
symbol = self.symbol
urls = []
for d in range(len(starts) - 1):
start = str(self.__format_date__(starts[d]))
end = str(self.__format_date__(starts[d + 1]))
url = "HTTP://finance.yahoo.com/quote/{0}/history?period1={1}&period2={2}&interval=1d&filter=history&frequency=1d"
url = url.format(symbol, start, end)
urls.append(url)
return urls

Multiprocessing versus multithreading

Now we can run our program and compare the speed at which Python can run this code using multiprocessing and multithreading.

if __name__ == "__main__":
start = datetime.today() - timedelta(days=365*10)
end = datetime.today()
aapl = price_history('aapl', start, end)
urls = aapl.urls
#Multiprocessing t0 = time.time() p = Pool()
history_pool = p.map(aapl.scrape_history, urls)

t1 = time.time()

print(f"{t1 - t0} seconds to download {len(urls)} urls.")

#Multithreading

t0 = time.time()
#sets the number of threads to the lesser of 30 or length of urls
threads = min(30, len(urls))
with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
history = list(executor.map(aapl.scrape_history, urls))
t1 = time.time()

print(f"{t1 - t0} seconds to download {len(urls)} urls.")
history_concat = pd.concat(history)

history_concat = history_concat[~history_concat.Open.str.contains("Dividend")]

If you run this code in Pycharm, the output will look like this:

Note: on an Apple operating system, you will need to set the environment variable to os.environ[‘OBJC_DISABLE_INITIALIZE_FORK_SAFETY’] = ‘YES’

As you can see, you can save over 80% of your processing time by using multithreading versus multiprocessing. While multiprocessing certainly has a place in computer programming, some simple multithreading code is all that you need to complete jobs of moderate size much faster.