The relation between Covid-19 News Articles and Stock Exchange

Original article can be found here (source): Artificial Intelligence on Medium

The relation between Covid-19 News Articles and Stock Exchange Prices

A machine learning approach to understand the relation between the news media articles and the downfall of stock exchanges — Panic or Information?

A pedestrian walking past a display of global stock market results in Tokyo last week. Credit: Kimimasa Mayama/EPA, via Shutterstock

Covid-19 is undoubtedly a cruel virus and we have seen it ripping families apart around the globe. At the time of writing this article, about 470,000 people are infected by this virus in all continents (except Antarctica). So this variety of coronaviruses should be treated with caution and respect. However, if you compare the current daily statistics of the Covid-19 infections to the population of the world, you will find that the probability that any one of us will catch the virus today is super small.

According to the official WHO data, 60 out of 1 million people have hosted the virus until now. This is a very small number and an outlier if you ask a Statistician. So should we really take all the precautions that Governments are asking us to take like hand washing, sanitizing, keeping social distance, etc.? Certainly yes, as even if the probability of getting sick is small, you are not special, it can infect you and you can transmit it to others. Also, you do not want to cripple the health care system which is already overburdened and you want to stay healthy (for a change). But do we really need to panic? No, right?

I believe that all of you reading this article know that we don’t need to panic but still, we’re seeing empty shelves in our supermarkets. So why are those bulk buyers panicking and how is it propagating?

Let me think, what do I see when I turn on any news channel or read a news report today? Firstly, I see news about coronavirus and …, well, there is no ‘and’, I only see news about the coronavirus!

Much of the news these days has disingenuous reporting with sensational claims and flashing scaremongering headlines clearly to attract your attention and clicks. Several media outlets are capitalizing on our fear of losing our dear ones and ignoring to report all other news, which directly or indirectly propagates panic and hysteria. Such panic is not good for our psychological and economical state and we can already see it’s effect on the crashing stock markets around the world.

Several studies in the past have shown that stock markets are directly affected by the everyday news (e.g. Zhou et al. 2018; Hiransha et al. 2018). In this article, I’ll show the predictions of stock prices using the news articles scrapped for the month of January, February, and March of 2020. The following are some of the tasks that are performed for these predictions:

  1. Pulling all news data for all countries and filter articles related to Covid-19.
  2. Combining news data for January, February, and March and scrape them using the URLs in the data.
  3. Applying co-reference resolution to the text, manually labeling economic and non-economic articles and training a random forest/logistic regression model to classify all articles.
  4. Downloading stock exchange data.
  5. Building a neural network to predict stock prices from news articles.

Task 1

For task 1, we download GDELT data. The entire GDELT database is 100% free and open. Following is the code used to download news data for each day. For this analysis, I used English only news articles.

Next, I searched for news headlines that have words related to the coronavirus. For instance, I used the following keywords

relevant_words = [‘corona’, ‘coronavirus’, ‘wuhan’, ‘hubei’, ‘virus’, ‘quarantine’]

The number of articles per day with these keywords is in-between 88,356 (08–03–2020) to 178823 (04–03–2020) for the month of March. This number was just 12,317 on 01–02–2020.

I know that’s a huge rise in the number for English only articles, right?

Note that this number is true only when the above keywords are mentioned in the headlines. There can be several more keywords (e.g. I missed Covid-19) and some articles may be talking about coronavirus in the text and not in the headline.

Task 2

The URL information is present for each article in the GDELT data already. I used Newsplease to scrap the full text for these articles using the following script. To make it easy for myself, I combined all Covid-19 articles for each day over the past 3 months into one file.

Task 3

Next, I applied coreference resolution to the text of news articles. As the name suggests, this is a task of locating all common expressions that refer to the same entity in text. In order to understand the sentiment behind an article and classify it into conflict and non-conflict events, it is important to change all the pronominal words like he, his, her, she, them, their, us, etc. into the nouns to which they belong.

The modern Natural Language Processing (NLP) techniques like neural networks allow us to do this job easily by training a model with a coreference-annotated dataset and use the trained model to perform coreference resolution for all articles. Even better, there are tools available that are trained on such huge datasets and we can just use them to resolve out text data of news articles. One such tool is Neuralcoref, a pipeline extension for Spacy which annotates and resolves coreferences using a neural network.

Following is the working code for co-referencing.

After labeling some of these articles manually, the classification algorithms like Random Forests Classifiers and logistic regression are used to categorize articles into economic and non-economic articles.

First, the co-referenced text is cleaned using the following clean_text() function:

# stop words
stopw = set(stopwords.words(‘english’))
snow = nltk.stem.SnowballStemmer(‘english’)
# lets remove words like not, very from stop words
reqd_words = set([‘only’,’very’,”doesn’t”,’few’,’not’])
stopw = stopw — reqd_words
# text cleaning
def clean_text(article):
cleaned_article = []
cleaned_words_list = text_to_word_sequence(article)
for word in cleaned_words_list:
if word not in stopw and len(word) > 2:
cleaned_article.append(snow.stem(word))
return ‘ ‘.join(cleaned_article)
final_df[‘stemmed_articles’] = final_df.text_coref.apply(lambda x: clean_text(x))

The cleaned text is then converted to vectors using TF-IDF bigrams as following

# converting data into vectors using TF-IDF bigram
tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=5, max_features=10000)
tfidf_xtrain_vect = tfidf.fit_transform(train_df.stemmed_articles)
tfidf_xtest_vect = tfidf.transform(test_df.stemmed_articles)

And the model is trained using the grid search:

def best_model(x_train, y_train, x_test, y_test):
pipe = Pipeline([(‘classifier’ , RandomForestClassifier())])
param_grid = [
{‘classifier’ : [LogisticRegression()],
‘classifier__penalty’ : [‘l1’, ‘l2’],
‘classifier__C’ : inverse_lambda,
‘classifier__class_weight’ : [None, ‘balanced’],
‘classifier__solver’ : [‘liblinear’]},
{‘classifier’ : [RandomForestClassifier()],
‘classifier__n_estimators’ : list(range(10,300,10)),
‘classifier__max_features’ : list(range(6,32,5))}
]
clf = GridSearchCV(pipe, param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)
best_clf = clf.fit(x_train, y_train)
print(f’best estimator is {clf.best_estimator_}’)
best_logreg_model = clf.best_params_[‘classifier’]
best_logreg_model.fit(x_train, y_train)
unigram_predicts = best_logreg_model.predict(x_test)
cv_cm = pd.crosstab(y_test, unigram_predicts, rownames=[“True Label”], colnames=[“predicted label”])
print(“confusion matrix on test data is:”)
print(cv_cm)
print(“ “)
print(“classification report on test data is”)
print(classification_report(y_true=y_test, y_pred=unigram_predicts))
return best_logreg_model

The full code for this is in the following gist.

The model produces the following results on a test dataset:

confusion matrix on test data is:
predicted label NEGATIVE POSITIVE
True Label
NEGATIVE 405 12
POSITIVE 16 381

classification report on test data is
precision recall f1-score support
NEGATIVE 0.96 0.97 0.97 417
POSITIVE 0.97 0.96 0.96 397
accuracy 0.97 814
macro avg 0.97 0.97 0.97 814
weighted avg 0.97 0.97 0.97 814

With this trained model, I find approximately 20% of news articles that report economical news related to the coronavirus.

Task 4

Next, we download the stock market data using the Alpha Vantage API. For this one needs to get a private key by submitting credit card information but the first 500 calls are free. I used the following alpha vantage code to get the stock data for some of the stock exchanges.

And following is the plot showing normalized closing prices of stocks.

Stock exchange normalized prices downloaded with Alpha Vantage

Task 5

With news articles and stock market data in hand, I used a neural network framework with LSTM to predict stock prices. For this, I adapted the methodology discussed here. I took news articles published on the same date as stock exchange prices. The training+validation and test data are split equally and shuffled. Thus half of the stock predictions as shown below are for the news articles and the stock prices used for training and half of these are from the independent test datasets. My aim is to visualize any correlations between the predicted and actual prices.

I will discuss the adapted version of the network in more detail in a future post and here I will present the model predictions for some of the stock exchanges.

I) New York Stock Exchange (NYSE) closing prices from 1st January 2020 to 20 March 2020. The green line shows actual prices and blue lines are the prices predicted from the news articles.

NYSE closing prices (green) and predicted prices (blue).

II) Same as above but for the Hong Kong Stock Exchange (HKSE).

HKSE closing prices (green) and predicted prices (blue).

III) For Australian Securities Exchange (ASX)

ASX closing prices (green) and predicted prices (blue).

IV) For Bombay Stock Exchange (BSE)

BSE closing prices (green) and predicted prices (blue).

All these stock predictions from the news article data show a correlation between the news and stock prices. Although the correlations are not too strong on a day by day basis as the stock exchange prices depend on several other factors. The downfall trend of stock predictions from news articles is however similar to the actual trend.

What should we do?

On a personal level, I think we need to calm down and keep working. Follow all the precautions. Think twice and crosscheck before believing any news that spreads panic. There is no need to update ourselves with the number of coronavirus cases every hour and keep talking about it in every discussion. Possibly, we need to stop watching/reading the news about the coronavirus and to update and inform ourselves, we can always look into several official platforms developed by the Governments of each country.

Remember, feelings like fear and panic are contagious, probably much more than the Covid-19.