Original article was published on Artificial Intelligence on Medium
Perfect! All 542,591 tweets have been now combined into one dataframe. As expected though, there are many problems with the data-types and format. In addition to that, the dataset includes some columns that are of no interest in this project.
To start with, I will be sorting the tweets by chronologically ascending order.
Now that the tweets have been sorted appropriately, the non-relevant columns must be dropped. Upon inspection of the different features, the ones that are deemed to be of value are “date” and “text”.
As you may have noticed in the figure above, there are multiple rows corresponding to the same day. This is obviously expected, as there are around 300–500 tweets for each day of the year. The best approach I came up with, is to combine all of the tweets belonging to a certain date into one big string. This way, sentiment analysis will be conducted with ease. Another issue that has to be addressed is the existence of special characters in the tweets. Such characters may include emojis, hashtags e.t.c. Although they are common in tweets, they present an obstacle for successful sentiment analysis.
As you can see, the data has not only been sorted in ascending date-wise order, but there is also one entry for each date, and all special characters have been removed (it is important that the date column is converted into python “datetime” format).
Adding Tesla historical stock prices
Importing the tesla historical-price dataset is now in order. After performing the appropriate tweaking of the dataset, a new column will be added to “df” called “Price”. The reason behind this is to merge the two datasets and have the closing price of each day as well as the tweets of that day together (the newly created column is simply a copy of the “Close” column of the tesla historical prices dataset).
The end-goal is to be able to predict the price according to the general sentiment of the previous day. A new dataframe will be thus created which is going to contain the:
The default “Vader” lexicon provided by nltk will be used (custom lexicons with economic key-words can also be used).
(The scores assigned to the “Comp”, “Negative”, “Neutral”, and “Positive” columns are numerical indicators used by the model to identify the sentiment of the tweet when it was written.)
With the final dataframe set-up, it is time to create a testing and training set. I will be assigning 80% of the data for training and the remaining 20% will be used for testing.
Training a model is now in order. As I do not yet know which Machine Learning technique is best suited for this problem, I will be testing a plethora of models with the intention of evaluating which one is the most accurate. At this point, the goal is to reach an accuracy of 40%. The main techniques tested are:
- Decision Trees
- Random Forest
- Logistic Regression
- Artificial Neural Networks
After much testing, I came to the conclusion that the best possible course of action would be to utilize the benefits provided by Ensemble Learning (combining different machine learning techniques to reach more accurate results).
Not only was it more accurate, but it enabled me to reach a staggering accuracy of 82.39%!
By itself, this is not the most reliable indicator of a model’s accuracy. When making the predictions, I created smaller copies of the initial dataframe consisting of 73 days of predictions each, with the intention of performing easier visualization of the data.
In order to validate the accuracy of the real-time results, I will be manually creating a confusion matrix for one of the prediction dataframes.
The image above depicts the predicted closing price and the actual closing price for the timespan of January 1st, 2019 — March 14th, 2019.
In order for the prediction to be considered successful, one of the following must be true:
- The predicted price of the next day was correctly identified as a “buy” (Predicted(n+1) > Predicted(n) && Actual(n+1) > Actual(n)).
- The predicted price of the next day was correctly identified as a “sell” (Predicted(n+1) < Predicted(n) && Actual(n+1) < Actual(n)).
- The predicted price of the next day was correctly identified as a “hold” (Predicted(n+1) == Predicted(n) && Actual(n+1) == Actual(n)).
This can be easily emulated by performing the following: