Original article can be found here (source): Deep Learning on Medium
Below is the proposed methodology of building an Online Reputation Manager and the implementation of the project using Natural Language Processing.
A. Extraction of Tweets
Tweets are extracted using the Twitter API offered by Twitter’s developer platform. Endpoints included in the “Tweets and Users” preview allow developers to request Tweet and user information from the Twitter API. The GET/users endpoint provides developers with information about a Twitter user. The information provided by the endpoint is in the form of hydrated user objects. The hydrated user object contains public Twitter account metadata such as name, description, location, and more, which is returned in the payload. A payload is a data that is returned from a request to the API.
The extracted information is then saved into a dataframe using Pandas for further processing and analysis.
B. Extraction of YouTube comments
YouTube is the world’s largest video-sharing site with about 1.9 billion monthly active users. People use it to share info, teach, entertain, advertise and much more.
Thus, YouTube comments have some pivotal data that one can utilize to carry out research and analysis of reviews and feedback of consumers on various companies, services and products. YouTube comments are extracted using the YouTube API, which enables one to search for videos matching specific search criteria.
The extracted comments are stored in a CSV file with various columns extracted like username, description, timestamp, etc.
C. Extraction of News from Online Platforms
News circulated by various journalists and published by different newspapers provides vital first-hand information, which is extracted with the help of News API, which is a simple HTTP REST API for searching and retrieving live articles from all over the web.
One can search for articles with any combination of the following criteria:
● Keyword or phrase
● Date published
● Source name
● Source domain name
One can sort the results in the following orders:
● Date published
● Relevancy to the search keyword
● Popularity of source
D. Sentimental Analysis using SIA VADER
VADER (Valence Aware Dictionary and Sentiment Reasoner) is a lexicon and rule-based sentiment analysis tool modelled by Natural Language Processing that is specifically attuned to sentiments expressed in social media.
VADER uses a combination of a sentiment lexicon is a list of lexical features (e.g. words) which are generally labelled according to their semantic orientation as either positive or negative.
VADER has been found to be quite successful since VADER not only tells about the Positivity and Negativity of a sentence but also tells us about how positive or negative a sentiment is with a Compound Score. The positive, neutral and negative scores represent the proportion of the text that falls inside these categories whilst the compound score represents the sum of all the lexicon ratings where +1 represents most positive and -1 represents most negative.
If this score is between -0.2 and 0.2 then the sentiment is neutral. If it is lower than -0.2 than the sentiment is negative and if the compound score is bigger than 0.2 it’s positive.
The VADER sentiment analysis library allows us to easily implement sentiment analysis that operates at almost real-time speed.
VADER has a lot of advantages over traditional methods of Sentiment Analysis, including:
● It works exceedingly well on social media type text, yet readily generalizes to multiple domains.
● It doesn’t require any training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon.
● It is fast enough to be used online with streaming data.
● It does not severely suffer from a speed-performance trade-off.
To refine the results further, the scores can be normalized considering factors like the impact of the number of followers on the spread of the comment and review on the internet.
E. Sentiment Analysis using LSTM
LSTM, which stands for Long Short Term Memory is a powerful architecture of Recurrent Neural Networks. Using such architecture, the model is created and trained.
The data is cleaned by removing the following:
URLs, @mentions, and punctuations. Stop words may or may not be removed. Here, they were kept.
The model is trained on pre-processed labelled twitter data (tweets) using a pre-trained embedding. Pre-trained embeddings are gloVe. It has embeddings for 4,00,000 words which were directly imported and embedding layer was set to non-trainable.
Below is the detailed explanation of the model architecture:
- Input Layer — Each tweet is padded and made equal to the maximum length of all tweets, which is 45.
- Embedding Layer — Each input is converted into embeddings, which are pre-trained glove embeddings with 50 features per word and hence the output shape from the embedding layer is (None, 45, 50).
- Two LSTM layers each with 128 units are added and a dropout layer has been added to each of them to prevent overfitting. The dropout probability has been set to 0.5.
- Finally, a hidden layer (of 1 unit) with sigmoid activation has been added for the classification problem.
The model is trained on Adam Optimizer and Binary Cross-Entropy loss.
Fitting is done over 50,000 training, for example, in batch sizes of 32 for 50 epochs and an accuracy of 86% is obtained on the test set.
The saved model is then used for the analysis of the extracted data.