Original article was published on Deep Learning on Medium
A basic walkthrough of the Deep learning models and various pre-processing techniques used for Natural Language Processing.
Ever wondered about the amount of data being processed daily, and the automated process which goes behind it?
We all have heard about ‘Data’ being the most important resource on the planet at the moment. But we never had a thought about how data could be more important and resourceful than oil which has been running the mankind for centuries. Data has been transforming people’s lives, an organization’s operations, and companies’ efficacy. This is mainly due to the insights it provides about the working and uses of resources in the past and helping in providing better performance output of the employees and the technology being utilized.
Information is the ‘Oil’ of the Twenty-first century and analytics is the combustion engine. ~ Peter Sondergaard
According to SiliconANGLE, around 295 Exabytes of data were stored during the period from 1986–2007. The storage formula infers that the data production rate has been doubling every three years which states that there would be around 600 exabytes of data being stored in the data storage facilities.
Without data, we’re just another person with an opinion. ~W. Edwards Deming
We’d have always wondered about the pop-ups related to car advertisements coming on the websites you’ve been using while you’d been searching for information about cars just a while ago. This is how our online data is being used. Recommendation system, predictive analysis, sentiment analysis, and text summarization are just some of the use cases of the data that is being processed for a better usage and insight of the user behavior. Data has also been used for product development and improvement of quality with the help of user reviews and the performance statistics helping the companies to have an edge in the manufacturing and servicing of a particular product or service. Numerous types of data are being produced which consist of text, image, numbers, and videos too.
> I’ve been working on text data for the past 2 years and want to put some light on the topic and its importance in the present world. I’d also provide the walkthrough to the basic methods and toolkits that are being used for processing data in it’s raw and true form. Text data has its applications in various fields of work ranging from advertisement to fraud detection.
> Natural Language Processing is the field of machine learning which helps the machine to understand, analyze, and process text data written by humans for better insights and feedback for product development and customer satisfaction.
> In the present date, NLP is booming in the Data Science industry due to the information that text data possesses which holds a lot of importance for the growth of individuals or a company.
- I’d list the applications that text processing is being used for in various industries at present:
- Spam and fraud detection- There are emails received by us now and then about certain advertisements/ frauds trying to slip us in their fishy nets. But there has always been a pattern or a certain set of keywords that they use which helps the fraud detector in analyzing the content and predicting whether the email sent is a genuine one or just another one sent by frauds. The same goes for the spam emails that users receive regarding different products or services which the users are not interested in. The spam classifier identifies the keywords and classifies whether the mail sent is spam or not.
- Fake news detection- Some rumors are spread at times for various malicious benefits of the individuals. This technique is used to identify the fake news by certain word terminology used by the writer of the news and helps avoid numerous problems in general.
- Social media analysis- The internet is filled with text, numerical, video, and image data. This amount of data is used to analyze the users’ behavior, their choices, or the current styling or even the trending products in the present world. Issues and events are going on around the world and reviews regarding the same being generated now and then. This amount of data is too large to analyze manually. The analysis is done by some machine learning algorithms, deep learning models, or BI (Business Intelligence) tools.
- Product reviews analysis- Reviews and feedbacks related to products are always considered to be necessary for the company as this helps them understand the customers’ viewpoints and the features they consider to be changed or improved in the near times. This helps the companies to efficiently allocate the manpower in the various research fields for the improvements in the product features accordingly. Sentiment analysis of product reviews is the area of research that facilitates this idea.
- Business Intelligence- The analysts working at multinational companies have to analyze huge amounts of data now and then which is related to their products or their business policies or models. Reviewing and improving the functionalities along with major decision making is done after analyzing the data which forms a base to the idea of a particular area.
- Chatbots and customer care services- Just a few years back, there were thousands of people working in the customer care services answering to queries and problems of their customers. Feeding and training machine learning models with the user-generated data heavily, this has enabled the chatbots to answer the users regarding the problems with the same by context understanding and human-like interaction.
There are various methodologies through which the text data is being processed and analyzed for various insights and product development. They are namely:
- Named entity recognition- This method is used to classify the different names into the respective categories of person, company, locations, monetary values, assets, time expressions, and quantity. This helps in the better analysis as it also works as a pre-processing technique for the processing of the data for the extraction of insights.
- Tokenization- This method is used to convert the text data into certain tokens after chopping off some of the characters from the words and at times substituting the words with a new set of token names. This involves the removal of all punctuations and splitting all the non-alphanumeric sequences present in the text data.
- Bag of Words- This process is used for the extraction of features from the text data to provide aspects that are being used covered in the data. This is achieved by converting the words and the features present to numeric binary values which helps in the identification of the features amongst the entire text data.
- Natural language generation- This is the method used to generate plain English sentences with the help of the keywords and the features provided. It consists of various stages such as lexicon choices, realization, and content aggregation. The language and the set of words that should be used are decided during the lexicon choice while the realization is the process of understanding and taking care of the various grammatical aspects and content aggregation is used for deciding the main context of the text data and the context that has to be provided with the same.
- Sentiment analysis- This is the technology used for the analysis of the text for the sentiment score of the text data and to decide whether the text is written in the positive or the negative aspect. This is done through various methodologies such as deep neural networks and bag of words along with lexicons such as Sentiwordnet.
> Basic Pre-Processing steps are taken while using Natural Language Processing for text data-
- There are 3 components of text pre-processing which are as follows-
- Tokenization- This process is used to convert the text data into tokens and then processed for the information to be extracted. It has been explained in brief above.
- Normalization- This is used to convert all the text characters into a normalized form so that there is no bias like the text data when processed for analysis and consists of a uniform sequence. This may include the lowercasing of all the characters etc.
- Noise removal- This is mainly used for the removal of the unused data spaces such as the blank whitespaces which take up memory while computation.
- The methods used for pre-processing off text data commonly are-
- Stemming and Lemmatization- This methodology of pre-processing is used to convert the words present in the data to their original and the base form keeping the meaning and the semantics of the text intact. Porter Stemmer is a well-known algorithm used for stemming and lemmatization of the text data. Though stemming has always been used, lemmatization has proved to be better as it tends to retain the meaning of the words even after the trimming off of words.
- Removal of HTML tags and links- The function is used for the removal of HTML tags and links from the data which reduces inconsistencies in the semantics of the text.
- Removal of stopwords- Words like I, was, is, and, but, etc. are considered as stopwords that carry no semantics and are thus of no use for insights from the data. Removal of stopwords helps in the loss of computation power as only the words useful for the analysis are left behind for further processing.
- Removal of numbers and special characters- The special characters and numbers are trimmed off using functions from the NLTK for efficient processing of the text data.
- POS tagging- The POS (Parts of Speech) tagger is used to tag the different types of parts of speech for better analysis of the text data. The sentiment of features of a product could be analyzed using the adjectives in the text.
- Along with benefits do come some serious controversies that are created for the popularity of the work or the people themselves. The same happened a few years back when Microsoft depicted that it could identify people suffering from pancreatic cancer just by analyzing large samples of search engine queries. And the astonishing fact was the prediction was done even before the user was diagnosed with the disease. This led to a lot of mishaps when people not suffering from the disease had undergone treatment for the same.
- Now let’s talk about the various Deep Learning models that are being used for Natural Language Processing in the present date-
- Convolutional Neural Networks- A CNN is applied for aspect extraction or higher-level features extraction using words or n-grams. The words of the text are converted into embedding matrix using the word embedding which is then fed into the Convolutional Neural Network. The input is processed in layers that consist of the creation of feature maps, convolutional layers, max-pooling layers, and applying max operations on each filter to obtain a fixed-length output while reducing the dimensionality at the same time. The only drawback of CNN for the NLP purposes is that it does not work well on long term dependencies which is where the Recurrent Neural Networks come in.
- Recurrent Neural Networks- Have we ever thought how would a machine learning model be able to answer questions and summarise data remembering the context which was mentioned before? This is the network architecture which memorizes the outputs and inputs for a certain set of tokens with the help of various gates present in the network. When compared to CNN ( Convolutional Neural Networks ), RNN does perform better in certain text processing tasks but cannot be considered the best as the results and the processing accuracy varies from task to task. Input to RNN models is mainly one-hot encodings or word embedding which is achieved with the help of tokenization and generation of sequences. There are various variants of the RNNs which consist of the LSTM (Long Short Term Memory) which consists of three gates- input, forget, and output gates which help in remembering the context of the data processed. GRUs (Gated Recurrent Units) are also similar to the LSTMs in the functionality and accuracy in the processing. RNNs are presently used for many purposes such as-
- Sentiment analysis
- Content summarisation- abstractive and subjective
- Natural language generation
- Attention mechanism- This method is based mainly on the drawback of the RNN which fails to use the hidden context vector based on the hidden state sequence while the process of analysis. This is an important field of research in NLP as this process can efficiently be used for text summarization, sentiment analysis, and content generation based on the information about the context which is not explicit.
> These were the basic tools and technologies used for Natural Language Processing.
> The amount of information to this topic of discussion is endless and a lot of research is being done on this domain for the best possible insights one could get from just a small piece of text. So the next time we come across work related to text processing, not only will we know the importance of it, but we also would be able to understand the process that the text undergoes to get such insightful analysis results.
Feel free to contact me for any suggestions or queries regarding the topics covered in this blog.
Harsh Patel– Information Technology Engineering, Nirma University, Ahmedabad, India