Source: Deep Learning on Medium
As the idea for building this model is mainly from Cahya Wirawan github project, hence I put the source on the beginning of my article.
Playground for Language Modelling. Contribute to cahya-wirawan/language-modeling development by creating an account on…github.com
ULMFiT (Universal Language Model Fine-Tuning for Text Classification) is transfer learning technique that is built for NLP as shown in the journal article below published by Jeremy Howard with the link here.
The biggest difference in the sentiment analysis using n-gram NLP and this deep learning is the later is much easier and require very minimal domain expert in linguistic area. Moreover, when there is enough training data, the parameters of embedding matrix could differentiate between different forms of words without the need of lemmatization.
As the lack of data is not enough to be the stumbling factor to train deep learning model from scratch, the labelled news using Bahasa is also almost impossible to be found. To tackle those problems, I try to integrate the latest technique in NLP, ULMFiT, to build the sentiment model using small dataset.
The steps could be summarized as follow:
- Build the language model using Wikipedia corpus in Bahasa language according to Cahya Wirawan project.
- Collect the news corpus in Bahasa (stock or equity news) and do the retraining using this corpus
- Do the transfer learning of language model to classification task using small labeled news dataset (targeted to each equity)
Those steps is taken from Jeremy Howard paper and visualized as below:
Overall, I was able to replicate the ideas proposed in the paper to tailored to our specific project. By using the news dataset, the accuracy of the language model is around 44%. Furthermore, by incorporating small labeled news dataset we are able to obtain the 70% classification accuracy with two labels (positive and negative).