Source: Deep Learning on Medium
Convolutional Neural Networks for classification: a historical perspective
Convolutional Neural Network (CNNs), are mainly known for Computer Vision tasks. They, nevertheless, have various applications in text mining. In this post I will provide insights into some of the most popular NLP applications. We have seen that CNN can capture local features and combinations of these features. This make them primarily suitable for document and text classification tasks.
We will focus here on the classification application involving mainly CNNs, and not other types of neural networks, which will be the subject of other posts.
All the NLP tasks discussed below can be seen as assigning labels to words. The traditional NLP approach is:
- Extract from the sentence a rich set of hand-designed features
- Fed them to a standard classification algorithm,
Support Vector Machine (SVM), often with a linear kernel is typically used as a classification algorithm. The choice of features is a trial and error process, based first on linguistic intuition, business need and experimentation. Feature engineering as it is called is task dependent, implying a new research for each new task which impacts the cost which might be quite important for large-scale applications.
Neural Networks, offer a new and different approach where the text is processed as little as possible and is fed to a multilayer neural network (NN) architecture, trained in an end-to-end fashion. The architecture takes the input sentence and learns several layers of feature extraction that process the inputs.
In this blog I will review the main contributing articles for the classification task.
Text classification is the most known NLP application of CNN, thanks to the hierarchical convolution scheme they can capture local interactions of the type n-grams of words. CNN-based frameworks can also capture temporal and hierarchical features in variable-length text sequences. CNN use Word Embeddings as a first layer as either a static or pertained way. This trick allows to use the same concepts and architecture that made the success of CNN in image classification.
Collobert et al. 2011, before the release of word2vec, already uses a one-layer convolution block to perform many NLP tasks. One of the key points of their architecture is its ability to perform well with the use of raw words. The method is able to learns good word representations for the words. At the start, the features are taken form lookup table with randomly generated values. Word representation are then trained by backpropagation.
Yu et al. 2014, Evaluated a sentence model based on a convolutional neural network (CNN) that is sensitive to word ordering and is able to capture features of the sentence n-grams. The convolution and pooling layers help to capture long-range dependencies, which are common in questions.
Kalchbrenner et al. 2014. Extend the idea to form a dynamic CNN (DCNN) by stacking CNN and using dynamic k-max pooling operations over long sentences. This research significantly improved over existing approaches at the time in much short text and multiclass classification.
Yoon Kim extends the single-block CNN by adding multiple input channels and multiple kernels of various lengths to give higher-order combinations of ngrams. He further established that using unsupervised pertained word representation such as word2vec improve the performance of the classification
Yin et al. further improved Kim’s multi-channel, variable kernel framework and used Multichannel Variable-Size Convolution (MVCNN) architecture for sentence classification. It combines various pertained embeddings and variable-size convolution filters.
Santos and Gatti propose a new deep convolutional neural network that exploits from character-to sentence-level information to perform sentiment analysis of short texts to achieve effective short text classification.
Johnson and Zhang explore the usage of region embeddings for effective short text categorisation, due to the ability to capture contexts over a larger span where word embeddings fail.
Wang and others perform semantic clustering using density peaks on pre-trained word embeddings forming a representation they call semantic cliques, with such semantic units used further with convolutions for short-text mining.
Zhang et al. explore the use of character-level representations for a CNN instead of word-level embeddings. On a large dataset, character-level modeling of sentences performs very well when compared to traditional sparse representations or even deep learning frameworks, such as word embedding CNNs or RNNs.
Conneau et al. designed a very deep CNN along with modifications such as shortcuts to learn more complex features.
Xiao and Cho further extends the character level encoding for entire document classification task when combined with RNNs with a lower number of parameters.
Zhang et all combine CNN, Bi-LSTM with multi-attention mechanism to take into account the contextual meaning of words and improve the performance of the classification.