Original article was published on Deep Learning on Medium
Opensource datasets for Natural Learning Process -NLP
Natural Language Processing (NLP) is one of the most active areas of research. In this blog, I will walk you through some of the popular data sets which one can use for research and learning.
NLP can be broadly divided into seven subsections based on the problem one is solving.
- Speech Recognition
- Text Classification
- Document Summarization
- Q/A and Product Title Summarization
- Sentiment Analysis
- Recommender System
- Machine Translation and Language Model.
The dataset consists of 1000 hours of 16kHz from audiobooks as a part of LibriVox project.
- Free Spoken Digit Dataset
The dataset has recordings of spoken digits in
wav files at 8kHz of 2500 recordings.
The data is an acoustic-phonetic continuous speech corpus of broadband recordings of 630 speakers.
This is a great initiative to build an open-source voice database, one can create an account and contribute.
Other dataset for exploration is VoxForge
It is a large lexical database of English. You can read more about it in the below link.
- Yelp is yet another great source for text classification data.
- Movie Binary Classification
- Other data to explore in this field is of Reuters
To access past data one needs to complete Agreement Concerning Dissemination of DUC Results and the User Agreements (Organization Applications)
- Cornell Newsroom is one of the largest datasets for training and evaluating summarization systems. The dataset consists of 1.3 million articles and summaries contributed by authors and editors worldwide.
- Legal Case Reports Data Set
The datasets consist of four thousand legal cases for summarization and citation.
Other related summarization dataset like timeline summarization is good for exploration.
Q/A and Product Title Summarization
- The use case is product title summarization in E-commerce applications.
- Teaching Machines to Read and Comprehend
Alternate source of same dataset
- The other sources for Q/A analysis
- MovieLens data, the hub of any movie datasets in short spans 58,000 movies, their 27,000,000 ratings, and 1,100,000 tag applications.
- The other good movie dataset is maintained by Cornell University, contain reviews with positive and negative sentiments.
- Amazon product review datasets for various category along with their ratings 1–5
- Hotel and Car reviews datasets from Tripadvisor and Edmunds
The datasets consist of reviews of 259,000 hotels across 10 cities and ~42k reviews of ~140–250 cars for three years.
- The giant in this field is Amazon review data containing product reviews, and metadata from 142.8 million reviews spanning May 1996 — July 2014.
Machine Translation and Language Modeling
- Language Model , Google word corpus, and Gutenberg are two prominent sources of data.
- Machine Translation of European languages
- Statistical Machine Translation
Miscellaneous Data Sets
The Blog Corpus data consists of 140 million words within the corpus
Hope this was good starting point for NLP!!