Replicating the Toronto BookCorpus dataset — a write-up

Source: Deep Learning on Medium

While the Toronto BookCorpus (TBC) dataset is no longer publicly available, it still used frequently in modern NLP research (e.g. transformers like BERT, RoBERTa, XLNet, XLM, etc.) As such, for those of us who were not able to grab a copy of the dataset before it was taken off-line, this write-up presents a means to replicate the original TBC dataset as best as possible.

🔍 Getting a sense of the original Toronto BookCorpus dataset

As such, in order to replicate the TBC dataset as best as possible, we first need to consult the original paper¹ and website that introduced it to get a good sense of its contents.

In the paper, the Zhu et al. (2015) write: “we collected a corpus of 11,038 books from the web. […] We only included books that had more than 20K words in order to filter out perhaps noisier shorter stories.” Next, the authors present some summary statistics:

From the website, we learn that the website Smashwords served as the original source of the 11,038 books collected and used in the dataset.

📚 Collecting the books

Now that we know (roughly) how many books to collect and from what source, we can get started with the collection of these books. To this end, I wrote some code that scrapes the Smashwords website, which is publicly available in this GitHub repository. The code is fast (concurrent), well-structured and well-documented, so it should be very easy to use.

Inspector mode on a Smashwords book page (accessible through “Inspect Element” or F12 on Firefox)

🔗 Getting the plaintext book URLS

In order to obtain a list of URLs of plaintext books to download, we first need to scrape the front page(s) of Smashwords for URLs of book pages (every book has its own page on Smashwords). Next, we can scrape these book pages for the URLs of the plaintext books to download. You can find instructions to do so using my code here.

📥 Downloading the plaintext books

Now that we have a list of plaintext books to download (using their URL), we need to… download them! This is a bit tricky, as Smashwords (temporarily) blocks any IP-address that downloads too many (>500) books in a certain amount of time. This can be circumvented, however, by using a VPN when downloading the books and switching IP-addresses often (about 30 times). You can find instructions to do so using my code here.

⚙️ Pre-processing the books

In order to obtain a true replica of the Toronto BookCorpus dataset, both in terms of size and contents, we need to pre-process the plaintext books we have just downloaded as follows: 1. sentence tokenizing the books and 2. writing all books to a single text file, using one sentence per line. You can find instructions to do so using my code here.