Source: Deep Learning on Medium
1- Collect/Scrap dataset and process to csv
To build our text summarization model, we need to have a dataset of the required language, this data would be in the format of text with title, so as to train the model to summarize the text to the title.One of the most effective datasets for our goal is news, as each news article would have a big article with a title that summarizes it.
So we would need a method to collect online news in Hindi, one of the most useful methods that i have found was using the amazing scrapper news-please , this scrapper would allow you to state the needed websites for scrapping, and it would recursively scrap for data and put it in a json format.
I suggest scrapping on google colab, as to not waste your bandwidth in scrapping, and to only download the resultant json files would be much smaller than scrapping the whole html file.
1-A run this notebook on your (google colab)
1-B set the configurations (google colab)
in google colab, in the file tab, go up one level, then under root directory create directory called news-please-repo, then under it, create config directory.
here you would create 2 files (their contents can be found in the notebook), 1 file (config.cfg) would set the directoory to save the json files, i like to save them to google drive, so feel free to set your own path (this option can be found in the variable called working_path
The second file would set the names of the websites to scrap from, I used about 9 websites (their names are found in the notebook), feel free to add and modify to your own news websites,
i suggest that you modify the sites.hjson to contain couple of sites each google colab session, so that each google colab session would scrap from couple of sites not the all of them at the same time
1-C Download from google drive to process locally (google colab)
after running the news-please command for a couple of hours (or couple of google colab sessions) and saving the resultant json files to your google drive, i suggest downloading the data from google drive to your local computer to process the files locally, as accessing the files from google colab would be quite slow (i think it has something to do with slow file i/o between google colab and google drive)
download the zip by simply selecting the folder in google drive and downloading it, it would zip it automatically
1-D Process the downloaded zip locally (locally)
after the zip has been downloaded, unzip it, install
pip install langdetect
and run this script (locally on your computer for fast file accessing, don’t forget to modify the location of your extracted zip in the script), this script would loop through all the scrapped json files, check if they are in Hindi, and then would save them to a csv
1-E Upload the resultant csv
now after you run the script, upload it to your google drive