Source: Deep Learning on Medium
How I Built My Own Dataset For My Machine Learning Project
My Story Of Creating My Own Dataset
At some point while figuring out your next big project in machine learning, you might have reached a point where you would have found that there is no ready-made dataset for the problem statement that you are chasing after.
Well! I too faced the same situation and this is the story of how I tackled it. This article is adapted on the lesson 2 of the fastai course v2.
The Origin Story
It all began a couple of days back when I was working on a secret experiment on “creating a neural network in 11 minutes”.
Here I used a waste classification dataset which had only two classes →
- O — Organic waste
- R — Recyclable waste
Since “with great powers comes great responsibility”. I have already open sourced it for the sake of humanity. The tutorial is available here.
After I created this neural network which could help people to sort there trash and save the environment.
Then I thought what if my neural network could recognize the types of waste which was not included in the original dataset?
This is where it all began.
So It Begins
With the mission of building my own dataset, I spent many sleepless night doing very deep research on the internet to find a way to do this.
Well! the research was not that deep. I just went to the lesson 2 of fastai course and executed the following steps →
- I went to the google image search.
- Entered the search term as “Plastic soft drink bottles” and then I hit enter.
- After the search results were displayed, I scrolled down the search page till I could no longer see any relevant images.
- Then hit F12 to open the “developer tools”on my chrome browser.
- Over there I clicked on the “console”window and entered the following script.
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
- This script downloaded all the image urls which I had scrolled through in the previous steps. The script also dumped these urls into a text file named “download.txt”and saved it to the “download”directory on my windows machine.
Since those images could prove to be dangerous in wrong hands I needed a safe place to store them.
For this I did the following →
- With the help ofPath() created a root directory.
- With the help of .mkdir() created a subdirectory inside the root directory.
I wrapped all these methods inside a function and then called that function to create the directory structure for storing the downloaded files. This function also returns the path to the main directory and the subdirectory.
def createDirectory(subDirectoryName, directoryName): folder = subDirectoryName
path = Path(directoryName)
dest = path/folder
dest.mkdir(parents=True, exist_ok=True) return path, destpath, dest = createDirectory('N', 'data')
Yeah! that’s my climax face. At this point of time I have my urls ready and I have created a safe place to store my data.
Now, the finale begins. Now, it is the time to download all those images.
Since all the cool kids and great programmers create functions for every programming task I created a function once again. So, here it goes.
This function uses the download_images() method in fastai to read the urls from the ‘download.txt’ file and then download the images in those url. It also stores these images to the destination folder.
def downloadImages(fileName, path, destination):
download_images(path/fileName, destination)downloadImages('download', path/dest)
There might be some images which can’t be opened as they are corrupted or they are not image files. Fastai provides a cool method verify and delete such images. The following line of of code uses this method to delete such images.
verify_images(path/folder, delete=True, max_size=500)
At this point I have the images of the non-recyclable trash collected and stored in a directory.
Next I did the following →
- Downloaded the waste classification dataset from kaggle to my local machine.
- Copied the directory ’N’ which was created in the previous sections and pasted that into the ‘TRAIN’ directory of the waste classification dataset.
- Randomly copied some of the images in the directory ’N’ from the ‘TRAIN’ folder and pasted them into a newly created directory ’N’ in the ‘TEST’ folder.
- After all this copying and pasting my final dataset looked like this.
I could had written a script to do all the copying and pasting, but let’s say I was just lazy. But you could try out a script to do that.
What’s Next? A Possible Sequel?
My mission to gather the knowledge about different categories of trash is over but my quest to help humanity to segregate their trash isn’t.
So, what’s next? May be I will upload this dataset to Kaggle or maybe I will retrain my waste classifying neural network on this new data. May be I will do both.
But whatever the case may be I will post it here at “ML and Automation”. So, stay tuned for the next article to know how I am going to use this newly gathered data.