Data set split

Source: Deep Learning on Medium

As a beginner in deep learning, the splitting of data sets was a big headache for me. When applying to jobs in this field, specifically to entry-level jobs, the recruiters generally will provide you with some data set of images and your job is to basically show how image classification works. Just pass your image data set through CNN and present the result. Well, easy isn’t it? Maybe.

In most of the articles you find for this task, the data set is already organized. Better to say already categorized into training, testing and validation data sets with each image labeled to the category they belong to. Sometimes this isn’t the case. Sometimes you need to shop the ingredients before you start baking the cake.

Bored? Well, you need not.


This code example will help you to do this boring job before you start cooking your CNN. For this, we are going to use Google Colab ( basically a Jupyter notebook+ Free GPU!). We’ll also see how to mount the google drive to Colab notebook which we can use as our storage space for the data set.

All the codes used here can be found on the GitHub link:

In addition to splitting the data set into various directories, we’ll see how to create a data dictionary. This dictionary is no special than a simple python dictionary written into a JSON file which contains the mapping of ‘index’ to ‘category’.

What’s the use? After predicting the classes with their probabilities we need a mechanism to map it to the category they belong to. This JSON file solves our problem.

So let’s just hop on!

  1. Import the necessary libraries
import json #create the json
import shutil #copy images to train, test and valid dirs
import os #files and dirs manipulation
import math #split calculate

2. Mount google drive

#mount google drive to access the stored data set and manipulate it
from google.colab import drive

Here we are mounting the google drive so that we can use it to access our data set which is already there on the drive. After running this code in the notebook cell, it’ll redirect you to a link. Get to the link and authenticate yourself to use it.

Note: Mounting is a session dependent task. That is, every time you reset the runtime of the notebook you have to run this code once again to mount.

3. Get the list of directories(categories), create the train, test and valid directories

We expect our parent directory with the structure as below:

Here you can see that there are 3 categories that we need to classify among. Get the categories into a list.

#path configuration
parent_dir = ‘/content/gdrive/My Drive/assignment_dataset’
#get category folder list
category_list = list(filter(lambda x: os.path.isdir(x), os.listdir()))
for category in category_list:

We’ll create our train, test and valid directories at this time.

#create training,validation,testing directories
data_set_dirs= ['train','valid','test']
for dsdirs in data_set_dirs:
path = parent_dir + '/'+ dsdirs
os.mkdir( path,755 )

After creating the new directories, we’ll have something like this:

4. Set split ratio, start splitting

#define proportion of data
train_prop = 0.6
valid_prop = test_prop = (1-train_prop)/2
#function to split data of each category into trainning, validation and testing set
def create_dataset():
for ii,cat in enumerate(category_list):
src_path = parent_dir + '/' + cat
dest_dir1 = parent_dir+'/train/'+str(ii)
dest_dir2 = parent_dir+'/valid/'+str(ii)
dest_dir3 = parent_dir+'/test/'+str(ii)

dest_dirs_list = [dest_dir1,dest_dir2,dest_dir3]
for dirs in dest_dirs_list:
os.mkdir(dirs,755 )

#get files' names list from respective directories
files = [f for f in os.listdir() if os.path.isfile(f)]

#get training, testing and validation files count
train_count = math.ceil(train_prop*len(files))
valid_count = int((len(files)-train_count)/2)
test_count = valid_count

#get files to segragate for train,test and validation data set
train_data_list = files[0: train_count]
valid_data_list = files[train_count+1:train_count+1+valid_count]
test_data_list = files[train_count+valid_count:]

for train_data in train_data_list:
train_path = src_path + '/' + train_data

for valid_data in valid_data_list:
valid_path = src_path + '/' + valid_data

for test_data in test_data_list:
test_path = src_path + '/' + test_data

What exactly we did here? The approach is to get a category from the category_list; then create one directory for it respectively in train, test and valid directories. So, for example, we’ll create the directory named as ‘0’ for the first category in category_list in all 3 directories. For each category, 3 folders.

After that, we’ll list out the images under the category under consideration and split them out to the 3 folders created previously with the ratio defined earlier.

Doing this for all the categories we’ll have a structure like the image below where each folder describes one category. For test and valid directories also, we have the same structure.

Note: Start of the names from 0 or 1 or something different depends fully upon your convenience. Soon we get to see whatever the numbering we’ll map them.

5. Dictionary generation(Mapping)

The final output of our CNN will the probability of the image under training being its ‘type’. Simply saying it’ll let you know what is the probability of an image being an ant, a brontosaurus or a crocodile in our case. Going by this example it will output the result as ‘0′ is 90 %, ‘1’ is 2% and ‘2′ is 8%.

To link the predicted class(folder like 0,1 or 2) to its ‘name’, we use the dictionary. In our case, we can make the dictionary with index mapping and store it in a JSON file which we can later refer and find the correct name.

The data dictionary can be made like this:

#save category data as dictionary in a json file
cat_data = {}
for ix,cat in enumerate(category_list):
cat_data[ix] = cat
with open('/content/gdrive/My Drive/assignment_dataset/cat_data.json', 'w') as outfile:
json.dump(cat_data , outfile)

and the resultant JSON file is:

{"0": "brontosaurus", "1": "ant", "2": "crocodile"}

So that’s it! 
You just learned how to create a basic pipeline for splitting the data set into various directories required for training and testing of an image classifier project using CNN. Also, you got an idea regarding the mapping through the use of a simple python dictionary and a JSON file.

Well done!

As I am just starting out in this field, please leave your valuable comments regarding anything starting from the write-up to the content or code. Highly appreciate your suggestions. And a big thanks for reading this :)