Source: Deep Learning on Medium
Get the Files Ready in Place
Download the data from Kaggle. Next, unzip the train and test data set. I unzipped them to a folder named data. I used Google Drive and Colab. So, in the end, my file structure looks like this:
(Optional) Download data from Kaggle to Google Drive on Colab
First, follow the Kaggle API documentation and download your kaggle.json. Upload this kaggle.json to your Google Drive. Remember to change line 5 in the scripts above to where you actually stored your kaggle.json.
Read the Data from CSV to numpy array
We are using the pd.read_csv from the panda library. It is a nice utility function that does what we asked: read the data from CSV file into a numpy array. Remember to call the .values in the end.
In this challenge, we are given the train and test data sets. In the train data set, there are 42,000 hand-written images of size 28×28. The first column of the CSV is going to be which digit the image represents(we call this ground truth and/or label), and the rest are 28×28=784 pixels with value ranged in [0, 255]. The test data set contains 28,000 entries and it does not have the ground truth column, because it is our job to figure out what the label actually is.
If we were not pursuing the simplicity of the demonstration, we would also split the train data set into the actual train data set and a validation/dev data set. With this separate group of data, we can test our model’s performance during the training time. And since the model won’t be trained with this group of data, it gives us a sense of how the model would perform in general. We can’t achieve this effect with only the train data because during training, the model will get more and more overfitted to the train data set.
Prepare the data with PyTorch
Ultimately, we want to create the data loader. But to obtain this data loader, we need to create a dataset. The dataset makes direct contacts with our freshly read data and processes the data on-the-fly, while the data loader does the labor and loads the data when we need it. The data loader will ask for a batch of data from the data set each time. And the dataset will do the pre-processing for this batch only, not the entire data set. There’s a trade-off between pre-process all data beforehand, or process them when you actually need them.
To customize our own dataset, we define the TrainDataset and TestDataset that inherit from the PyTorch’s Dataset. We separate the Train and Test dataset classes because their __getitem__ outputs are different. Alternatively, we could also save a flag in __init__ that indicates how many outputs are there for the corresponding class instance.
We divided the pixel values by 255.0. This step does two things: 1. it converts the values to float; 2. it normalizes the data to the range of [0, 1]. Normalization is a good practice.
We also shuffled our train data when building the data loader. This randomness helps train the model because otherwise we will be stuck at the same training pattern.
Batch size. It depends on the capability of our GPU and our configuration for other hyperparameters. I like to use a batch size of 2 when debugging my model. Yes, unfortunately, we will need to debug the model sometimes if we want to craft our own wheels and it is not an easy task. During the actual training, I find values between 16 to 512 make sense.