Original article was published by Besma Guesmi on Deep Learning on Medium
- Data Preprocessing
Data preprocessing is preparing (cleaning and organizing) data to adapt it the building and training models. In simple words, data preprocessing is a data mining technique that transforms raw data into an understandable and readable format. It helps to clean, format and organize the raw data thereby making it ready-to-go for the model.
In this project, there are four significant steps in data preprocessing to prepare our data.
1.1 Acquire the Dataset
The dataset can be downloaded from the Academic Torrents website, which can be found here : https://academictorrents.com/details/274be65156ed14828fb7b30b82407a2417e1924a
In this project, we will use NIfTI (Neuroimaging Informatics Technology Inititive) format; it can store data with different meanings. Imaging data, statistical values and other data (any vector, matrix, label set or mesh) Can be saved in a nifti1 *.nii or *.hdr/*.img file.
MR Images: MR Images is a type of scan that uses strong magnetic fields and radio waves to produce detailed images of the inside of the body (Brain, breasts, heart and blood vessels and so on).
1.2 Importing libraries
First, we need to install the entire requirements modules; tensorflow, keras, os, matplotlib, numpy, as well as others libraries related to specifications of the project.
In addition, one of the most needed modules is “nibabel” in order to read “nifti” format.
Read more about Python libraries for Data Science here: https://www.upgrad.com/blog/python-libraries-for-data-science/
1.3 Loading and Read the dataset
At this stage, we need to:
Store the path to our images dataset into a variable using os module “os.path.join”
Import ‘nibabel’ and load dataset (Note: “nibabel” does not load the image array, it holds until the data array is requested using get_fdata () method).
1.4 Standardize images
A critical preprocessing step in computer vision. Principally, the models adopted train faster on smaller images. The time required adds up when the image is larger or more complicated. Moreover, many deep learning models architectures require the same size of collected images though it is not the case for majority of data aquired.
Establishing a base size for all images fed into AI algorithms resulting a data set to a minimum image size.
The minimum image size is set to 32,32,1(Width, Height, and Channel).
Many other preprocessing techniques can be used to get your data Images ready to train in your model. Removing the background color from the images reduces the noise. Other projects may require brightening or darkening the images. Using data Augmentation technique to enlarge dataset with perturbed versions of the existing (Scaling, rotations, De-colorized, De-texturized and so on). In short, any adjustments that is needed to apply dataset are considered a sort of preprocessing. In addition, selecting the appropriate processing techniques based on the dataset and the solutions which builds intuition of which ones needed when working on different projects.
1.5 Splitting dataset
The final step is split dataset into two separates sets: Training sets and Test sets.
The “Train set” is used to train the model and the “test set” is used to test and evaluate the model.
Usually, the dataset is split into 70% train set and 30% test set or 80% train set and 20% test set.
In the code, the data is split using sklearn “from sklearn.model_selection import train_test_split”