Deep Learning — Dataset

Source: Deep Learning on Medium


Part IV

The most important thing in order to create a good AI model is to have a good dataset. Deep Learning learns from data, and it is critical that you feed it with the right data for the problem you are trying to solve. What makes a good dataset? First of all, the selection of the right data that creates the dataset, and second that the data is in the right scale and format.

Data is typically broken into two major categories: quantitative and qualitative. So quantitative data deals with quantities whereas qualitative data deals with descriptions, qualities. Here is an example for qualitative data for a cup of coffee:

  1. Brown
  2. Strong aroma
  3. Hot
  4. Mild Aroma

Quantitative description will be:

  1. 106 calories
  2. 65 degrees Celsius
  3. $5.99 cost

Dataset Preparation Process

In order to train a neural network, you need to feed it with a list of tensors. A tensor is a generalization of vectors and matrices; in a general sense, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor. Also, the data will need to be selected from available data, normalized and transformed so that the algorithm can process it. Here is a 3-step process for data preparation:

Data Selection

In order to select data, first you need to understand the business domain and what problem are you trying to resolve. There is a natural tendency to believe that more data is better, but unfortunately that is not necessarily true. You need to consider what data you actually need to address the problem that you are trying to resolve.

Data Conditioning

The second step of the dataset conditioning consists of cleaning data and data normalization. This step is often viewed as a preprocessing step before data transformation.

Data that is prepared needs to be nominal, ordinal, interval or ratio. Nominal and ordinal observations are not inherently numeric, thus they need to be converted into a number that is meaningful for the deep learning algorithm.

Data Transformation

This is the step that transforms the dataset input data into format and ranges that our deep learning algorithms will be able to absorb. For example, the famous iris dataset has categorization as Iris-setosa, Iris-versicosa and Iris-virginica. In a case when we are building a multi-class classification problem using deep learning, it is needed to reshape the input from a vector that contains values for each class to be a matrix with boolean for each class value. Here is an example how is this done in Keras:

# load data
dataframe = read_csv("iris.csv", header=None)
dataset = dataframe.values
X = dataset[:, 0:4].astype(float)
Y = dataset[:, 4]
# Encoding labels
# encode class value as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)

All inputs in a neural network must be tensors of floating point data (float32). The step to transform input data into tensor format is called data vectorization.

It is problematic to feed into a neural network values that take widely different ranges. The network might be able to automatically adapt to such data, but it will make learning more difficult. A best practice to deal with such data is for each feature in the input data to subtract the mean of the feature and divide it by the standard deviation so that the feature is centered around 0 and has a unit standard deviation. This is called data normalization. Data normalization is only done by calculating values from the training dataset.

In summary, in order to make learning of the network easier one must:

  • Take small values — typically values should be in the range from 0 to 1 or -1 to 1.
  • Be homogeneous — all features should take values in roughly the same range.
  • Normalize each feature independently to have a mean of 0.
  • Normalize each feature independently to have a standard deviation of 1.

Summary

Hope you enjoyed this reading. This series on Deep Learning will continue by exploring different aspects and topics related to Deep Learning.

References

  1. Deep Learning with Python, By Francois Chollet, ISBN 9781617294433
  2. Artificial Intelligence for Humans Volume 1: Fundamental Algorithms, By Jeff Heaton, ISBN978–1493682225
  3. Artificial Intelligence for Humans Volume 3: Deep Learning and Neural Networks, By Jeff Heaton, ISBN978–1505714340
  4. Develop Deep Learning Models on Theano and TensorFlow Using Keras, By Jason Brownlee
  5. Deep Learning, By Ian Goodfellow, Yoshua Bengio and Aaron Courville, ISBN 9780262035613
  6. Neural Networks and Learning Machines, By Simon Haykin, ISBN 9780131471399
  7. //hackernoon.com/memorizing-is-not-learning-6-tricks-to-prevent-overfitting-in-machine-learning-820b091dc42
  8. Dropout: A Simple Way to Prevent Neural Networks from Overfitting, by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov