Difference between ML and deep learning with respect to Splitting of the dataset into…

Source: Deep Learning on Medium


Amount of data Vs performance graph

The major difference between machine learning and deep learning is its execution as the size of data increases. Deep learning algorithms require a large amount of data, when the data is small these algorithms don’t perform that well. Now the question arises in mind, why deep learning algorithms require a large amount of data?

While applying machine learning algorithms to a specific problem, most of the applied features need to be identified by an expert and then hand-coded as per the domain and data type. In the case of deep learning, the algorithm itself detects the high-level features from the data, there is no need to handpick the features from the data, it learns the features from the data during training, hence it requires much more data to be able to effectively extract and learn the features. This is the main reason why machine learning algorithm can be trained on a relatively much lesser amount of data as compared to deep learning algorithms.

Train / Cross-validation/Test sets:

Training set: This part of the dataset includes the data on which we want to train our model, i.e.to fit the parameters of the model.

Cross-validation/dev set: We usually train a model using different algorithms so that we can compare which algorithm gives the best results. In order to check which model performs best, we use cross-validation set. Also in order to tune the hyperparameters of the classifier, for example, to choose the number of hidden units in the neural network, we use cross-validation set.

Test set: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. Basically, it is used to check how well your model performs on unseen data.

Splitting of the dataset into train/cross-validation/test sets in case of the machine learning:

We know that machine learning algorithms can be trained on a small amount of data as compared to deep learning algorithms. The best practice is to divide the dataset into a 60/20/20 pattern. Which means 60% of the total data points present in the dataset goes to the training set and used to fit the parameters , 20% of the total data points present in the dataset goes to the cross-validation set which is used to select the best algorithm to solve the problem as well as for hyperparameter tuning and remaining 20% of the total examples present in the dataset goes to the test set which is used to give final unbiased evaluation of the model.

The pattern of splitting the dataset may vary depending on the size of the dataset available. If the available dataset contains between 1000 to 10000 total data points then 60/20/20 pattern is more likely to be used. If the available dataset contains between 10000 to 30000 total data points then 70/15/15 pattern is more likely to be used and for the dataset containing the data points greater than 30000 (and less than 80–90000) it is more likely to use 80/10/10 pattern for splitting the dataset. No increase in the accuracy of the model is seen after a specific amount of data limit is crossed. If a greater amount of data is available to train the model then Deep learning is the best option to achieve a model with greater accuracy than any other machine learning algorithms.

Splitting of the dataset into train/cross-validation/test sets in case of the Deep Learning:

We know that in case of deep learning a huge amount of data is required to train the model as the model learns the features with training, there is no need to extract the features. The pattern of splitting the dataset in case of deep learning totally depends on the amount of data available. A large amount of data is required for training the model, i.e., train set size will be much more as compared to cross-validation and test set. Some specific amount of data is always required for cross-validation and test set.

for example, suppose we have a dataset with 1,000,000 data points, then the pattern of splitting can be 98/1/1, i.e. training set will contain 980,000 data points, cross-validation set will contain 10,000 data points and test set will contain 10,000 data points. Dataset can also be split in 99/0.5/0.5 pattern as cross-validation and test set does not require a large amount of data but, they always require a certain amount of data. The sizes of cross-validation and test sets need not be the same.

The main takeaway is that the training set will always contain much much more data point as compared to cross-validation and test set in case of deep learning. While in case of machine learning training set size is greater than test and cross-validation but not that greater as in case of deep learning.