Hyperparameter Tuning with Deep Learning Grid Search

Source: Deep Learning on Medium

Hyperparameter Tuning with Deep Learning Grid Search

This machine learning project is about Diabetes Prediction. We would be working on kaggle pima indians diabetes dataset.

The necessary packages are imported.

# Importing the necessary packagesimport pandas as pd
import numpy as np
import keras

The dataset is read into ‘df’ dataframe.

# Reading the filedf = pd.read_csv(‘/kaggle/input/pima-indians-diabetes-database/diabetes.csv’)

Let us understand the dataframe ‘df’.

df.shape # Shape of ‘df’

The size is (768,9) which suggests there are 768 cases and 9 columns.

df.columns # Prints columns of ‘df’

The columns are [‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’, ‘Outcome’]

df.describe() # Displays properties of each column

All the columns have count = 768 which suggests there are no missing values. The mean of ‘Outcome’ is 0.35 which suggests there are more ‘Outcome’ = 0 than ‘Outcome’ = 1 cases in the given dataset.

The dataframe ‘df’ is converted into numpy array ‘dataset’

dataset = df.values

The ‘dataset’ is split into input X and output y

X = dataset[:,0:8]
y = dataset[:,8].astype(‘int’)

Standardization

It can be observed that mean value of columns are very different. Hence the dataset is to be standardized so that no inappropriate weightage is given to any feature.

# Standardizationa = StandardScaler()
a.fit(X)
X_standardized = a.transform(X)

Now let us look at mean and standard deviation of ‘X_standardized’.

pd.DataFrame(X_standardized).describe()

Mean of all columns is around 0 and Standard deviation of all columns is around 1. The data has been standardized.

Tuning of Hyperparameters :- Batch Size and Epochs

# Importing the necessary packagesfrom sklearn.model_selection import GridSearchCV, KFold
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import Adam

The neural architecture and optimization algorithm are defined. The neural network consists of 1 input layer, 2 hidden layers with rectified linear unit activation function and 1 output layer with sigmoid activation function. Adams optimization is chosen as the optimization algorithm for the neural network model.

We run the grid search for 2 hyperparameters :- ‘batch_size’ and ‘epochs’. The cross validation technique used is K-Fold with the default value k = 3. The accuracy score is calculated.

# Create the modelmodel = KerasClassifier(build_fn = create_model,verbose = 0)# Define the grid search parametersbatch_size = [10,20,40]
epochs = [10,50,100]
# Make a dictionary of the grid search parametersparam_grid = dict(batch_size = batch_size,epochs = epochs)# Build and fit the GridSearchCVgrid = GridSearchCV(estimator = model,param_grid = param_grid,cv = KFold(),verbose = 10)
grid_result = grid.fit(X_standardized,y)

The results are summarized. The best accuracy score and the best values of hyperparameters are printed.

# Summarize the resultsprint(‘Best : {}, using {}’.format(grid_result.best_score_,grid_result.best_params_))
means = grid_result.cv_results_[‘mean_test_score’]
stds = grid_result.cv_results_[‘std_test_score’]
params = grid_result.cv_results_[‘params’]
for mean, stdev, param in zip(means, stds, params):
print(‘{},{} with: {}’.format(mean, stdev, param))

The best accuracy score is 0.7604 for ‘batch_size’ = 40 and ‘epochs’ = 10. So we choose ‘batch_size’ = 40 and ‘epochs’ = 10 while tuning other hyperparameters.

Tuning of Hyperparameters :- Learning rate and Drop out rate

Learning rate plays an important role in optimization algorithm. If learning rate is too large, the algorithm may diverge and thus can’t find the local optima. If learning rate is too small, the algorithm may take many iterations to converge which results in high computational power and time. Thus we need a optimum value of learning rate which is small enough for the algorithm to converge and large enough to fasten the converging process. Learning rate helps with ‘Early Stopping’ which is a regularization method where the training set is trained as long as test set accuracy is increasing.

Drop out is a regularization method which reduces the complexity of the model and thus prevents overfitting the training data. By dropping an activation unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections. Dropout rate can take values between 0 and 1. 0 implies no activation units are knocked out and 1 implies all the activation units are knocked out.

The best accuracy score is 0.7695 for ‘dropout_rate’ = 0.1 and ‘learning_rate’ = 0.001. So we choose ‘dropout_rate’ = 0.1 and ‘learning_rate’ = 0.001 while tuning other hyperparameters.

Tuning of Hyperparameters :- Activation Function and Kernel Initializer

Activation functions introduce non-linear properties to the neural network such that non-linear complex functional mappings between input and output can be established. If we do not apply activation function, then the output would be a simple linear function of the input.

The neural network needs to start with some weights and then iteratively update them to better values. Kernel initializer decides the statistical distribution or function to be used for initializing the weights.

The best accuracy score is 0.7591 for ‘activation_function’ = tanh and ‘kernel_initializer’ = uniform. So we choose ‘activation_function’ = tanh and ‘kernel_initializer’ = uniform while tuning other hyperparameters.

Tuning of Hyperparameter :-Number of Neurons in activation layer

The complexity of the data has to be matched with the complexity of the model. The number of neurons in activation layer decide the complexity of the model. Higher the number of neurons in activation layer, higher is the degree of non-linear complex functional mappings between input and output.

The best accuracy score is 0.7591 for number of neurons in first layer = 16 and number of neurons in second layer = 4.

The optimum values of Hyperparameters are as follows :-
Batch size = 40
Epochs = 10
Dropout rate = 0.1
Learning rate = 0.001
Activation function = tanh
Kernel Initializer = uniform
No. of neurons in layer 1 = 16
No. of neurons in layer 2 = 4

Training model with optimum values of Hyperparameters

The model is trained using optimum values of hyperparameters found in previous section.

We get an accuracy of 77.6% and F1 scores of 0.84 and 0.65.

The hyperparameter optimization was carried out by taking 2 hyperparameters at once. We may have missed the best values. The performance can be further improved by finding the optimum values of hyperparameters all at once given by the code snippet below. Note :- This process is computationally expensive.

Happy reading!