Source: Deep Learning on Medium

In Part 1, I covered a classification and regression problem on a forest fire data set using artificial neural networks. Both models gave fairly poor inconsistent results despite using deep learning to determine a model for the data. I will be covering three possible methods to improve these models.

### Feature Evaluation and Dimensionality Reduction

When building the previous models, I did not take into account the statistical significance of each parameter. Instead I simply chose to include all independent variables. The first method of improving the model is to determine and only use the features which are most statistically significant to the independent variable(s).

**Methods**

- All Features

– The method I used was previously was to use all variables. The only cases to do this are: prior domain knowledge, it is a required, or preparing for backward elimination. - Backward Elimination

– Involves selecting all features and removing those which are not within a selected significance level in a series of steps. This method is the quickest and the one I will be using. - Forward selection

– Also requires selecting a significance level but requires creating a regression model for each feature and fitting then comparing and taking the lowest value and adding it to the remaining sets. This is continued until the lowest P-value exceeds the level of significance selected. - Bidirectional elimination (stepwise)

– After selecting the significance level forward selection is run once the all steps of backwards elimination. This is continued until no more variables can be added or taken away. - Score comparison

– A criterion for goodness of fit is selected and all possible regression models 2n-1 are attempted. The one with best criterion is selected. This method is the most resource intensive.

**Example**

The steps for Backward Elimination are as follows:

- Select significance level to stay in the model (I will choose SL = 0.15) and then create a new column of all 1’s to be used for determining significant variables
- Fit the model with all possible predictors
- Consider predictor with highest p-value (if P>SL continue otherwise Quit)
- Remove it from predictor and all values with the same p-value
- Re-Fit / Re-build model and Return to Step 3
- Quit: model is ready

#Evaluate statistical significance with Backward Elimination

import statsmodels.formula.api as sm

#517 rows in X

X = np.append(arr = np.ones((517, 1)).astype(int), values = X, axis = 1)

#Select all Features

X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]]

regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

regressor_OLS.summary()

#View results and remove the highest P-value from X-Opt and repeat

x17 has the highest p-value = 0.943. This is removed and then the process repeats until all remaining values are under 0.15. A script can be used to accomplish this.

import statsmodels.formula.api as sm

def backwardElimination(x, sl):

numVars = len(x[0])

for i in range(0, numVars):

classifier_OLS = sm.OLS(y, x).fit()

maxVar = max(classifier_OLS.pvalues).astype(float)

if maxVar > sl:

for j in range(0, numVars — i):

if (classifier_OLS.pvalues[j].astype(float) == maxVar):

x = np.delete(x, j, 1)

return x

#Significance Level = 15%

SL = 0.15

X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]]

X_Modeled = backwardElimination(X_opt, SL)

regressor_OLS = sm.OLS(endog = y, exog = X_Modeled).fit()

regressor_OLS.summary()

The results show that only four variables remain. These correspond to a dummy variable, X position, and Temp from the original data. Note x1-x3 are not the same as x1-x3 from the initial summary. Moving forward, all other features can be ignored in the models and only these will be considered.

**Feature Selection**

Another method of reducing dimensionality is Feature Selection techniques. The previous method used Feature Elimination but I could have also used Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), or Kernel PCA which are Feature Selection Techniques. For example PCA allows the user to see the variances of all features and select the amount of features to include to explain a set percentage of the variance. This is done after splitting the dataset and feature scaling.

#PCA Dimensionality Reduction

from sklearn.decomposition import PCA

pca = PCA(n_components = None) #Replace none with 2

X_train_C = pca.fit_transform(X_train_C)

X_test_C = pca.transform(X_test_C)

explained_variance = pca.explained_variance_ratio_

The explained variance per feature is ordered from greatest to least as shown in the first six rows of *explained_variance*. Therefore, if I wanted to only select two variables so that I am able to graph the features against the model in 2D, then I would be able to explain (13.9 + 7.4)% or approximately 21% of the variance. I would then need to run Splitting the data set again along with feature scaling and run the above code replacing *None* with *2*.

### Parameter Tuning

At its core, parameter is involves creating multiple models for every combination of a given set of parameters and then comparing all of them. Then the most accurate parameters are selected for the model. For instance, you could train ten models for ten different batch sizes and compare these to select the best batch size. With parameter tuning you can view the effect different types of parameters have on the models accuracy.

**Method**With parameter tuning you can put in the variations to test for each selected parameter and then run. Using a script, this will build and compare all possible combination of models. This is compute intensive so this stage will take the longest especially depending on the amount of parameters and choices for each selected.

**Example**I will adjust the batch size, optimizer, and epochs used to determine the best set of those parameters for the classification problem of determining the size class of a fire.

‘’’CLASSIFICATION’’’

#Avoid: ValueError: Classification metrics can’t handle a mix of continuous-multioutput and binary targets

dataset = pd.read_csv(‘forestfires.csv’)

y = dataset.iloc[:, 12].values # dependent variable

for i in range(0, len(y)):

y[i] = (y[i]*2.47)

if y[i] < 100.0:

y[i] = 0

else:

y[i] = 1

y_Corrected = y.astype(int)

# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_Modeled, y_Corrected, test_size = 0.2)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

#Tuning For Epochs, Batch Size, Optimizer

def build_classifier(optimizer):

classifier = Sequential()

classifier.add(Dense(units = 10, kernel_initializer = ‘uniform’, activation = ‘relu’, input_dim = 4))

classifier.add(Dense(units = 10, kernel_initializer = ‘uniform’, activation = ‘relu’))

classifier.add(Dense(units = 10, kernel_initializer = ‘uniform’, activation = ‘relu’))

classifier.add(Dense(units = 1, kernel_initializer = ‘uniform’, activation = ‘sigmoid’))

classifier.compile(optimizer = ‘adam’, loss = ‘binary_crossentropy’, metrics = [‘accuracy’])

return classifier

classifier = KerasClassifier(build_fn = build_classifier)

parameters = {‘batch_size’: [1, 16, 32], ‘epochs’: [100, 500], ‘optimizer’: [‘adam’, ‘rmsprop’]}

grid_search = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = ‘accuracy’, cv = 10)

grid_search = grid_search.fit(X_train, y_train)

best_parameters = grid_search.best_params_

best_accuracy = grid_search.best_score_

print(‘Best Parameters: %s’ % best_parameters)

print(‘Best Accuracy: %s’ % best_accuracy)

From the output the best parameters are:

- Batch size: 25
- Epochs: 100
- Optimizer: adam

These can then be used to build and test a new classification model which I will cover with the final improvements. Note this function can also be modified to test a few different structures for the neural network — adjusting layer and nodes in each layer — but this will also increase the amount of time required to tune the parameters.

### Dropout

The concept of dropout is simple — drop a select amount of nodes in with each input. The reasoning behind this is to prevent overfitting of data from a single node. This was one possible reason for the models giving high accuracy ratings on training data but low accuracy ratings on test data.

**Method**Randomly drop a select percentage of nodes from the selected layer. This is done using a dropout function in Keras.

**Example**Using all of the previous improvements from the previous stages I will implement a model using dropout set to 10% and test for accuracy against a test set.

#BUILD NEW MODEL

#Using Modifications

from keras.layers import Dropout #For Layers of ANN

classifier = Sequential()

classifier.add(Dense(units = 10, kernel_initializer = ‘uniform’, activation = ‘relu’, input_dim = 4))

classifier.add(Dropout(rate=0.1))

classifier.add(Dense(units = 10, kernel_initializer = ‘uniform’, activation = ‘relu’))

classifier.add(Dropout(rate=0.1))

classifier.add(Dense(units = 10, kernel_initializer = ‘uniform’, activation = ‘relu’))

classifier.add(Dropout(rate=0.1))

classifier.add(Dense(units = 1, kernel_initializer = ‘uniform’, activation = ‘sigmoid’))

classifier.compile(optimizer = ‘adam’, loss = ‘binary_crossentropy’, metrics = [‘accuracy’])

classifier.fit(X_train, y_train, batch_size = 25, epochs = 100)

from sklearn.metrics import confusion_matrix

y_pred = classifier.predict(X_test)

y_pred = (y_pred > 0.5)

cm = confusion_matrix(y_test, y_pred)

Evaluation Results

Mean: 0.9394308885561603

Variance: 0.01942428129876616

Now we can see that the model improved significantly and tested and evaluated accuracy are very close to modeled accuracy. Though it should be noted that due to the few cases of large forest fires now the model tends towards false negative and predicts the majority are small forest fires under predicting the burned area to be under 100 acres. This could be an error with the data and therefore another model (KNN, Decision Tree, XGBoost, etc) should be used to test this phenomena.

For the sake of visualizing the model however I can select and train the model with only two independent variables. Therefore I will be able to show a 2D graph of the ANN model. The data will collect in two lines since one feature is binary.

### Conclusion

As can be seen, adding a minimal amount of model evaluation and tuning can assist in improving a model. Though there may be no way to improve the quality of the small dataset, these models still have room for improvement through varying neural network structure, trying various dimensionality reduction techniques, and adjusting the dropout. In addition, the prediction to be made about the data could be changed as well to answer a different question that may be better suited to the data set.