Source: Deep Learning on Medium
Local Optima: Earlier, the main problem while minimizing the cost is getting stuck in local optima rather than ending in global optima but now it has been researched that chances of getting stuck in local optima are very-very low.
Instead, now the main problem is of plateaus. They make learning slow.
Epochs — If too low then you might underfit and if too high then you might overfit (if the network is large).
Basic Recipe for Machine Learning:
High Bias: Low Training Set Performance.
To address High Bias use Bigger Network or Train Longer.
High Variance: Low Validation/Dev Set Performance.
To address High Bias use More Data or use Regularization.
Early Stopping — Stop the training before the validation error gets worse.
Batch Normalization — It does more things (Speed up Training, Train with Higher Learning Rate) along with Regularization. We use Batch Norm pretty much always.
Data Augmentation — It is also used to provide Regularization.
Weight Initialization — Xavier
To Solve Vanishing and Exploding Gradients we initialize the weights which are close to 1 using Xavier Initialization.
Activation Functions — ReLU (Leaky, ELU), Log Sigmoid, Tan Sigmoid (Tanhx), SoftMax.
Output Types — Classification (Single class (Cat vs Dog), Multiclass (Object Recognition, Segmentation), Regression, Generation
Loss — MSE, RMSE, Cross Entropy, Mean Absolute Error (MAE), Negative Log Likelihood, Wasserstein
Note: When you do single label multi-class classification then you use SoftMax as your Activation function for last layer and Cross-Entropy as your Loss Function.
Metrics — Accuracy, F1 Score (Precision, Recall) (2PR/(P+R)), Error Rate.
Types of Search for Local Minima — Gradient Descent, Simulated Annealing, Evolutionary.
Optimizer — Gradient Descent, Momentum, RMS Prop, Adam
Exponentially weighted moving average (EWMA) (Here Beta is the averaging window if Beta = 1 then the window is larger so we get the curve smoother.)
These optimizers are called Dynamic Learning Rates. And by using these we know which parameters are needed to be updated with large values and which ones are needed to be updated with small values.
Momentum (Beta 1) in order to damp the oscillations, EWMA is introduced.
RMS Prop (Beta 2) used to dampen the vertical oscillations and fasten the horizontal oscillation.
Adam (First Momentum then RMS Prop is applied as it is the combination of both. Bias Correction is also used to remove the initial Bias)
(Beta 1 = 0.9, Beta 2 = 0.999, Epsilon = 10^-8 we use Epsilon so that the denominator is not zero), Even with Adam, you need Learning Rate Annealing.
Batch Size — 4,…, 16, 32 …1024… depending on the computing power you have.
Learning Rate Decay (Learning Rate Annealing) — To speed up the optimization, learning rate decay is used in which initially the learning rate is faster in comparison to the end of the iteration.
Types: Exponentially Decay, Stair Case Decay.
In Fastai Library fit_one_cycle() works well where we initially increase the learning rate and then we decrease it.
Regularization — Batch Normalization, L2(Ridge Regression)(Weight Decay), Dropout, L1(Lasso(Least Absolute Shrinkage and Selection Operator)), Elastic Net(L1 and L2)
Extra Note: While you’re updating the weight you use the gradient of the loss with respect to weight. Here the gradient applied to loss applies the gradient to the regularization term and after the application of gradient, the thing which we get is called Weight Decay because then it reduces the weight as it gets subtracted from the previous weight.
In Practice, you want a bit of both Dropout and L2 (Weight Decay) and there is no rule that you should be selecting one over the other.
Regularization Rate — If it is Overfitting then use High Regularization and if it is Underfitting then use Low Regularization.
Learning Rate — Use lr_find of the Fastai Library and see the Plot where the value at which the negative slope is higher (by Jeremy Howard) or the value 10 times before (by David Silva) the value starts to increase.
Discriminative Learning Rates is the process of applying different learning rates to different sections of the NN.
We apply low learning rate to starting layers because as we are using transfer learning these layers already have near optimal values.
We apply a bit higher learning rate to later layers because we need to train them for our new dataset which was not used while making the pre-trained model.
Deciding Learning rate using Fastai Library:
learn.fit_one_cycle(5, 1e-3) -> This applies 1e-3 to all the layers.
learn.fit_one_cycle(5, slice(1e-3)) -> This applies 1e-3 to the last layers and 1e-3/3 to the other layers.
learn.fit_one_cycle(5, slice(1e-5, 1e-3)) -> This applies 1e-3 to the last layers and 1e-5 to the initial layers and applies learning rate in increment order from first to last layers and increments depending upon how many layers are present in between.
In Fast.ai the CNN is divided into three layer group which is done by taking the newly added layers at last as one layer group and the remaining layers are divided half into two layer groups.
Then we apply 1e-5 to the first layer group.
Then we apply 1e-4 to the middle layer group.
Then we apply 1e-3 to the last layer group.
(Here Fast.ai -> Fastai Library)