- Use the
**ADAM optimizer**. It works really well. Prefer it to more traditional optimizers such as vanilla gradient descent. TensorFlow note: If saving and restoring weights, remember to set up the`Saver`

*after*setting up the`AdamOptimizer`

, because ADAM has state (namely per-weight learning rates) that need to be restored as well. **ReLU**is the best nonlinearity (activation function). Kind of like how Sublime is the best text editor,pun intended. But really, ReLUs are fast, simple, and, amazingly, they work, without diminishing gradients along the way. While sigmoid is a common textbook activation function, it does not propagate gradients well through DNNs.- Do
**NOT**use an activation function at your output layer. This should be obvious, but it is an easy mistake to make if you build each layer with a shared function: be sure to*turn off*the activation function at the output. - DO
**add a bias**in every layer. A bias essentially translates a plane into a best-fitting position. In`y=mx+b`

, b is the bias, allowing the line to move up or down into the “best fit” position. - Whiten (
**normalize**) your input data. For training, subtract the mean of the data set, then divide by its standard deviation. The less your weights have to be stretched and pulled in every which direction, the faster and more easily your network will learn. Keeping the input data mean-centered with constant variance will help with this. **Don’t bother decaying the learning rate**(usually). Learning rate decay was more common with SGD, but ADAM takes care of this naturally. If you absolutely want to squeeze out every ounce of performance: decay the learning rate for a short time at the end of training; you’ll probably see a sudden, very small drop in error, then it will flatten out again.- I
**f your convolution layer has 64 or 128 filters**, that’s probably plenty. Especially for a deep network. Like, really, 128 is A LOT. If you already have a high number of filters, adding more probably won’t improve things. - Pooling essentially lets the network learn “the general idea” of “that part” of an image. Max pooling, for example, can help a convolutional network become robust against translation, rotation, and scaling of features in the image.

Source: Deep Learning on Medium