Understanding Keras — Dense Layers

Source: Deep Learning on Medium


Let’s dive into all the nuts and bolts of a Keras Dense Layer!

Diving into Keras

One of the things that I find really helps me to understand an API or technology is diving into its documentation. Keras is no different! It has a pretty-well written documentation and I think we can all benefit from getting more acquainted with it. That’s what inspired this blog (and more to come) where we step through the various, documented layers and other fun things that Keras has to offer us and see if we can’t learn a new thing or two about this awesome API!

Today, we will dive into the most basic layer, the Dense layer (something I have no doubt that you all are at least a bit familiar with!). This post is also available in a video form, which you can check out here!

The Dense Layer

So, if you don’t know where the documentation is for the Dense layer on Keras’ site, you can check it out here as a part of its core layers section. Here, we will find it as the first layer. Indeed, it is that important.

Right away, we can look at the default parameters of the layer, all of which we will explore today. Here it is, if you don’t want to click the link:

keras.layers.Dense(units, activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)

But before we get into the parameters, let’s just take a brief look at the basic description Keras gives us of this layer and unpack that a bit.

Just your regular densely-connected NN layer.

That seems simple enough! Furthermore, it tells us that a dense layer is the implementation of the equation output = activation(dot(input, kernel) + bias). This means that we are taking the dot product between our input tensor and whatever the weight kernel matrix is featured in our dense layer. Then, we add a bias vector (if we want to have a bias) and take an element-wise activation of the output values (some sort of function, linear or, more often, non-linear!).

Another interesting note is that if you give a Dense layer an input with a rank greater than 2, it will be flattened before taking the dot product. This is good to know, and something I wasn’t directly aware of before reading the documentation.

With all that introduction taken care of, let’s start diving into the parameters.

Units

The units are the most basic parameter to understand. This parameter is a positive integer that denotes the output size of the layer. It’s the most important parameter we can set for this layer. The unit parameter actually dictates the size of the weight matrix and bias vector (the bias vector will be the same size, but the weight matrix will be calculated based on the size of the input data so that the dot product will produce data that is of output size, units).

Activation

This parameter sets the element-wise activation function to be used in the dense layer. By default, we can see that it is set to None. That means that by default it is a linear activation. This may work for your use-case! However, linearity is limited, and thus Keras does give us a bunch of built-in activation functions. This is where we might choose an activation function to use for our layer.

Use Bias

This parameter is very simple! It’s just whether or not we wish to use a bias vector in our calculation for our layer. There may be cases in which we do not. By default, this is activated and Keras assumes that we will want to use a bias vector and learn its values.

Initializers

The initializer parameters tell Keras how to initialize the values of our layer. For the Dense layer, we need to initialize our weight matrix and our bias vector (if we are using it). Like with activations, there a bunch of different initializers to explore!

Specifically, by default Keras uses the Zero initializer for the bias and the Glorot Uniform initializer for the kernel weight matrix. As you might assume, the Zero initializer simply will set our bias vector to all zeros.

The Glorot Uniform is the interesting one in this case. It pulls values from a uniform distribution, however, its limits are dynamic with the size of the Dense layer! It actually uses the following equation to calculate the limits of the layer:

limit = sqrt(6 / (fan_in + fan_out))
# Where the uniform distribution will fall uniformly between [-limit, limit]

Fan in is simply the units in the input tensor and fan out is the units in the output tensor. Why this range? Well, it’s written about in this paper. We won’t get into this paper in this post, but I encourage you to read it if you’re interested!

Regularizers

The next three parameters are regularization (or penalty) parameters. By default, these aren’t used, but they can be useful in helping with the generalization of your model in some situations so its important to know that they exist!

You can check it out in more detail here if you want to get your hands on some regularizers (there are L1, L2 and L1_L2 ready for you to use out of the box as well as information on how to write your own regularizer as well). The point is, we can apply a regularizer on three components of our layer. We can apply it on the weight matrix, the bias vector, or the entire thing (if we choose to apply it after the activation). These techniques will have various effects such as keeping things sparse or keeping weights close to zero. It’s another hyperparameter to explore, and perhaps one that helps your model get that last percentage of generalization before you deploy it for public use!

Constraints

Finally, the last parameter we will discuss are the two constraint parameters. Simply put, these can constrain the values that our weight matrix or our bias vector can take on. By default, these aren’t activated, but you can view some of the options available on the constraint page, one of which is fairly easy to understand, the NonNeg constraint which forces values of the weight/bias to be greater than or equal to 0.

These can be useful if you’re trying to do any sort of weight clipping. For example, the W-GAN uses weight clipping. Perhaps, if you were to re-write this model yourself in Keras, you’d wish to use a Constraint to enforce this idea!

Wrapping-Up

So there you have it, the Dense layer! I hope you found this post helpful and learned something about the Dense layer that you didn’t know before. Feel free to let me know if you’d like to see more of these articles (or videos) and I’d love to have a conversation with you about it here or on Twitter.

If you want to read more of what I’ve written, why not check out some of my other posts like:


Originally hosted at hunterheidenreich.com.