Deep Neural Networks As Computational Graphs

Source: Deep Learning on Medium

Neural Networks as Computational Graphs

I like to think of the architecture of a deep neural network as a template for a function. When we define the architecture of a neural network we’re laying out the series of sub-functions and specifying how they should be composed. When we train the neural network we’re experimenting with the parameters of these sub-functions. Consider this function as an example:

The component sub-functions of this function are all of the operators: two squares, two additions, and 4 multiplications. The tunable parameters of this function are a, b, and c, in neural network parlance these are called weights. The inputs to the function are X and Y — we can’t tune those values in machine learning because they are the values from the dataset, which we would have (hopefully) gathered earlier in the process.

By changing the values of our weights (a, b, and c) we can dramatically impact the output of the function. On the other hand, regardless of the values of a, b, and c there will always be an x², a y² and an xy term — so our function has a limited range of possible configurations.

Here is a computational graph representing this function:

Take a minute to look at this graph. Is it clear how this graph represents our function f(x, y)?

This isn’t technically a neural network, but it’s very close in all the ways that count. It’s a graph that represents a function; we could use it to predict some kinds of trends; and we could train it using gradient descent and backpropagation if we had a dataset that mapped two inputs to an output. This particular computational graph will be good at modeling some quadratic trends involving exactly 2 variables, but bad at modeling anything else.

In this example, training the network would amount to changing the weights until we find some combination of a, b, and c that causes the function to work well as a predictor for our dataset. If you’re familiar with linear regression, this should feel similar to tuning the weights of the linear expression.

This graph is still quite simple compared to even the simplest neural networks that are used in practice, but the main idea — that a, b, and c can be adjusted to improve the model’s performance — remains the same.

The reason this neural network would not be used in practice is that it isn’t very flexible. This function only has 3 parameters to tune: a, b, and c. Making matters worse, we’ve only given ourselves room for 2 features per input (x and y).

Fortunately, we can easily solve this problem by using more complex functions and allowing for more complex input. Huzzah!

Recall two facts about deep neural networks:

  1. DNNs are a special kind of graph, a “computational graph”.
  2. DNNs are made up of a series of “fully connected” layers of nodes.

“Fully connected” means that the output from each node in the first layer becomes one of the inputs for every node in the second layer. In a computational graph the edges are the output values of functions — so in a fully connected layer the output for each sub-function is used as one of the inputs for each of the sub-functions in the next layer. But, what are those functions?

The function performed by each node in the neural net is called a transfer function (which is also called the activation function). There are two steps in every transfer function. First, all of the input values are combined in some way, usually this is a weighted sum. Second, a “nonlinear” function is applied to that sum; this second function might change from layer to layer within a single neural network.

Popular nonlinear functions for this second step are tanh, log, max(0, x) (called Rectified Linear Unit, or ReLU), and the sigmoid function. At the time of this writing, ReLU is the most popular choice of nonlinearity, but things change quickly.

If we zoom in on a neural network, we’d notice that each “node” in the network was actually 2 nodes in our computational graph:

If we zoom in on a neural network, we’d notice that each “node” in the network was actually 2 nodes in our computational graph:

Each node is actually multiple component functions.

In this case, the transfer function is a sum followed by a sigmoid. Typically, all the nodes in a layer have the same transfer and activation function. Indeed it is common for all the layers in the same network to use the same activation functions, though it is not a requirement by any means.

The last sources of complexity in our neural network are biases and weights. Every incoming edge has a unique weight, the output value from the previous node is multiplied by this weight before it is given to the transfer function. Each transfer function also has a single bias which is added before the nonlinearity has been applied. Lets zoom in one more time:

In this diagram we can see that each input to the sum is first weighted via multiplication then it is summed. The bias is added to that sum as well, and finally the total is sent to our nonlinear function (sigmoid in this case). These weights and biases are the parameters that are ultimately fine-tuned during training.

In the previous example, I said we didn’t have enough flexibility because we only had 3 parameters to fine tune. So just how many parameters are there in a deep neural network for us to tune?

If we define a neural net to predict binary classification (in/not in a relationship) with 2 hidden layers each with 512 nodes and an input vector with 20 features we will have 20*512 + 512*512 + 512*2 = 273,408 weights that we can fine tune plus 1024 biases (one for each node in the hidden layers). This is a “simple” neural network. “Complex” neural networks frequently have several million tunable weights and biases.

This extraordinary flexibility is what allows neural nets to find and model complex relationships. It’s also why they require lots of data to train. Using backpropagation and gradient descent we can purposely change the millions of weights until the output becomes more correct, but because we’re doing calculations involving millions of variables it takes a lot of time and a lot of data to find the right combination of weights and biases.

While they are sometimes called a “black box”, neural networks are really just a way of representing very complex mathematical functions. The neural nets we build are particularly useful functions because they have so many parameters that can be fine tuned. The result of the fine tuning is that rich complexities between different components of the input can be plucked out of the noise.

Ultimately, the “architecture” of our computational graph will have a big impact on how well our network can perform. Questions like: how many nodes per layer, which activation functions are used at each layer, and how many layers to use, are the subject of research and might change dramatically from neural network to neural network. The architecture will depend on the type of prediction being made and the kind of data being fed into the system — just like we shouldn’t use a linear function to model parabolic data, we shouldn’t use any neural net to solve every problem.