Improving upon Rosenblatt’s perceptron

Original article was published on Deep Learning on Medium

Adaptive Linear Neurons (1960)

Like Rosenblatt’s perceptron, ADALINE (aka Adaptive Linear Element or the Widrow-Hoff rule) is a single layer perceptron. It however differs from it in how it learns the weights w and bias b from data.

ADALINE. To be compared with the diagram for Rosenblatt’s perceptron.

The main difference comes from the feedback error used to adapt the weights and bias of the two perceptrons. While Rosenblatt’s uses the classification error (i.e. a binary value), ADALINE introduces the concept of a so-called loss function (also sometimes call cost function or objective function) relying on the output of the artificial neuron prior to the quantization (i.e. on a continuous value). Albeit this might seem like a minor difference, we’ll see shortly that it actually makes a big difference when it comes to the final model !

The introduction of a loss function makes ADALINE much closer to modern machine learning methods than Rosenblatt’s perceptron is. Given our set of m examples (xₘ, yₘ) with yₘ ∈ {0, 1} denoting the class x belongs to, the loss function for ADALINE is defined as

where φ(z) is the activation function (i.e. the identity function here). Based on this definition, the weights w and bias b are obtained as the solution of the minimization problem

i.e. they are chosen such that the sum of squares errors on our set of examples is as small as possible.

How to minimize the loss function ?

Let us see how to find the set of weights and bias minimizing our loss function. To do so, we’ll rely on the fact that the loss function is quadratic and the associated minimization problem is convex. If you don’t know what a convex optimization problem is, it is, in a nutshell, an optimization problem we know how to solve efficiently. We’ll also require some basic high-school calculus.

A convex function ℒ(z) has only one minimum. It is located at the point z where the gradient of ℒ(z) is zero, i.e.

Our goal is to find the set of weights w and bias b satisfying this condition. These are the weights and bias resulting in the sum of squared errors being as small as it can possibly get (given our data).

The Delta rule : Remember that the gradient of a function indicates the direction with the maximum positive slope. A simple heuristic to find the point where the function is minimum is thus to move in the direction opposite to the gradient. Hence, let us first compute the gradient of our loss function. For a general activation function φ(z), it is given by

and

with φ’(z) the derivative of the activation function with respect to z. Note that ADALINE stands for the ADAptive LInear NEuron. Its activation is the identity, i.e. φ(z) = z and thus φ’(z) = 1. The gradient of the loss function thus simplifies to

and

Starting from a given set of weights w and bias b, the delta rule states that, to reduce the sum of squares errors, these weights and bias need to be updated as follows

where the updates ∆w and ∆b are given by

and

Here, α is a scalar commonly called the learning rate or step size. In its simplest form, the delta rule assumes α constant. The variables w and b are continuously updated until a prescribed number of iterations has been performed or when the norm of the update is smaller than a user-defined tolerance. That is all there is to it.

As you can see, both the mathematics and the philosophy behind the delta rule are simple. It is nonetheless a special case of the more general backpropagation algorithm used to train deeper neural networks, hence its importance in this series. Do not hesitate to re-derive all of the math as an exercise. In the mean time, let us now move to the fun stuff and implement ADALINE in Python. Assuming you are already familiar with Python, the following code should be quite self explanatory.

For the sake of clarity and usability, we will try throughout this series to stick to the scikit-learn API. An extended version of this code is also available on my TowardsDataScience Github repo (here).

ADALINE vs. Rosenblatt’s perceptron

Figure 4: Number of misclassified points in the taining dataset as the training proceeds.
Figure 5: Evolution of decision boundary of Rosenblatt’s perceptron(light gray) and ADALINE (black) over 100 epochs. In both cases, the learning rate is set to 0.1.

Figure 4 shows the number of misclassified points as the training of both Rosenblatt’s perceptron and ADALINE proceeds while figure 5 depicts the evolution of the decision boundaries for both models. Rosenblatt’s perceptron is all over the place, having an accuracy close to 99% at a given epoch and less than 50% at the next one. This lack of robustness comes from the fact that one mislabeled data point prevents Rosenblatt’s perceptron from learning anything.

Comparatively, the number of misclassified points for ADALINE decreases monotonically as the training proceeds. Looking at the evolution of ADALINE’s decision boundary, it can be seen that it is much smoother than that of Rosenblatt’s perceptron. This smoother evolution results from the definition of the cost function, enabling the training procedure to perform small adjustments to the weights w and bias b at each step based on some evaluation of how much wrong the current prediction is rather than simply on whether it is or not (as would be the case for Rosenblatt’s perceptron). The information encoded by the loss function thus makes the training of ADALINE more robust and well-behaved. Note finally that, even though the two classes are not linearly separable (due to the one mislabeled data point), the final linear decision boundary of ADALINE has nonetheless been chosen in a principled way based on the formulation of a minimization problem, thus ensuring it is optimal in some sense. No such things can be said about the decision boundary of Rosenblatt’s perceptron as it simply does not exist for the problem considered herein, the perceptron learning algorithm being unable to converge. Here relies the major difference between these two single layer perceptrons…

Is ADALINE ordinary least squares in disguise?

Before concluding, let us try to get a better understanding of what ADALINE is actually learning. Although it is used for classification purposes, it is closely related to ordinary least squares (OLS), one of the simplest regression model. This close connection is clearly visible when looking at the loss function

Using a simple change of variable,

and

this loss function can be re-written as

where each row of X is a given training example. This loss function is nothing but the quadratic loss function minimized for ordinary least squares. The closed-form of optimal solution is

This equivalence between the two models thus allows us to understand the weights w and bias b of ADALINE as the coefficients appearing in the equation of the hyperplane that best separates the two classes in the least-squares sense.

Although ADALINE improves upon Rosenblatt’s perceptron, we’ll see in coming posts that solving the problem in the least-squares sense is not the most appropriate thing one could do for a classification problem…