Cascade-Correlation, a Forgotten Learning Architecture

Original article was published by Johanna Appel on Artificial Intelligence on Medium


Part 1: Cascade Correlation

Without further ado, let’s implement a simplistic version of CasCor! Since we are working with Python, we need to import some basics and PyTorch:

Overview

The CasCor algorithm starts off with a one-layer network (only output neurons). It then follows these steps:

  1. Train (only!) the output neurons until reaching a plateau
  2. If the residual error is good enough, stop;
  3. Otherwise, freeze all weights of the existing network
  4. Then train a new hidden neuron, by optimizing the correlation between its output and the last measured residual error (see below)
  5. Go back to step 1— rinse & repeat

Step 4. is the most important and most complicated part of CasCor, illustrated also in the figure below.
Until the training has finished, the hidden neuron is disconnected from the network. We statically feed all the original inputs and the values of earlier layers as a weighted sum to its activation function. Then the training algorithm optimizes the neuron’s weights to achieve the best possible correlation of its output value with the sample set residual error, measured in an earlier iteration.
To train for optimum correlation, we will need to use the correlation measure instead of some standard loss function, but more on that later.

After the training finished, we can add the hidden neuron to the network. The output neurons will now receive its value as an additional input, and we need to train their new weights, so we jump back to step 1.

The CasCor paper [1] illustrates the resulting network (with shortcuts) and the process of adding a hidden neuron, in a similar fashion to this:

When adding a new hidden neuron, only the new weights (red squares) are iterated. Here, the arrows are weighted sums fed into the circle depicting the activation function. (Graphic translated and adapted from Wikipedia, 2020)

In this article, we focus on the two major parts of CasCor: Steps 1. and 4., i.e. training the outputs and training a hidden neuron. The rest we’ll simply do manually.

Test Set

For simplicity, we use a 2D categorization task for testing (It’s easier to debug). Our network will thus have 2 input dimensions — the coordinates of the grid — and 1 output dimension — a value between 0 and 1.
We’ll train on the following data sets (without test sets), where 0 values are black and 1 values are white:

To actually feed the input and output into our training function, we also need to convert these to a ‘long’ form and add a static bias value of 1.
Plus, testing showed that CasCor and quickprop perform better when the inputs are normalized, so let’s do that as well.

Quickprop

To train the units in the CasCor network, they use a technique that was also invented by Fahlman in 1988 [2] called quickprop.
Quickprop is an alternative to back-propagation that uses a variation of Newton’s method. For more info on this aside from the original paper, this useful blog post by Giuseppe Bonaccorso also describes it quite well.

Note that quickprop is not strictly necessary to implement CasCor. However, to stick close to the original paper and for maximized learning, we’ll use it here as well. It is actually an interesting topic all on its own, and I encourage you to investigate it!
If you couldn’t care less about quickprop, skip ahead to the next section and treat any further mention of it simply as ‘training neuron weights based on given input & expected output pairs’.

Our implementation is based on the blog post — but since we don’t want to focus on quickprop in this article, we’ll just peek at some adjustments to their code instead of diving into the maths.

Flexibility. The code from the post uses a fixed activation and loss function and statically implements their gradient. For CasCor, we need to be a bit more flexible (at least when it comes to the loss) so we pass these functions as parameters.

Automatic Gradient Computation. Since the activation and loss function are now variable, we’ll run into trouble when trying to build their gradient. But, using PyTorch, we can easily skip over that and let the autograd do the heavy lifting.

Convergence. Giuseppe’s code tests the change in weights per iteration to determine convergence. Some quick tests found that to be troublesome since it often seems to get stuck on saddle points and local minima. So instead of that, we’ll use the residual error.
Specifically, we’ll calculate a running mean of the residual error, and check if the difference in mean per iteration is smaller than a given tolerance.
Last but not least, if the error diverges or converges too slowly, quickprop simply gives up after a certain amount of iterations (it runs out of patience, see the function parameter).

Output Training

With quickprop implemented, let’s get to the fun part!
CasCor starts with a one-layer network, i.e. we will be using a single output neuron and connect that to our input (and bias). To start training this neuron, we take sets of input (x) and output (y) samples and create newly initialized (random) weights, all of which we feed into quickprop.

Note that this approach does not care whether the network is single-layer or deeper — since we are only training the output weights, we could also have run the inputs through a number of hidden layers and use that as the x for the training.

Let’s test this with the training sets.

Looks good for the simple one!

Unsurprisingly, this doesn’t match up so well, since we only trained a single unit. But it is a good enough approximation for now.

Hidden Neuron Training

As we have seen, our simple one-neuron model approximates the second shape with quite a bit of error. To achieve a better fit, we’ll need to add hidden layer(s). When we add a hidden neuron, we:

  1. Freeze all other parameters (including output)
  2. Run the training sample forward through the net and use the input values and other hidden unit values as the input of the new unit
  3. Train the new neuron such that its value best correlates with the residual errors calculated in an earlier run

The correlation function S (also described in more detail in [1]) is given by:

Where phi is the activation function of the neuron, V is the value of it, and E is the residual error (the earlier prediction minus the actual target). The bar-terms are the means and o and p are indices of output values and samples respectively.

With this as our ‘loss’ function, we can simply run quickprop again.

Let’s run this with the one-layer net predictions (pred) of the last sample set!

This shows us pretty nicely where the error currently is biggest (very dark or very bright spots in the first plot) and what the neuron tries to do to correlate with that (second plot).

Combining Hidden & Output Neurons

As the last step (of this article) we now train our output layer again, additionally based on the values (neuron_value) computed by our newly trained neuron. To do so, we need to include these values as input to the output neurons (x2).

Pretty!

The output neuron can now approximate the shape because it can base its regression on the additional value of the hidden neuron. And since that correlates with the error we got from the earlier optimum, adding it has the effect of reducing the problem the output neuron has to solve by one dimension.

Caveats

All the pictures are the result of executing the code snippets, however, what you cannot see is how often they were run until they generated those images. Particularly, training the neuron only converges with a very low probability to the weights that produce the displayed values, i.e. I had to run the algorithm a couple of times until I got that particular output.
This can presumably be attributed in part to the quickprop implementation. From Fahlman’s paper [2] and another, more recent comparison of quickprop with back-propagation [3] it becomes clear that a) it will need some more adjustments to yield better results, and b) it doesn’t perform equally well on all domains. Specifically, it fails to compete with back-prop on bias and variance in some real-world image classification problems.

That said, a comparable network structure (1 hidden, 1 output neuron) trained with standard back-propagation and using the Adam optimizer never even converged to the above result during the test runs (it needed at least one other neuron to converge to that), but this could be bad luck since it wasn’t a controlled test setup.

Another problem of CasCor in general is the thinness of the network vs. its depth. Since GPUs can handle matrices very efficiently, it is faster to run through broader networks that are less deep (i.e. that can better be represented as matrices), as compared to thinner and deeper networks, such as CasCor generates by default.
This is of course not so much of a problem with our little example, but it might become a problem when the generated network solves a more complicated task.

Which brings us to the next section.

Future Improvements

As we have seen, this basic implementation of CasCor in fact works! 🙂
However, we are still missing a lot of boilerplate code that automates the process, and some optimization of the training method, to find a global optimum with higher probability.
That is why, in the next parts of this series, we’ll see how we can:

  • Automate the output -> hidden -> output -> hidden -> … training cycle
  • Change quickprop to deliver more stable results and to train a set of new nodes instead of only one (and pick the best)
  • Further improve that process to pick more than one node (i.e. have ‘broader’ layers)
  • Change the problem domain to some more interesting/challenging ones (e.g. domains that can only be solved with deeper and/or recurrent networks)
  • Benchmark it against other machine learning methods in a controlled setting