Power Neural Network


This is a note about an idea from half a year ago, describing that I call the Power Neural Network to my coauthor.

In the human brain not every neural layer is connected (we don’t do fully connected (FC) layers). This represents a form of constrain for the system. The non-FC property is the basis for the power net idea, which I previously described at “cone layers”. FC may be too flexible and may contain a lot of wasteful connections. Having to mute those out over time for training may be inefficient compared to having a net that already mutes them out. The muting out of useless connections is also not guaranteed to happen.

Imagine a layer may have sections that may be reserved for some neural task, to enable modularization of cognitive functions. E.g. Riding a bike takes multiple modules, and also their combinations. e.g. moving the arms with the body to balance the bike.

Photo by Viktor Kern on Unsplash

So we have the concept of modules and levels. In the example above, the arms and body modules connects to a higher level module called “balancing a bike”. The 2 lower modules can for sure connect to other modules like “balancing a surfboard” or even “balancing in general”. But they do not connect to modules like “listening to music” or “reading”.

The cone NN architecture I described earlier to you (like above) can be generalized into connection via power set. Imagine a lower level network has 3 elements (modules) {a, b, c}, so the possible combinations are 2³: {_empty_set, (a), (b), (c ), (a, b), (a, c), (b, c), (a, b, c)}
These should be the components that exist in the next level, by assigning groups of neurons to accept incoming connections from their respective power set elements.

We may call this PowerNet, since it’s derived from the power set. The implementation of this network may not be trivial, or there may exist some smart matrix multiplication trick to achieve the same goal. Few things to think about:
– the growing size of power sets — 2³ = 8, but the third layer will be 2⁸, and it grows quicker after. For networks that need to be small, it can’t be of that size too. see next item for remedy.
– relaxing the condition — do some hybrid of partial FC and PowerNet; perhaps the separation does not have to be strict, but can be probabilistic, like Dropout. We can also have some kind of PowerDropout by implementing probabilistic PowerNet using dropout on the connections that should be dropped.
– complexity: compute cost will be lower, than FC of course, since we are calculating less terms. With proper implementation the cost shall be only what gets computed.
– theoretical basis: is it Turing complete? or almost Turing complete? Will it hit convergence fast enough/faster for the functions it is approximating? I have no idea.
– empirical basis: if we can build it, we can test it of course. My hope is that it can be used for the HydraNet for multitasking.

Source: Deep Learning on Medium