The Transference Problem of Data-specific Features

Source: Deep Learning on Medium


This post highlights the problem of taking a network trained on dataset A, and fine-tuning it on dataset B, and then re-testing it’s performance and dataset A and finding that the performance has decreased.

For example, a network pre-trained on ImageNet, if that model is taken and the last layer of features are fine-tuned on a new dataset, the weighted features for optimal performance for the ImageNet task will have been corrupted. While this isn’t catastrophic, it is notable that the brain does not appear to have this problem. For humans, learning brand new tasks doesn’t skew your whole mind to that task. Humans are able to perform optimally on many specific tasks. Even, for example, someone showing you an item once, like the name of an interesting mineral, a week later you can probably recall that, even though you have learned and used your mind for countless other things during that elapsed week.

One possible solution for this, is some set of constant internal features in which the networks learn all trained data in terms of. That is, some set of innate features which don’t get changed for every new dataset that the network sees. New knowledge is therefore encoded as a combination of some innate component, and data-specific component. Crucially, this allows for comparison of new and old stimuli, and robust operation of the model, because all stored information is in reference to some innate component.

Essentially this means that a network model should share some set of features for all tasks that it is trained on, and have some a priori high-dimensional vector space, where all new data is defined within those parameters. Having that space independent of data-specific features may also open the possibility of continuity between sparse data, extrapolation, generalization or creativity.

Having said this, it is also true that this transference problem could also be solved by being exposed to an extremely large dataset early on in the training process, such that the features learned from the data essentially represent universally applicable features. This may be the development process for human toddlers, who have relatively long-periods of observation and dependency on other agents, before they begin to exert their own judgement. Additionally, the ability of the human mind to not be entirely skewed by specific tasks may be due to the many more magnitudes of neural connections and parameters that exist in the human brain compared to current artificial networks.

Or, as with many seemingly opposing conceptual disagreements, the solution may be a duality between both innate and learned features, just as nature and nurture are today considered inseparable. This would be a transition from learning features originating from data, to those features becoming innate, and ultimately having all subsequent knowledge being learned in terms of those originally data-specific features. Just as, for example, an infant at 1 week may have no innate cognitive features, and is completely open to floating objects and teleporting objects, but which, with continued experience, begins to engrain a set of internal features, according to which they begin to expect and predict occurrences in the world. Thus they have transitioned from the infant observing and perceiving, where nothing is a surprise, to the more mature judging, planning and acting, where you can be surprised, angry, and create forward plans.

One prediction of this theory, would be the inability of young children to brainstorm or create long-term plans. Certainly the association between the immature orbital frontal cortex and the lack of considering future consequences of situations and personal actions, may be some supporting evidence for this theory. Additionally, there should be less cognitive effects of fear, anxiety or anticipation of young children. This certainly appears to be the case, where serious anxiety disorder appear to begin during teenage years.

This process essentially equates to decreasing neuroplasticity from child to adult. So where there was originally a blank slate to learn perceptual features, plasticity decreases, that those features form the basis for all future encountered observations.

One example of phenomena that support the presence of some innate responses, are the consistent responses of humans to stimuli such as snakes and spiders. Given that there are predictable responses to these stimuli regardless of the learning environment, indicates that there must be some mechanism for how some pattern of perceptual features (ie. seeing a snake) always maps on to an innate physiological response. Additionally, stimuli such as facial beauty and sexual stimuli have predictable responses and are indicative of some innate responses.

However, as mentioned in the prior post ‘One Domain of Input Data’, if the learning algorithm of taking regularities in the input data as the features, can be successfully implemented, this would function as a universal feature detector. Because then, for any image or audio data, repeated sequences becoming the features which future data is understood in terms of. This strategy is also much more consistent with the nature of a general learning algorithm, where the rules of learning are specified, but not the substance. That is compared to current network approaches, which store the substance of the dataset features in the layers, and are highly specific and constrained to that specific dataset.

Work by psychologist Jeffrey Elman indicates that infants who listen to clips of essentially random audio, are particularly sensitive to repeated sequences in the audio. Thus, in the same way that words are, in some sense, features for us to parse the world, so those repeated sequences become the features with which to parse future stimuli. This is insight into how a mind can construct it’s own features and over time, construct complex mental models from raw data in an unsupervised way.