The Illusion of The UnGameable Objective


Photo by Michał Parzuchowski on Unsplash

Harvard cognitive scientist Joscha Bach, in a tongue-in-cheek tweet, has thought of what he calls “The Lebowski Theorem”:

No super intelligent AI is going to bother with a task that is harder than hacking its reward function.

People are prone to the cognitive bias known as “The Illusion of Control”. This is the bias that people have that overestimates their ability to control events. This manifests itself in the need to control complex situations wherein its improbable to control. A useful framework for analyzing the distinction between different systems is the Cynefin framework which explores simple, complicated, complex and chaotic systems. Control is possible for simple and complicated systems but becomes intractable for complex and chaotic systems.

Deep Learning systems are complex systems ( or even “transient chaos”). The complex behavior exhibits itself in the learning phase. It is in this phase that the objective function (aka fitness function) is meant to control learning. The problem with complex systems is that the system can learn an unintended way of conforming with the objective function. Analogous to how humans are able to discover loopholes in existing government regulation, complex learning systems are able to find solutions that perhaps are just mimicking intelligence.

Recent investigations of deep learning applied to natural language processing (NLP) reveal how these systems are able to achieve outstanding performance without developing any real understanding of the underlying natural language text. Two recent articles explore the issues in NLP systems:

We are all too often blinded by the surprising performance of deep learning system that we forget to realize that our objective function may in fact been hacked. The exploration capabilities inherent in these systems can lead to solutions that are unexpected for the researchers. The complex system may have simply discovered an unknown heuristic that a researcher is unaware of. Many of the incremental advances in the state-of-the-art architectures may be a consequence of cherry picking. Cherry picking in the sense that the machine does the cherry picking and the human researcher announcing the results.

Hacking of the objective function may perhaps be rare for controlled complicated systems but may be pervasive in complex learning systems. Furthermore, understanding of this behavior is not well researched. The existence of adversarial features that can easily fool a network is one manifestation of this hacking. It is as if the system is able to fool the objective function and the validation tests but remain unable to function correctly in certain adversarial scenarios. It is kind of like a magic trick where through diversion we are primed to believe in the impossible, yet when we look behind the curtains we discover the ordinary. Complex learning systems are playing a magic trick on researchers and these researchers are all falling for it!

Complex learning systems will choose the path of least action (or resistance). As Joscha Bach alludes to, these systems will arrive at a solution that will most likely hack the original objective function.

Disentangling truth from fiction is not a task humans are well equipped for — or one that we even deem generally necessary. For much of our life, our defenses are down. It doesn’t require a sophisticated conversational magic trick to deceive you, it just requires a minor nudge.

Jeff Clune has a survey (The Surprising Creativity of Digital Evolution) of the many ways that evolutionary learning systems have discovered unexpected solutions to problems. In one experiment, the learning system learned to exploit numerical imprecision of the physics engine to gain free energy. Another system that was designed to learn how fix buggy code that sorted lists arrived at a solution that ensured that the list always had no entries and thus was always sorted. A big list of ‘reward hacking’ examples can be found in “Specification Gaming examples in AI”.

Goodhart’s Law is when a metric is used that becomes ineffective in further optimization. That is, the original objective function has outlived its usefulness. A recent paper discusses the various variants of this law. The paper discusses four variations of when system is misaligned with the original intent of the objective function. The extremal variant of Goodhart’s law is an example of hacking the objective function. That is, a complex learning system can discover a solution that is outside the known constraints. That is, the objective appears to have been achieved, however it is in a domain different from the originally intended domain.

This reward hacking behavior is of course extremely important in the study of AI safety. “Concrete Problems in AI Safety” explores five practical research problems that all relate to a misalignment of the original objective function and the learning process: “avoiding side effects” , “avoiding reward hacking”, “scalable supervision”, “safe exploration” and “distributional shift”.

In “Deep Reinforcement Learning doesn’t Work Yet”, Alex Irpan goes over the pain of implementing RL. He writes:

Making a reward function isn’t that difficult. The difficulty comes when you try to design a reward function that encourages the behaviors you want while still being learnable.

Remember that working with deep learning networks is not like traditional programming where everything is overly specified and thus overly constrained. Rather, it involves balancing constraints with sufficient freedom for the system to learn.

Unfortunately, even if we indeed has constraints or rules, these rules are meant to be broken. Jules Hedges makes the astute observation in “Breaking the Rules” that even in games that are referred, that rules are broken to gain an advantage. He writes:

A player dives, and successfully tricks the referee into awarding a foul against the other team

This is an example of meta-level reasoning about the rules, exploiting the known bounded rationality of the referee’s ability to enforce the rules correctly.

An astute observation in that in many real environments, the question of rule enforcement comes into play. Enforcement requires detection and thus if a system is able to mask its adversarial intentions through a distributed diffuse set of actions that are below the noise level. Ultimately, it may reduce to a question of costs. The cost to abide by the rules outweigh the cost to circumvent the rules. Easily said than done.

Further Reading:

Source: Deep Learning on Medium