The AI Behind OpenAI’s Robotic Hand that can a Solve Rubik’s Cube One-Handed

Source: Deep Learning on Medium

The AI Behind OpenAI’s Robotic Hand that can Solve Rubik’s Cube One-Handed

Yesterday, artificial intelligence(AI) powerhouse OpenAI astonished the world by unveiling a prototype of a robotic arm that could solve a Rubik’s cube with one hand. The prototype didn’t only represent a milestone for the robotics ecosystem in solving high complexity tasks that actively require sensorial information but it also resulted on a major achievement for the AI community. The reason is that the OpenAI robot was completely trained using simulations based on the reinforcement learning models that the OpenAI Five system used to beat human players in Dota2. The research was discussed in a paper that accompanied the news.

The importance of OpenAI’s achievement was not about designing a robot that could solve a Rubik’s cube. That has been done many times before. A few years ago, a machine developed by MIT solved a cube in less than 0.4 seconds. In late 2018, a Japanese YouTube channel called Human Controller even developed its own self-solving Rubik’s cube using a 3D-printed core attached to programmable servo motors. However all those attempts required robots highly specialized on that single task. Instead of recreating that work, OpenAI designed a robot that can solve the Rubik’s cube like a human would using trial and error and mastering movements with our hands. This achievement is relevant because it can be expanded to many physical manipulation tasks.

Designing robots that can be as versatile as humans in cognitive and physical manipulation tasks remains the biggest challenge of robotics. For the last few decades, hard tasks which humans accomplish with their fixed pair of hands have required designing a custom robot for each task which is obviously not scalable. A Rubik’s cube is a perfect example of a task that combines physical and cognitive skills. It takes children several years to mater the process and gain the dexterity required to solve a Rubik’s cube. Now, imagine training a robot to solve a Rubik’s cube. Even if you have the perfect design, the training would require massive amounts of data which have resulted unfeasible in the past. OpenAI decided to tackle this challenge by relying on simulations based on reinforcement learning. But even before that, OpenAI needed to break down the Rubik’s cube problem in a series of discrete solvable tasks.

Tasks for Solving a Rubik’s Cube

Designing a robot that can solve a Rubik’s cube requires both rapid processing of visual information as well as incredibly levels of dexterity. At a high level, OpenAI divided the challenge in two fundamental tasks: block rotation and solving the Rubik’s cube itself.

Block Rotation

For years, OpenAI’s robotics division has been working on robotic arms that can effectively manipulate objects which serve as the foundation for this project. The goal of the block reorientation task is to rotate a block into a desired goal orientation. A goal is considered achieved if the block’s rotation matches the goal rotation within 0.4 radians.

Rubik’s Cube Algorithm

Conceptually, a Rubik’s cube is a puzzle with 6 internal degrees of freedom. It consists of 26 cubelets that are connected via a system of joints and springs. Each of the 6 faces of the cube can be rotated, allowing the Rubik’s cube to be scrambled. A Rubik’s cube is considered solved if all 6 faces have been returned to a single color each.

OpenAI structured the challenge of solving a Rubik’s cube in two fundamental tasks: A rotation corresponds to rotating a single face of the Rubik’s cube by 90 degrees in the clockwise or counter-clockwise direction. A flip corresponds to moving a different face of the Rubik’s cube to the top. These subgoals can then be performed sequentially to eventually solve the Rubik’s cube.

On the methodology side, there are plenty of algorithms that have been designed to solve a Rubik’s cube. OpenAI relied on Kociemba’s algorithm for picking the solution steps. This algorithm produces a sequence of subgoals for the robotic hand to perform.

To help sense the state of the Rubik’s cube, OpenAI designed a “smart” Rubik’s cube with built-in sensors and a Bluetooth module. This design allows to sense and control the different positions of the cube.

The AI: Automatic Domain Randomization

To accomplish the goal of learning to solve a Rubik’s cube like a human would, OpenAI relied on training neural networks using reinforcement learning and simulation and transferring that knowledge to the robotic hand. This type of technique is called domain randomization and has been tested on different robotic environments. However, basic domain randomization still presents challenges simulating the physics of solving a Rubik’s cube in the real world. Factors like friction, elasticity and dynamics are incredibly difficult to measure and model for objects as complex as Rubik’s Cubes or robotic hands and we found that domain randomization alone is not enough.

To address these challenges, OpenAI designed a method called Automatic Domain Randomization (ADR), which endlessly generates progressively more difficult environments in simulation. The main hypothesis behind ADR is that training on a maximally diverse distribution over environments leads to transfer via emergent meta-learning. More concretely, if the model has some form of memory, it can learn to adjust its behavior during deployment to improve performance on the current environment over time, i.e. by implementing a learning algorithm internally.

ADR is used to generate distributions of the environments by randomizing certain aspects. For instance, one of the parameters we randomize is the size of the Rubik’s Cube (above). ADR begins with a fixed size of the Rubik’s Cube and gradually increases the randomization range as training progresses. The algorithm gradually expand the distribution over environments that help the AI models perform better.

The Architecture

OpenAI implementation relies on a distributed architecture for training both vision analysis and policy neural networks. The system use Redis for storing the ADR parameters and training data. A series of workers regularly run the algorithm and evaluate its performance. The ADR updater uses those buffers to obtain average performance and increases or decreases boundaries accordingly. A group of Rollout workers (for the policy) and data producers (for vision) produce data by sampling an environment as parameterized by the current set of ADR parameters (see . This data is then used by the optimizer to improve the policy and vision model, respectively.

The Results

The constant levels of simulations with random physics help to produce a model that is really resilient to adversarial conditions. To test the limits of this model, OpenAI played with different perturbations of the physics environments. In most tests, the robot was able to adapt to the new physics and continue the task of solving the Rubik’s cube.

The Meta-Learning Angle

One of the original theses of the ADR algorithms was that sufficiently randomized environment leads to emergent meta-learning or an organic ability of learning to learn. To test this, OpenAI measured the time to success per cube flip (rotating the cube such that a different color faces up) for our neural network under different perturbations, such as resetting the network’s memory, resetting the dynamics, or breaking a joint. Initially, as the neural network successfully achieves more flips, each successive time to success decreases because the network learns to adapt. However, when perturbations are applied (vertical gray lines in the above chart), the time to success increases. This is because the strategy the network is employing doesn’t work in the changed environment. The network then relearns about the new environment and we again see time to success decrease to the previous baseline.

OpenAI achievement on solving the Rubik’s cube using reinforcement learning and simulations is nothing short of a major milestone for modern AI. The principles applied in the model can be used on many object manipulation tasks. The ADR algorithm shows that intelligent randomizations can lead to emergent meta-learning and be the foundation of creating really robust AI models.