Interaction networks for learning about objects, relations and physics

Interaction networks for learning about objects, relations and physics Google DeepMind, NIPS 2016

Welcome back! There were so many great papers from OSDI ’16 to cover at the end of last year that I didn’t have a chance to get to NIPS. I’m kicking off this year therefore with a few of the Google DeepMind papers from NIPS ’16 in December.

Last year we looked at ‘Building machines that learn and think like people’ which talked about some of the important differences between how computers learn and how humans learn. In particular, machine learning is anchored in statistical pattern recognition whereas we conjecture that humans learn by building models of the world that we reason over. An intuitive understanding of basic physics begins to develop in humans from about 2 months old: solidity, continuity, paths of motion, and so on. In ’Towards deep symbolic reinforcement learning’, Garnelo et al. argue for a combination of deep learning and symbolic reasoning to accommodate this kind of knowledge. In today’s paper, the Google DeepMind team demonstrate a pure deep learning approach to learning a physics engine…

Reasoning about objects, relations, and physics is central to human intelligence, and a key goal of artificial intelligence…

Interaction networks deal with objects and the relations between them. An object-centric model learns the behaviour of individual objects and the aggregated effects of the interactions they participate in. A relation-centric model learns the behaviours of two or more interacting objects. The input to an interaction network is a graph of objects and the relations between them (learning that graph from visual representations of a scene is a separate challenge!).

Decomposing complex systems into objects and relations, and reasoning about them explicitly, provides for combinatorial generalisation to novel contexts, one of the most important future challenges for AI, and a crucial step toward closing the gap between how humans and machines think.

In the sections that follow we’ll look at : (a) the abstract structure of an interaction network (IN) – i.e., how it models the real world; (b) an embodiment of INs using deep neural network building blocks; and (c) examples of the kind of things that INs were able to learn during evaluation.

The Interaction Network (IN) model

Let o_t represent the state of object o at time t.

Now consider a directed relationship r between two objects o_1 and o_2. This could be for example a fixed object attached by a spring to a freely moving second object. Let f_R be a relation-centric function that takes as input two objects and the relation between them, and returns the effect of the interaction, e_{t+1}.

e_{t+1} = f_R(o_{1, t}, o_{2,t})

Let f_O be an object-centric function that takes as input the state of an object at time o_t together with any predicted effect on the state of the object at time t+1, e_{t+1}, and outputs the state of the object at time t+1, o_{t+1}.

o_{t+1} = f_O(o_t, e_{t+1})

A system with many objects and relations between them can be represented by a graph G = The relations themselves may have attributes.

The final input to the system is X, representing external effects such as active control inputs or gravitational acceleration which are applied to each object separately.

A basic interaction network function is then defined as:

IN(G) = \phi_O(a(G, X, \phi_R(m(G))))

That might look complicated at first glance, but all it really says is first apply the relation function f_R to all (o_i,o_j,r_k) tuples, and then aggregate (that’s the function a in the above) all of the effects that apply to each receiver object and combine them with any effects due to X. Finally apply the object function f_O to each object together with the result of the effect aggregation step for that object.

This is represented visually in the figure at the start of this section.

A learnable implementation of an IN

The general definition of the IN in the previous section is agnostic to the choice of functions and algorithms, but we now outline a learnable implementation capable of reasoning about complex systems with nonlinear relations and dynamics. We use standard deep neural network building blocks, multilayer perceptrons (MLP), matrix operations, etc., which can be trained efficiently from data using gradient-based optimisation, such as stochastic gradient descent.

If each object has a state vector of length D_s, and there are N_o objects, then O is represented as a D_s \times N_o matrix.

Relationships are represented by three matrices: R_r is a binary vector whose j^{th} column is a one-hot vector indicating the receiver object of the j^{th} relation; R_s similarly identifies sender objects. The final matrix, R_a encodes the attributes of each relationship.

The functions f_R and f_O are both encoded as multilayer perceptrons. The aggregation function a is implemented as a matrix product.

Training an IN requires optimizing an objective function over the learnable parameters of \phi_R and \phi_O. Note, m and a involve matrix operations that do not contain learnable parameters.

For the evaluation, the best model architecture for the MLPs was selected by a grid search over layer sizes and depths. The f_R MLP applied to each column had four 150-length hidden layers of linear transforms plus biases, followed by rectified linear units (ReLUs), and a output layer of length 50 that was a linear transformation plus bias. The f_O MLP had one 100-length hidden layer and output length 2, targeting the x,y velocity.

How well does it learn?

The experiments involved predicting future states of systems and estimating their abstract properties (potential energy).

We evaluated the IN’s ability to learn to make these judgments in three complex physical domains: n-body systmems; balls bouncing in a box; and strings composed of springs that collide with rigid objects.

In the figure below we see 1000 rollout steps for n-body, bouncing ball, and string problems. On the left in each case are the trajectories from a physics engine simulation, and on the right the predictions of the IN model.

Our results shown that the IN can predict the next-step dynamics of our task domains very accurately after training… We also found that the IN trained on single-step predictions can be used to simulate trajectories over thousands of steps very effectively, often tracking the ground truth closely, especially in the n-body and string domains.

Where results do differ, this can be (partially?) attributed to the highly non-linear dynamics in which imperceptible prediction errors by the model can lead to large differences in system state. Such incoherent rollouts however remain consistent with our understanding of the domains.

What does it all mean?

Our results provide surprisingly strong evidence of IN’s ability to learn accurate physical simulations and generalize their training to novel systems with different numbers and configurations of objects and relations… Our interaction network implementation is the first learnable physics engine than can scale up to real-world problems, and is a promising template for new AI approaches to reasoning about other physical and mechanical systems, scene understanding, social perception, hierarchical planning, and analogical reasoning.

Now we just need to bolt-on a perceptual front-end that can infer objects and relations from raw observations!