I just read Deepmind's paper on Relational Networks:

https://arxiv.org/pdf/1706.01427.pdf**My take/summary of the paper**

A Relational Neural Network (RN) takes a set of 'objects' represented by vectors and computes the following function:

R(objects) = f1(sigma{i,j}( f2(objects(i), objects(j)) )

Where f1 and f2 are learned neural nets, such as MLPs. In pseudo code this is:

Vector<float> R(Vector<object> objects) {

output = []

for obj1 in objects

for obj2 in objects

output = output + f2(obj1, obj2)

return output

So we have a function that takes a set of vectors as input, runs all possible object pairings through neural net f1, sums the output vectors from f1 and passes the aggregated result through neural net f2. f2 learns to compute a relation between objects. The relations are aggregated and f2 learns to produce the desired result from the aggregate.

The RN implements the standard neural net API - vectors in, vectors out, fully differentiable, backprop can be used to train. As such an RN can be plugged into any deep neural network. An RN is optimised to learn relations like Convolutional Networks are optimised for spatial reasoning (they implement locality and translation invariance) and LSTMs are optimised for sequence learning.

Deepmind connected a convolutional network, which took in images, to a recurrent LSTM which took in a textual question and an RN. They trained the system on images containing multiple shapes which could be large/small, red/green/blue, square/circle/triangle and questions such as:

- what colour is the nearest shape to the red square ?
- How many shapes are on the left of the blue shape
- How many objects have the shape of the green object?

The convolutional network finds objects in the image, the LSTM builds a code which represents the 'meaning' of the question and the RN looks at pairs of objects, computing a property of the pair (a relation), aggregating the properties and answering the question from the aggregate. In addition to a pair of objects, Deepmind pass the question code to f1 so that it can condition on the question. The motivation being that if the question concerns large circles, then relations between small squares are probably irrelevant.

**My thoughts arising from the paper**

Lets think about how the system might answer the question 'What colour is the furthest shape from the red square ?'. Each shape can be represented by a vector like <size colour shape x y>e.g. <small red square 25 13>. More likely they use 1-hot or embedding representation for colour, size, shape but meh. So f1 receives two vectors representing shapes and a question code as input. It can learn to evaluate something like the distance between the shapes iff shape1 is a red square. The distance between the shapes and the colour of the shape2 from each evaluation of f1 is coded into a vector and these vectors are summed and passed to f2. f2 must find the largest distance and output it's associated colour.

Now the system answers these kinds of questions very well, but something seems wrong to me. The system must learn to encode the distance/colour tuples into the output vector in such a way that different results do not overwrite each other. Presumably each tuple is placed at a distinct offset in the vector. f2 then needs to select the tuple with the largest distance. That requires evaluating relations between tuples in an MLP - if an MLP did that well we would not need RNs ! The system appears to be capable of doing in the test domain, but it does not seem easy to learn and feels unnatural. Also the system is limited in the number of shapes it can process because of the fixed length of the output vector.

I considered two ways this problem might be improved.

- Feed the ouptut vectors from f1 into a winner-takes-all network.
- Since the output of the comparisons is a list of vectors, why not make the RN recurrent and feed the output vectors back around

Solution 1 makes learning easier. Essentially f1 must write a colour and an 'activation' into each output vector. The winner-takes-all network picks the vector with the highest activation and f2 learns to extract the colour. Job done ! However solution 1 is not ideal. It hardwires a 'max' operation into the network, making the algorithm less general. It works well for 'farthest', but questions involving counts need the summation and would not work.

Solution 2 is more interesting. If the outputs of relations are fed back into the RN as vectors then the second step implements relations between relations (so now 'nearest' 'farthest', 'biggest' etc are easily evaluated). 'farthest' is a relationship between values of 'distance', which is a relationship between positions. There is still a need to perform aggregation. I think it would be possible to provide a suite of functions: summation, max, min etc. Alternatively a 'learnt' aggregation might be possible using a recurrent network to process all the output vectors. By implementing recurrence, solution 2 allows very complex relations over relations over relations....

**Another step beyond**

So RNs are essentially taking a programming cliche, the nested for-loop (or it's parallel counterpart) and plugging in a couple of neural nets to provide learnable functions. Is it possible to take this idea and run with it ? Can we 'neuralise' some other programming cliches ? Here are some ideas:

- Go bigger - iterate over 3 objects i,j,k to find ternary relationships.
- Go smaller - iterate in single loop over objects. This would be more efficient for searching and counting.
- Neuralise the 'sort' function. Now our network can understand orderings as well as relations.
- Neuralise compression. Might help to keep things efficient or throw away unnecessary information.
- Implement 'concatenation' - allows lists of objects to be put back into a single vector and then processed as a group.

Finally is it possible to combine these ideas into a very flexible architecture ? I imagine a vector input to a recurrent network which contains parallel modules:

- LSTM
- Convolutional network
- Relational network
- Sorting network
- Compression network
- Concatenation network

The system learns to pass the vector through the appropriate module(s) and then feeds data back around, where it can then pass through a different module. Of course training this kind of recurrent network would be

*very* hard, since of all the modules proposed, only LSTMs are designed to propagate gradients back through many layers.