I just read Deepmind's paper on their new Reinforcement Learning system which uses 'imagination' during problem solving. It's pretty cool. The features I liked are:

- Plans ahead using an Environment Model (EM), which predicts what will happen when it takes an action in a given state.
- Can build the environment model as it learns, or use a supplied model.
- Deliberately robust to errors in the EM. If the EM does not help, system learns to ignore or down-weight it, thus falling back on standard Reinforcement Learning.

The planning uses various simple look-ahead schemes, e.g.:

- Considering all alternative actions for next step.
- Recursively chaining actions to predict (imagine) the result of the next N steps.
- Learning to combine the above two methods to perform plan-tree expansion - I think they never actually used this idea

Clearly the Environment Model is the interesting part of the system. The details are a bit sketchy. The idea is to train a neural net to take the current state, and a possible action, as input and output a probabalistic, imagined next state. Since the input state is a pixel image giving a view of a game, the output state gives pixel probabilities.

The EM clearly works playing Sobokan, but it occurs to me that actions are quite deterministic in this game. i.e. if you push a box forward into an empty space, the player and the box both move forward one step. The EM should generate near 100% probability for the new positions. The situation would be quite different if the action was e.g. let fall a pen balanced on its end. Here the resulting pen position is quite non-deterministic, although there is a well defined locus of positions where it might end up. A probabalistic model would 'average' together the possible positions giving a small probability to all positions on a circle. That is not what really happens.

Hence my reason for writing this message. Could a more sophisticated EM not be implemented using Generative Adversarial Networks.? These are well known for 'imagination' applications. The benefit is that they generate realistic specific outcomes. They do not blur together multiple possibilities. Of course one could run the GAN multiple times to search within the variation in outcomes. If it is run enough times the probability distribution emerges.

Of course the system just needs information from the EM to choose the next step.. It is possible that a blurry probability distribution provides this information better than actual possible future events. Thoughts ?