I have been thinking a lot about the strengths and limitations of Deep Reinforcement Learning (RL). It led me to an idea that I'd like to discuss.
RL uses theory based on the Bellman equation to assign appropriate credit to actions which were taken some time before reward is received. This theory leads to gradually improving estimates of the value of an action in a given situation, and a DNN is trained to generate the (approximate) desired values, for each action, given the perceptual input. The outcome is that the DNN learns features in the perceptual input, which are relevant to maximising reward.
A problem with RL is that the rewards may be very infrequent. This makes it hard to identify which actions actually helped get the reward. The algorithm figures out which actions contributed in an essentially statistical way, so infrequent rewards mean a LOT of trials are needed. Often auxiliary tasks are added in order to generate denser rewards. These can help to learn features faster, and so well chosen auxiliary tasks speed up learning of the main task. Learning multiple value functions drives feature learning very well.
Although RL allows an agent to accomplish a task, it is very data hungry and only learns a very narrow, task specific model of the world. This led me to wonder if there are any alternative paradigms to RL that could drive feature learning and perhaps yield better understanding of the world. An idea that occurred to me was that learning a policy that could simulate/predict the world might be desirable. It needs a scheme analogous to RL which continuously refines a function that can be approximated by a DNN. For example after new experiences are observed in the world, the system runs the simulator policy and compares the results with what really happened. The errors are identified, the model is updated and the DNN is retrained. Over time the simulator becomes better at correctly predicting the world. It can be continuously checked against past experience and honed to fit as closely as possible.
Learning a simulator can drive feature learning like RL, however it also produces a simulator with a well measured accuracy. This simulator is essentially a model of the world and a simulator with a high accuracy could be used to train a new task using RL (i.e. it allows model based RL). The simulator represents deeper understanding, not just of a single task, but of how the world behaves. As such the information can be used to enable many different tasks to be performed. A lot of reasoning can be done by running a simulator to see what is likely to happen. Josh Tenenbaum believes that humans often run mental simulations as part of their reasoning.
So how should the simulator be built ? I imagine a 'policy' that generates the most likely events given the perceptual input. The events are used to update the perceptual input in the predicted way and the simulator is run forward a few steps and then compared to reality. The accuracy must be measured and then credit needs to be assigned to the predicted events. This is very far from a workable design. How can the events in the event space be determined ? It seems very open ended. Is it better to predict at pixel level ? What parameters should control the simulator ? Does model based RL (like Dyna) already achieve the same goals ? Any ideas ?