Recent DeepMind published their 3rd paper in Nature "Hybrid computing using a neural network with dynamic external memory". They devise a recurrent network structure (deep LSTM) that iteratively sends new reading/writing commands to the external memory, as well as the action output, based on previous reading from the memory and current input. They called it DNC (Differential Neural Computer). The hope here is that the network could perform reasoning based on the given information. They experiment their model on bAbI reasoning task, network traversal/shortest path prediction, deduction of relationship in the family tree and playing block puzzle games, showing its performance is much better than LSTM without external memory.
Here are some comments:
1. Overall, it seems that they implicitly learn a heuristic function for search-based reasoning. As they mentioned in the paper, "Visualization of a DNC trained on shortest-path suggests that it progressively explored the links radiating out from the start and end nodes until a connecting path was found (Supplementary Video 1).". We could also see this behavior in London Underground task (Fig. 3). This could be efficient for experiments with small search space, but not necessarily a good strategy for real problems.
2. There seems to be lots of manual tunable knobs in the network. The network is to give the next set of operations on the external memory. There are many types of operations on the external memory, with different kind of attention mechanism (content-based attention, consequent writing attention, and "usage" mechanism built in reading and writing). Not sure which components are more important. Ideally, there should be a more automatic or principled approach.
3. Interesting details:
(1) Training a sequential structural prediction model directly with the ground truth answers is not necessary good, since when the prediction deviates from the ground truth, the model might fail easily. In this paper, they use DAgger [1] in structure prediction that blends the ground truth distribution with current predicted distribution. This makes the prediction robust.
(2) For block puzzle games, they use actor-critic-like model. In this scenario, DNC outputs policy and value functions, conditioned on the configuration of the game, taken as inputs at the beginning. This coincides with our experience in Doom AI, that the actor-critic model converges faster than Q-learning.
(3) Curriculum training (i.e., training the model from easy tasks first) plays an important role. This agrees with our experience when training our Doom AI (We will release the paper soon).
References.
[1] Ross et al, A Reduction of Imitation Learning and Structured Prediction
to No-Regret Online Learning, AISTATS 2011
No comments:
Post a Comment