Thursday, October 13, 2016

Notes on DeepMind's 3rd Nature paper

Recent DeepMind published their 3rd paper in Nature "Hybrid computing using a neural network with dynamic external memory". They devise a recurrent network structure (deep LSTM) that iteratively sends new reading/writing commands to the external memory, as well as the action output, based on previous reading from the memory and current input. They called it DNC (Differential Neural Computer). The hope here is that the network could perform reasoning based on the given information. They experiment their model on bAbI reasoning task, network traversal/shortest path prediction, deduction of relationship in the family tree and playing block puzzle games, showing its performance is much better than LSTM without external memory.

Here are some comments:

1. Overall, it seems that they implicitly learn a heuristic function for search-based reasoning. As they mentioned in the paper, "Visualization of a DNC trained on shortest-path suggests that it progressively explored the links radiating out from the start and end nodes until a connecting path was found (Supplementary Video 1).". We could also see this behavior in London Underground task (Fig. 3). This could be efficient for experiments with small search space, but not necessarily a good strategy for real problems.

2. There seems to be lots of manual tunable knobs in the network. The network is to give the next set of operations on the external memory. There are many types of operations on the external memory, with different kind of attention mechanism (content-based attention, consequent writing attention, and "usage" mechanism built in reading and writing). Not sure which components are more important. Ideally, there should be a more automatic or principled approach.

3. Interesting details:
(1) Training a sequential structural prediction model directly with the ground truth answers is not necessary good,  since when the prediction deviates from the ground truth, the model might fail easily. In this paper, they use DAgger [1] in structure prediction that blends the ground truth distribution with current predicted distribution. This makes the prediction robust.

(2) For block puzzle games, they use actor-critic-like model. In this scenario, DNC outputs policy and value functions, conditioned on the configuration of the game, taken as inputs at the beginning. This coincides with our experience in Doom AI, that the actor-critic model converges faster than Q-learning.

(3) Curriculum training (i.e., training the model from easy tasks first) plays an important role. This agrees with our experience when training our Doom AI (We will release the paper soon).

References.
[1] Ross et al, A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, AISTATS 2011

Tuesday, October 11, 2016

Doom AI competition

Here I put some old news in order to warm up this newly open English blog.

Our team (named "F1", my intern Yuxin Wu and me) won the champion of Track 1 in Doom AI Competition! Please check the result here.

http://vizdoom.cs.put.edu.pl/competition-cig-2016/results

Some interesting videos here:

Track 1:
https://www.youtube.com/watch?v=94EPSjQH38Y

Track 2:
https://www.youtube.com/watch?v=tDRdgpkleXI

Human Vs AI (our bot led briefly over human from 6:30 to 7:00!)
https://www.youtube.com/watch?v=Qv4esGWOg7w

Some Notes on Quantum Field Theory

Got some time to read an introduction to Quantum Field Theory. Mostly from here (http://hepwww.rl.ac.uk/hepsummerschool/Dasgupta%2008%20Intro%20to%20QFT.pdf)

The motivation here is to know how physicists formulate complicated objects so that we could borrow to theories in Deep Learning. In particular, how they combine special relativity and quantum mechanics together into a nice formulation. 

Outline: 
1. Quantum mechanics, or Schrödinger equation that describes how a single particle evolves over time, does not follow Lorentz transformation, which is a must for any theory that is compatible with relativity (special / general). 

2. There are attempts to modify Schrödinger equation for that. This includes the famous Dirac equation that makes wave function 4 dimensional so that it conforms to Lorentz transformation, but then end up with eigenstates that have unbounded negative eigenvalues (or "negative energies"). That's why Dirac came up with the concept of "Dirac Sea", assuming all the negative states have been occupied so that a particle with positive energy will not fall into the negative states, according to Pauli's exclusive principles.

3. QFT solves the problem by considering a field theory rather than a particle theory, and using Heisenberg picture (operators are moving) rather than Schrödinger picture (unit vectors are moving). This yields a field of changing operators over spacetime, and the energy of the ground state ("vacuum state") is infinity. 

4. Infinity energy on the ground state is really bad. Renormalization then follows: assuming the energy of the ground state to be zero (!!) and we only compute the energy difference between excited state and the ground state. This gives sensible solutions. Note that this is not "renormalization group". 

5. The good things about QFT is that it could also model particle interactions (E.g., 2 particles collide) once we add a interaction term in the Lagrangian. Hilbert space is so vast that one or a few unit vectors could represent any state, regardless of the number of particles. So we just use one unit vector to represent the input particles and one vector to represent the output particles, and the "interaction" is basically a unitary operator that transforms one to the other. The "cross section" (probability that an interaction happens) is dependent on their inner product, and thus the unitary operator. Since cross section is measurable, the theory finally is justifiable. 

6. The remaining of the introduction then focuses on computation. Computing the cross section is hard so only Lagrangian of specific forms are considered. As in signal processing, the overall cross section is an integral on Green functions (the response function over delta impulse). Depending on the structure of Lagrangian, the structure of Green function could be factorizable (here I see a bit connection with graphical models). From Feynman rules, one could write the Green function from the structure of Lagrangian. 

Some feelings (I could be very wrong since I am really not an expert in this direction):

1. Physicists will add extra dimensions if the algebra does not work out, since from group representation theory, any fancy group operations can be embedded into matrix space with sufficient number of dimensions. That's probably the reason why the world is 11-dimensional (maybe more if new physical phenomena are discovered). So I won't think too much about the specific number.

2. "Vacuum is nontrivial" is a construction of the theory. By representing all interesting states as the consequence of the (creation and annihilation) operators acting on the "vacuum state" |0>, the theory puts all the parameters there. From that point of view, of course vacuum is nontrivial. Using empirical evidence we could then determine the parameters in the Lagrangian, and thus determine the characteristics of vacuum. 

1 and 2 are often misused in the popular science. The idea is clear in the context but is highly ambiguous if taken separately.