Yuandong's Blog

Wednesday, April 14, 2021

Research advices and tips for PhD study

Here is a short note about a (subset) of the questions raised in the AIStats'21 mentorship session and my answers. They are about different research advices and tips for PhD study.
Disclaimer: This is far from a complete list, but just a summary of the mentorship session. If I have time I will try to make it more comprehensive later.

How to pick a good research topic?

Do not pick topics that are super hot. PhD is about to work on one research direction persistently, dig deep and become an expert in that particular direction. Working on hot topic can be quite frustrating since new paper keeps coming up and the performance you currently have may be behind.

At the beginning of the PhD, the most important things is to raise the confidence level. Publishing the first paper means a lot: it gives confidence and enables further collaborations with other people. So there is a trade-off between aiming high and reaching the first milestones. On the other hand, procrastinating things to a few years later would lead to much high pressure (e.g., from the advisor, the family and the peers) and the PhD experience is likely to be not great.

Also, pick topics that are suitable for you. Think about your existing skills and required skills for this research directions. It is great if you could pick a direction that you really enjoy (and constantly get intrinsic rewards). If that's hard, pick direction that are useful and can potentially make good impact.

You can also try to negotiate with your advisor to pick a topic that is of mutual interest (and win-win situation). Take myself as an example, I am personally strong technically but was super bad in presentation and writing due to introverted personality and language barrier. So I ended up with a direction that is different from my advisor, but reached regular coaching for presentation and writing. I spent 2 full months in preparing the first presentation (a CVPR oral talk), memorizing every word (otherwise I won't be able to spit a single word in front of a huge audience). After that things became much better and easier. From my advisor's point of view, the paper can be good milestones for grants so he is also happy with it.

You don't need to make that decision alone. You can ask senior PhDs, other professors, etc. Make connections with other people and listen to their advice and thoughts.

How to learn new directions with no prior experience?

Read papers, read surveys, keep reading them. Don't memorize them, think about the underlying logic chains and how the story flows.

One tip is to "guess" or "predict" what the paper will say next, without passively reading them. This will force you to think about the logic (like self-supervised learning) and at the same time, accelerate paper reading later substantially. This will also naturally lead to novel research ideas/thoughts once you have read enough papers and summarized them well.

Another tip is to write 1-2 sentence summary of the papers read (or tell other people the content of the papers). Once you realize you cannot summarize well, it is time to go back and re-read the paper. Reading alone doesn't count since the understanding may not be there.

At the beginning, the summary can quite low-quality, overly focusing on small details etc. But things will be substantially better if keeping this habit for long enough.

You can also ask other people who are experts in the field. Think about what you already know, and ask questions proactively. For example, ask about the top3 works in the field and why they are impactful, etc.

Note that the meta-skill of "ramping up a new direction quickly" can be super important for the rest of your life. The research direction you worked on in your PhD can be obsolete very soon, but the learned meta-skill enables you to pick up new things very quickly and make you be adapted to new situations. This is one of the big things I learned from PhD life.

I personally switched from physics-based vision (as my PhD topic) to deep reinforcement learning after I joined FAIR. It was a very nice experience and a lot of things need to be done in a proactive and self-motivating manner. Hands-on capability (e.g., strong coding) definitely helps when there were no PyTorch/Tensorflow and no existing codebase to reference from. Math skills is also very important to nail down the gist of the idea and reason about things clearly.

Always remember that freedom is not granted but earned, if acting fast enough.

How to make full use of the focus time? What becomes different once you enter industry?

How to make efficient use of the focus time is another important meta-skill to be learned in PhD. Note that it is very hard to do that at the beginning. People who are motivated to do a PhD are often lured by the big, open problems in the field and thinks they can solve them in a few years. Such a high aim may sometimes create negative / detrimental effects on their PhD life.

I personally experienced a lot of "wasted" thinking at the beginning of my PhDs. E.g., I spent 40 hours thinking about useless stuff and another 40 hours doing actual works that are immediate or asked by my advisor. This is a very frustrating experience.

Fortunately, things will get better over time. Here are some tips. You can break a demanding goal into small steps until they are reachable. Lists possible milestones and arrange them properly. Formulate (or nail down) what you have thought about clearly in a piece of paper. Give time to take a rest and refresh, etc. To be honest, many interesting thoughts hit me in the shower time.

Sometimes reading literature will also trigger new thinking and thoughts. I have experienced these many times, in particular when what I predicted for the rest of the paper is different from what actually is written in the paper. Such a "what if" thinking will naturally lead to nice insights, and is quite efficiently by itself.

The reason why such an "efficient focused thinking" is important is that after PhD graduation, you won't have the luxury to spend 40 hours thinking about useless things anymore. You would need to take care of the family, to deal with a lot of regular errands, to be interrupted constantly by emails, calls, messages, etc. In this case, if you have already learned how to think efficiently, it will bring tremendous advantage to your future career.

Saturday, July 8, 2017

Open source ELF

Finally we open-sourced ELF, an extensive, lightweight and flexible framework for game research.

Link to repository: https://github.com/facebookresearch/ELF

Arxiv: https://arxiv.org/abs/1707.01067

Facebook engineering blogpost: https://code.facebook.com/posts/132985767285406/introducing-elf-an-extensive-lightweight-and-flexible-platform-for-game-research/

Game replay: https://youtu.be/YgZyWobkqfw

We have been working on this framework for about half a year. The goal of ELF is to "build an affordable framework for game and reinforcement learning research". ELF gives an end-to-end solution from game simulators to training paradigms. It reduces the requirement of resources (CPU, GPU and memory) and at the same time increases the readability of code via better engineering design. Furthermore, it provides a miniature and fast real-time strategy (RTS) engine, on which future feature can be built on.

In terms of contributions, I (Yuandong Tian) lead the framework design and finish the code of ELF framework and RTS engine; Qucheng Gong built two extensions (Capture the Flag and Tower Defense) and web-based visualization; Yuxin Wu plugged in Atari emulator into the framework and finished the speed test; Wendy (Wenling) Shang added LeakyReLU and Batch Normalization to improve the performance of trained AI. Finally, Larry Zitnick gave a lot of important suggestions.

The design of ELF has changed for multiple times and now converges to the current version, which is reasonable. The main idea is to use C++ multi-threading for concurrent game execution, so that in each iteration of Python interface, it always receives a batch of game states with predefined size and randomized orders. This batch can directly send to reinforcement learning algorithm for forward/backward pass. Compared to existing frameworks that wrap a single game instance into one Python interface, this design does not require a custom-made Python framework for inter-process communication, and thus makes code much cleaner, more readable and perform better.

In ELF, we use PyTorch as the training backend. Python dictionary is used as the interface between models and algorithms. Any algorithm or model read the entries it needs via predefined keys, and returns key-entry pairs it produces. This decouples training algorithms from models and enhances ELF's flexibility and readability. This design has another benefit: different models can be used depending on different game instance and their current game states. This unifies many paradigms that require topology changes between game instance and models, such as Monte-Carlo Tree Search (MCTS) and Self-Play. People who have tried DarkForest might be annoyed by invoking two separate programs, one for MCTS and the other for policy/value evaluation using GPU. In ELF, one is sufficient.

Furthermore, ELF is not limited to game environments. Any environment/simulator with C/C++ interface, e.g., physics engine or discrete/continuous control system, can be incorporated into ELF. The framework will automatically handle synchronization and return batch states. In this sense, ELF is very general.

Under ELF, we have implemented a miniature RTS engine, and three concrete environments (MiniRTS, Capture the Flag and Tower Defense). MiniRTS is a miniature RTS game that captures its key dynamics: gather resources, build troops and buildings, defend and attack the enemy, fog-of-war (unknown region outside player's sight), etc. Units on MiniRTS can move continuously on the map with collision check and path planning.

When built from scratch, the RTS engine is customized to facilitate the usage of deep and reinforcement learning. MiniRTS, which is built in two weeks, is not as complicated as commercial games built from a large group in several months. However, it runs fast with minimal usage of resources and extensibility. For example, MiniRTS runs 40K FPS per core on a Macbook. It takes only 1.5 minutes to finish evaluations of 10k games. Finally, we also provide an interactive web-based visualization tool that can be used to analyze game replay, and be served as a game play interface between human versus AI. In comparison, if we use commercial games as a research platform, a lot more resource might be needed with limited customizability.

On the three concrete games, we train an actor-critic model with an off-policy extension. When training MiniRTS, we only use the reward that comes from the final consequence of the games. No reward shaping is used, e.g., provide auxiliary rewards when a tank is built or minerals are gathered. The action space in RTS games is generally exponential, therefore we discretize it into 9 global, or strategic actions (e.g., build workers/troops, all attack/defend, etc), so that existing methods can be used. The model we have trained on MiniRTS can beat our rule-based AI 70% of the time.

For people who are interested in game and reinforcement learning research, as a shameless advertisement, I would strongly recommend this framework.

Enjoy!

Thursday, October 13, 2016

Notes on DeepMind's 3rd Nature paper

Recent DeepMind published their 3rd paper in Nature "Hybrid computing using a neural network with dynamic external memory". They devise a recurrent network structure (deep LSTM) that iteratively sends new reading/writing commands to the external memory, as well as the action output, based on previous reading from the memory and current input. They called it DNC (Differential Neural Computer). The hope here is that the network could perform reasoning based on the given information. They experiment their model on bAbI reasoning task, network traversal/shortest path prediction, deduction of relationship in the family tree and playing block puzzle games, showing its performance is much better than LSTM without external memory.

Here are some comments:

1. Overall, it seems that they implicitly learn a heuristic function for search-based reasoning. As they mentioned in the paper, "Visualization of a DNC trained on shortest-path suggests that it progressively explored the links radiating out from the start and end nodes until a connecting path was found (Supplementary Video 1).". We could also see this behavior in London Underground task (Fig. 3). This could be efficient for experiments with small search space, but not necessarily a good strategy for real problems.

2. There seems to be lots of manual tunable knobs in the network. The network is to give the next set of operations on the external memory. There are many types of operations on the external memory, with different kind of attention mechanism (content-based attention, consequent writing attention, and "usage" mechanism built in reading and writing). Not sure which components are more important. Ideally, there should be a more automatic or principled approach.

3. Interesting details:
(1) Training a sequential structural prediction model directly with the ground truth answers is not necessary good, since when the prediction deviates from the ground truth, the model might fail easily. In this paper, they use DAgger [1] in structure prediction that blends the ground truth distribution with current predicted distribution. This makes the prediction robust.

(2) For block puzzle games, they use actor-critic-like model. In this scenario, DNC outputs policy and value functions, conditioned on the configuration of the game, taken as inputs at the beginning. This coincides with our experience in Doom AI, that the actor-critic model converges faster than Q-learning.

(3) Curriculum training (i.e., training the model from easy tasks first) plays an important role. This agrees with our experience when training our Doom AI (We will release the paper soon).

References.
[1] Ross et al, A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, AISTATS 2011

Tuesday, October 11, 2016

Doom AI competition

Here I put some old news in order to warm up this newly open English blog.

Our team (named "F1", my intern Yuxin Wu and me) won the champion of Track 1 in Doom AI Competition! Please check the result here.

http://vizdoom.cs.put.edu.pl/competition-cig-2016/results

Some interesting videos here:

Track 1:
https://www.youtube.com/watch?v=94EPSjQH38Y

Track 2:
https://www.youtube.com/watch?v=tDRdgpkleXI

Human Vs AI (our bot led briefly over human from 6:30 to 7:00!)
https://www.youtube.com/watch?v=Qv4esGWOg7w

Some Notes on Quantum Field Theory

Got some time to read an introduction to Quantum Field Theory. Mostly from here (http://hepwww.rl.ac.uk/hepsummerschool/Dasgupta%2008%20Intro%20to%20QFT.pdf)

The motivation here is to know how physicists formulate complicated objects so that we could borrow to theories in Deep Learning. In particular, how they combine special relativity and quantum mechanics together into a nice formulation.

Outline:

1. Quantum mechanics, or Schrödinger equation that describes how a single particle evolves over time, does not follow Lorentz transformation, which is a must for any theory that is compatible with relativity (special / general).

2. There are attempts to modify Schrödinger equation for that. This includes the famous Dirac equation that makes wave function 4 dimensional so that it conforms to Lorentz transformation, but then end up with eigenstates that have unbounded negative eigenvalues (or "negative energies"). That's why Dirac came up with the concept of "Dirac Sea", assuming all the negative states have been occupied so that a particle with positive energy will not fall into the negative states, according to Pauli's exclusive principles.

3. QFT solves the problem by considering a field theory rather than a particle theory, and using Heisenberg picture (operators are moving) rather than Schrödinger picture (unit vectors are moving). This yields a field of changing operators over spacetime, and the energy of the ground state ("vacuum state") is infinity.

4. Infinity energy on the ground state is really bad. Renormalization then follows: assuming the energy of the ground state to be zero (!!) and we only compute the energy difference between excited state and the ground state. This gives sensible solutions. Note that this is not "renormalization group".

5. The good things about QFT is that it could also model particle interactions (E.g., 2 particles collide) once we add a interaction term in the Lagrangian. Hilbert space is so vast that one or a few unit vectors could represent any state, regardless of the number of particles. So we just use one unit vector to represent the input particles and one vector to represent the output particles, and the "interaction" is basically a unitary operator that transforms one to the other. The "cross section" (probability that an interaction happens) is dependent on their inner product, and thus the unitary operator. Since cross section is measurable, the theory finally is justifiable.

6. The remaining of the introduction then focuses on computation. Computing the cross section is hard so only Lagrangian of specific forms are considered. As in signal processing, the overall cross section is an integral on Green functions (the response function over delta impulse). Depending on the structure of Lagrangian, the structure of Green function could be factorizable (here I see a bit connection with graphical models). From Feynman rules, one could write the Green function from the structure of Lagrangian.

Some feelings (I could be very wrong since I am really not an expert in this direction):

1. Physicists will add extra dimensions if the algebra does not work out, since from group representation theory, any fancy group operations can be embedded into matrix space with sufficient number of dimensions. That's probably the reason why the world is 11-dimensional (maybe more if new physical phenomena are discovered). So I won't think too much about the specific number.

2. "Vacuum is nontrivial" is a construction of the theory. By representing all interesting states as the consequence of the (creation and annihilation) operators acting on the "vacuum state" |0>, the theory puts all the parameters there. From that point of view, of course vacuum is nontrivial. Using empirical evidence we could then determine the parameters in the Lagrangian, and thus determine the characteristics of vacuum.

1 and 2 are often misused in the popular science. The idea is clear in the context but is highly ambiguous if taken separately.