Read to Fight Monsters: Using RL to teach agents to generalize to new settings

2/13/2020

What the research is:

A grounded reinforcement learning problem called Read to Fight Monsters (RTFM), in which the agent must jointly reason over a language goal, environment dynamics described in a document, and environment observations. Additionally, we propose a reinforcement learning approach called txt2π, which models this three-way interaction. In solving RTFM with txt2π, the agent learns complex tasks that require several steps involving reasoning and coreference — outperforming state-of-the-art methods such as FiLM. Moreover, by requiring the agent to read, txt2π enables it to generalize to new environments with dynamics not seen during training. .

How it works:

To study generalization via reading, we set up RTFM as a game-like scenario in which the agent must utilize a text document that explains environment dynamics and its own environment observations to achieve a set goal. We procedurally generate a large number of unique environment dynamics (including a list of items, such as poisonous monsters and blessed items), associated text descriptions (for instance, “Blessed items are effective against poison monsters”), and goals (Defeat the order of the forest, for example). These environment dynamics and corresponding language descriptions must differ every episode such that the agent cannot memorize a limited set of dynamics but instead has to systematically generalize through reading. Exposing the agent to a combinatorially large set of dynamics requires it to cross-reference relevant information within the document and from its observations to shape its policy and accomplish the goal.

The txt2π method we propose consists of bidirectional feature-wise learning modulation layers that build codependent representations of the environment, the goal, and the textual document. Unlike previous methods, the attention in each layer of txt2π allows a selective reading of the document during each stage of the reasoning process. In testing, agents trained with this method exhibit complex behavior, such as engaging the correct enemies after acquiring the correct items or avoiding incorrect enemies.

Key snapshots from a trained policy on one randomly sampled environment. Frame 1 shows the initial world. In 4, the agent approaches “fanatical sword,” which beats the target “fire goblin.” In 5, the agent acquires the sword. In 10, the agent evades the distractor “poison bat” while chasing the target. In 11, the agent engages the target and defeats it, thereby winning the episode. Sprites are used for visualisation — the agent observes cell content in text (shown in white).

Why it matters:

This work suggests that language understanding via reading is a promising way to learn policies that generalize to new environments. Although txt2π outperforms state-of-the-art methods, such as FiLM on RTFM, our best models trail performance of human players. We know there is still ample room for improvement in grounded policy learning on complex RTFM problems. Looking ahead, we are interested in exploring how to use supporting evidence in external documentation to reason about plans and induce hierarchical policies.