Carlos Florensa - Reverse Curriculum Generation for Reinforcement Learning (2017)

History / Edit / PDF / EPUB / BIB /
Created: November 4, 2017 / Updated: February 6, 2021 / Status: finished / 3 min read (~402 words)

  • Given a start and end state, how is inversing the two actually any better?
    • It looks like this is basically creating some sort of field that indicates the direction to take from the current position to the end position
      • It would mean this method needs the ability to generate cases from the end, which is not something that is possible in video games for instances
  • How is the basic RL approach different than A*?

  • RL applied to the robotics domain where one can start from the end state and learn its way backwards until it reaches the start state

  • We propose a method to learn tasks without requiring any prior knowledge other than obtaining a single state in which the task is achieved. The robot is trained in "reverse", gradually learning to reach the goal from a set of start states increasingly far from the goal

  • In our work, we avoid all reward engineering or use of demonstrations by exploiting two key insights
    • It's easier to reach the goal from states nearby the goal, or from states nearby where the agent already knows how to reach the goal
    • Applying random actions from one such state leads the agent to new feasible nearby states, from where it is not too much harder to reach the goal
  • Our approach can be understood as sequentially composing locally stabilizing controllers by growing a tree of stabilized trajectories backwards from the goal state
    • This can be viewed as a "funnel" which takes start states to the goal state via a series of locally valid policies

  • Assumption 1: We can arbitrarily reset the agent into any start state $s_0 \in \mathcal{S}$ at the beginning of all trajectories
  • Assumption 2: At least one state $s^g$ is provided such that $s^g \in S^g$
  • Assumption 3: The Markov Chain induced by taking uniformly sampled random actions has a communicating class including all start states $S^0$ and the given goal state $s^g$

  • A limitation of the current approach is that it generates start states that grow from a single goal uniformly outwards, until they cover the original start state distribution $\text{Unif}(S^0)$

  • Florensa, Carlos, et al. "Reverse Curriculum Generation for Reinforcement Learning." arXiv preprint arXiv:1707.05300 (2017).