Bootstrapping methods are more difficult to combine with FA than are non-bootstrapping methods.

What is function approximation ?

What is the main cause of diverging value function ?

When can the value function diverge ?

The deadly triad (exert from Sutton’s slides NIPS 2015 tutorial)

The risk of divergence arises whenever we combine three things:

  1. Function approximation:
    significantly generalizing from large numbers of examples.

  2. Bootstrapping
    learning value estimates from other value estimates, as in dynamic programming and temporal-difference learning.

  3. Off-policy learning
    learning about a policy from data not due to that policy, as in Q-learning, where we learn about the greedy policy from data with a necessarily more exploratory policy.

Based on the above the following should converge (always?):

  • On-policy with any form of Bootstrapping such as TD(0)

Off/On-policy

  • Value Iteration: off-policy
  • Q-learning: off-policy
  • Policy Iteration: on-policy
  • SARSA: on-policy