Bootstrapping methods are more difficult to combine with FA than are non-bootstrapping methods.
What is function approximation ?
What is the main cause of diverging value function ?
When can the value function diverge ?
The deadly triad (exert from Sutton’s slides NIPS 2015 tutorial)
The risk of divergence arises whenever we combine three things:
-
Function approximation:
significantly generalizing from large numbers of examples. -
Bootstrapping
learning value estimates from other value estimates, as in dynamic programming and temporal-difference learning. -
Off-policy learning
learning about a policy from data not due to that policy, as in Q-learning, where we learn about the greedy policy from data with a necessarily more exploratory policy.
Based on the above the following should converge (always?):
- On-policy with any form of Bootstrapping such as TD(0)
Off/On-policy
- Value Iteration: off-policy
- Q-learning: off-policy
- Policy Iteration: on-policy
- SARSA: on-policy