Safely Approximating the Value Function

Function approximation of the value function is key to generalisation. However, one has to be careful!

Guillaume de Chambrier

Guillaume de Chambrier

Bootstrapping methods are more difficult to combine with FA than are non-bootstrapping methods.

What is function approximation ?

What is the main cause of diverging value function ?

When can the value function diverge ?

The deadly triad (exert from Sutton’s slides NIPS 2015 tutorial)

The risk of divergence arises whenever we combine three things:

Function approximation:
significantly generalizing from large numbers of examples.
Bootstrapping
learning value estimates from other value estimates, as in dynamic programming and temporal-difference learning.
Off-policy learning
learning about a policy from data not due to that policy, as in Q-learning, where we learn about the greedy policy from data with a necessarily more exploratory policy.

Based on the above the following should converge (always?):

On-policy with any form of Bootstrapping such as TD(0)

Off/On-policy

Value Iteration: off-policy
Q-learning: off-policy
Policy Iteration: on-policy
SARSA: on-policy