Resources

Policy gradient methods for robotics arg min blog: a blog of minimum value

Notation

variable dimension name
\(\tau\) \(s_0, u_0, \dots, s_{H-1}, u_{H-1}\) trajectory
\(p(s_{t+1}|s_t, u_t)\) \(\mathbb{R}\) state transition dynamical model
\(\pi_{\theta}(u_t|s_t)\) \(\mathbb{R}\) policy
\(P_{\theta}(\tau)\) \(p(s_0)\prod_{t=0}^{H-1} p(s_{t+1}|s_t, u_t)\, \pi_{\theta}(u_t|s_t)\) probability of a trajectory
\(R(\tau)\) \(\sum_{t=0}^{H-1} r(s_t, u_t)\) Utility of a trajectory

Policy gradient likelihood ratio

We seek to maximise the expected reward:

\[\begin{align} J(\theta) &= \mathbb{E}_{\tau \sim P_{\theta}}\left[ R(\tau) \right] \\ &= \sum_{\tau} P_{\theta}(\tau) R(\tau) \\ \end{align}\]

We can derive an expresion of the gradient of the above objective function which is independent of the state transition model \(p(s_{t+1}|s_t, u_t)\) and only depends on the policy.

\[\begin{align} \nabla_{\theta} J(\theta) &= \sum_{\tau} \nabla_{\theta} P_{\theta}(\tau) R(\tau) \\ &= \sum_{\tau} P_{\theta}(\tau) \frac{\nabla_{\theta} P_{\theta}(\tau)}{P_{\theta}(\tau)} R(\tau) \\ &= \sum_{\tau} P_{\theta}(\tau) \nabla_{\theta} \log P_{\theta}(\tau) R(\tau) \\ &= \sum_{\tau} P_{\theta}(\tau) \nabla_{\theta} \log \left( p(s_0)\prod_{t=0}^{H-1} p(s_{t+1}|s_t, u_t)\, \pi_{\theta}(u_t|s_t) \right) R(\tau) \\ &= \sum_{\tau} P_{\theta}(\tau) \nabla_{\theta} \left( \log p(s_0) + \sum_{t=0}^{H-1} \log p(s_{t+1}|s_t, u_t) + \log \pi_{\theta}(u_t|s_t) \right) R(\tau) \\ &= \sum_{\tau} P_{\theta}(\tau) \left(\sum_{t=0}^{H-1} \nabla_{\theta} \log \pi_{\theta}(u_t|s_t) \right) R(\tau)\\ &= \mathbb{E}_{\tau \sim P_{\theta}}\left[ \left(\sum_{t=0}^{H-1} \nabla_{\theta} \log \pi_{\theta}(u_t|s_t) \right) R(\tau) \right] \end{align}\]

As this is an expaction over trajectories we can approximate the gradient by sampling trajectories from a fixed policy and state transition function:

\[\begin{align} \nabla_{\theta} J(\theta) &\approx \frac{1}{m} \sum_{i=1}^m \left(\sum_{t=0}^{H-1} \nabla_{\theta} \log \pi_{\theta}(u^{(i)}_t|s^{(i)}_t) \right) R(\tau^{(i)}) \end{align}\]

Each gradient is of a trajectory is weighted by the reward of the full trajectory. An observation is that the current actions do not depend on past rewards.

\[\nabla_{\theta} J(\theta) \approx \frac{1}{m} \sum_{i=1}^m \left(\sum_{t=0}^{H-1} \nabla_{\theta} \log \pi_{\theta}(u^{(i)}_t|s^{(i)}_t) \left[ \sum_{k=t}^{H-1} r(s^{(i)}_k, u^{(i)}_k) \right] \right)\]

This as an effect of reducing the variance of the gradients’ estimate.