Trust Region Policy Optimization#

Quick Facts#

TRPO is an on-policy algorithm.
TRPO is an improvement work done based on NPG .
TRPO is an important theoretical basis for CPO .
An API Documentation is available for TRPO.

TRPO Theorem#

Background#

Trust region policy optimization (TRPO) is an iterative method for optimizing policies in reinforcement learning that ensures monotonic improvements. It works by iteratively finding a local approximation of the objective return and maximizing the approximated function. TRPO guarantees that the new policy is constrained within a trust region relative to the current policy, which is achieved by using KL divergence to measure the distance between the two policies.

TRPO is well-suited for optimizing comprehensive nonlinear policies such as neural networks. It is based on the Natural Policy Gradient (NPG) method, which uses conjugate gradient to avoid expensive computational costs. Furthermore, TRPO incorporates a line search mechanism to ensure that updated policy adhere to the predetermined KL divergence constraint.

Problems of NPG

It is very difficult to calculate the Hessian matrix directly.
Error introduced by Taylor expansion because of the fixed step length.
Low utilization of sampled data.

Advantage of TRPO

Using conjugate gradient algorithm to compute the Fisher-Vector product.
Using line search algorithm to eliminate the error introduced by Taylor expansion.
Using importance sampling to reuse data.

Performance difference over policies#

In policy optimization, our objective is to ensure that every update leads to a consistent improvement in the expected return. To accomplish this, we usually formulate the equation for expected return in a specific format that is both intuitive and straightforward to manipulate.

(1)#\[J^R(\pi') = J^R(\pi) + \{J^R(\pi') - J^R(\pi)\}\]

To achieve monotonic improvements, we only need to consider \(\Delta = J^R(\pi') - J^R(\pi)\) to be non-negative.

As shown in NPG, the difference in performance between two policies \(\pi'\) and \(\pi\) can be expressed as:

Theorem 1 (Performance Difference Bound)

(2)#\[ J^R(\pi') = J^R(\pi) + \mathbb{E}_{\tau \sim \pi'}[\sum_{t=0}^{\infty} \gamma^t A^{R}_{\pi}(s_t,a_t)]\]

where this expectation is taken over trajectories \(\tau=(s_0, a_0, s_1,\\ a_1, \cdots)\), and the notation \(\mathbb{E}_{\tau \sim \pi'}[\cdots]\) indicates that actions are sampled from \(\pi'\) to generate \(\tau\).

The proof of the Theorem 1 can be seen in the Appendix, click on this card to jump to view.

Theorem 1 is intuitive as the expected discounted reward of \(\pi'\) can be viewed as the expected discounted reward of \(\pi\), and an extra advantage of \(\pi'\) over \(\pi\). The latter term accounts for how much \(\pi'\) can improve over \(\pi\), which is of our interest.

Note

We can rewrite Theorem 1 with a sum over states instead of timesteps:

(3)#\[\begin{split}\label{equation: performance in discount visit density} J^R(\pi') &=J^R(\pi)+\sum_{t=0}^{\infty} \sum_s P\left(s_t=s \mid \pi'\right) \sum_a \pi' (a \mid s) \gamma^t A^{R}_{\pi}(s, a) \\ &=J^R(\pi)+\sum_s \sum_{t=0}^{\infty} \gamma^t P\left(s_t=s \mid \pi' \right) \sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a) \\ &=J^R(\pi)+\sum_s d_{\pi'}(s) \sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a)\end{split}\]

This equation implies for any policy \(\pi'\), if it has a nonnegative expected advantage at every state \(s\), i.e., \(\sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a) \geq 0\), it is guaranteed to increase the policy performance \(J^R\), or leave it constant in the case that the expected advantage is zero everywhere. However, in the approximate setting, it will typically be unavoidable, due to estimation and approximation errors, that there will be some states \(s\) in which the expected advantage is negative, that is, \(\sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a)<0\).

Surrogate function for the objective#

Eq.3 requires information about future state distribution under \(\pi'\), which is usually unknown and difficult to estimate. The complex dependency of \(d_{\pi'}(s)\) on \(\pi'\) makes Eq.3 difficult to optimize directly. Instead, we introduce the following local approximation to \(J^R\):

(4)#\[L_\pi(\pi')=J^R(\pi)+\sum_s d_\pi(s) \sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a)\]

Here we only replace \(d_{\pi'}\) with \(d_\pi\). It has been proved that if the two policy \(\pi'\) and \(\pi\) are close enough, \(L_\pi(\pi')\) can be considered as equivalent to \(J^R(\pi')\).

Corollary 1 (Performance Difference Bound)

Formally, suppose a parameterized policy \(\pi_{\boldsymbol{\theta}}\), where \(\pi_{\boldsymbol{\theta}}(a \mid s)\) is a differentiable function of the parameter vector \({\boldsymbol{\theta}}\), then \(L_\pi\) matches \(J^R\) to first order (see NPG). That is, for any parameter value \({\boldsymbol{\theta}}_0\), we have:

(5)#\[L_{\pi_{{\boldsymbol{\theta}}_0}}\left(\pi_{{\boldsymbol{\theta}}_0}\right)=J^R\left(\pi_{{\boldsymbol{\theta}}_0}\right)\]

(6)#\[\nabla_{\boldsymbol{\theta}} L_{\pi_{{\boldsymbol{\theta}}_0}}\left(\pi_{\boldsymbol{\theta}}\right)|_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}_0}=\left.\nabla_{\boldsymbol{\theta}} J^R\left(\pi_{\boldsymbol{\theta}}\right)\right|_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}_0}\]

The proof of the Corollary 1 can be seen in the Appendix, click on this card to jump to view.

Eq.6 implies that a sufficiently small step \(\pi_{{\boldsymbol{\theta}}_0} \rightarrow \pi'\) improving \(L_{\pi_{{\boldsymbol{\theta}}_{\text {old }}}}\) will also improve \(J^R\), but does not provide explicit guidance on determining the appropriate step size for policy updates.

To address this issue, NPG proposed a policy updating scheme called conservative policy iteration(CPI), which could provide explicit lower bounds on the improvement of \(J^R\). To define the conservative policy iteration update, let \(\pi_{\mathrm{old}}\) denote the current policy, and let \(\pi^{*}=\arg \underset{\pi^{*}}{\max} L_{\pi_{\text {old }}}\left(\pi^{*}\right)\). The new policy \(\pi_{\text {new }}\) was defined to be the following mixture:

(7)#\[\pi_{\text {new }}(a \mid s)=(1-\alpha) \pi_{\text {old }}(a \mid s)+\alpha \pi^{*}(a \mid s)\]

Kakade and Langford derived the following lower bound:

(8)#\[\begin{split}J^R\left(\pi_{\text {new }}\right) &\geq L_{\pi_{\text {old }}}\left(\pi_{\text {new }}\right)-\frac{2 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ \text { where } \epsilon &=\max _s\left|\mathbb{E}_{a \sim \pi^{*}(a \mid s)}\left[A^{R}_{\pi}(s, a)\right]\right|\end{split}\]

However, the lower bound in Eq.8 only applies to mixture policies, so it needs to be extended to general policy cases.

Monotonic Improvement Guarantee for General Stochastic Policies#

Based on the theoretical guarantee Eq.16 in mixture policies case, TRPO extends the lower bound to general policies by replacing \(\alpha\) with a distance measure between \(\pi\) and \(\pi'\), and changing the constant \(\epsilon\) appropriately. The chosen distance measurement is the total variation divergence (TV divergence), which is defined by \(D_{TV}(p \| q)=\frac{1}{2} \sum_i \left|p_i-q_i\right|\) for discrete probability distributions \(p, q\). Define \(D_{\mathrm{TV}}^{\max }(\pi, \pi')\) as

(9)#\[D_{\mathrm{TV}}^{\max}(\pi, \pi')=\max_s D_{\mathrm{TV}}\left(\pi\left(\cdot \mid s\right) \| \pi'\left(\cdot \mid s\right)\right)\]

And the new bound is derived by introducing the \(\alpha\)-coupling method.

Theorem 2 (Performance Difference Bound derived by \(\alpha\)-coupling method)

Let \(\alpha=D_{\mathrm{TV}}^{\max }\left(\pi_{\mathrm{old}}, \pi_{\text {new }}\right)\). Then the following bound holds:

(10)#\[\begin{split}J^{R}\left(\pi_{\text {new }}\right) &\geq L_{\pi_{\text {old }}}\left(\pi_{\text {new }}\right)-\frac{4 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ \text { where } \epsilon &=\max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\end{split}\]

The proof of the Theorem 2 can be seen in the Appendix, click on this card to jump to view.

The proof extends Kakade and Langford’s result. Given the fact that the random variables from two distributions with total variation divergence less than \(\alpha\) can be coupled, we easily obtain that they are equal with probability \(1-\alpha\).

Next, we note the following relationship between the total variation divergence and the \(\mathrm{KL}\) divergence: \([D_{\mathrm{TV}}(p \| q)]^2 \leq D_{\mathrm{KL}}(p \| q)\). Let \(D_{\mathrm{KL}}^{\max }(\pi, \pi')=\underset{s}{\max} D_{\mathrm{KL}}(\pi(\cdot|s) \| \pi'(\cdot|s))\). The following bound then follows directly from Theorem 2 :

(11)#\[\begin{split}J^R(\pi') & \geq L_\pi(\pi')-C D_{\mathrm{KL}}^{\max }(\pi, \pi') \\ \quad \text { where } C &=\frac{4 \epsilon \gamma}{(1-\gamma)^2}\end{split}\]

TRPO describes an approximate policy iteration scheme based on the policy improvement bound in Eq.11. Note that for now, we assume exact evaluation of the advantage values \(A^{R}_{\pi}\).

It follows from Eq.11 that TRPO is guaranteed to generate a monotonically improving sequence of policies \(J^R\left(\pi_0\right) \leq J^R\left(\pi_1\right) \leq J^R\left(\pi_2\right) \leq \cdots \leq J^R\left(\pi_n\right)\). To see this, let \(M_i(\pi)=L_{\pi_i}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_i, \pi\right)\). Then

(12)#\[\begin{split}J^{R}\left(\pi_{i+1}\right) &\geq M_i\left(\pi_{i+1}\right) \\ J^{R}\left(\pi_i\right)&=M_i\left(\pi_i\right), \text { therefore, } \\ J^{R}\left(\pi_{i+1}\right)-\eta\left(\pi_i\right)&\geq M_i\left(\pi_{i+1}\right)-M\left(\pi_i\right)\end{split}\]

Thus, by maximizing \(M_i\) at each iteration, we guarantee that the true objective \(J^R\) is non-decreasing.

Reference#

Appendix#

Click here to jump to TRPO Theorem

Click here to jump to Code withOmniSafe

Proof of Theorem 1 (Difference between two arbitrary policies)#

Proof of Theorem 1

First note that \(A^{R}_{\pi}(s, a)=\mathbb{E}_{s' \sim \mathbb{P}\left(s^{\prime} \mid s, a\right)}\left[r(s)+\gamma V^R_{\pi}\left(s^{\prime}\right)-V^R_{\pi}(s)\right]\). Therefore,

(16)#\[\begin{split}\mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{\infty} \gamma^t A^{R}_{\pi}\left(s_t, a_t\right)\right] &=\mathbb{E}_{\tau \sim \pi'}\left[\sum _ { t = 0 } ^ { \infty } \gamma ^ { t } \left(r\left(s_t\right)+\gamma V^{R}_{\pi}\left(s_{t+1}\right)-V^{R}_{\pi}\left(s_{t} \right)\right) \right] \\ &=\mathbb{E}_{\tau \sim \pi'}\left[-V^R_{\pi}\left(s_0\right)+\sum_{t=0}^{\infty} \gamma^t r\left(s_t\right)\right] \\ &=-\mathbb{E}_{s_0}\left[V^R_{\pi}\left(s_0\right)\right]+\mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{\infty} \gamma^t r\left(s_t\right)\right] \\ &=-J^R(\pi)+J^R(\pi')\end{split}\]

Proof of Corollary 1#

Proof of Corollary 1

From Eq.2 and Eq.4 , we can easily know that

(17)#\[\begin{split}& L_{\pi_{{\boldsymbol{\theta}}_0}}\left(\pi_{{\boldsymbol{\theta}}_0}\right)=J^{R}\left(\pi_{{\boldsymbol{\theta}}_0}\right)\quad \\ \text{since}~~ &\sum_s \rho_\pi(s) \sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a)=0.\end{split}\]

Now Eq.4 can be written as follows:

(18)#\[J^{R}\left(\pi^{'}_{{\boldsymbol{\theta}}}\right) = J^{R}(\pi_{{\boldsymbol{\theta}}_0}) + \sum_s d_{\pi^{'}_{{\boldsymbol{\theta}}}}(s) \sum_a \pi^{'}_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s,a)\]

So,

(19)#\[\begin{split}\nabla_{{\boldsymbol{\theta}}} J^{R}(\pi_{{\boldsymbol{\theta}}})|_{{\boldsymbol{\theta}} = {\boldsymbol{\theta}}_0} &= J^{R}(\pi_{{\boldsymbol{\theta}}_0}) + \sum_s \nabla d_{\pi_{{\boldsymbol{\theta}}}}(s) \sum_a \pi_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s,a)+\sum_s d_{\pi_{{\boldsymbol{\theta}}}}(s) \sum_a \nabla \pi_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s,a) \\ &= J^{R}(\pi_{{\boldsymbol{\theta}}_0}) + \sum_s d_{\pi_{{\boldsymbol{\theta}}}}(s) \sum_a \nabla \pi_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s,a)\end{split}\]

Note

\(\sum_s \nabla d_{\pi_{{\boldsymbol{\theta}}}}(s) \sum_a \pi_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}}}(s,a)=0\)

Meanwhile,

(20)#\[L_{\pi_{{\boldsymbol{\theta}}_0}}(\pi_{{\boldsymbol{\theta}}})=J^{R}(\pi_{{\boldsymbol{\theta}}_0})+\sum_s d_{\pi_{{\boldsymbol{\theta}}_0}}(s) \sum_a \pi_{{\boldsymbol{\theta}}}(a \mid s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s, a)\]

So,

(21)#\[\nabla L_{\pi_{{\boldsymbol{\theta}}_0}}(\pi_{{\boldsymbol{\theta}}}) | _{{\boldsymbol{\theta}} = {\boldsymbol{\theta}}_0}=J^{R}(\pi_{{\boldsymbol{\theta}}_0})+\sum_s d_{\pi_{{\boldsymbol{\theta}}_0}}(s) \sum_a \nabla \pi_{{\boldsymbol{\theta}}}(a \mid s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s, a)\]

Combine Eq.19 and Eq.20, we have

(22)#\[\left.\nabla_{\boldsymbol{\theta}} L_{\pi_{{\boldsymbol{\theta}}_0}}\left(\pi_{\boldsymbol{\theta}}\right)\right|_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}_0}=\left.\nabla_{\boldsymbol{\theta}} J^{R}\left(\pi_{\boldsymbol{\theta}}\right)\right|_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}_0}\]

Proof of Theorem 2 (Difference between two arbitrary policies)#

Define \(\bar{A}^R(s)\) as the expected advantage of \(\pi'\) over \(\pi\) at \(s\),

(23)#\[\bar{A}^R(s)=\mathbb{E}_{a \sim \pi^{'}(\cdot \mid s)}\left[A^{R}_{\pi}(s, a)\right]\]

Theorem 1 can be written as follows:

(24)#\[J^R(\pi')=J^R(\pi)+\mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{\infty} \gamma^t \bar{A}^R\left(s_t\right)\right]\]

Note that \(L_\pi\) can be written as

(25)#\[L_\pi(\pi')=J^R(\pi)+\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t \bar{A}^R\left(s_t\right)\right]\]

To bound the difference between \(J^R(\pi')\) and \(L_\pi(\pi')\), we need to bound the difference arising from each timestep. To do this, we first need to introduce a measure of how much \(\pi\) and \(\pi'\) agree. Specifically, we’ll couple the policies, so that define a joint distribution over pairs of actions.

Definition 1

\((\pi, \pi')\) is an \(\alpha\)-coupled policy pair if it defines a joint distribution \((a, a')|s\), such that \(P(a \neq a'|s) \leq \alpha\) for all s. \(\pi\) and \(\pi'\) will denote the marginal distributions of a and \(a'\), respectively.

Computationally, \(\alpha\)-coupling means that if we randomly choose a seed for our random number generator, and then we sample from each of \(\pi\) and \(\pi'\) after setting that seed, the results will agree for at least fraction \(1-\alpha\) of seeds.

Lemma 1

Given that \(\pi, \pi'\) are \(\alpha\)-coupled policies, for all s,

(26)#\[|\bar{A}^R(s)| \leq 2 \alpha \max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\]

Lemma 2

Let \((\pi, \pi')\) be an \(\alpha\)-coupled policy pair. Then

(27)#\[\begin{split}\left|\mathbb{E}_{s_t \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi}\left[\bar{A}^R\left(s_t\right)\right]\right|&\leq 2 \alpha \max _s \bar{A}^R(s) \\ &\leq 4 \alpha\left(1-(1-\alpha)^t\right) \max _s\left|A^{R}_{\pi}(s, a)\right|\end{split}\]

Proof of Lemma 1

(28)#\[\begin{split}\bar{A}^R(s) &= \mathbb{E}_{\tilde{a} \sim \tilde{\pi}}\left[A^{R}_{\pi}(s, \tilde{a})\right] - \mathbb{E}_{a \sim \pi}\left[A^{R}_{\pi}(s, a)\right] \\ &=\mathbb{E}_{(a, \tilde{a}) \sim(\pi, \tilde{\pi})}\left[A^{R}_{\pi}(s, \tilde{a})-A^{R}_{\pi}(s, a)\right]\\ &= P(a \neq \tilde{a} \mid s) \mathbb{E}_{(a, \tilde{a}) \sim(\pi, \tilde{\pi}) \mid a \neq \tilde{a}}\left[A^{R}_{\pi}(s, \tilde{a})-A^{R}_{\pi}(s, a)\right]\end{split}\]

So,

(29)#\[|\bar{A}^R(s)| \leq \alpha \cdot 2 \max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\]

Proof of Lemma 2

Given the coupled policy pair \((\pi, \pi')\), we can also obtain a coupling over the trajectory distributions produced by \(\pi\) and \(\pi'\), respectively. Namely, we have pairs of trajectories \(\tau, \tau'\), where \(\tau\) is obtained by taking actions from \(\pi\), and \(\tau'\) is obtained by taking actions from \(\pi'\), where the same random seed is used to generate both trajectories. We will consider the advantage of \(\pi'\) over \(\pi\) at timestep \(t\), and decompose this expectation based on whether \(\pi\) agrees with \(\pi'\) at all timesteps \(i<t\)

Let \(n_t\) denote the number of times that \(a_i \neq a^{'}_i\) for \(i<t\), i.e., the number of times that \(\pi\) and \(\pi'\) disagree before timestep \(t\).

(30)#\[\begin{split}\mathbb{E}_{s_t \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]&=P\left(n_t=0\right) \mathbb{E}_{s_t \sim \pi' \mid n_t=0}\left[\bar{A}^R\left(s_t\right)\right]\\ &+P\left(n_t>0\right) \mathbb{E}_{s_t \sim \pi' \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\end{split}\]

The expectation decomposes similarly for actions are sampled using \(\pi\) :

(31)#\[\begin{split}\mathbb{E}_{s_t \sim \pi}\left[\bar{A}^R\left(s_t\right)\right]&=P\left(n_t=0\right) \mathbb{E}_{s_t \sim \pi \mid n_t=0}\left[\bar{A}^R\left(s_t\right)\right]\\ &+P\left(n_t>0\right) \mathbb{E}_{s_t \sim \pi \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\end{split}\]

Note that the \(n_t=0\) terms are equal:

(32)#\[\mathbb{E}_{s_t \sim \pi' \mid n_t=0}\left[\bar{A}^R\left(s_t\right)\right]=\mathbb{E}_{s_t \sim \pi \mid n_t=0}\left[\bar{A}^R\left(s_t\right)\right]\]

because \(n_t=0\) indicates that \(\pi\) and \(\pi'\) agreed on all timesteps less than \(t\). Subtracting Equations Eq.26 and Eq.27, we get

(33)#\[\begin{split}&\mathbb{E}_{s_t \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi}\left[\bar{A}^R\left(s_t\right)\right] \\ =&P\left(n_t>0\right)\left(\mathbb{E}_{s_t \sim \pi' \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\right) \label{equation: sub for unfold}\end{split}\]

By definition of \(\alpha, P(\pi, \pi'\) agree at timestep \(i) \geq 1-\alpha\), so \(P\left(n_t=0\right) \geq(1-\alpha)^t\), and

(34)#\[P\left(n_t>0\right) \leq 1-(1-\alpha)^t \label{equation: probability with a couple policy}\]

Next, note that

(35)#\[\begin{split}&\left|\mathbb{E}_{s_t \sim \pi' \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\right| \\ & \leq\left|\mathbb{E}_{s_t \sim \pi' \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\right|+\left|\mathbb{E}_{s_t \sim \pi \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\right| \\ & \leq 4 \alpha \max _{s, a}\left|A^{R}_{\pi}(s, a)\right| \label{equation: abs performance bound nt geq 0}\end{split}\]

Where the second inequality follows from Lemma 2. Plugging Eq.34 and Eq.35 into Eq.33, we get

(36)#\[\left|\mathbb{E}_{s_t \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi}\left[\bar{A}^R\left(s_t\right)\right]\right| \leq 4 \alpha\left(1-(1-\alpha)^t\right) \max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\]

The preceding Lemma bounds the difference in expected advantage at each timestep \(t\). We can sum over time to bound the difference between \(J^R(\pi')\) and \(L_\pi(\pi')\). Subtracting Eq.24 and Eq.25, and defining \(\epsilon=\max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\), we have

(37)#\[\begin{split}\left|J^R(\pi')-L_\pi(\pi')\right| &=\sum_{t=0}^{\infty} \gamma^t\left|\mathbb{E}_{\tau \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{\tau \sim \pi}\left[\bar{A}^R\left(s_t\right)\right]\right| \\ & \leq \sum_{t=0}^{\infty} \gamma^t \cdot 4 \epsilon \alpha\left(1-(1-\alpha)^t\right) \\ &=4 \epsilon \alpha\left(\frac{1}{1-\gamma}-\frac{1}{1-\gamma(1-\alpha)}\right) \\ &=\frac{4 \alpha^2 \gamma \epsilon}{(1-\gamma)(1-\gamma(1-\alpha))} \\ & \leq \frac{4 \alpha^2 \gamma \epsilon}{(1-\gamma)^2} \label{TRPO: difference between L and J}\end{split}\]

Last, to replace \(\alpha\) by the total variation divergence, we need to use the correspondence between TV divergence and coupled random variables:

Note

Suppose \(p_X\) and \(p_Y\) are distributions with \(D_{T V}\left(p_X \| p_Y\right)=\alpha\). Then there exists a joint distribution \((X, Y)\) whose marginals are \(p_X, p_Y\), for which \(X=Y\) with probability \(1-\alpha\). More details in See (Levin et al., 2009), Proposition 4.7.

It follows that if we have two policies \(\pi\) and \(\pi'\) such that

(38)#\[\max_s D_{\mathrm{TV}}(\pi(\cdot|s) \| \pi'(\cdot|s)) \leq \alpha\]

then we can define an \(\alpha\)-coupled policy pair \((\pi, \pi')\) with appropriate marginals. Taking \(\alpha=\underset{s}{\max} D_{T V}\left(\pi(\cdot \mid s) \| \pi'(\cdot \mid s)\right)\) in Eq.37, Theorem 2 follows.

Trust Region Policy Optimization#

Quick Facts#

TRPO Theorem#

Background#

Performance difference over policies#

Surrogate function for the objective#

Monotonic Improvement Guarantee for General Stochastic Policies#

Practical Implementation#

Approximately Solving the TRPO Update#

Code with OmniSafe#

Quick start#

Architecture of functions#

Documentation of algorithm specific functions#

Configs#

Reference#

Appendix#

Proof of Theorem 1 (Difference between two arbitrary policies)#

Proof of Corollary 1#

Proof of Theorem 2 (Difference between two arbitrary policies)#