Projection-Based Constrained Policy Optimization#

Quick Facts#

PCPO is an on-policy algorithm.
PCPO can be used for environments with both discrete and continuous action spaces.
PCPO is an improvement work done based on CPO .
The OmniSafe implementation of PCPO support parallelization.
An API Documentation is available for PCPO.

PCPO Theorem#

Background#

Projection-Based Constrained Policy Optimization (PCPO) is a two-stage iterative method for optimizing policies. The first stage involves a local reward improvement update, while the second stage reconciles any constraint violation by projecting the policy back onto the constraint set.

PCPO is an improvement work based on CPO (Constrained Policy Optimization), which also provides lower bounds on reward improvement and upper bounds on constraint violation.

In addition to these guarantees, PCPO characterizes its convergence based on two metrics: \(L2\) norm and \(KL\) divergence. It is designed to address the challenge of learning control policies that optimize a reward function while satisfying constraints.

Hint

If you are new to the CPO family of algorithms, we recommend reading CPO tutorial (Constrained Policy Optimization) to fully understand the concepts introduced in this chapter.

Optimization Objective#

In the previous chapters, you learned that CPO solves the following optimization problems:

(1)#\[\begin{split}\pi_{k+1} &= \arg\max_{\pi \in \Pi_{\boldsymbol{\theta}}}J^R(\pi)\\ \text{s.t.}\quad D(\pi,\pi_k) & \le\delta\\ J^{C_i}(\pi) &\le d_i\quad i=1,...m\end{split}\]

where \(\Pi_{\boldsymbol{\theta}}\subseteq\Pi\) denotes the set of parametrized policies with parameters \(\boldsymbol{\theta}\), and \(D\) is some distance measure. In local policy search for CMDPs, we additionally require policy iterates to be feasible, so instead of optimizing over \(\Pi_{\boldsymbol{\theta}}\), PCPO optimizes over \(\Pi_{\boldsymbol{\theta}}\cap\Pi_{C}\). Next, we will introduce you to how PCPO solves the above optimization problems. For you to have a clearer understanding, we hope that you will read the next section with the following questions:

Questions

What is a two-stage policy update and how?
What is performance bound for PCPO and how PCPO get it?
How PCPO practically solve the optimal problem?

Two-stage Policy Update#

PCPO updates policy in two stages. The first stage is Reward Improvement Stage, which maximizes reward using a trust region optimization method without constraints. This might results in a new intermediate policy that fails to satisfy the constraints. The second stage, named Projection Stage, reconciles the constraint violation (if any) by projecting the policy back onto the constraint set, choosing the procedure in the constraint set that is closest to the selected intermediate policy. Next, we will describe how PCPO completes the two-stage update.

Stage 1

Reward Improvement Stage

First, PCPO optimizes the reward function by maximizing the reward advantage function \(A^R_{\pi}(s,a)\) subject to \(KL\)-Divergence constraint. This constraints the intermediate policy \(\pi_{k+\frac12}\) to be within a \(\delta\)-neighborhood of \(\pi_{k}\):

(2)#\[\begin{split}&\pi_{k+\frac12}=\arg\underset{\pi}{\max}\underset{\substack{s\sim d_{\pi_k}\\ a\sim\pi}}{\mathbb{E}}[A^R_{\pi_k}(s,a)]\\ \text{s.t.}\quad &\underset{s\sim d_{\pi_k}}{\mathbb{E}}[D_{KL}(\pi||\pi_k)[s]]\le\delta\nonumber\end{split}\]

This update rule with the trust region is called TRPO (sees in Trust Region Policy Optimization). It constraints the policy changes to a divergence neighborhood and guarantees reward improvement.

Stage 2

Projection Stage

Second, PCPO projects the intermediate policy \(\pi_{k+\frac12}\) onto the constraint set by minimizing a distance measure \(D\) between \(\pi_{k+\frac12}\) and \(\pi\):

(3)#\[\begin{split}&\pi_{k+1}=\arg\underset{\pi}{\min} D(\pi,\pi_{k+\frac12})\\ \text{s.t.}\quad &J^C\left(\pi_k\right)+\underset{\substack{s\sim d_{\pi_k}\\ a\sim\pi}}{\mathbb{E}}\left[A^C_{\pi_k}(s, a)\right] \leq d\end{split}\]

The Projection Stage guarantees that the constraint-satisfying policy \(\pi_{k+1}\) remains in close proximity to \(\pi_{k+\frac{1}{2}}\). On the other hand, the Reward Improvement Stage ensures that the agent’s updates aim to maximize rewards without violating the distance measure \(D\). The Projection Stage prompts the agent to update its policy in a direction that satisfies the constraint while not across \(D\).

Policy Performance Bounds#

In safety-critical applications, how worse the performance of a system evolves when applying a learning algorithm is an important issue. For the two cases where the agent satisfies the constraint and does not satisfy the constraint, PCPO provides worst-case performance bound respectively.

Theorem 1

Worst-case Bound on Updating Constraint-satisfying Policies

Define \(\epsilon_{\pi_{k+1}}^{R}\doteq \max\limits_{s}\big|\underset{a\sim\pi_{k+1}}{\mathbb{E}}[A^{R}_{\pi_{k}}(s,a)]\big|\), and \(\epsilon_{\pi_{k+1}}^{C}\doteq \max\limits_{s}\big|\underset{a\sim\pi_{k+1}}{\mathbb{E}}[A^{C}_{\pi_{k}}(s,a)]\big|\). If the current policy \(\pi_k\) satisfies the constraint, then under \(KL\) divergence projection, the lower bound on reward improvement, and upper bound on constraint violation for each policy update are

(4)#\[\begin{split}J^{R}(\pi_{k+1})-J^{R}(\pi_{k})&\geq-\frac{\sqrt{2\delta}\gamma\epsilon_{\pi_{k+1}}^{R}}{(1-\gamma)^{2}}\\ J^{C}(\pi_{k+1})&\leq d+\frac{\sqrt{2\delta}\gamma\epsilon_{\pi_{k+1}}^{C}}{(1-\gamma)^{2}}\end{split}\]

where \(\delta\) is the step size in the reward improvement step.

The proof of the Theorem 1 can be seen in the CPO tutorial, click on this card to jump to view.

Theorem 2

Worst-case Bound on Updating Constraint-violating Policies

Define \(\epsilon_{\pi_{k+1}}^{R}\doteq \max\limits_{s}\big|\underset{a\sim\pi_{k+1}}{\mathbb{E}}[A^{R}_{\pi_{k}}(s,a)]\big|\), \(\epsilon_{\pi_{k+1}}^{C}\doteq \max\limits_{s}\big|\underset{a\sim\pi_{k+1}}{\mathbb{E}}[A^{C}_{\pi_{k}}(s,a)]\big|\), \(b^{+}\doteq \max(0,J^{C}(\pi_k)-d),\) and \(\alpha_{KL} \doteq \frac{1}{2a^T\boldsymbol{H}^{-1}a},\) where \(a\) is the gradient of the cost advantage function and \(\boldsymbol{H}\) is the Hessian of the \(KL\) divergence constraint. If the current policy \(\pi_k\) violates the constraint, then under \(KL\) divergence projection, the lower bound on reward improvement and the upper bound on constraint violation for each policy update are

(5)#\[\begin{split}J^{R}(\pi_{k+1})-J^{R}(\pi_{k})\geq&-\frac{\sqrt{2(\delta+{b^+}^{2}\alpha_\mathrm{KL})}\gamma\epsilon_{\pi_{k+1}}^{R}}{(1-\gamma)^{2}}\\ J^{C}(\pi_{k+1})\leq& ~d+\frac{\sqrt{2(\delta+{b^+}^{2}\alpha_\mathrm{KL})}\gamma\epsilon_{\pi_{k+1}}^{C}}{(1-\gamma)^{2}}\end{split}\]

where \(\delta\) is the step size in the reward improvement step.

The proof of the Theorem 2 can be seen in the Appendix, click on this card to jump to view.

References#

Appendix#

Click here to jump to PCPO Theorem Click here to jump to Code with OmniSafe

Proof of Theorem 2#

To prove the policy performance bound when the current policy is infeasible (constraint-violating), we first prove two lemmas of the \(KL\) divergence between \(\pi_{k}\) and \(\pi_{k+1}\). We then prove the main theorem for the worst-case performance degradation.

Lemma 1

If the current policy \(\pi_{k}\) satisfies the constraint, the constraint set is closed and convex, and the \(KL\) divergence constraint for the first step is \(\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+\frac{1}{2}} ||\pi_{k})[s]\big]\leq \delta\), where \(\delta\) is the step size in the reward improvement step, then under \(KL\) divergence projection, we have

(11)#\[\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+1} ||\pi_{k})[s]\big]\leq \delta.\]

Lemma 2

If the current policy \(\pi_{k}\) violates the constraint, the constraint set is closed and convex, the \(KL\) divergence constraint for the first step is \(\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+\frac{1}{2}} ||\pi_{k})[s]\big]\leq \delta\). where \(\delta\) is the step size in the reward improvement step, then under the \(KL\) divergence projection, we have

(12)#\[\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+1} ||\pi_{k})[s]\big]\leq \delta+{b^+}^2\alpha_\mathrm{KL},\]

where \(\alpha_\mathrm{KL} \doteq \frac{1}{2a^T\boldsymbol{H}^{-1}a}\), \(a\) is the gradient of the cost advantage function, \(\boldsymbol{H}\) is the Hessian of the \(KL\) divergence constraint, and \(b^+\doteq\max(0,J^{C}(\pi_k)-h)\).

Proof of Lemma 1

By the Bregman divergence projection inequality, \(\pi_{k}\) being in the constraint set, and \(\pi_{k+1}\) being the projection of the \(\pi_{k+\frac{1}{2}}\) onto the constraint set, we have

(13)#\[\begin{split}&\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k} ||\pi_{k+\frac{1}{2}})[s]\big]\geq \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k}||\pi_{k+1})[s]\big] \\ &+ \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+1} ||\pi_{k+\frac{1}{2}})[s]\big]\\ &\Rightarrow\delta\geq \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k} ||\pi_{k+\frac{1}{2}})[s]\big]\geq \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k}||\pi_{k+1})[s]\big].\end{split}\]

The derivation uses the fact that \(KL\) divergence is always greater than zero. We know that \(KL\) divergence is asymptotically symmetric when updating the policy within a local neighborhood. Thus, we have

(14)#\[\delta\geq \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+\frac{1}{2}} ||\pi_{k})[s]\big]\geq \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+1}||\pi_{k})[s]\big].\]

Proof of Lemma 2

We define the sub-level set of cost constraint functions for the current infeasible policy \(\pi_k\):

(15)#\[\begin{split}L_{\pi_k}=\{\pi~|~J^{C}(\pi_{k})+ \mathbb{E}_{\substack{s\sim d_{\pi_{k}}\\ a\sim \pi}}[A_{\pi_k}^{C}(s,a)]\leq J^{C}(\pi_{k})\}.\end{split}\]

This implies that the current policy \(\pi_k\) lies in \(L_{\pi_k}\), and \(\pi_{k+\frac{1}{2}}\) is projected onto the constraint set: \(\{\pi~|~J^{C}(\pi_{k})+ \mathbb{E}_{\substack{s\sim d_{\pi_{k}}\\ a\sim \pi}}[A_{\pi_k}^{C}(s,a)]\leq h\}\). Next, we define the policy \(\pi_{k+1}^l\) as the projection of \(\pi_{k+\frac{1}{2}}\) onto \(L_{\pi_k}\).

For these three polices \(\pi_k, \pi_{k+1}\) and \(\pi_{k+1}^l\), with \(\varphi(x)\doteq\sum_i x_i\log x_i\), we have

(16)#\[ \begin{align}\begin{aligned}\begin{split}\delta &\geq \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+1}^l ||\pi_{k})[s]\big] \\&=\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+1} ||\pi_{k})[s]\big] -\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL} (\pi_{k+1} ||\pi_{k+1}^l)[s]\big]\\ &+\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[(\nabla\varphi(\pi_k)-\nabla\varphi(\pi_{k+1}^{l}))^T(\pi_{k+1}-\pi_{k+1}^l)[s]\big] \nonumber \\\end{split}\\\begin{split}\Rightarrow \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL} (\pi_{k+1} ||\pi_{k})[s]\big]&\leq \delta + \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL} (\pi_{k+1} ||\pi_{k+1}^l)[s]\big]\\ &- \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[(\nabla\varphi(\pi_k)-\nabla\varphi(\pi_{k+1}^{l}))^T(\pi_{k+1}-\pi_{k+1}^l)[s]\big].\end{split}\end{aligned}\end{align} \]

The inequality \(\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL} (\pi_{k+1}^l ||\pi_{k})[s]\big]\leq\delta\) comes from that \(\pi_{k}\), \(\pi_{k+1}^l\) are in \(L_{\pi_k}\), and Lemma 1.

If the constraint violation of the current policy \(\pi_k\) and \(b^+\) are small, \(\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL} (\pi_{k+1} ||\pi_{k+1}^l)[s]\big]\) can be approximated by the second order expansion. By the update rule in Eq.5, we have

(17)#\[\begin{split}\underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+1} ||\pi_{k+1}^l)[s]\big] &\approx \frac{1}{2}(\boldsymbol{\theta}_{k+1}-\boldsymbol{\theta}_{k+1}^l)^{T}\boldsymbol{H}(\boldsymbol{\theta}_{k+1}-\boldsymbol{\theta}_{k+1}^l)\\ &=\frac{1}{2} \Big(\frac{b^+}{a^T\boldsymbol{H}^{-1}a}\boldsymbol{H}^{-1}a\Big)^T\boldsymbol{H}\Big(\frac{b^+}{a^T\boldsymbol{H}^{-1}a}\boldsymbol{H}^{-1}a\Big)\\ &=\frac{{b^+}^2}{2a^T\boldsymbol{H}^{-1}a}\\ &={b^+}^2\alpha_\mathrm{KL},\end{split}\]

where \(\alpha_\mathrm{KL} \doteq \frac{1}{2a^T\boldsymbol{H}^{-1}a}.\)

And since \(\delta\) is small, we have \(\nabla\varphi(\pi_k)-\nabla\varphi(\pi_{k+1}^{l})\approx \mathbf{0}\) given \(s\). Thus, the third term in Eq.8 can be eliminated.

Combining Eq.8 and Eq.13, we have \([ \underset{s\sim d_{\pi_{k}}}{\mathbb{E}}\big[\mathrm{KL}(\pi_{k+1}||\pi_{k})[s]\big]\leq \delta+{b^+}^2\alpha_\mathrm{KL}.]\)

Now we use Lemma 2 to prove the Theorem 2. Following the same proof in Theorem 1, we complete the proof.

Proof of Analytical Solution to PCPO#

Analytical Solution to PCPO

Consider the PCPO problem. In the first step, we optimize the reward:

(18)#\[\begin{split}\boldsymbol{\theta}_{k+\frac{1}{2}} = & \arg \underset{\boldsymbol{\theta}}{\min} g^{T}(\boldsymbol{\theta}-\boldsymbol{\theta}_{k}) \\ \text{s.t.}\quad&\frac{1}{2}(\boldsymbol{\theta}-\boldsymbol{\theta}_{k})^{T}\boldsymbol{H}(\boldsymbol{\theta}-\boldsymbol{\theta}_{k})\leq \delta,\end{split}\]

and in the second step, we project the policy onto the constraint set:

(19)#\[\begin{split}\boldsymbol{\theta}_{k+1} = &\arg\underset{\boldsymbol{\theta}}{ \min} \frac{1}{2}(\boldsymbol{\theta}-{\boldsymbol{\theta}}_{k+\frac{1}{2}})^{T}\boldsymbol{L}(\theta-{\boldsymbol{\theta}}_{k+\frac{1}{2}}) \\ \text{s.t.}\quad &a^{T}(\boldsymbol{\theta}-\boldsymbol{\theta}_{k})+b\leq 0,\end{split}\]

where \(g, a, \boldsymbol{\theta} \in \mathbb{R}^n, b, \delta\in \mathbb{R}, \delta>0,\) and \(\boldsymbol{H},\boldsymbol{L}\in \mathbb{R}^{n\times n}, \boldsymbol{L}=\boldsymbol{H}\), if using the \(KL\) divergence projection, and \(\boldsymbol{L}=\boldsymbol{I}\) if using the \(L2\) norm projection. When there is at least one strictly feasible point, the optimal solution satisfies

(20)#\[\begin{split}\boldsymbol{\theta}_{k+1}&=\boldsymbol{\theta}_{k}+\sqrt{\frac{2\delta}{g^T\boldsymbol{H}^{-1}g}}\boldsymbol{H}^{-1}g\nonumber\\ &-\max(0,\frac{\sqrt{\frac{2\delta}{g^T\boldsymbol{H}^{-1}g}}a^{T}\boldsymbol{H}^{-1}g+b}{a^T\boldsymbol{L}^{-1}a})\boldsymbol{L}^{-1}a\end{split}\]

assuming that \(\boldsymbol{H}\) is invertible to get a unique solution.

Proof of Analytical Solution to PCPO (Click here)

For the first problem, since \(\boldsymbol{H}\) is the Fisher Information matrix, which automatically guarantees it is positive semi-definite, it is a convex program with quadratic inequality constraints. Hence if the primal problem has a feasible point, then Slater’s condition is satisfied and strong duality holds. Let \(\boldsymbol{\theta}^{*}\) and \(\lambda^*\) denote the solutions to the primal and dual problems, respectively. In addition, the primal objective function is continuously differentiable. Hence the Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient for the optimality of \(\boldsymbol{\theta}^{*}\) and \(\lambda^*.\) We now form the Lagrangian:

(21)#\[\mathcal{L}(\boldsymbol{\theta},\lambda)=-g^{T}(\boldsymbol{\theta}-\boldsymbol{\theta}_{k})+\lambda\Big(\frac{1}{2}(\boldsymbol{\theta}-\boldsymbol{\theta}_{k})^{T}\boldsymbol{H}(\boldsymbol{\theta}-\boldsymbol{\theta}_{k})- \delta\Big).\]

And we have the following KKT conditions:

(22)#\[\begin{split}-g + \lambda^*\boldsymbol{H}\boldsymbol{\theta}^{*}-\lambda^*\boldsymbol{H}\boldsymbol{\theta}_{k}=0~~~~&~~~\nabla_\boldsymbol{\theta}\mathcal{L}(\boldsymbol{\theta}^{*},\lambda^{*})=0 \\ \frac{1}{2}(\boldsymbol{\theta}^{*}-\boldsymbol{\theta}_{k})^{T}\boldsymbol{H}(\boldsymbol{\theta}^{*}-\boldsymbol{\theta}_{k})- \delta=0~~~~&~~~\nabla_\lambda\mathcal{L}(\boldsymbol{\theta}^{*},\lambda^{*})=0 \\ \frac{1}{2}(\boldsymbol{\theta}^{*}-\boldsymbol{\theta}_{k})^{T}\boldsymbol{H}(\boldsymbol{\theta}^{*}-\boldsymbol{\theta}_{k})-\delta\leq0~~~~&~~~\text{primal constraints}\label{KKT_3}\\ \lambda^*\geq0~~~~&~~~\text{dual constraints}\\ \lambda^*\Big(\frac{1}{2}(\boldsymbol{\theta}^{*}-\boldsymbol{\theta}_{k})^{T}\boldsymbol{H}(\boldsymbol{\theta}^{*}-\boldsymbol{\theta}_{k})-\delta\Big)=0~~~~&~~~\text{complementary slackness}\end{split}\]

By Eq.22, we have \(\boldsymbol{\theta}^{*}=\boldsymbol{\theta}_{k}+\frac{1}{\lambda^*}\boldsymbol{H}^{-1}g\). And \(\lambda^*=\sqrt{\frac{g^T\boldsymbol{H}^{-1}g}{2\delta}}\) . Hence we have our optimal solution:

(23)#\[\boldsymbol{\theta}_{k+\frac{1}{2}}=\boldsymbol{\theta}^{*}=\boldsymbol{\theta}_{k}+\sqrt{\frac{2\delta}{g^T\boldsymbol{H}^{-1}g}}\boldsymbol{H}^{-1}g\]

Following the same reasoning, we now form the Lagrangian of the second problem:

(24)#\[\mathcal{L}(\boldsymbol{\theta},\lambda)=\frac{1}{2}(\boldsymbol{\theta}-{\boldsymbol{\theta}}_{k+\frac{1}{2}})^{T}\boldsymbol{L}(\boldsymbol{\theta}-{\boldsymbol{\theta}}_{k+\frac{1}{2}})+\lambda(a^T(\boldsymbol{\theta}-\boldsymbol{\theta}_{k})+b)\]

And we have the following KKT conditions:

(25)#\[\begin{split}\boldsymbol{L}\boldsymbol{\theta}^*-\boldsymbol{L}\boldsymbol{\theta}_{k+\frac{1}{2}}+\lambda^*a=0~~~~&~~~\nabla_\boldsymbol{\theta}\mathcal{L}(\boldsymbol{\theta}^{*},\lambda^{*})=0 \\ a^T(\boldsymbol{\theta}^*-\boldsymbol{\theta}_{k})+b=0~~~~&~~~\nabla_\lambda\mathcal{L}(\boldsymbol{\theta}^{*},\lambda^{*})=0 \\ a^T(\boldsymbol{\theta}^*-\boldsymbol{\theta}_{k})+b\leq0~~~~&~~~\text{primal constraints} \\ \lambda^*\geq0~~~~&~~~\text{dual constraints} \\ \lambda^*(a^T(\boldsymbol{\theta}^*-\boldsymbol{\theta}_{k})+b)=0~~~~&~~~\text{complementary slackness}\end{split}\]

By Eq.25, we have \(\boldsymbol{\theta}^{*}=\boldsymbol{\theta}_{k+1}+\lambda^*\boldsymbol{L}^{-1}a\). And by solving Eq.25, we have \(\lambda^*=\max(0,\\ \frac{a^T(\boldsymbol{\theta}_{k+\frac{1}{2}}-\boldsymbol{\theta}_{k})+b}{a\boldsymbol{L}^{-1}a})\). Hence we have our optimal solution:

(26)#\[\boldsymbol{\theta}_{k+1}=\boldsymbol{\theta}^{*}=\boldsymbol{\theta}_{k+\frac{1}{2}}-\max(0,\frac{a^T(\boldsymbol{\theta}_{k+\frac{1}{2}}-\boldsymbol{\theta}_{k})+b}{a^T\boldsymbol{L}^{-1}a^T})\boldsymbol{L}^{-1}a\]

we have

(27)#\[\begin{split}\boldsymbol{\theta}_{k+1}&=\boldsymbol{\theta}_{k}+\sqrt{\frac{2\delta}{g^T\boldsymbol{H}^{-1}g}}\boldsymbol{H}^{-1}g\\ &-\max(0,\frac{\sqrt{\frac{2\delta}{g^T\boldsymbol{H}^{-1}g}}a^{T}\boldsymbol{H}^{-1}g+b}{a^T\boldsymbol{L}^{-1}a})\boldsymbol{L}^{-1}a\end{split}\]

Proof of Theorem 3#

For our analysis, we make the following assumptions: we minimize the negative reward objective function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) (We follow the convention of the literature that authors typically minimize the objective function). The function \(f\) is \(L\)-smooth and twice continuously differentiable over the closed and convex constraint set \(\mathcal{C}\). We have the following Lemma 3 to characterize the projection and for the proof of Theorem 3

Lemma 3

For any \(\boldsymbol{\theta}\), \(\boldsymbol{\theta}^{*}=\mathrm{Proj}^{\boldsymbol{L}}_{\mathcal{C}}(\boldsymbol{\theta})\) if and only if \((\boldsymbol{\theta}-\boldsymbol{\theta}^*)^T\boldsymbol{L}(\boldsymbol{\theta}'-\boldsymbol{\theta}^*)\leq0, \forall\boldsymbol{\theta}'\in\mathcal{C}\), where \(\mathrm{Proj}^{\boldsymbol{L}}_{\mathcal{C}}(\boldsymbol{\theta})\doteq \underset{\boldsymbol{\theta}' \in \mathrm{C}}{\arg \min}||\boldsymbol{\theta}-\boldsymbol{\theta}'||^2_{\boldsymbol{L}}\) and \(\boldsymbol{L}=\boldsymbol{H}\) if using the \(KL\) divergence projection, and \(\boldsymbol{L}=\boldsymbol{I}\) if using the \(L2\) norm projection.

Proof of Lemma 3 (Click here)

\((\Rightarrow)\) Let \(\boldsymbol{\theta}^{*}=\mathrm{Proj}^{\boldsymbol{L}}_{\mathcal{C}}(\boldsymbol{\theta})\) for a given \(\boldsymbol{\theta} \not\in\mathcal{C},\) \(\boldsymbol{\theta}'\in\mathcal{C}\) be such that \(\boldsymbol{\theta}'\neq\boldsymbol{\theta}^*,\) and \(\alpha\in(0,1).\) Then we have

(28)#\[\begin{split}\label{eq:appendix_lemmaD1_0} \left\|\boldsymbol{\theta}-\boldsymbol{\theta}^*\right\|_L^2 & \leq\left\|\boldsymbol{\theta}-\left(\boldsymbol{\theta}^*+\alpha\left(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}^*\right)\right)\right\|_L^2 \\ &=\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^*\right\|_L^2+\alpha^2\left\|\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}^*\right\|_{\boldsymbol{L}}^2\\ ~~~~ &-2\alpha\left(\boldsymbol{\theta}-\boldsymbol{\theta}^*\right)^T \boldsymbol{L}\left(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}^*\right) \\ & \Rightarrow\left(\boldsymbol{\theta}-\boldsymbol{\theta}^*\right)^T \boldsymbol{L}\left(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}^*\right) \leq \frac{\alpha}{2}\left\|\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}^*\right\|_{\boldsymbol{L}}^2\end{split}\]

Since the right-hand side of Eq.28 can be made arbitrarily small for a given \(\alpha\), and hence we have:

(29)#\[(\boldsymbol{\theta}-\boldsymbol{\theta}^*)^T\boldsymbol{L}(\boldsymbol{\theta}'-\boldsymbol{\theta}^*)\leq0, \forall\boldsymbol{\theta}'\in\mathcal{C}.\]

Let \(\boldsymbol{\theta}^*\in\mathcal{C}\) be such that \((\boldsymbol{\theta}-\boldsymbol{\theta}^*)^T\boldsymbol{L}(\boldsymbol{\theta}'-\boldsymbol{\theta}^*)\leq0, \forall\boldsymbol{\theta}'\in\mathcal{C}\). We show that \(\boldsymbol{\theta}^*\) must be the optimal solution. Let \(\boldsymbol{\theta}'\in\mathcal{C}\) and \(\boldsymbol{\theta}'\neq\boldsymbol{\theta}^*\). Then we have

(30)#\[\begin{split}\begin{split} &\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\|_L^2-\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^*\right\|_L^2\\ &=\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^*+\boldsymbol{\theta}^*-\boldsymbol{\theta}^{\prime}\right\|_L^2-\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^*\right\|_L^2 \\ &=\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^*\right\|_L^2+\left\|\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}^*\right\|_L^2-2\left(\boldsymbol{\theta}-\boldsymbol{\theta}^*\right)^T \boldsymbol{L}\left(\boldsymbol{\theta}^{\prime}-\boldsymbol{\theta}^*\right)\\ &~~~~-\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^*\right\|_{\boldsymbol{L}}^2 \\ &>0 \\ &\Rightarrow\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\|_L^2 >\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^*\right\|_L^2 . \end{split}\end{split}\]

Hence, \(\boldsymbol{\theta}^*\) is the optimal solution to the optimization problem, and \(\boldsymbol{\theta}^*=\mathrm{Proj}^{\boldsymbol{L}}_{\mathcal{C}}(\boldsymbol{\theta})\).

Based on Lemma 3 we have the proof of following Theorem 3.

Theorem 3 (Stationary Points of PCPO with the \(KL\) divergence and \(L2\) Norm Projections)

Let \(\eta\doteq \sqrt{\frac{2\delta}{g^{T}\boldsymbol{H}^{-1}g}}\) in Eq.5, where \(\delta\) is the step size for reward improvement, \(g\) is the gradient of \(f\), \(\boldsymbol{H}\) is the Fisher information matrix. Let \(\sigma_\mathrm{max}(\boldsymbol{H})\) be the largest singular value of \(\boldsymbol{H}\), and \(a\) be the gradient of cost advantage function in Eq.5. Then PCPO with the \(KL\) divergence projection converges to stationary points with \(g\in-a\) (i.e., the gradient of \(f\) belongs to the negative gradient of the cost advantage function). The objective value changes by

(31)#\[f(\boldsymbol{\theta}_{k+1})\leq f(\boldsymbol{\theta}_{k})+||\boldsymbol{\theta}_{k+1}-\boldsymbol{\theta}_{k}||^2_{-\frac{1}{\eta}\boldsymbol{H}+\frac{L}{2}\boldsymbol{I}}\]

PCPO with the \(L2\) norm projection converges to stationary points with \(\boldsymbol{H}^{-1}g\in-a\) (i.e., the product of the inverse of \(\boldsymbol{H}\) and gradient of \(f\) belongs to the negative gradient of the cost advantage function). If \(\sigma_\mathrm{max}(\boldsymbol{H})\leq1\), then the objective value changes by

(32)#\[f(\boldsymbol{\theta}_{k+1})\leq f(\boldsymbol{\theta}_{k})+(\frac{L}{2}-\frac{1}{\eta})||\boldsymbol{\theta}_{k+1}-\boldsymbol{\theta}_{k}||^2_2\]

Projection-Based Constrained Policy Optimization#

Quick Facts#

PCPO Theorem#

Background#

Optimization Objective#

Two-stage Policy Update#

Policy Performance Bounds#

Practical Implementation#

Implementation of a Two-stage Update#

Analysis#

Code with OmniSafe#

Quick start#

Architecture of functions#

Documentation of algorithm specific functions#

Configs#

References#

Appendix#

Proof of Theorem 2#

Proof of Analytical Solution to PCPO#

Proof of Theorem 3#