First Order Constrained Optimization in Policy Space#

Quick Facts#

FOCOPS is an on-policy algorithm.
FOCOPS can be used for environments with both discrete and continuous action spaces.
FOCOPS is an algorithm using first-order method.
An API Documentation is available for FOCOPS.

FOCOPS Theorem#

Background#

First Order Constrained Optimization in Policy Space (FOCOPS) is a first-order method that maximizes an agent’s overall reward while ensuring the agent satisfies a set of cost constraints. FOCOPS purposes that CPO has disadvantages below:

Problems of CPO

Error resulting from taking sample trajectories from the current policy.
Approximation errors resulting from Taylor approximations.
Approximation errors result from using the conjugate method to calculate the inverse of the Fisher information matrix.

Advantage of FOCOPS

Extremely simple to implement since it only utilizes first-order approximations.
Simple first-order method avoids error caused by Taylor method and the conjugate method.
Outperform CPO in the experiment.
No recovery steps are required.

FOCOPS mainly includes the following contributions:

Provides a two-stage policy update to optimize the current policy.
Gives the practical implementation for solving the two-stage policy update.
Offers rigorous derivative proofs for the above theories, as detailed in the Appendix to this tutorial.

One suggested reading order is CPO(Constrained Policy Optimization), PCPO(Projection-Based Constrained Policy Optimization), then FOCOPS. If you have yet to read the PCPO, it does not matter. Nevertheless, be sure to read this article after reading the CPO tutorial we have written so that you can fully understand the following passage.

Optimization Objective#

In the previous chapters, you learned that CPO solves the following optimization problems:

(1)#\[\begin{split}\pi_{k+1}&=\arg \max _{\pi \in \Pi_{\boldsymbol{{\boldsymbol{\theta}}}}} \mathbb{E}_{\substack{s \sim d_{\pi_k}\\a \sim \pi}}[A^R_{\pi_k}(s, a)]\\ \text{s.t.} \quad J^{C_i}\left(\pi_k\right) &\leq d_i-\frac{1}{1-\gamma} \mathbb{E}_{\substack{s \sim d_{\pi_k} \\ a \sim \pi}}\left[A^{C_i}_{\pi_k}(s, a)\right] \quad \forall i \\ \bar{D}_{K L}\left(\pi \| \pi_k\right) &\leq \delta\end{split}\]

where \(\prod_{{\boldsymbol{\theta}}}\subseteq\prod\) denotes the parametrized policies with parameters \({\boldsymbol{\theta}}\), and \(\bar{D}_{K L}\) is the \(KL\) divergence of two policies. In local policy search for CMDPs, we require policy iterates to be feasible. Instead of optimizing over \(\prod_{{\boldsymbol{\theta}}}\), PCPO optimizes over \(\prod_{{\boldsymbol{\theta}}}\cap\prod_{C}\). Next, we will introduce you to how FOCOPS solves the above optimization problems. For you to have a clearer understanding, we hope that you will read the next section with the following questions:

Questions

What is a two-stage policy update, and how?
How to practically implement FOCOPS?
How do parameters impact the performance of the algorithm?

Two-stage Policy Update#

Instead of solving the Eq.1 directly, FOCOPS uses a two-stage approach summarized below:

Two-stage Policy Update

Given policy \(\pi_{{\boldsymbol{\theta}}_k}\), find an optimal update policy \(\pi^*\) by solving the optimization problem from Eq.1 in the non-parameterized policy space.
Project the policy found in the previous step back into the parameterized policy space \(\Pi_{{\boldsymbol{\theta}}}\) by searching for the closest policy \(\pi_{{\boldsymbol{\theta}}}\in\Pi_{{\boldsymbol{\theta}}}\) to \(\pi^*\), to obtain \(\pi_{{\boldsymbol{\theta}}_{k+1}}\).

Finding the Optimal Update Policy#

In the first stage, FOCOPS rewrites Eq.1 as below:

(2)#\[\begin{split}\pi^* &=\arg \max _{\pi \in \Pi} \mathbb{E}_{\substack{s \sim d_{\pi_k}\\a \sim \pi}}[A^R_{\pi_k}(s, a)]\\ \text{s.t.} \quad J^{C}\left(\pi_k\right) &\leq d-\frac{1}{1-\gamma} \mathbb{E}{\substack{s \sim d_{\pi_k} \\ a \sim \pi}}\left[A^{C}_{\pi_k}(s, a)\right] \quad \\ \bar{D}_{K L}\left(\pi \| \pi_k\right) & \leq \delta\end{split}\]

These problems are only slightly different from Eq.1 , that is, what we focus on now is the non-parameterized policy \(\pi\) but not the policy parameter \({\boldsymbol{\theta}}\). Then FOCOPS provides a solution as follows:

Theorem 1

Let \(\tilde{b}=(1-\gamma)\left(b-\tilde{J}^C\left(\pi_{{\boldsymbol{\theta}}_k}\right)\right)\). If \(\pi_{{\boldsymbol{\theta}}_k}\) is a feasible solution, the optimal policy for Eq.2 takes the form

(3)#\[\pi^*(a \mid s)=\frac{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right)\]

where \(Z_{\lambda,\nu}(s)\) is the partition function which ensures Eq.3 is a valid probability distribution, \(\lambda\) and \(\nu\) are solutions to the optimization problem:

(4)#\[\begin{split}\min _{\lambda, \nu \geq 0} \lambda \delta+\nu \tilde{b}+\lambda \underset{\substack{s \sim d_{\pi_{{\boldsymbol{\theta}}_k}} \\ a \sim \pi^*}}{\mathbb{E}}[\log Z_{\lambda, \nu}(s)]\end{split}\]

The proof of the Theorem 1 can be seen in the Appendix, click on this card to jump to view.

The form of the optimal policy is intuitive. It gives high probability mass to areas of the state-action space with high return, offset by a penalty term times the cost advantage. We will refer to the optimal solution to Eq.2 as the optimal update policy. Suppose you need help understanding the meaning of the above Equation. In that case, you can first think that FOCOPS finally solves Eq.2 by solving Eq.3 and Eq.4. Theorem 1 is a viable solution.

Question I

Question

What is the bound for FOCOPS worst-case guarantee for cost constraint?

Question II

Question

Can FOCOPS solve the multi-constraint problem and how?

Answer I

Answer

FOCOPS purposes that the optimal update policy \(\pi^*\) satisfies the following bound for the worst-case guarantee for cost constraint in CPO:

(5)#\[J^C\left(\pi^*\right) \leq d+\frac{\sqrt{2 \delta} \gamma \epsilon_{\pi^*}^C}{(1-\gamma)^2}\]

where \(\epsilon^C_{\pi^*}=\max _s\left|\underset{a \sim \pi}{\mathbb{E}}\left[A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\right|\).

Answer II

Answer

By introducing Lagrange multipliers \(\nu_1,\nu_2,...,\nu_m\ge0\), one for each cost constraint and applying a similar duality argument, FOCOPS extends its results to accommodate for multiple constraints.

Approximating the Optimal Update Policy#

The optimal update policy \(\pi^*\) is obtained in the previous section. However, it is not a parameterized policy. In this section, we will show you how FOCOPS projects the optimal update policy back into the parameterized policy space by minimizing the loss function:

(6)#\[\mathcal{L}({\boldsymbol{\theta}})=\underset{s \sim d_{\pi_{{\boldsymbol{\theta}}_k}}}{\mathbb{E}}\left[D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]\right]\]

Here \(\pi_{{\boldsymbol{\theta}}}\in \Pi_{{\boldsymbol{\theta}}}\) is some projected policy that FOCOPS will use to approximate the optimal update policy. The first-order methods are also used to minimize this loss function:

Corollary 1

The gradient of \(\mathcal{L}({\boldsymbol{\theta}})\) takes the form

(7)#\[\nabla_{\boldsymbol{\theta}} \mathcal{L}({\boldsymbol{\theta}})=\underset{s \sim d_{\pi_{\boldsymbol{\theta}}}}{\mathbb{E}}\left[\nabla_{\boldsymbol{\theta}} D_{K L}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]\right]\]

where

(8)#\[\begin{split}\nabla_{\boldsymbol{\theta}} D_{K L}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s] &=\nabla_{\boldsymbol{\theta}} D_{K L}\left(\pi_{\boldsymbol{\theta}} \| \pi_{{\boldsymbol{\theta}}_k}\right)[s] \\ & -\frac{1}{\lambda} \underset{a \sim \pi_{{\boldsymbol{\theta}}_k}}{\mathbb{E}}\left[\frac{\nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a \mid s)}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right]\end{split}\]

The proof of the Corollary 1 can be seen in the Appendix, click on this card to jump to view.

Note that Eq.7 can be estimated by sampling from the trajectories generated by policy \(\pi_{{\boldsymbol{\theta}}_k}\) so it can be trained using stochastic gradients.

Corollary 1 outlines the FOCOPS algorithm:

Note

At every iteration, we begin with a policy \(\pi_{{\boldsymbol{\theta}}_k}\), which we use to run trajectories and gather data. We use that data and Eq.4 first to estimate \(\lambda\) and \(\nu\). We then draw a mini-batch from the data to estimate \(\nabla_{\boldsymbol{\theta}} \mathcal{L}({\boldsymbol{\theta}})\) given in Corollary 1. After taking a gradient step using Equation Eq.7, we draw another mini-batch then repeat the process.

References#

Appendix#

Proof for Theorem 1#

Lemma 1

Problem Eq.2 is convex w.r.t \(\pi={\pi(a|s):s\in \mathcal{S},a\in\mathcal{A}}\).

Proof of Lemma 1

First, note that the objective function is linear w.r.t \(\pi\). Since \(J^{C}(\pi_{{\boldsymbol{\theta}}_k})\) is a constant w.r.t \(\pi\), constraint Eq.2 is linear. Constraint Eq.2 can be rewritten as \(\sum_s d_{\pi_{{\boldsymbol{\theta}}_k}}(s) D_{\mathrm{KL}}\left(\pi \| \pi_{{\boldsymbol{\theta}}_k}\right)[s] \leq \delta\). The \(KL\) divergence is convex w.r.t its first argument. Hence constraint Eq.2, a linear combination of convex functions, is also convex. Since \(\pi_{{\boldsymbol{\theta}}_k}\) satisfies constraint Eq.2 also satisfies constraint Eq.2, therefore Slater’s constraint qualification holds, and strong duality holds.

Proof of Corollary#

Proof of Corollary 1

We only need to calculate the gradient of the loss function for a single sampled s. We first note that,

(21)#\[\begin{split}&D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]\\ =&-\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \pi^*(a \mid s)+\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \pi_{\boldsymbol{\theta}}(a \mid s) \\ =&H\left(\pi_{\boldsymbol{\theta}}, \pi^*\right)[s]-H\left(\pi_{\boldsymbol{\theta}}\right)[s]\end{split}\]

where \(H\left(\pi_{\boldsymbol{\theta}}\right)[s]\) is the entropy and \(H\left(\pi_{\boldsymbol{\theta}}, \pi^*\right)[s]\) is the cross-entropy under state \(s\). The above is the basic mathematical knowledge in information theory, which you can get in any information theory textbook. We expand the cross entropy term, which gives us the following:

(22)#\[\begin{split}&H\left(\pi_{\boldsymbol{\theta}}, \pi^*\right)[s]\\ &=-\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \pi^*(a \mid s) \\ &=-\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \left(\frac{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left[\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right]\right) \\ &=-\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \pi_{{\boldsymbol{\theta}}_k}(a \mid s)+\log Z_{\lambda, \nu}(s)-\frac{1}{\lambda} \sum_a \pi_{\boldsymbol{\theta}}(a \mid s)\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\end{split}\]

We then subtract the entropy term to recover the \(KL\) divergence:

(23)#\[\begin{split}&D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]=D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi_{{\boldsymbol{\theta}}_k}\right)[s]+\log Z_{\lambda, \nu}(s)-\\&\frac{1}{\lambda} \underset{a \sim \pi_{{\boldsymbol{\theta}}_k}(\cdot \mid s)}{\mathbb{E}}\left[\frac{\pi_{\boldsymbol{\theta}}(a \mid s)}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right]\nonumber\end{split}\]

In the last equality, we applied importance sampling to rewrite the expectation w.r.t. \(\pi_{{\boldsymbol{\theta}}_k}\). Finally, taking the gradient on both sides gives us the following:

(24)#\[\begin{split}&\nabla_{\boldsymbol{\theta}} D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]=\nabla_{\boldsymbol{\theta}} D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi_{{\boldsymbol{\theta}}_k}\right)[s]\\&-\frac{1}{\lambda} \underset{a \sim \pi_{{\boldsymbol{\theta}}_k}(\cdot \mid s)}{\mathbb{E}}\left[\frac{\nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a \mid s)}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right]\nonumber\end{split}\]

Proof of Corollary 2

From Theorem 1, we have:

(25)#\[\begin{split}L\left(\pi^*, \lambda, \nu\right)=\lambda \delta+\nu \tilde{b}+\lambda \underset{\substack{s \sim d_{\pi^*} \\ a \sim \pi^*}}{\mathbb{E}}\left[\log Z_{\lambda, \nu}(s)\right]\end{split}\]

The first two terms are an affine function w.r.t. \(\nu\). Therefore, its derivative is \(\tilde{b}\). We will then focus on the expectation in the last term. To simplify our derivation, we will first calculate the derivative of \(\pi^*\) w.r.t. \(\nu\),

(26)#\[\begin{split}\frac{\partial \pi^*(a \mid s)}{\partial \nu} &=\frac{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}{Z_{\lambda, \nu}^2(s)}\left[Z_{\lambda, \nu}(s) \frac{\partial}{\partial \nu} \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right)\right.\\ &\left.-\exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right) \frac{\partial Z_{\lambda, \nu}(s)}{\partial \nu}\right] \\ &=-\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \pi^*(a \mid s)-\pi^*(a \mid s) \frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}\nonumber\end{split}\]

Therefore the derivative of the expectation in the last term of \(L(\pi^*,\lambda,\nu)\) can be written as:

(27)#\[\begin{split}\frac{\partial}{\partial \nu} \underset{\substack{s \sim d_{\pi {\boldsymbol{\theta}}_k} \\ a \sim \pi^*}}{\mathbb{E}}\left[\log Z_{\lambda, \nu}(s)\right] &= \underset{\substack{s \sim d_{\pi_{\boldsymbol{\theta}}} \\ a \sim \pi_{{\boldsymbol{\theta}}_k}}}{\mathbb{E}}\left[\frac{\partial}{\partial \nu}\left(\frac{\pi^*(a \mid s)}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)} \log Z_{\lambda, \nu}(s)\right)\right] \\ &= \underset{\substack{s \sim d_{\pi_{\boldsymbol{\theta}}} \\ a \sim \pi_{{\boldsymbol{\theta}}_k}}}{\mathbb{E}}\left[\frac{1}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}\left(\frac{\partial \pi^*(a \mid s)}{\partial \nu} \log Z_{\lambda, \nu}(s)+\pi^*(a \mid s) \frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}\right)\right] \\ &= \underset{\substack{s \sim d_{\pi_{\boldsymbol{\theta}}} \\ a \sim \pi^*}}{\mathbb{E}}\left[-(\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda}+\frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}) \log Z_{\lambda, \nu}(s)+\frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}\right]\end{split}\]

Also:

(28)#\[\begin{split}\frac{\partial Z_{\lambda, \nu}(s)}{\partial \nu} &=\frac{\partial}{\partial \nu} \sum_a \pi_{{\boldsymbol{\theta}}_k}(a \mid s) \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right) \\ &=\sum_a-\pi_{{\boldsymbol{\theta}}_k}(a \mid s) \frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right) \\ &=\sum_a-\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \frac{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right) Z_{\lambda, \nu}(s) \\ &=-\frac{Z_{\lambda, \nu}(s)}{\lambda} \underset{a \sim \pi^*(\cdot \mid s)}{\mathbb{E}}\left[A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\end{split}\]

Therefore:

(29)#\[\frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}=\frac{\partial Z_{\lambda, \nu}(s)}{\partial \nu} \frac{1}{Z_{\lambda, \nu}(s)}=-\frac{1}{\lambda} \underset{a \sim \pi^*(\cdot \mid s)}{\mathbb{E}}\left[A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\]

Plugging Eq.29 into the last equality in Eq.27 gives us:

(30)#\[\begin{split}\frac{\partial}{\partial \nu} \underset{\substack{s \sim d_{\pi_{\boldsymbol{\theta}}} \\ a \sim \pi^*}}{\mathbb{E}}\left[\log Z_{\lambda, \nu}(s)\right] &=\underset{\substack{s \sim d_{\pi^*} \\ a \sim \pi^*}}{\mathbb{E}}\left[-\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \log Z_{\lambda, \nu}(s)+\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \log Z_{\lambda, \nu}(s)-\frac{1}{\lambda} A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right] \\ &=-\frac{1}{\lambda} \underset{\substack{s \sim d_{\pi_{{\boldsymbol{\theta}}_k}} \\ a \sim \pi^*}}{\mathbb{E}}\left[A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\end{split}\]

Combining Eq.30 with the derivatives of the affine term give us the final desired result.

First Order Constrained Optimization in Policy Space#

Quick Facts#

FOCOPS Theorem#

Background#

Optimization Objective#

Two-stage Policy Update#

Finding the Optimal Update Policy#

Approximating the Optimal Update Policy#

Practical Implementation#

Variables Analysis#

Code with OmniSafe#

Quick start#

Architecture of functions#

Documentation of algorithm specific functions#

Configs#