First Order Constrained Optimization in Policy Space#
Quick Facts#
FOCOPS is an on-policy algorithm.
FOCOPS can be used for environments with both discrete and continuous action spaces.
FOCOPS is an algorithm using first-order method.
An API Documentation is available for FOCOPS.
FOCOPS Theorem#
Background#
First Order Constrained Optimization in Policy Space (FOCOPS) is a first-order method that maximizes an agent’s overall reward while ensuring the agent satisfies a set of cost constraints. FOCOPS purposes that CPO has disadvantages below:
Problems of CPO
Error resulting from taking sample trajectories from the current policy.
Approximation errors resulting from Taylor approximations.
Approximation errors result from using the conjugate method to calculate the inverse of the Fisher information matrix.
Advantage of FOCOPS
Extremely simple to implement since it only utilizes first-order approximations.
Simple first-order method avoids error caused by Taylor method and the conjugate method.
Outperform CPO in the experiment.
No recovery steps are required.
FOCOPS mainly includes the following contributions:
Provides a two-stage policy update to optimize the current policy.
Gives the practical implementation for solving the two-stage policy update.
Offers rigorous derivative proofs for the above theories, as detailed in the Appendix to this tutorial.
One suggested reading order is CPO(Constrained Policy Optimization), PCPO(Projection-Based Constrained Policy Optimization), then FOCOPS. If you have yet to read the PCPO, it does not matter. Nevertheless, be sure to read this article after reading the CPO tutorial we have written so that you can fully understand the following passage.
Optimization Objective#
In the previous chapters, you learned that CPO solves the following optimization problems:
where \(\prod_{{\boldsymbol{\theta}}}\subseteq\prod\) denotes the parametrized policies with parameters \({\boldsymbol{\theta}}\), and \(\bar{D}_{K L}\) is the \(KL\) divergence of two policies. In local policy search for CMDPs, we require policy iterates to be feasible. Instead of optimizing over \(\prod_{{\boldsymbol{\theta}}}\), PCPO optimizes over \(\prod_{{\boldsymbol{\theta}}}\cap\prod_{C}\). Next, we will introduce you to how FOCOPS solves the above optimization problems. For you to have a clearer understanding, we hope that you will read the next section with the following questions:
Questions
What is a two-stage policy update, and how?
How to practically implement FOCOPS?
How do parameters impact the performance of the algorithm?
Two-stage Policy Update#
Instead of solving the Eq.1 directly, FOCOPS uses a two-stage approach summarized below:
Two-stage Policy Update
Given policy \(\pi_{{\boldsymbol{\theta}}_k}\), find an optimal update policy \(\pi^*\) by solving the optimization problem from Eq.1 in the non-parameterized policy space.
Project the policy found in the previous step back into the parameterized policy space \(\Pi_{{\boldsymbol{\theta}}}\) by searching for the closest policy \(\pi_{{\boldsymbol{\theta}}}\in\Pi_{{\boldsymbol{\theta}}}\) to \(\pi^*\), to obtain \(\pi_{{\boldsymbol{\theta}}_{k+1}}\).
Finding the Optimal Update Policy#
In the first stage, FOCOPS rewrites Eq.1 as below:
These problems are only slightly different from Eq.1 , that is, what we focus on now is the non-parameterized policy \(\pi\) but not the policy parameter \({\boldsymbol{\theta}}\). Then FOCOPS provides a solution as follows:
Theorem 1
Let \(\tilde{b}=(1-\gamma)\left(b-\tilde{J}^C\left(\pi_{{\boldsymbol{\theta}}_k}\right)\right)\). If \(\pi_{{\boldsymbol{\theta}}_k}\) is a feasible solution, the optimal policy for Eq.2 takes the form
where \(Z_{\lambda,\nu}(s)\) is the partition function which ensures Eq.3 is a valid probability distribution, \(\lambda\) and \(\nu\) are solutions to the optimization problem:
The form of the optimal policy is intuitive. It gives high probability mass to areas of the state-action space with high return, offset by a penalty term times the cost advantage. We will refer to the optimal solution to Eq.2 as the optimal update policy. Suppose you need help understanding the meaning of the above Equation. In that case, you can first think that FOCOPS finally solves Eq.2 by solving Eq.3 and Eq.4. Theorem 1 is a viable solution.
Question
What is the bound for FOCOPS worst-case guarantee for cost constraint?
Question
Can FOCOPS solve the multi-constraint problem and how?
Answer
FOCOPS purposes that the optimal update policy \(\pi^*\) satisfies the following bound for the worst-case guarantee for cost constraint in CPO:
where \(\epsilon^C_{\pi^*}=\max _s\left|\underset{a \sim \pi}{\mathbb{E}}\left[A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\right|\).
Answer
By introducing Lagrange multipliers \(\nu_1,\nu_2,...,\nu_m\ge0\), one for each cost constraint and applying a similar duality argument, FOCOPS extends its results to accommodate for multiple constraints.
Approximating the Optimal Update Policy#
The optimal update policy \(\pi^*\) is obtained in the previous section. However, it is not a parameterized policy. In this section, we will show you how FOCOPS projects the optimal update policy back into the parameterized policy space by minimizing the loss function:
Here \(\pi_{{\boldsymbol{\theta}}}\in \Pi_{{\boldsymbol{\theta}}}\) is some projected policy that FOCOPS will use to approximate the optimal update policy. The first-order methods are also used to minimize this loss function:
Corollary 1
The gradient of \(\mathcal{L}({\boldsymbol{\theta}})\) takes the form
where
Note that Eq.7 can be estimated by sampling from the trajectories generated by policy \(\pi_{{\boldsymbol{\theta}}_k}\) so it can be trained using stochastic gradients.
Corollary 1 outlines the FOCOPS algorithm:
Note
At every iteration, we begin with a policy \(\pi_{{\boldsymbol{\theta}}_k}\), which we use to run trajectories and gather data. We use that data and Eq.4 first to estimate \(\lambda\) and \(\nu\). We then draw a mini-batch from the data to estimate \(\nabla_{\boldsymbol{\theta}} \mathcal{L}({\boldsymbol{\theta}})\) given in Corollary 1. After taking a gradient step using Equation Eq.7, we draw another mini-batch then repeat the process.
Practical Implementation#
Hint
Solving Eq.4 is computationally impractical for large state or action spaces as it requires calculating the partition function \(Z_{\lambda,\nu}(s)\), which often involves evaluating a high-dimensional integral or sum. Furthermore, \(\lambda\) and \(\nu\) are depend on \(k\) and should be adapted at every iteration.
This section will introduce you to how FOCOPS practically implements its algorithm. In practice, though hyperparameter sweeps, FOCOPS found that a fixed \(\lambda\) provides good results, which means the value \(\lambda\) does not have to be updated. However, \(\nu\) needs to be continuously adapted during training to ensure cost-constraint satisfaction. FOCOPS appeals to an intuitive heuristic for determining \(\nu\) based on primal-dual gradient methods. With strong duality, the optimal \(\lambda^*\) and \(\nu\) minimizes the dual function Eq.4, which is then denoted as \(L(\pi^*,\lambda,\nu)\). By applying gradient descent w.r.t \(\nu\) to minimize \(L(\pi^*,\lambda,\nu)\), we obtain:
Corollary 2
The derivative of \(L(\pi^*,\lambda,\nu)\) w.r.t \(\nu\) is
The last term in the gradient expression in Eq.9 cannot be evaluated since we do not have access to \(\pi^*\). Since \(\pi_{{\boldsymbol{\theta}}_k}\) and \(\pi^*\) are close, it is reasonable to assume that \(\underset{\substack{s \sim d_{\pi_k}\\ a \sim \pi^*}}{\mathbb{E}}\left[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right] \approx \underset{\substack{s \sim d_{\pi_k}\\ a \sim \pi_{{\boldsymbol{\theta}}_k}}}{\mathbb{E}}\left[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]=0\). In practice, this term can be set to zero, which gives the updated term:
where \(\alpha\) is the step size. Note that we have incorporated the discount term \((1-\gamma)\) into \(\tilde{b}\) into the step size. The projection operator \(proj_{\nu}\) projects \(\nu\) back into the interval \([0,\nu_{max}]\), where \(\nu_{max}\) is chosen so that \(\nu\) does not become too large. In fact. FOCOPS purposed that even setting \(\nu_{max}=+\infty\) does not appear to reduce performance greatly. Practically, \(J^C(\pi_{{\boldsymbol{\theta}}_k})\) can be estimated via Monte Carlo methods using trajectories collected from \(\pi_{{\boldsymbol{\theta}}_k}\). Using the update rule Eq.10, FOCOPS performs one update step on \(\nu\) before updating the policy parameters \({\boldsymbol{\theta}}\). A per-state acceptance indicator function \(I\left(s_j\right)^n:=\mathbf{1}_{D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi_{{\boldsymbol{\theta}}_k}\right)\left[s_j\right] \leq \delta}\) is added to Eq.7, in order better to enforce the accuracy for the first-order purposed method.
Hint
Here \(N\) is the number of samples collected by policy \(\pi_{{\boldsymbol{\theta}}_k}\). \(\hat A^R\) and \(\hat A^C\) are estimates of the advantage functions (for the return and cost) obtained from critic networks. The advantage functions are obtained using the Generalized Advantage Estimator (GAE). Note that FOCOPS only requires first-order methods (gradient descent) and is thus extremely simple to implement.
Variables Analysis#
In this section, we will explain the meaning of parameters \(\lambda\) and \(\mu\) of FOCOPS and their impact on the algorithm’s performance in the experiment.
Analysis of \(\lambda\)
In Eq.3, note that as \(\lambda \rightarrow 0\), \(\pi^*\) approaches a greedy policy; as \(\lambda\) increases, the policy becomes more exploratory. Therefore \(\lambda\) is similar to the temperature term used in maximum entropy reinforcement learning, which has been shown to produce good results when fixed during training. In practice, FOCOPS finds that its algorithm reaches the best performance when the \(\lambda\) is fixed.
Analysis of \(\nu\)
We recall that in Eq.3, \(\nu\) acts as a cost penalty term. Increasing \(\nu\) makes it less likely for state-action pairs with higher costs to be sampled by \(\pi^*\). Hence in this regard, the update rule in Eq.10 is intuitive, because it increases \(\nu\) if \(J^C(\pi_{{\boldsymbol{\theta}}_k})>d\) (which means the agent violates the cost constraints) and decreases \(\nu\) otherwise.
Code with OmniSafe#
Quick start#
Run FOCOPS in OmniSafe
Here are 3 ways to run FOCOPS in OmniSafe:
Run Agent from preset yaml file
Run Agent from custom config dict
Run Agent from custom terminal config
1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5
6agent = omnisafe.Agent('FOCOPS', env_id)
7agent.learn()
1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5custom_cfgs = {
6 'train_cfgs': {
7 'total_steps': 10000000,
8 'vector_env_nums': 1,
9 'parallel': 1,
10 },
11 'algo_cfgs': {
12 'steps_per_epoch': 20000,
13 },
14 'logger_cfgs': {
15 'use_wandb': False,
16 'use_tensorboard': True,
17 },
18}
19
20agent = omnisafe.Agent('FOCOPS', env_id, custom_cfgs=custom_cfgs)
21agent.learn()
We use train_policy.py
as the entrance file. You can train the agent with FOCOPS simply using train_policy.py
, with arguments about FOCOPS and environments does the training.
For example, to run FOCOPS in SafetyPointGoal1-v0 , with 1 torch thread, seed 0 and single environment, you can use the following command:
1cd examples
2python train_policy.py --algo FOCOPS --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 10000000 --device cpu --vector-env-nums 1 --torch-threads 1
Architecture of functions#
FOCOPS.learn()
FOCOPS._env.rollout()
FOCOPS._update()
FOCOPS._buf.get()
FOCOPS._update_lagrange()
FOCOPS._update_actor()
FOCOPS._update_cost_critic()
FOCOPS._update_reward_critic()
Documentation of algorithm specific functions#
FOCOPS._compute_adv_surrogate()
Compute the surrogate advantage function.
1return (adv_r - self._lagrange.lagrangian_multiplier * adv_c) / (
2 1 + self._lagrange.lagrangian_multiplier
3)
FOCOPS._loss_pi()
Compute the loss of policy network.
In FOCOPS, the loss is defined as:
In code implementation, we use the following code to compute the loss:
1distribution = self._actor_critic.actor(obs)
2logp_ = self._actor_critic.actor.log_prob(act)
3std = self._actor_critic.actor.std
4ratio = torch.exp(logp_ - logp)
5
6kl = torch.distributions.kl_divergence(distribution, self._p_dist).sum(-1, keepdim=True)
7loss = (kl - (1 / self._cfgs.algo_cfgs.focops_lam) * ratio * adv) * (
8 kl.detach() <= self._cfgs.algo_cfgs.focops_eta
9).type(torch.float32)
10loss = loss.mean()
11loss -= self._cfgs.algo_cfgs.entropy_coef * distribution.entropy().mean()
Configs#
Train Configs
device (str): Device to use for training, options:
cpu
,cuda
,cuda:0
, etc.torch_threads (int): Number of threads to use for PyTorch.
total_steps (int): Total number of steps to train the agent.
parallel (int): Number of parallel agents, similar to A3C.
vector_env_nums (int): Number of the vector environments.
Algorithms Configs
Note
The following configs are specific to FOCOPS algorithm.
clip (float): Clipping parameter for FOCOPS.
steps_per_epoch (int): Number of steps to update the policy network.
update_iters (int): Number of iterations to update the policy network.
batch_size (int): Batch size for each iteration.
target_kl (float): Target KL divergence.
entropy_coef (float): Coefficient of entropy.
reward_normalize (bool): Whether to normalize the reward.
cost_normalize (bool): Whether to normalize the cost.
obs_normalize (bool): Whether to normalize the observation.
kl_early_stop (bool): Whether to stop the training when KL divergence is too large.
max_grad_norm (float): Maximum gradient norm.
use_max_grad_norm (bool): Whether to use maximum gradient norm.
use_critic_norm (bool): Whether to use critic norm.
critic_norm_coef (float): Coefficient of critic norm.
gamma (float): Discount factor.
cost_gamma (float): Cost discount factor.
lam (float): Lambda for GAE-Lambda.
lam_c (float): Lambda for cost GAE-Lambda.
adv_estimation_method (str): The method to estimate the advantage.
standardized_rew_adv (bool): Whether to use standardized reward advantage.
standardized_cost_adv (bool): Whether to use standardized cost advantage.
penalty_coef (float): Penalty coefficient for cost.
use_cost (bool): Whether to use cost.
Model Configs
weight_initialization_mode (str): The type of weight initialization method.
actor_type (str): The type of actor, default to
gaussian_learning
.linear_lr_decay (bool): Whether to use linear learning rate decay.
exploration_noise_anneal (bool): Whether to use exploration noise anneal.
std_range (list): The range of standard deviation.
Hint
actor (dictionary): parameters for actor network actor
activations: tanh
hidden_sizes:
64
64
Hint
critic (dictionary): parameters for critic network critic
activations: tanh
hidden_sizes:
64
64
Logger Configs
use_wandb (bool): Whether to use wandb to log the training process.
wandb_project (str): The name of wandb project.
use_tensorboard (bool): Whether to use tensorboard to log the training process.
log_dir (str): The directory to save the log files.
window_lens (int): The length of the window to calculate the average reward.
save_model_freq (int): The frequency to save the model.
Lagrange Configs
Note
The following configs are specific to FOCOPS algorithm.
lagrangian_upper_bound (float): Upper bound of Lagrange multiplier.
cost_limit (float): Tolerance of constraint violation.
lagrangian_multiplier_init (float): Initial value of Lagrange multiplier.
lambda_lr (float): Learning rate of Lagrange multiplier.
lambda_optimizer (str): Optimizer for Lagrange multiplier.
References#
Appendix#
Proof for Theorem 1#
Lemma 1
Problem Eq.2 is convex w.r.t \(\pi={\pi(a|s):s\in \mathcal{S},a\in\mathcal{A}}\).
Proof of Lemma 1
First, note that the objective function is linear w.r.t \(\pi\). Since \(J^{C}(\pi_{{\boldsymbol{\theta}}_k})\) is a constant w.r.t \(\pi\), constraint Eq.2 is linear. Constraint Eq.2 can be rewritten as \(\sum_s d_{\pi_{{\boldsymbol{\theta}}_k}}(s) D_{\mathrm{KL}}\left(\pi \| \pi_{{\boldsymbol{\theta}}_k}\right)[s] \leq \delta\). The \(KL\) divergence is convex w.r.t its first argument. Hence constraint Eq.2, a linear combination of convex functions, is also convex. Since \(\pi_{{\boldsymbol{\theta}}_k}\) satisfies constraint Eq.2 also satisfies constraint Eq.2, therefore Slater’s constraint qualification holds, and strong duality holds.
Proof of Theorem 1 (Click here)
Based on Lemma 1 the optimal value of the Eq.2 \(p^*\) can be solved by solving the corresponding dual problem. Let
where \(A^{lag}=\underset{a \sim \pi(\cdot \mid s)}{\mathbb{E}}\left[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\). Therefore.
Note that if \(\pi^*\), \(\lambda^*\), \(\nu^*\) are optimal for Eq.13, \(\pi^*\) is also optimal for Eq.2 because of the strong duality.
Consider the inner maximization problem in Eq.13. We separate it from the original problem and try to solve it first:
Which is equivalent to the inner maximization problem in Eq.13. We can solve this convex optimization problem using a simple Lagrangian argument. We can write the Lagrangian of it as:
where \(\zeta > 0\) is the Lagrange multiplier associated with the constraint \(\sum_a \pi(a \mid s)=1\). Different \(G(\pi)\) w.r.t. \(\pi(a \mid s)\) for some \(a\):
Setting Eq.16 to zero and rearranging the term, we obtain:
We chose \(\zeta\) so that \(\sum_a \pi(a \mid s)=1\) and rewrite \(\zeta / \lambda+1\) as \(Z_{\lambda, \nu}(s)\). We find that the optimal solution \(\pi^*\) to Eq.14 takes the form
Then we obtain:
Plugging the result back to Eq.13, we obtain:
Proof of Corollary#
Proof of Corollary 1
We only need to calculate the gradient of the loss function for a single sampled s. We first note that,
where \(H\left(\pi_{\boldsymbol{\theta}}\right)[s]\) is the entropy and \(H\left(\pi_{\boldsymbol{\theta}}, \pi^*\right)[s]\) is the cross-entropy under state \(s\). The above is the basic mathematical knowledge in information theory, which you can get in any information theory textbook. We expand the cross entropy term, which gives us the following:
We then subtract the entropy term to recover the \(KL\) divergence:
In the last equality, we applied importance sampling to rewrite the expectation w.r.t. \(\pi_{{\boldsymbol{\theta}}_k}\). Finally, taking the gradient on both sides gives us the following:
Proof of Corollary 2
From Theorem 1, we have:
The first two terms are an affine function w.r.t. \(\nu\). Therefore, its derivative is \(\tilde{b}\). We will then focus on the expectation in the last term. To simplify our derivation, we will first calculate the derivative of \(\pi^*\) w.r.t. \(\nu\),
Therefore the derivative of the expectation in the last term of \(L(\pi^*,\lambda,\nu)\) can be written as:
Also:
Therefore:
Plugging Eq.29 into the last equality in Eq.27 gives us:
Combining Eq.30 with the derivatives of the affine term give us the final desired result.