First Order Constrained Optimization in Policy Space#

Quick Facts#

  1. FOCOPS is an on-policy algorithm.

  2. FOCOPS can be used for environments with both discrete and continuous action spaces.

  3. FOCOPS is an algorithm using first-order method.

  4. An API Documentation is available for FOCOPS.

FOCOPS Theorem#

Background#

First Order Constrained Optimization in Policy Space (FOCOPS) is a first-order method that maximizes an agent’s overall reward while ensuring the agent satisfies a set of cost constraints. FOCOPS purposes that CPO has disadvantages below:

Problems of CPO

  • Error resulting from taking sample trajectories from the current policy.

  • Approximation errors resulting from Taylor approximations.

  • Approximation errors result from using the conjugate method to calculate the inverse of the Fisher information matrix.

Advantage of FOCOPS

  • Extremely simple to implement since it only utilizes first-order approximations.

  • Simple first-order method avoids error caused by Taylor method and the conjugate method.

  • Outperform CPO in the experiment.

  • No recovery steps are required.

FOCOPS mainly includes the following contributions:

  • Provides a two-stage policy update to optimize the current policy.

  • Gives the practical implementation for solving the two-stage policy update.

  • Offers rigorous derivative proofs for the above theories, as detailed in the Appendix to this tutorial.

One suggested reading order is CPO(Constrained Policy Optimization), PCPO(Projection-Based Constrained Policy Optimization), then FOCOPS. If you have yet to read the PCPO, it does not matter. Nevertheless, be sure to read this article after reading the CPO tutorial we have written so that you can fully understand the following passage.


Optimization Objective#

In the previous chapters, you learned that CPO solves the following optimization problems:

(1)#\[\begin{split}\pi_{k+1}&=\arg \max _{\pi \in \Pi_{\boldsymbol{{\boldsymbol{\theta}}}}} \mathbb{E}_{\substack{s \sim d_{\pi_k}\\a \sim \pi}}[A^R_{\pi_k}(s, a)]\\ \text{s.t.} \quad J^{C_i}\left(\pi_k\right) &\leq d_i-\frac{1}{1-\gamma} \mathbb{E}_{\substack{s \sim d_{\pi_k} \\ a \sim \pi}}\left[A^{C_i}_{\pi_k}(s, a)\right] \quad \forall i \\ \bar{D}_{K L}\left(\pi \| \pi_k\right) &\leq \delta\end{split}\]

where \(\prod_{{\boldsymbol{\theta}}}\subseteq\prod\) denotes the parametrized policies with parameters \({\boldsymbol{\theta}}\), and \(\bar{D}_{K L}\) is the \(KL\) divergence of two policies. In local policy search for CMDPs, we require policy iterates to be feasible. Instead of optimizing over \(\prod_{{\boldsymbol{\theta}}}\), PCPO optimizes over \(\prod_{{\boldsymbol{\theta}}}\cap\prod_{C}\). Next, we will introduce you to how FOCOPS solves the above optimization problems. For you to have a clearer understanding, we hope that you will read the next section with the following questions:

Questions

  • What is a two-stage policy update, and how?

  • How to practically implement FOCOPS?

  • How do parameters impact the performance of the algorithm?


Two-stage Policy Update#

Instead of solving the Eq.1 directly, FOCOPS uses a two-stage approach summarized below:

Two-stage Policy Update

  • Given policy \(\pi_{{\boldsymbol{\theta}}_k}\), find an optimal update policy \(\pi^*\) by solving the optimization problem from Eq.1 in the non-parameterized policy space.

  • Project the policy found in the previous step back into the parameterized policy space \(\Pi_{{\boldsymbol{\theta}}}\) by searching for the closest policy \(\pi_{{\boldsymbol{\theta}}}\in\Pi_{{\boldsymbol{\theta}}}\) to \(\pi^*\), to obtain \(\pi_{{\boldsymbol{\theta}}_{k+1}}\).


Finding the Optimal Update Policy#

In the first stage, FOCOPS rewrites Eq.1 as below:

(2)#\[\begin{split}\pi^* &=\arg \max _{\pi \in \Pi} \mathbb{E}_{\substack{s \sim d_{\pi_k}\\a \sim \pi}}[A^R_{\pi_k}(s, a)]\\ \text{s.t.} \quad J^{C}\left(\pi_k\right) &\leq d-\frac{1}{1-\gamma} \mathbb{E}{\substack{s \sim d_{\pi_k} \\ a \sim \pi}}\left[A^{C}_{\pi_k}(s, a)\right] \quad \\ \bar{D}_{K L}\left(\pi \| \pi_k\right) & \leq \delta\end{split}\]

These problems are only slightly different from Eq.1 , that is, what we focus on now is the non-parameterized policy \(\pi\) but not the policy parameter \({\boldsymbol{\theta}}\). Then FOCOPS provides a solution as follows:

Theorem 1

Let \(\tilde{b}=(1-\gamma)\left(b-\tilde{J}^C\left(\pi_{{\boldsymbol{\theta}}_k}\right)\right)\). If \(\pi_{{\boldsymbol{\theta}}_k}\) is a feasible solution, the optimal policy for Eq.2 takes the form

(3)#\[\pi^*(a \mid s)=\frac{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right)\]

where \(Z_{\lambda,\nu}(s)\) is the partition function which ensures Eq.3 is a valid probability distribution, \(\lambda\) and \(\nu\) are solutions to the optimization problem:

(4)#\[\begin{split}\min _{\lambda, \nu \geq 0} \lambda \delta+\nu \tilde{b}+\lambda \underset{\substack{s \sim d_{\pi_{{\boldsymbol{\theta}}_k}} \\ a \sim \pi^*}}{\mathbb{E}}[\log Z_{\lambda, \nu}(s)]\end{split}\]

The form of the optimal policy is intuitive. It gives high probability mass to areas of the state-action space with high return, offset by a penalty term times the cost advantage. We will refer to the optimal solution to Eq.2 as the optimal update policy. Suppose you need help understanding the meaning of the above Equation. In that case, you can first think that FOCOPS finally solves Eq.2 by solving Eq.3 and Eq.4. Theorem 1 is a viable solution.

Question

What is the bound for FOCOPS worst-case guarantee for cost constraint?

Question

Can FOCOPS solve the multi-constraint problem and how?

Answer

FOCOPS purposes that the optimal update policy \(\pi^*\) satisfies the following bound for the worst-case guarantee for cost constraint in CPO:

(5)#\[J^C\left(\pi^*\right) \leq d+\frac{\sqrt{2 \delta} \gamma \epsilon_{\pi^*}^C}{(1-\gamma)^2}\]

where \(\epsilon^C_{\pi^*}=\max _s\left|\underset{a \sim \pi}{\mathbb{E}}\left[A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\right|\).

Answer

By introducing Lagrange multipliers \(\nu_1,\nu_2,...,\nu_m\ge0\), one for each cost constraint and applying a similar duality argument, FOCOPS extends its results to accommodate for multiple constraints.


Approximating the Optimal Update Policy#

The optimal update policy \(\pi^*\) is obtained in the previous section. However, it is not a parameterized policy. In this section, we will show you how FOCOPS projects the optimal update policy back into the parameterized policy space by minimizing the loss function:

(6)#\[\mathcal{L}({\boldsymbol{\theta}})=\underset{s \sim d_{\pi_{{\boldsymbol{\theta}}_k}}}{\mathbb{E}}\left[D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]\right]\]

Here \(\pi_{{\boldsymbol{\theta}}}\in \Pi_{{\boldsymbol{\theta}}}\) is some projected policy that FOCOPS will use to approximate the optimal update policy. The first-order methods are also used to minimize this loss function:

Corollary 1

The gradient of \(\mathcal{L}({\boldsymbol{\theta}})\) takes the form

(7)#\[\nabla_{\boldsymbol{\theta}} \mathcal{L}({\boldsymbol{\theta}})=\underset{s \sim d_{\pi_{\boldsymbol{\theta}}}}{\mathbb{E}}\left[\nabla_{\boldsymbol{\theta}} D_{K L}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]\right]\]

where

(8)#\[\begin{split}\nabla_{\boldsymbol{\theta}} D_{K L}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s] &=\nabla_{\boldsymbol{\theta}} D_{K L}\left(\pi_{\boldsymbol{\theta}} \| \pi_{{\boldsymbol{\theta}}_k}\right)[s] \\ & -\frac{1}{\lambda} \underset{a \sim \pi_{{\boldsymbol{\theta}}_k}}{\mathbb{E}}\left[\frac{\nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a \mid s)}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right]\end{split}\]

Note that Eq.7 can be estimated by sampling from the trajectories generated by policy \(\pi_{{\boldsymbol{\theta}}_k}\) so it can be trained using stochastic gradients.

Corollary 1 outlines the FOCOPS algorithm:

Note

At every iteration, we begin with a policy \(\pi_{{\boldsymbol{\theta}}_k}\), which we use to run trajectories and gather data. We use that data and Eq.4 first to estimate \(\lambda\) and \(\nu\). We then draw a mini-batch from the data to estimate \(\nabla_{\boldsymbol{\theta}} \mathcal{L}({\boldsymbol{\theta}})\) given in Corollary 1. After taking a gradient step using Equation Eq.7, we draw another mini-batch then repeat the process.


Practical Implementation#

Hint

Solving Eq.4 is computationally impractical for large state or action spaces as it requires calculating the partition function \(Z_{\lambda,\nu}(s)\), which often involves evaluating a high-dimensional integral or sum. Furthermore, \(\lambda\) and \(\nu\) are depend on \(k\) and should be adapted at every iteration.

This section will introduce you to how FOCOPS practically implements its algorithm. In practice, though hyperparameter sweeps, FOCOPS found that a fixed \(\lambda\) provides good results, which means the value \(\lambda\) does not have to be updated. However, \(\nu\) needs to be continuously adapted during training to ensure cost-constraint satisfaction. FOCOPS appeals to an intuitive heuristic for determining \(\nu\) based on primal-dual gradient methods. With strong duality, the optimal \(\lambda^*\) and \(\nu\) minimizes the dual function Eq.4, which is then denoted as \(L(\pi^*,\lambda,\nu)\). By applying gradient descent w.r.t \(\nu\) to minimize \(L(\pi^*,\lambda,\nu)\), we obtain:

Corollary 2

The derivative of \(L(\pi^*,\lambda,\nu)\) w.r.t \(\nu\) is

(9)#\[\begin{split}\frac{\partial L\left(\pi^*, \lambda, \nu\right)}{\partial \nu}=\tilde{b}-\underset{\substack{s \sim d_{\pi^*} \\ a \sim \pi^*}}{\mathbb{E}}\left[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\end{split}\]

The last term in the gradient expression in Eq.9 cannot be evaluated since we do not have access to \(\pi^*\). Since \(\pi_{{\boldsymbol{\theta}}_k}\) and \(\pi^*\) are close, it is reasonable to assume that \(\underset{\substack{s \sim d_{\pi_k}\\ a \sim \pi^*}}{\mathbb{E}}\left[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right] \approx \underset{\substack{s \sim d_{\pi_k}\\ a \sim \pi_{{\boldsymbol{\theta}}_k}}}{\mathbb{E}}\left[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]=0\). In practice, this term can be set to zero, which gives the updated term:

(10)#\[\nu \leftarrow \underset{\nu}{\operatorname{proj}}\left[\nu-\alpha\left(d-J^C\left(\pi_{{\boldsymbol{\theta}}_k}\right)\right)\right]\]

where \(\alpha\) is the step size. Note that we have incorporated the discount term \((1-\gamma)\) into \(\tilde{b}\) into the step size. The projection operator \(proj_{\nu}\) projects \(\nu\) back into the interval \([0,\nu_{max}]\), where \(\nu_{max}\) is chosen so that \(\nu\) does not become too large. In fact. FOCOPS purposed that even setting \(\nu_{max}=+\infty\) does not appear to reduce performance greatly. Practically, \(J^C(\pi_{{\boldsymbol{\theta}}_k})\) can be estimated via Monte Carlo methods using trajectories collected from \(\pi_{{\boldsymbol{\theta}}_k}\). Using the update rule Eq.10, FOCOPS performs one update step on \(\nu\) before updating the policy parameters \({\boldsymbol{\theta}}\). A per-state acceptance indicator function \(I\left(s_j\right)^n:=\mathbf{1}_{D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi_{{\boldsymbol{\theta}}_k}\right)\left[s_j\right] \leq \delta}\) is added to Eq.7, in order better to enforce the accuracy for the first-order purposed method.

Hint

Here \(N\) is the number of samples collected by policy \(\pi_{{\boldsymbol{\theta}}_k}\). \(\hat A^R\) and \(\hat A^C\) are estimates of the advantage functions (for the return and cost) obtained from critic networks. The advantage functions are obtained using the Generalized Advantage Estimator (GAE). Note that FOCOPS only requires first-order methods (gradient descent) and is thus extremely simple to implement.


Variables Analysis#

In this section, we will explain the meaning of parameters \(\lambda\) and \(\mu\) of FOCOPS and their impact on the algorithm’s performance in the experiment.

Analysis of \(\lambda\)

In Eq.3, note that as \(\lambda \rightarrow 0\), \(\pi^*\) approaches a greedy policy; as \(\lambda\) increases, the policy becomes more exploratory. Therefore \(\lambda\) is similar to the temperature term used in maximum entropy reinforcement learning, which has been shown to produce good results when fixed during training. In practice, FOCOPS finds that its algorithm reaches the best performance when the \(\lambda\) is fixed.

Analysis of \(\nu\)

We recall that in Eq.3, \(\nu\) acts as a cost penalty term. Increasing \(\nu\) makes it less likely for state-action pairs with higher costs to be sampled by \(\pi^*\). Hence in this regard, the update rule in Eq.10 is intuitive, because it increases \(\nu\) if \(J^C(\pi_{{\boldsymbol{\theta}}_k})>d\) (which means the agent violates the cost constraints) and decreases \(\nu\) otherwise.


Code with OmniSafe#

Quick start#

Run FOCOPS in OmniSafe

Here are 3 ways to run FOCOPS in OmniSafe:

  • Run Agent from preset yaml file

  • Run Agent from custom config dict

  • Run Agent from custom terminal config

1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5
6agent = omnisafe.Agent('FOCOPS', env_id)
7agent.learn()
 1import omnisafe
 2
 3
 4env_id = 'SafetyPointGoal1-v0'
 5custom_cfgs = {
 6    'train_cfgs': {
 7        'total_steps': 10000000,
 8        'vector_env_nums': 1,
 9        'parallel': 1,
10    },
11    'algo_cfgs': {
12        'steps_per_epoch': 20000,
13    },
14    'logger_cfgs': {
15        'use_wandb': False,
16        'use_tensorboard': True,
17    },
18}
19
20agent = omnisafe.Agent('FOCOPS', env_id, custom_cfgs=custom_cfgs)
21agent.learn()

We use train_policy.py as the entrance file. You can train the agent with FOCOPS simply using train_policy.py, with arguments about FOCOPS and environments does the training. For example, to run FOCOPS in SafetyPointGoal1-v0 , with 1 torch thread, seed 0 and single environment, you can use the following command:

1cd examples
2python train_policy.py --algo FOCOPS --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 10000000 --device cpu --vector-env-nums 1 --torch-threads 1

Architecture of functions#

  • FOCOPS.learn()

    • FOCOPS._env.rollout()

    • FOCOPS._update()

      • FOCOPS._buf.get()

      • FOCOPS._update_lagrange()

      • FOCOPS._update_actor()

      • FOCOPS._update_cost_critic()

      • FOCOPS._update_reward_critic()


Documentation of algorithm specific functions#

FOCOPS._compute_adv_surrogate()

Compute the surrogate advantage function.

1return (adv_r - self._lagrange.lagrangian_multiplier * adv_c) / (
2    1 + self._lagrange.lagrangian_multiplier
3)

FOCOPS._loss_pi()

Compute the loss of policy network.

In FOCOPS, the loss is defined as:

(11)#\[L = \nabla_{\boldsymbol{\theta}} D_{K L}\left(\pi_{\boldsymbol{\theta}}^{'} \| \pi_{{\boldsymbol{\theta}}}\right)[s] -\frac{1}{\eta} \underset{a \sim \pi_{{\boldsymbol{\theta}}}} {\mathbb{E}}\left[\frac{\nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a \mid s)} {\pi_{{\boldsymbol{\theta}}}(a \mid s)}\left(A^{R}_{\pi_{{\boldsymbol{\theta}}}}(s, a) -\lambda A^C_{\pi_{{\boldsymbol{\theta}}}}(s, a)\right)\right]\]

In code implementation, we use the following code to compute the loss:

 1distribution = self._actor_critic.actor(obs)
 2logp_ = self._actor_critic.actor.log_prob(act)
 3std = self._actor_critic.actor.std
 4ratio = torch.exp(logp_ - logp)
 5
 6kl = torch.distributions.kl_divergence(distribution, self._p_dist).sum(-1, keepdim=True)
 7loss = (kl - (1 / self._cfgs.algo_cfgs.focops_lam) * ratio * adv) * (
 8    kl.detach() <= self._cfgs.algo_cfgs.focops_eta
 9).type(torch.float32)
10loss = loss.mean()
11loss -= self._cfgs.algo_cfgs.entropy_coef * distribution.entropy().mean()

Configs#

Train Configs

  • device (str): Device to use for training, options: cpu, cuda, cuda:0, etc.

  • torch_threads (int): Number of threads to use for PyTorch.

  • total_steps (int): Total number of steps to train the agent.

  • parallel (int): Number of parallel agents, similar to A3C.

  • vector_env_nums (int): Number of the vector environments.

Algorithms Configs

Note

The following configs are specific to FOCOPS algorithm.

  • clip (float): Clipping parameter for FOCOPS.

  • steps_per_epoch (int): Number of steps to update the policy network.

  • update_iters (int): Number of iterations to update the policy network.

  • batch_size (int): Batch size for each iteration.

  • target_kl (float): Target KL divergence.

  • entropy_coef (float): Coefficient of entropy.

  • reward_normalize (bool): Whether to normalize the reward.

  • cost_normalize (bool): Whether to normalize the cost.

  • obs_normalize (bool): Whether to normalize the observation.

  • kl_early_stop (bool): Whether to stop the training when KL divergence is too large.

  • max_grad_norm (float): Maximum gradient norm.

  • use_max_grad_norm (bool): Whether to use maximum gradient norm.

  • use_critic_norm (bool): Whether to use critic norm.

  • critic_norm_coef (float): Coefficient of critic norm.

  • gamma (float): Discount factor.

  • cost_gamma (float): Cost discount factor.

  • lam (float): Lambda for GAE-Lambda.

  • lam_c (float): Lambda for cost GAE-Lambda.

  • adv_estimation_method (str): The method to estimate the advantage.

  • standardized_rew_adv (bool): Whether to use standardized reward advantage.

  • standardized_cost_adv (bool): Whether to use standardized cost advantage.

  • penalty_coef (float): Penalty coefficient for cost.

  • use_cost (bool): Whether to use cost.

Model Configs

  • weight_initialization_mode (str): The type of weight initialization method.

  • actor_type (str): The type of actor, default to gaussian_learning.

  • linear_lr_decay (bool): Whether to use linear learning rate decay.

  • exploration_noise_anneal (bool): Whether to use exploration noise anneal.

  • std_range (list): The range of standard deviation.

Hint

actor (dictionary): parameters for actor network actor

  • activations: tanh

  • hidden_sizes:

  • 64

  • 64

Hint

critic (dictionary): parameters for critic network critic

  • activations: tanh

  • hidden_sizes:

  • 64

  • 64

Logger Configs

  • use_wandb (bool): Whether to use wandb to log the training process.

  • wandb_project (str): The name of wandb project.

  • use_tensorboard (bool): Whether to use tensorboard to log the training process.

  • log_dir (str): The directory to save the log files.

  • window_lens (int): The length of the window to calculate the average reward.

  • save_model_freq (int): The frequency to save the model.

Lagrange Configs

Note

The following configs are specific to FOCOPS algorithm.

  • lagrangian_upper_bound (float): Upper bound of Lagrange multiplier.

  • cost_limit (float): Tolerance of constraint violation.

  • lagrangian_multiplier_init (float): Initial value of Lagrange multiplier.

  • lambda_lr (float): Learning rate of Lagrange multiplier.

  • lambda_optimizer (str): Optimizer for Lagrange multiplier.


References#

Appendix#

Proof for Theorem 1#

Lemma 1

Problem Eq.2 is convex w.r.t \(\pi={\pi(a|s):s\in \mathcal{S},a\in\mathcal{A}}\).

Proof of Lemma 1

First, note that the objective function is linear w.r.t \(\pi\). Since \(J^{C}(\pi_{{\boldsymbol{\theta}}_k})\) is a constant w.r.t \(\pi\), constraint Eq.2 is linear. Constraint Eq.2 can be rewritten as \(\sum_s d_{\pi_{{\boldsymbol{\theta}}_k}}(s) D_{\mathrm{KL}}\left(\pi \| \pi_{{\boldsymbol{\theta}}_k}\right)[s] \leq \delta\). The \(KL\) divergence is convex w.r.t its first argument. Hence constraint Eq.2, a linear combination of convex functions, is also convex. Since \(\pi_{{\boldsymbol{\theta}}_k}\) satisfies constraint Eq.2 also satisfies constraint Eq.2, therefore Slater’s constraint qualification holds, and strong duality holds.

Proof of Theorem 1 (Click here)

Based on Lemma 1 the optimal value of the Eq.2 \(p^*\) can be solved by solving the corresponding dual problem. Let

(12)#\[L(\pi, \lambda, \nu)=\lambda \delta+\nu \tilde{b}+\underset{s \sim d_{\pi_{{\boldsymbol{\theta}}_k}}}{\mathbb{E}}\left[A^{lag}-\lambda D_{\mathrm{KL}}\left(\pi \| \pi_{{\boldsymbol{\theta}}_k}\right)[s]\right]\nonumber\]

where \(A^{lag}=\underset{a \sim \pi(\cdot \mid s)}{\mathbb{E}}\left[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\). Therefore.

(13)#\[p^*=\max _{\pi \in \Pi} \min _{\lambda, \nu \geq 0} L(\pi, \lambda, \nu)=\min _{\lambda, \nu \geq 0} \max _{\pi \in \Pi} L(\pi, \lambda, \nu)\]

Note that if \(\pi^*\), \(\lambda^*\), \(\nu^*\) are optimal for Eq.13, \(\pi^*\) is also optimal for Eq.2 because of the strong duality.

Consider the inner maximization problem in Eq.13. We separate it from the original problem and try to solve it first:

(14)#\[\begin{split}&\underset{\pi}{\operatorname{max}} A^{lag}-\underset{a \sim \pi(\cdot \mid s)}{\mathbb{E}}\left[\lambda\left(\log \pi(a \mid s)+\log \pi_{{\boldsymbol{\theta}}_k}(a \mid s)\right)\right] \\ \text { s.t. } & \sum_a \pi(a \mid s)=1 \\ & \pi(a \mid s) \geq 0 \quad \forall a \in \mathcal{A}\end{split}\]

Which is equivalent to the inner maximization problem in Eq.13. We can solve this convex optimization problem using a simple Lagrangian argument. We can write the Lagrangian of it as:

(15)#\[G(\pi)=\sum_a \pi(a \mid s)[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a) -\lambda(\log \pi(a \mid s)-\log \pi_{{\boldsymbol{\theta}}_k}(a \mid s))+\zeta]-1\]

where \(\zeta > 0\) is the Lagrange multiplier associated with the constraint \(\sum_a \pi(a \mid s)=1\). Different \(G(\pi)\) w.r.t. \(\pi(a \mid s)\) for some \(a\):

(16)#\[\frac{\partial G}{\partial \pi(a \mid s)}=A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\lambda\left(\log \pi(a \mid s)+1-\log \pi_{{\boldsymbol{\theta}}_k}(a \mid s)\right)+\zeta\]

Setting Eq.16 to zero and rearranging the term, we obtain:

(17)#\[\pi(a \mid s)=\pi_{{\boldsymbol{\theta}}_k}(a \mid s) \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)+\frac{\zeta}{\lambda}+1\right)\]

We chose \(\zeta\) so that \(\sum_a \pi(a \mid s)=1\) and rewrite \(\zeta / \lambda+1\) as \(Z_{\lambda, \nu}(s)\). We find that the optimal solution \(\pi^*\) to Eq.14 takes the form

(18)#\[\pi^*(a \mid s)=\frac{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right)\]

Then we obtain:

(19)#\[\begin{split}&\underset{\substack{s \sim d_{{\boldsymbol{\theta}}_{{\boldsymbol{\theta}}_k}} \\ a \sim \pi^*}}{\mathbb{E}}\left[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\lambda\left(\log \pi^*(a \mid s)-\log \pi_{{\boldsymbol{\theta}}_k}(a \mid s)\right)\right] \\ = &\underset{\substack{s \sim d_{\pi_{{\boldsymbol{\theta}}_k}} \\ a \sim \pi^*}}{\mathbb{E}}\left[A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\lambda\left(\log \pi_{{\boldsymbol{\theta}}_k}(a \mid s)-\log Z_{\lambda, \nu}(s)\right.\right. \\ &\left.\left. + \frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)-\log \pi_{{\boldsymbol{\theta}}_k}(a \mid s)\right)\right]\\ = &\lambda\underset{\substack{s \sim d_{{\boldsymbol{\theta}}_{{\boldsymbol{\theta}}_k}} \\ a \sim \pi^*}}{\mathbb{E}}[logZ_{\lambda,\nu}(s)]\nonumber\end{split}\]

Plugging the result back to Eq.13, we obtain:

(20)#\[\begin{split}p^*=\underset{\lambda,\nu\ge0}{\min}\lambda\delta+\nu\tilde{b}+\lambda\underset{\substack{s \sim d_{{\boldsymbol{\theta}}_{{\boldsymbol{\theta}}_k}} \\ a \sim \pi^*}}{\mathbb{E}}[logZ_{\lambda,\nu}(s)]\end{split}\]

Proof of Corollary#

Proof of Corollary 1

We only need to calculate the gradient of the loss function for a single sampled s. We first note that,

(21)#\[\begin{split}&D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]\\ =&-\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \pi^*(a \mid s)+\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \pi_{\boldsymbol{\theta}}(a \mid s) \\ =&H\left(\pi_{\boldsymbol{\theta}}, \pi^*\right)[s]-H\left(\pi_{\boldsymbol{\theta}}\right)[s]\end{split}\]

where \(H\left(\pi_{\boldsymbol{\theta}}\right)[s]\) is the entropy and \(H\left(\pi_{\boldsymbol{\theta}}, \pi^*\right)[s]\) is the cross-entropy under state \(s\). The above is the basic mathematical knowledge in information theory, which you can get in any information theory textbook. We expand the cross entropy term, which gives us the following:

(22)#\[\begin{split}&H\left(\pi_{\boldsymbol{\theta}}, \pi^*\right)[s]\\ &=-\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \pi^*(a \mid s) \\ &=-\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \left(\frac{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left[\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right]\right) \\ &=-\sum_a \pi_{\boldsymbol{\theta}}(a \mid s) \log \pi_{{\boldsymbol{\theta}}_k}(a \mid s)+\log Z_{\lambda, \nu}(s)-\frac{1}{\lambda} \sum_a \pi_{\boldsymbol{\theta}}(a \mid s)\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\end{split}\]

We then subtract the entropy term to recover the \(KL\) divergence:

(23)#\[\begin{split}&D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]=D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi_{{\boldsymbol{\theta}}_k}\right)[s]+\log Z_{\lambda, \nu}(s)-\\&\frac{1}{\lambda} \underset{a \sim \pi_{{\boldsymbol{\theta}}_k}(\cdot \mid s)}{\mathbb{E}}\left[\frac{\pi_{\boldsymbol{\theta}}(a \mid s)}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right]\nonumber\end{split}\]

In the last equality, we applied importance sampling to rewrite the expectation w.r.t. \(\pi_{{\boldsymbol{\theta}}_k}\). Finally, taking the gradient on both sides gives us the following:

(24)#\[\begin{split}&\nabla_{\boldsymbol{\theta}} D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi^*\right)[s]=\nabla_{\boldsymbol{\theta}} D_{\mathrm{KL}}\left(\pi_{\boldsymbol{\theta}} \| \pi_{{\boldsymbol{\theta}}_k}\right)[s]\\&-\frac{1}{\lambda} \underset{a \sim \pi_{{\boldsymbol{\theta}}_k}(\cdot \mid s)}{\mathbb{E}}\left[\frac{\nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a \mid s)}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right]\nonumber\end{split}\]

Proof of Corollary 2

From Theorem 1, we have:

(25)#\[\begin{split}L\left(\pi^*, \lambda, \nu\right)=\lambda \delta+\nu \tilde{b}+\lambda \underset{\substack{s \sim d_{\pi^*} \\ a \sim \pi^*}}{\mathbb{E}}\left[\log Z_{\lambda, \nu}(s)\right]\end{split}\]

The first two terms are an affine function w.r.t. \(\nu\). Therefore, its derivative is \(\tilde{b}\). We will then focus on the expectation in the last term. To simplify our derivation, we will first calculate the derivative of \(\pi^*\) w.r.t. \(\nu\),

(26)#\[\begin{split}\frac{\partial \pi^*(a \mid s)}{\partial \nu} &=\frac{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}{Z_{\lambda, \nu}^2(s)}\left[Z_{\lambda, \nu}(s) \frac{\partial}{\partial \nu} \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right)\right.\\ &\left.-\exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right) \frac{\partial Z_{\lambda, \nu}(s)}{\partial \nu}\right] \\ &=-\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \pi^*(a \mid s)-\pi^*(a \mid s) \frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}\nonumber\end{split}\]

Therefore the derivative of the expectation in the last term of \(L(\pi^*,\lambda,\nu)\) can be written as:

(27)#\[\begin{split}\frac{\partial}{\partial \nu} \underset{\substack{s \sim d_{\pi {\boldsymbol{\theta}}_k} \\ a \sim \pi^*}}{\mathbb{E}}\left[\log Z_{\lambda, \nu}(s)\right] &= \underset{\substack{s \sim d_{\pi_{\boldsymbol{\theta}}} \\ a \sim \pi_{{\boldsymbol{\theta}}_k}}}{\mathbb{E}}\left[\frac{\partial}{\partial \nu}\left(\frac{\pi^*(a \mid s)}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)} \log Z_{\lambda, \nu}(s)\right)\right] \\ &= \underset{\substack{s \sim d_{\pi_{\boldsymbol{\theta}}} \\ a \sim \pi_{{\boldsymbol{\theta}}_k}}}{\mathbb{E}}\left[\frac{1}{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}\left(\frac{\partial \pi^*(a \mid s)}{\partial \nu} \log Z_{\lambda, \nu}(s)+\pi^*(a \mid s) \frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}\right)\right] \\ &= \underset{\substack{s \sim d_{\pi_{\boldsymbol{\theta}}} \\ a \sim \pi^*}}{\mathbb{E}}\left[-(\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda}+\frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}) \log Z_{\lambda, \nu}(s)+\frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}\right]\end{split}\]

Also:

(28)#\[\begin{split}\frac{\partial Z_{\lambda, \nu}(s)}{\partial \nu} &=\frac{\partial}{\partial \nu} \sum_a \pi_{{\boldsymbol{\theta}}_k}(a \mid s) \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right) \\ &=\sum_a-\pi_{{\boldsymbol{\theta}}_k}(a \mid s) \frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right) \\ &=\sum_a-\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \frac{\pi_{{\boldsymbol{\theta}}_k}(a \mid s)}{Z_{\lambda, \nu}(s)} \exp \left(\frac{1}{\lambda}\left(A^R_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)-\nu A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right)\right) Z_{\lambda, \nu}(s) \\ &=-\frac{Z_{\lambda, \nu}(s)}{\lambda} \underset{a \sim \pi^*(\cdot \mid s)}{\mathbb{E}}\left[A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\end{split}\]

Therefore:

(29)#\[\frac{\partial \log Z_{\lambda, \nu}(s)}{\partial \nu}=\frac{\partial Z_{\lambda, \nu}(s)}{\partial \nu} \frac{1}{Z_{\lambda, \nu}(s)}=-\frac{1}{\lambda} \underset{a \sim \pi^*(\cdot \mid s)}{\mathbb{E}}\left[A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\]

Plugging Eq.29 into the last equality in Eq.27 gives us:

(30)#\[\begin{split}\frac{\partial}{\partial \nu} \underset{\substack{s \sim d_{\pi_{\boldsymbol{\theta}}} \\ a \sim \pi^*}}{\mathbb{E}}\left[\log Z_{\lambda, \nu}(s)\right] &=\underset{\substack{s \sim d_{\pi^*} \\ a \sim \pi^*}}{\mathbb{E}}\left[-\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \log Z_{\lambda, \nu}(s)+\frac{A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)}{\lambda} \log Z_{\lambda, \nu}(s)-\frac{1}{\lambda} A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right] \\ &=-\frac{1}{\lambda} \underset{\substack{s \sim d_{\pi_{{\boldsymbol{\theta}}_k}} \\ a \sim \pi^*}}{\mathbb{E}}\left[A^C_{\pi_{{\boldsymbol{\theta}}_k}}(s, a)\right]\end{split}\]

Combining Eq.30 with the derivatives of the affine term give us the final desired result.