First Order Algorithms#

`FOCOPS`(env_id, cfgs)	The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm.
`CUP`(env_id, cfgs)	The Constrained Update Projection (CUP) Approach to Safe Policy Optimization.

FOCOPS#

Documentation

class omnisafe.algorithms.on_policy.FOCOPS(env_id, cfgs)[source]#

The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm.

References

Title: First Order Constrained Optimization in Policy Space
Authors: Yiming Zhang, Quan Vuong, Keith W. Ross.
URL: FOCOPS

Initialize an instance of algorithm.

_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute surrogate loss.

FOCOPS uses the following surrogate loss:

(3)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]

Parameters:

adv_r (torch.Tensor) – The reward_advantage sampled from buffer.
adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The advantage function combined with reward and cost.

Return type:

Tensor

_init()[source]#

Initialize the FOCOPS specific model.

The FOCOPS algorithm uses a Lagrange multiplier to balance the cost and reward.

Return type:: None

_init_log()[source]#

Log the FOCOPS specific information.

Things to log	Description
Metrics/LagrangeMultiplier	The Lagrange multiplier.

Return type:: None

_loss_pi(obs, act, logp, adv)[source]#

Compute pi/actor loss.

In FOCOPS, the loss is defined as:

\begin{eqnarray} L = \nabla_{\theta} D_{K L} \left( \pi_{\theta}^{'} \| \pi_{\theta} \right)[s] - \frac{1}{\eta} \underset{a \sim \pi_{\theta}}{\mathbb{E}} \left[ \frac{\nabla_{\theta} \pi_{\theta} (a \mid s)}{\pi_{\theta}(a \mid s)} \left( A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) \right) \right] \end{eqnarray}

where \(\eta\) is a hyperparameter, \(\lambda\) is the Lagrange multiplier, \(A_{\pi_{\theta_k}}(s, a)\) is the advantage function, \(A^C_{\pi_{\theta_k}}(s, a)\) is the cost advantage function, \(\pi^*\) is the optimal policy, and \(\pi_{\theta}\) is the current policy.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.
act (torch.Tensor) – The action sampled from buffer.
logp (torch.Tensor) – The log probability of action sampled from buffer.
adv (torch.Tensor) – The advantage sampled from buffer.

Returns:

The loss of pi/actor.

Return type:

Tensor

_update()[source]#

Update actor, critic, and Lagrange multiplier parameters.

In FOCOPS, the Lagrange multiplier is updated as the naive lagrange multiplier update.

Then in each iteration of the policy update, FOCOPS calculates current policy’s distribution, which used to calculate the policy loss.

Return type:: None

CUP#

Documentation

class omnisafe.algorithms.on_policy.CUP(env_id, cfgs)[source]#

The Constrained Update Projection (CUP) Approach to Safe Policy Optimization.

References

Title: Constrained Update Projection Approach to Safe Policy Optimization
Authors: Long Yang, Jiaming Ji, Juntao Dai, Linrui Zhang, Binbin Zhou, Pengfei Li,
Yaodong Yang, Gang Pan.
URL: CUP

Initialize an instance of algorithm.

_init()[source]#

The initialization of the algorithm.

Here we additionally initialize the Lagrange multiplier.

Return type:: None

_init_log()[source]#

Log the CUP specific information.

Things to log	Description
Metrics/LagrangeMultiplier	The Lagrange multiplier.
Loss/Loss_pi_c	The loss of the cost performance.
Train/SecondStepStopIter	The number of iterations to stop the second step.
Train/SecondStepEntropy	The entropy of the current policy.
Train/SecondStepPolicyRatio	The ratio between the current policy and the old policy.

Return type:: None

_loss_pi_cost(obs, act, logp, adv_c)[source]#

Compute the performance of cost on this moment.

We compute the KL divergence between the current policy and the old policy, the entropy of the current policy, and the ratio between the current policy and the old policy.

The loss of the cost performance is defined as:

(6)#\[L = \underset{a \sim \pi_{\theta}}{\mathbb{E}} [ \lambda \frac{1 - \gamma \nu}{1 - \gamma} \frac{\pi_{\theta}^{'} (a|s)}{\pi_{\theta} (a|s)} A^{C}_{\pi_{\theta}} + KL (\pi_{\theta}^{'} (a|s) || \pi_{\theta} (a|s)) ]\]

where \(\lambda\) is the Lagrange multiplier, \(\frac{1 - \gamma \nu}{1 - \gamma}\) is the coefficient value, \(\pi_{\theta}^{'} (a_t|s_t)\) is the current policy, \(\pi_{\theta} (a_t|s_t)\) is the old policy, \(A^{C}_{\pi_{\theta}}\) is the cost advantage, \(KL (\pi_{\theta}^{'} (a_t|s_t) || \pi_{\theta} (a_t|s_t))\) is the KL divergence between the current policy and the old policy.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.
act (torch.Tensor) – The action sampled from buffer.
logp (torch.Tensor) – The log probability of action sampled from buffer.
adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The loss of the cost performance.

Return type:

Tensor

_update()[source]#

Update actor, critic, and Lagrange multiplier parameters.

In CUP, the Lagrange multiplier is updated as the naive lagrange multiplier update.

Then in each iteration of the policy update, CUP calculates current policy’s distribution, which used to calculate the policy loss.

Return type:: None