First Order Algorithms#
|
The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm. |
|
The Constrained Update Projection (CUP) Approach to Safe Policy Optimization. |
FOCOPS#
Documentation
- class omnisafe.algorithms.on_policy.FOCOPS(env_id, cfgs)[source]#
The First Order Constrained Optimization in Policy Space (FOCOPS) algorithm.
References
Title: First Order Constrained Optimization in Policy Space
Authors: Yiming Zhang, Quan Vuong, Keith W. Ross.
URL: FOCOPS
Initialize an instance of algorithm.
- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
FOCOPS uses the following surrogate loss:
(3)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]- Parameters:
adv_r (torch.Tensor) – The
reward_advantage
sampled from buffer.adv_c (torch.Tensor) – The
cost_advantage
sampled from buffer.
- Returns:
The advantage function combined with reward and cost.
- Return type:
Tensor
- _init()[source]#
Initialize the FOCOPS specific model.
The FOCOPS algorithm uses a Lagrange multiplier to balance the cost and reward.
- Return type:
None
- _init_log()[source]#
Log the FOCOPS specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
- Return type:
None
- _loss_pi(obs, act, logp, adv)[source]#
Compute pi/actor loss.
In FOCOPS, the loss is defined as:
\begin{eqnarray} L = \nabla_{\theta} D_{K L} \left( \pi_{\theta}^{'} \| \pi_{\theta} \right)[s] - \frac{1}{\eta} \underset{a \sim \pi_{\theta}}{\mathbb{E}} \left[ \frac{\nabla_{\theta} \pi_{\theta} (a \mid s)}{\pi_{\theta}(a \mid s)} \left( A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) \right) \right] \end{eqnarray}where \(\eta\) is a hyperparameter, \(\lambda\) is the Lagrange multiplier, \(A_{\pi_{\theta_k}}(s, a)\) is the advantage function, \(A^C_{\pi_{\theta_k}}(s, a)\) is the cost advantage function, \(\pi^*\) is the optimal policy, and \(\pi_{\theta}\) is the current policy.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.act (torch.Tensor) – The
action
sampled from buffer.logp (torch.Tensor) – The
log probability
of action sampled from buffer.adv (torch.Tensor) – The
advantage
sampled from buffer.
- Returns:
The loss of pi/actor.
- Return type:
Tensor
- _update()[source]#
Update actor, critic, and Lagrange multiplier parameters.
In FOCOPS, the Lagrange multiplier is updated as the naive lagrange multiplier update.
Then in each iteration of the policy update, FOCOPS calculates current policy’s distribution, which used to calculate the policy loss.
- Return type:
None
CUP#
Documentation
- class omnisafe.algorithms.on_policy.CUP(env_id, cfgs)[source]#
The Constrained Update Projection (CUP) Approach to Safe Policy Optimization.
References
Title: Constrained Update Projection Approach to Safe Policy Optimization
- Authors: Long Yang, Jiaming Ji, Juntao Dai, Linrui Zhang, Binbin Zhou, Pengfei Li,
Yaodong Yang, Gang Pan.
URL: CUP
Initialize an instance of algorithm.
- _init()[source]#
The initialization of the algorithm.
Here we additionally initialize the Lagrange multiplier.
- Return type:
None
- _init_log()[source]#
Log the CUP specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
Loss/Loss_pi_c
The loss of the cost performance.
Train/SecondStepStopIter
The number of iterations to stop the second step.
Train/SecondStepEntropy
The entropy of the current policy.
Train/SecondStepPolicyRatio
The ratio between the current policy and the old policy.
- Return type:
None
- _loss_pi_cost(obs, act, logp, adv_c)[source]#
Compute the performance of cost on this moment.
We compute the KL divergence between the current policy and the old policy, the entropy of the current policy, and the ratio between the current policy and the old policy.
The loss of the cost performance is defined as:
(6)#\[L = \underset{a \sim \pi_{\theta}}{\mathbb{E}} [ \lambda \frac{1 - \gamma \nu}{1 - \gamma} \frac{\pi_{\theta}^{'} (a|s)}{\pi_{\theta} (a|s)} A^{C}_{\pi_{\theta}} + KL (\pi_{\theta}^{'} (a|s) || \pi_{\theta} (a|s)) ]\]where \(\lambda\) is the Lagrange multiplier, \(\frac{1 - \gamma \nu}{1 - \gamma}\) is the coefficient value, \(\pi_{\theta}^{'} (a_t|s_t)\) is the current policy, \(\pi_{\theta} (a_t|s_t)\) is the old policy, \(A^{C}_{\pi_{\theta}}\) is the cost advantage, \(KL (\pi_{\theta}^{'} (a_t|s_t) || \pi_{\theta} (a_t|s_t))\) is the KL divergence between the current policy and the old policy.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.act (torch.Tensor) – The
action
sampled from buffer.logp (torch.Tensor) – The
log probability
of action sampled from buffer.adv_c (torch.Tensor) – The
cost_advantage
sampled from buffer.
- Returns:
The loss of the cost performance.
- Return type:
Tensor
- _update()[source]#
Update actor, critic, and Lagrange multiplier parameters.
In CUP, the Lagrange multiplier is updated as the naive lagrange multiplier update.
Then in each iteration of the policy update, CUP calculates current policy’s distribution, which used to calculate the policy loss.
- Return type:
None