Lagrange Algorithms#

`PPOLag`(env_id, cfgs)	The Lagrange version of the PPO algorithm.
`TRPOLag`(env_id, cfgs)	The Lagrange version of the TRPO algorithm.

PPOLag#

Documentation

class omnisafe.algorithms.on_policy.PPOLag(env_id, cfgs)[source]#

The Lagrange version of the PPO algorithm.

A simple combination of the Lagrange method and the Proximal Policy Optimization algorithm.

Initialize an instance of algorithm.

_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute surrogate loss.

PPOLag uses the following surrogate loss:

(3)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]

Parameters:

adv_r (torch.Tensor) – The reward_advantage sampled from buffer.
adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The advantage function combined with reward and cost.

Return type:

Tensor

_init()[source]#

Initialize the PPOLag specific model.

The PPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.

Return type:: None

_init_log()[source]#

Log the PPOLag specific information.

Things to log	Description
Metrics/LagrangeMultiplier	The Lagrange multiplier.

Return type:: None

_update()[source]#

Update actor, critic, as we used in the PolicyGradient algorithm.

Additionally, we update the Lagrange multiplier parameter by calling the update_lagrange_multiplier() method. :rtype: None

Note

The _loss_pi() is defined in the PolicyGradient algorithm. When a lagrange multiplier is used, the _loss_pi() method will return the loss of the policy as:

(4)#\[L_{\pi} = -\underset{s_t \sim \rho_{\theta}}{\mathbb{E}} \left[ \frac{\pi_{\theta} (a_t|s_t)}{\pi_{\theta}^{old}(a_t|s_t)} [ A^{R}_{\pi_{\theta}} (s_t, a_t) - \lambda A^{C}_{\pi_{\theta}} (s_t, a_t) ] \right]\]

where \(\lambda\) is the Lagrange multiplier parameter.

TRPOLag#

Documentation

class omnisafe.algorithms.on_policy.TRPOLag(env_id, cfgs)[source]#

The Lagrange version of the TRPO algorithm.

A simple combination of the Lagrange method and the Trust Region Policy Optimization algorithm.

Initialize an instance of algorithm.

_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute surrogate loss.

TRPOLag uses the following surrogate loss:

(7)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]

Parameters:

adv_r (torch.Tensor) – The reward_advantage sampled from buffer.
adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The advantage function combined with reward and cost.

Return type:

Tensor

_init()[source]#

Initialize the TRPOLag specific model.

The TRPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.

Return type:: None

_init_log()[source]#

Log the TRPOLag specific information.

Things to log	Description
Metrics/LagrangeMultiplier	The Lagrange multiplier.

Return type:: None

_update()[source]#

Update actor, critic, as we used in the PolicyGradient algorithm.

Additionally, we update the Lagrange multiplier parameter by calling the update_lagrange_multiplier() method. :rtype: None

Note

The _loss_pi() is defined in the PolicyGradient algorithm. When a lagrange multiplier is used, the _loss_pi() method will return the loss of the policy as:

(8)#\[L_{\pi} = -\underset{s_t \sim \rho_{\theta}}{\mathbb{E}} \left[ \frac{\pi_{\theta} (a_t|s_t)}{\pi_{\theta}^{old}(a_t|s_t)} [ A^{R}_{\pi_{\theta}} (s_t, a_t) - \lambda A^{C}_{\pi_{\theta}} (s_t, a_t) ] \right]\]

where \(\lambda\) is the Lagrange multiplier parameter.

CRPO#

Documentation

class omnisafe.algorithms.on_policy.OnCRPO(env_id, cfgs)[source]#

The on-policy CRPO algorithm.

References

Title: CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee.
Authors: Tengyu Xu, Yingbin Liang, Guanghui Lan.
URL: CRPO.

Initialize an instance of OnCRPO.

_compute_adv_surrogate(adv_r, adv_c)[source]#

Compute the advantage surrogate.

In CRPO algorithm, we first judge whether the cost is within the limit. If the cost is within the limit, we use the advantage of the policy. Otherwise, we use the advantage of the cost.

Parameters:

adv_r (torch.Tensor) – The reward_advantage sampled from buffer.
adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The advantage function chosen from reward and cost.

Return type:

Tensor

_init_log()[source]#

Log the CRPO specific information.

Things to log	Description
Misc/RewUpdate	The number of times the reward is updated.
Misc/CostUpdate	The number of times the cost is updated.

Return type:: None

`DDPGLag`(env_id, cfgs)	The Lagrangian version of Deep Deterministic Policy Gradient (DDPG) algorithm.
`TD3Lag`(env_id, cfgs)	The Lagrangian version of Twin Delayed DDPG (TD3) algorithm.
`SACLag`(env_id, cfgs)	The Lagrangian version of Soft Actor-Critic (SAC) algorithm.

DDPGLag#

Documentation

class omnisafe.algorithms.off_policy.DDPGLag(env_id, cfgs)[source]#

The Lagrangian version of Deep Deterministic Policy Gradient (DDPG) algorithm.

References

Title: Continuous control with deep reinforcement learning
Authors: Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,
Tom Erez, Yuval Tassa, David Silver, Daan Wierstra.
URL: DDPG

Initialize an instance of algorithm.

_init()[source]#

The initialization of the algorithm.

Here we additionally initialize the Lagrange multiplier.

Return type:: None

_init_log()[source]#

Log the DDPGLag specific information.

Things to log	Description
Metrics/LagrangeMultiplier	The Lagrange multiplier.

Return type:: None

_log_when_not_update()[source]#

Log default value when not update.

Return type:: None

_loss_pi(obs)[source]#

Computing pi/actor loss.

The loss function in DDPGLag is defined as:

(10)#\[L = -Q^V (s, \pi (s)) + \lambda Q^C (s, \pi (s))\]

where \(Q^V\) is the min value of two reward critic networks outputs, \(Q^C\) is the value of cost critic network, and \(\pi\) is the policy network.

Parameters:: obs (torch.Tensor) – The observation sampled from buffer.
Returns:: The loss of pi/actor.
Return type:: Tensor

_update()[source]#

Update actor, critic, as we used in the PolicyGradient algorithm.

Additionally, we update the Lagrange multiplier parameter by calling the update_lagrange_multiplier() method.

Return type:: None

SACLag#

Documentation

class omnisafe.algorithms.off_policy.SACLag(env_id, cfgs)[source]#

The Lagrangian version of Soft Actor-Critic (SAC) algorithm.

References

Title: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Authors: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine.
URL: SAC

Initialize an instance of algorithm.

_init()[source]#

The initialization of the algorithm.

Here we additionally initialize the Lagrange multiplier.

Return type:: None

_init_log()[source]#

Log the SACLag specific information.

Things to log	Description
Metrics/LagrangeMultiplier	The Lagrange multiplier.

Return type:: None

_log_when_not_update()[source]#

Log default value when not update.

Return type:: None

_loss_pi(obs)[source]#

Computing pi/actor loss.

The loss function in SACLag is defined as:

(12)#\[L = -Q^V (s, \pi (s)) + \lambda Q^C (s, \pi (s))\]

where \(Q^V\) is the min value of two reward critic networks outputs, \(Q^C\) is the value of cost critic network, and \(\pi\) is the policy network.

Parameters:: obs (torch.Tensor) – The observation sampled from buffer.
Returns:: The loss of pi/actor.
Return type:: Tensor

_update()[source]#

Update actor, critic, as we used in the PolicyGradient algorithm.

Additionally, we update the Lagrange multiplier parameter by calling the update_lagrange_multiplier() method.

Return type:: None

TD3Lag#

Documentation

class omnisafe.algorithms.off_policy.TD3Lag(env_id, cfgs)[source]#

The Lagrangian version of Twin Delayed DDPG (TD3) algorithm.

References

Title: Addressing Function Approximation Error in Actor-Critic Methods
Authors: Scott Fujimoto, Herke van Hoof, David Meger.
URL: TD3

Initialize an instance of algorithm.

_init()[source]#

The initialization of the algorithm.

Here we additionally initialize the Lagrange multiplier.

Return type:: None

_init_log()[source]#

Log the TD3Lag specific information.

Things to log	Description
Metrics/LagrangeMultiplier	The Lagrange multiplier.

Return type:: None

_log_when_not_update()[source]#

Log default value when not update.

Return type:: None

_loss_pi(obs)[source]#

Computing pi/actor loss.

The loss function in TD3Lag is defined as:

(14)#\[L = -Q^V (s, \pi (s)) + \lambda Q^C (s, \pi (s))\]

where \(Q^V\) is the min value of two reward critic networks outputs, \(Q^C\) is the value of cost critic network, and \(\pi\) is the policy network.

Parameters:: obs (torch.Tensor) – The observation sampled from buffer.
Returns:: The loss of pi/actor.
Return type:: Tensor

_update()[source]#

Update actor, critic, as we used in the PolicyGradient algorithm.

Additionally, we update the Lagrange multiplier parameter by calling the update_lagrange_multiplier() method.

Return type:: None