Lagrange Algorithms#
|
The Lagrange version of the PPO algorithm. |
|
The Lagrange version of the TRPO algorithm. |
PPOLag#
Documentation
- class omnisafe.algorithms.on_policy.PPOLag(env_id, cfgs)[source]#
The Lagrange version of the PPO algorithm.
A simple combination of the Lagrange method and the Proximal Policy Optimization algorithm.
Initialize an instance of algorithm.
- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
PPOLag uses the following surrogate loss:
(3)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]- Parameters:
adv_r (torch.Tensor) – The
reward_advantage
sampled from buffer.adv_c (torch.Tensor) – The
cost_advantage
sampled from buffer.
- Returns:
The advantage function combined with reward and cost.
- Return type:
Tensor
- _init()[source]#
Initialize the PPOLag specific model.
The PPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.
- Return type:
None
- _init_log()[source]#
Log the PPOLag specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
- Return type:
None
- _update()[source]#
Update actor, critic, as we used in the
PolicyGradient
algorithm.Additionally, we update the Lagrange multiplier parameter by calling the
update_lagrange_multiplier()
method. :rtype:None
Note
The
_loss_pi()
is defined in thePolicyGradient
algorithm. When a lagrange multiplier is used, the_loss_pi()
method will return the loss of the policy as:(4)#\[L_{\pi} = -\underset{s_t \sim \rho_{\theta}}{\mathbb{E}} \left[ \frac{\pi_{\theta} (a_t|s_t)}{\pi_{\theta}^{old}(a_t|s_t)} [ A^{R}_{\pi_{\theta}} (s_t, a_t) - \lambda A^{C}_{\pi_{\theta}} (s_t, a_t) ] \right]\]where \(\lambda\) is the Lagrange multiplier parameter.
TRPOLag#
Documentation
- class omnisafe.algorithms.on_policy.TRPOLag(env_id, cfgs)[source]#
The Lagrange version of the TRPO algorithm.
A simple combination of the Lagrange method and the Trust Region Policy Optimization algorithm.
Initialize an instance of algorithm.
- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute surrogate loss.
TRPOLag uses the following surrogate loss:
(7)#\[L = \frac{1}{1 + \lambda} [ A^{R}_{\pi_{\theta}} (s, a) - \lambda A^C_{\pi_{\theta}} (s, a) ]\]- Parameters:
adv_r (torch.Tensor) – The
reward_advantage
sampled from buffer.adv_c (torch.Tensor) – The
cost_advantage
sampled from buffer.
- Returns:
The advantage function combined with reward and cost.
- Return type:
Tensor
- _init()[source]#
Initialize the TRPOLag specific model.
The TRPOLag algorithm uses a Lagrange multiplier to balance the cost and reward.
- Return type:
None
- _init_log()[source]#
Log the TRPOLag specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
- Return type:
None
- _update()[source]#
Update actor, critic, as we used in the
PolicyGradient
algorithm.Additionally, we update the Lagrange multiplier parameter by calling the
update_lagrange_multiplier()
method. :rtype:None
Note
The
_loss_pi()
is defined in thePolicyGradient
algorithm. When a lagrange multiplier is used, the_loss_pi()
method will return the loss of the policy as:(8)#\[L_{\pi} = -\underset{s_t \sim \rho_{\theta}}{\mathbb{E}} \left[ \frac{\pi_{\theta} (a_t|s_t)}{\pi_{\theta}^{old}(a_t|s_t)} [ A^{R}_{\pi_{\theta}} (s_t, a_t) - \lambda A^{C}_{\pi_{\theta}} (s_t, a_t) ] \right]\]where \(\lambda\) is the Lagrange multiplier parameter.
CRPO#
Documentation
- class omnisafe.algorithms.on_policy.OnCRPO(env_id, cfgs)[source]#
The on-policy CRPO algorithm.
References
Title: CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee.
Authors: Tengyu Xu, Yingbin Liang, Guanghui Lan.
URL: CRPO.
Initialize an instance of
OnCRPO
.- _compute_adv_surrogate(adv_r, adv_c)[source]#
Compute the advantage surrogate.
In CRPO algorithm, we first judge whether the cost is within the limit. If the cost is within the limit, we use the advantage of the policy. Otherwise, we use the advantage of the cost.
- Parameters:
adv_r (torch.Tensor) – The
reward_advantage
sampled from buffer.adv_c (torch.Tensor) – The
cost_advantage
sampled from buffer.
- Returns:
The advantage function chosen from reward and cost.
- Return type:
Tensor
DDPGLag#
Documentation
- class omnisafe.algorithms.off_policy.DDPGLag(env_id, cfgs)[source]#
The Lagrangian version of Deep Deterministic Policy Gradient (DDPG) algorithm.
References
Title: Continuous control with deep reinforcement learning
- Authors: Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,
Tom Erez, Yuval Tassa, David Silver, Daan Wierstra.
URL: DDPG
Initialize an instance of algorithm.
- _init()[source]#
The initialization of the algorithm.
Here we additionally initialize the Lagrange multiplier.
- Return type:
None
- _init_log()[source]#
Log the DDPGLag specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
- Return type:
None
- _loss_pi(obs)[source]#
Computing
pi/actor
loss.The loss function in DDPGLag is defined as:
(10)#\[L = -Q^V (s, \pi (s)) + \lambda Q^C (s, \pi (s))\]where \(Q^V\) is the min value of two reward critic networks outputs, \(Q^C\) is the value of cost critic network, and \(\pi\) is the policy network.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.- Returns:
The loss of pi/actor.
- Return type:
Tensor
SACLag#
Documentation
- class omnisafe.algorithms.off_policy.SACLag(env_id, cfgs)[source]#
The Lagrangian version of Soft Actor-Critic (SAC) algorithm.
References
Title: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Authors: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine.
URL: SAC
Initialize an instance of algorithm.
- _init()[source]#
The initialization of the algorithm.
Here we additionally initialize the Lagrange multiplier.
- Return type:
None
- _init_log()[source]#
Log the SACLag specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
- Return type:
None
- _loss_pi(obs)[source]#
Computing
pi/actor
loss.The loss function in SACLag is defined as:
(12)#\[L = -Q^V (s, \pi (s)) + \lambda Q^C (s, \pi (s))\]where \(Q^V\) is the min value of two reward critic networks outputs, \(Q^C\) is the value of cost critic network, and \(\pi\) is the policy network.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.- Returns:
The loss of pi/actor.
- Return type:
Tensor
TD3Lag#
Documentation
- class omnisafe.algorithms.off_policy.TD3Lag(env_id, cfgs)[source]#
The Lagrangian version of Twin Delayed DDPG (TD3) algorithm.
References
Title: Addressing Function Approximation Error in Actor-Critic Methods
Authors: Scott Fujimoto, Herke van Hoof, David Meger.
URL: TD3
Initialize an instance of algorithm.
- _init()[source]#
The initialization of the algorithm.
Here we additionally initialize the Lagrange multiplier.
- Return type:
None
- _init_log()[source]#
Log the TD3Lag specific information.
Things to log
Description
Metrics/LagrangeMultiplier
The Lagrange multiplier.
- Return type:
None
- _loss_pi(obs)[source]#
Computing
pi/actor
loss.The loss function in TD3Lag is defined as:
(14)#\[L = -Q^V (s, \pi (s)) + \lambda Q^C (s, \pi (s))\]where \(Q^V\) is the min value of two reward critic networks outputs, \(Q^C\) is the value of cost critic network, and \(\pi\) is the policy network.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.- Returns:
The loss of pi/actor.
- Return type:
Tensor