Second Order Algorithms#

CPO(env_id, cfgs)

The Constrained Policy Optimization (CPO) algorithm.

PCPO(env_id, cfgs)

The Projection-Based Constrained Policy Optimization (PCPO) algorithm.

Constraint Policy Optimization#

Documentation

class omnisafe.algorithms.on_policy.CPO(env_id, cfgs)[source]#

The Constrained Policy Optimization (CPO) algorithm.

CPO is a derivative of TRPO.

References

  • Title: Constrained Policy Optimization

  • Authors: Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel.

  • URL: CPO

Initialize an instance of algorithm.

_cpo_search_step(step_direction, grads, p_dist, obs, act, logp, adv_r, adv_c, loss_reward_before, loss_cost_before, total_steps=15, decay=0.8, violation_c=0, optim_case=0)[source]#

Use line-search to find the step size that satisfies the constraint.

CPO uses line-search to find the step size that satisfies the constraint. The constraint is defined as:

(3)#\[\begin{split}J^C (\theta + \alpha \delta) - J^C (\theta) \leq \max \{ 0, c \} \\ D_{KL} (\pi_{\theta} (\cdot|s) || \pi_{\theta + \alpha \delta} (\cdot|s)) \leq \delta_{KL}\end{split}\]

where \(\delta_{KL}\) is the constraint of KL divergence, \(\alpha\) is the step size, \(c\) is the violation of constraint.

Parameters:
  • step_dir (torch.Tensor) – The step direction.

  • g_flat (torch.Tensor) – The gradient of the policy.

  • p_dist (torch.distributions.Distribution) – The old policy distribution.

  • obs (torch.Tensor) – The observation.

  • act (torch.Tensor) – The action.

  • logp (torch.Tensor) – The log probability of the action.

  • adv (torch.Tensor) – The advantage.

  • adv_c (torch.Tensor) – The cost advantage.

  • loss_pi_before (float) – The loss of the policy before the update.

  • total_steps (int, optional) – The total steps to search. Defaults to 15.

  • decay (float, optional) – The decay rate of the step size. Defaults to 0.8.

  • violation_c (int, optional) – The violation of constraint. Defaults to 0.

  • optim_case (int, optional) – The optimization case. Defaults to 0.

Returns:

A tuple of final step direction and the size of acceptance steps.

Return type:

tuple[Tensor, int]

_determine_case(b_grads, ep_costs, q, r, s)[source]#

Determine the case of the trust region update.

Parameters:
  • b_grad (torch.Tensor) – Gradient of the cost function.

  • ep_costs (torch.Tensor) – Cost of the current episode.

  • q (torch.Tensor) – The quadratic term of the quadratic approximation of the cost function.

  • r (torch.Tensor) – The linear term of the quadratic approximation of the cost function.

  • s (torch.Tensor) – The constant term of the quadratic approximation of the cost function.

Returns:
  • optim_case – The case of the trust region update.

  • A – The quadratic term of the quadratic approximation of the cost function.

  • B – The linear term of the quadratic approximation of the cost function.

Return type:

tuple[int, Tensor, Tensor]

_init_log()[source]#

Log the Trust Region Policy Optimization specific information.

Things to log

Description

Misc/AcceptanceStep

The acceptance step size.

Return type:

None

_loss_pi_cost(obs, act, logp, adv_c)[source]#

Compute the performance of cost on this moment.

We compute the loss of cost of policy cost from real cost.

(4)#\[L = \mathbb{E}_{\pi} \left[ \frac{\pi^{'} (a|s)}{\pi (a|s)} A^C (s, a) \right]\]

where \(A^C (s, a)\) is the cost advantage, \(\pi (a|s)\) is the old policy, and \(\pi^{'} (a|s)\) is the current policy.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • act (torch.Tensor) – The action sampled from buffer.

  • logp (torch.Tensor) – The log probability of action sampled from buffer.

  • adv_c (torch.Tensor) – The cost_advantage sampled from buffer.

Returns:

The loss of the cost performance.

Return type:

Tensor

_update_actor(obs, act, logp, adv_r, adv_c)[source]#

Update policy network.

Constrained Policy Optimization updates policy network using the conjugate gradient algorithm, following the steps:

  • Compute the gradient of the policy.

  • Compute the step direction.

  • Search for a step size that satisfies the constraint.

  • Update the policy network.

Parameters:
  • obs (torch.Tensor) – The observation tensor.

  • act (torch.Tensor) – The action tensor.

  • logp (torch.Tensor) – The log probability of the action.

  • adv_r (torch.Tensor) – The reward advantage tensor.

  • adv_c (torch.Tensor) – The cost advantage tensor.

Return type:

None

Projection Based Constraint Policy Optimization#

Documentation

class omnisafe.algorithms.on_policy.PCPO(env_id, cfgs)[source]#

The Projection-Based Constrained Policy Optimization (PCPO) algorithm.

References

  • Title: Projection-Based Constrained Policy Optimization

  • Authors: Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, Peter J. Ramadge.

  • URL: PCPO

Initialize an instance of algorithm.

_update_actor(obs, act, logp, adv_r, adv_c)[source]#

Update policy network.

PCPO updates policy network using the conjugate gradient algorithm, following the steps:

  • Compute the gradient of the policy.

  • Compute the step direction.

  • Search for a step size that satisfies the constraint. (Both KL divergence and cost limit).

  • Update the policy network.

Parameters:
  • obs (torch.Tensor) – The observation tensor.

  • act (torch.Tensor) – The action tensor.

  • logp (torch.Tensor) – The log probability of the action.

  • adv_r (torch.Tensor) – The reward advantage tensor.

  • adv_c (torch.Tensor) – The cost advantage tensor.

Return type:

None