Base Off-policy Algorithms#

DDPG(env_id, cfgs)

The Deep Deterministic Policy Gradient (DDPG) algorithm.

TD3(env_id, cfgs)

The Twin Delayed DDPG (TD3) algorithm.

SAC(env_id, cfgs)

The Soft Actor-Critic (SAC) algorithm.

Deep Deterministic Policy Gradient#

Documentation

class omnisafe.algorithms.off_policy.DDPG(env_id, cfgs)[source]#

The Deep Deterministic Policy Gradient (DDPG) algorithm.

References

  • Title: Continuous control with deep reinforcement learning

  • Authors: Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,

    Tom Erez, Yuval Tassa, David Silver, Daan Wierstra.

  • URL: DDPG

Initialize an instance of algorithm.

_init()[source]#

The initialization of the algorithm.

User can define the initialization of the algorithm by inheriting this method.

Return type:

None

Examples

>>> def _init(self) -> None:
...     super()._init()
...     self._buffer = CustomBuffer()
...     self._model = CustomModel()
_init_env()[source]#

Initialize the environment.

OmniSafe uses omnisafe.adapter.OffPolicyAdapter to adapt the environment to this algorithm.

User can customize the environment by inheriting this method.

Return type:

None

Examples

>>> def _init_env(self) -> None:
...     self._env = CustomAdapter()
Raises:
  • AssertionError – If the number of steps per epoch is not divisible by the number of environments.

  • AssertionError – If the total number of steps is not divisible by the number of steps per epoch.

_init_log()[source]#

Log info about epoch.

Things to log

Description

Train/Epoch

Current epoch.

Metrics/EpCost

Average cost of the epoch.

Metrics/EpRet

Average return of the epoch.

Metrics/EpLen

Average length of the epoch.

Metrics/TestEpCost

Average cost of the evaluate epoch.

Metrics/TestEpRet

Average return of the evaluate epoch.

Metrics/TestEpLen

Average length of the evaluate epoch.

Value/reward_critic

Average value in rollout() (from critic network) of the epoch.

Values/cost_critic

Average cost in rollout() (from critic network) of the epoch.

Loss/Loss_pi

Loss of the policy network.

Loss/Loss_reward_critic

Loss of the reward critic.

Loss/Loss_cost_critic

Loss of the cost critic network.

Train/LR

Learning rate of the policy network.

Misc/Seed

Seed of the experiment.

Misc/TotalEnvSteps

Total steps of the experiment.

Time/Total

Total time.

Time/Rollout

Rollout time.

Time/Update

Update time.

Time/Evaluate

Evaluate time.

FPS

Frames per second of the epoch.

Return type:

None

_init_model()[source]#

Initialize the model.

OmniSafe uses omnisafe.models.actor_critic.constraint_actor_q_critic.ConstraintActorQCritic as the default model.

User can customize the model by inheriting this method.

Return type:

None

Examples

>>> def _init_model(self) -> None:
...     self._actor_critic = CustomActorQCritic()
_log_when_not_update()[source]#

Log default value when not update.

Return type:

None

_loss_pi(obs)[source]#

Computing pi/actor loss.

The loss function in DDPG is defined as:

(2)#\[L = -Q^V (s, \pi (s))\]

where \(Q^V\) is the reward critic network, and \(\pi\) is the policy network.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.

Returns:

The loss of pi/actor.

Return type:

Tensor

_update()[source]#

Update actor, critic. :rtype: None

  • Get the data from buffer

Note

obs

observaion stored in buffer.

act

action stored in buffer.

reward

reward stored in buffer.

cost

cost stored in buffer.

next_obs

next observaion stored in buffer.

done

terminated stored in buffer.

The basic process of each update is as follows:

  1. Get the mini-batch data from buffer.

  2. Get the loss of network.

  3. Update the network by loss.

  4. Repeat steps 2, 3 until the update_iters times.

_update_actor(obs)[source]#

Update actor.

  • Get the loss of actor.

  • Update actor by loss.

  • Log useful information.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.

Return type:

None

_update_cost_critic(obs, action, cost, done, next_obs)[source]#

Update cost critic.

  • Get the TD loss of cost critic.

  • Update critic network by loss.

  • Log useful information.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • action (torch.Tensor) – The action sampled from buffer.

  • cost (torch.Tensor) – The cost sampled from buffer.

  • done (torch.Tensor) – The terminated sampled from buffer.

  • next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

_update_reward_critic(obs, action, reward, done, next_obs)[source]#

Update reward critic.

  • Get the TD loss of reward critic.

  • Update critic network by loss.

  • Log useful information.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • action (torch.Tensor) – The action sampled from buffer.

  • reward (torch.Tensor) – The reward sampled from buffer.

  • done (torch.Tensor) – The terminated sampled from buffer.

  • next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

learn()[source]#

This is main function for algorithm update.

It is divided into the following steps: :rtype: tuple[float, float, float]

  • rollout(): collect interactive data from environment.

  • update(): perform actor/critic updates.

  • log(): epoch/update information for visualization and terminal log print.

Returns:
  • ep_ret – average episode return in final epoch.

  • ep_cost – average episode cost in final epoch.

  • ep_len – average episode length in final epoch.

Twin Delayed DDPG#

Documentation

class omnisafe.algorithms.off_policy.TD3(env_id, cfgs)[source]#

The Twin Delayed DDPG (TD3) algorithm.

References

  • Title: Addressing Function Approximation Error in Actor-Critic Methods

  • Authors: Scott Fujimoto, Herke van Hoof, David Meger.

  • URL: TD3

Initialize an instance of algorithm.

_init_model()[source]#

Initialize the model.

The num_critics in critic configuration must be 2.

Return type:

None

_update_reward_critic(obs, action, reward, done, next_obs)[source]#

Update reward critic.

  • Get the target action by target actor.

  • Add noise to target action.

  • Clip the noise.

  • Get the target Q value by target critic.

  • Use the minimum target Q value to update reward critic.

  • Log useful information.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • action (torch.Tensor) – The action sampled from buffer.

  • reward (torch.Tensor) – The reward sampled from buffer.

  • done (torch.Tensor) – The terminated sampled from buffer.

  • next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

Soft Actor-Critic#

Documentation

class omnisafe.algorithms.off_policy.SAC(env_id, cfgs)[source]#

The Soft Actor-Critic (SAC) algorithm.

References

  • Title: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

  • Authors: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine.

  • URL: SAC

Initialize an instance of algorithm.

property _alpha: float#

The value of alpha.

_init()[source]#

The initialization of the algorithm.

User can define the initialization of the algorithm by inheriting this method.

Return type:

None

Examples

>>> def _init(self) -> None:
...     super()._init()
...     self._buffer = CustomBuffer()
...     self._model = CustomModel()

In SAC, we need to initialize the log_alpha and alpha_optimizer.

_init_log()[source]#

Log info about epoch.

Things to log

Description

Train/Epoch

Current epoch.

Metrics/EpCost

Average cost of the epoch.

Metrics/EpRet

Average return of the epoch.

Metrics/EpLen

Average length of the epoch.

Metrics/TestEpCost

Average cost of the evaluate epoch.

Metrics/TestEpRet

Average return of the evaluate epoch.

Metrics/TestEpLen

Average length of the evaluate epoch.

Value/reward_critic

Average value in rollout() (from critic network) of the epoch.

Values/cost_critic

Average cost in rollout() (from critic network) of the epoch.

Loss/Loss_pi

Loss of the policy network.

Loss/Loss_reward_critic

Loss of the reward critic.

Loss/Loss_cost_critic

Loss of the cost critic network.

Train/LR

Learning rate of the policy network.

Misc/Seed

Seed of the experiment.

Misc/TotalEnvSteps

Total steps of the experiment.

Time/Total

Total time.

Time/Rollout

Rollout time.

Time/Update

Update time.

Time/Evaluate

Evaluate time.

FPS

Frames per second of the epoch.

Return type:

None

_init_model()[source]#

Initialize the model.

The num_critics in critic configuration must be 2.

Return type:

None

_log_when_not_update()[source]#

Log default value when not update.

Return type:

None

_loss_pi(obs)[source]#

Computing pi/actor loss.

The loss function in SAC is defined as:

(4)#\[L = -Q^V (s, \pi (s)) + \alpha \log \pi (s)\]

where \(Q^V\) is the min value of two reward critic networks, and \(\pi\) is the policy network, and \(\alpha\) is the temperature parameter.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.

Returns:

The loss of pi/actor.

Return type:

Tensor

_update_actor(obs)[source]#

Update actor and alpha if auto_alpha is True.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.

Return type:

None

_update_reward_critic(obs, action, reward, done, next_obs)[source]#

Update reward critic.

  • Sample the target action by target actor.

  • Get the target Q value by target critic.

  • Use the minimum target Q value to update reward critic.

  • Add the entropy loss to reward critic.

  • Log useful information.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • action (torch.Tensor) – The action sampled from buffer.

  • reward (torch.Tensor) – The reward sampled from buffer.

  • done (torch.Tensor) – The terminated sampled from buffer.

  • next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None