Base Off-policy Algorithms#

`DDPG`(env_id, cfgs)	The Deep Deterministic Policy Gradient (DDPG) algorithm.
`TD3`(env_id, cfgs)	The Twin Delayed DDPG (TD3) algorithm.
`SAC`(env_id, cfgs)	The Soft Actor-Critic (SAC) algorithm.

Deep Deterministic Policy Gradient#

Documentation

class omnisafe.algorithms.off_policy.DDPG(env_id, cfgs)[source]#

The Deep Deterministic Policy Gradient (DDPG) algorithm.

References

Title: Continuous control with deep reinforcement learning
Authors: Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,
Tom Erez, Yuval Tassa, David Silver, Daan Wierstra.
URL: DDPG

Initialize an instance of algorithm.

_init()[source]#

The initialization of the algorithm.

User can define the initialization of the algorithm by inheriting this method.

Return type:: None

Examples

>>> def _init(self) -> None:
...     super()._init()
...     self._buffer = CustomBuffer()
...     self._model = CustomModel()

_init_env()[source]#

Initialize the environment.

OmniSafe uses omnisafe.adapter.OffPolicyAdapter to adapt the environment to this algorithm.

User can customize the environment by inheriting this method.

Return type:: None

Examples

>>> def _init_env(self) -> None:
...     self._env = CustomAdapter()

Raises:

AssertionError – If the number of steps per epoch is not divisible by the number of environments.
AssertionError – If the total number of steps is not divisible by the number of steps per epoch.

_init_log()[source]#

Log info about epoch.

Things to log	Description
Train/Epoch	Current epoch.
Metrics/EpCost	Average cost of the epoch.
Metrics/EpRet	Average return of the epoch.
Metrics/EpLen	Average length of the epoch.
Metrics/TestEpCost	Average cost of the evaluate epoch.
Metrics/TestEpRet	Average return of the evaluate epoch.
Metrics/TestEpLen	Average length of the evaluate epoch.
Value/reward_critic	Average value in `rollout()` (from critic network) of the epoch.
Values/cost_critic	Average cost in `rollout()` (from critic network) of the epoch.
Loss/Loss_pi	Loss of the policy network.
Loss/Loss_reward_critic	Loss of the reward critic.
Loss/Loss_cost_critic	Loss of the cost critic network.
Train/LR	Learning rate of the policy network.
Misc/Seed	Seed of the experiment.
Misc/TotalEnvSteps	Total steps of the experiment.
Time/Total	Total time.
Time/Rollout	Rollout time.
Time/Update	Update time.
Time/Evaluate	Evaluate time.
FPS	Frames per second of the epoch.

Return type:: None

_init_model()[source]#

Initialize the model.

OmniSafe uses omnisafe.models.actor_critic.constraint_actor_q_critic.ConstraintActorQCritic as the default model.

User can customize the model by inheriting this method.

Return type:: None

Examples

>>> def _init_model(self) -> None:
...     self._actor_critic = CustomActorQCritic()

_log_when_not_update()[source]#

Log default value when not update.

Return type:: None

_loss_pi(obs)[source]#

Computing pi/actor loss.

The loss function in DDPG is defined as:

(2)#\[L = -Q^V (s, \pi (s))\]

where \(Q^V\) is the reward critic network, and \(\pi\) is the policy network.

Parameters:: obs (torch.Tensor) – The observation sampled from buffer.
Returns:: The loss of pi/actor.
Return type:: Tensor

_update()[source]#

Update actor, critic. :rtype: None

Get the data from buffer

Note

obs	`observaion` stored in buffer.
act	`action` stored in buffer.
reward	`reward` stored in buffer.
cost	`cost` stored in buffer.
next_obs	`next observaion` stored in buffer.
done	`terminated` stored in buffer.

Update value net by _update_reward_critic().
Update cost net by _update_cost_critic().
Update policy net by _update_actor().

The basic process of each update is as follows:

Get the mini-batch data from buffer.
Get the loss of network.
Update the network by loss.
Repeat steps 2, 3 until the update_iters times.

_update_actor(obs)[source]#

Update actor.

Get the loss of actor.
Update actor by loss.
Log useful information.

Parameters:: obs (torch.Tensor) – The observation sampled from buffer.
Return type:: None

_update_cost_critic(obs, action, cost, done, next_obs)[source]#

Update cost critic.

Get the TD loss of cost critic.
Update critic network by loss.
Log useful information.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.
action (torch.Tensor) – The action sampled from buffer.
cost (torch.Tensor) – The cost sampled from buffer.
done (torch.Tensor) – The terminated sampled from buffer.
next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

_update_reward_critic(obs, action, reward, done, next_obs)[source]#

Update reward critic.

Get the TD loss of reward critic.
Update critic network by loss.
Log useful information.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.
action (torch.Tensor) – The action sampled from buffer.
reward (torch.Tensor) – The reward sampled from buffer.
done (torch.Tensor) – The terminated sampled from buffer.
next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

learn()[source]#

This is main function for algorithm update.

It is divided into the following steps: :rtype: tuple[float, float, float]

rollout(): collect interactive data from environment.
update(): perform actor/critic updates.
log(): epoch/update information for visualization and terminal log print.

Returns:

ep_ret – average episode return in final epoch.
ep_cost – average episode cost in final epoch.
ep_len – average episode length in final epoch.

Twin Delayed DDPG#

Documentation

class omnisafe.algorithms.off_policy.TD3(env_id, cfgs)[source]#

The Twin Delayed DDPG (TD3) algorithm.

References

Title: Addressing Function Approximation Error in Actor-Critic Methods
Authors: Scott Fujimoto, Herke van Hoof, David Meger.
URL: TD3

Initialize an instance of algorithm.

_init_model()[source]#

Initialize the model.

The num_critics in critic configuration must be 2.

Return type:: None

_update_reward_critic(obs, action, reward, done, next_obs)[source]#

Update reward critic.

Get the target action by target actor.
Add noise to target action.
Clip the noise.
Get the target Q value by target critic.
Use the minimum target Q value to update reward critic.
Log useful information.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.
action (torch.Tensor) – The action sampled from buffer.
reward (torch.Tensor) – The reward sampled from buffer.
done (torch.Tensor) – The terminated sampled from buffer.
next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

Soft Actor-Critic#

Documentation

class omnisafe.algorithms.off_policy.SAC(env_id, cfgs)[source]#

The Soft Actor-Critic (SAC) algorithm.

References

Title: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Authors: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine.
URL: SAC

Initialize an instance of algorithm.

property _alpha: float#: The value of alpha.

_init()[source]#

The initialization of the algorithm.

User can define the initialization of the algorithm by inheriting this method.

Return type:: None

Examples

>>> def _init(self) -> None:
...     super()._init()
...     self._buffer = CustomBuffer()
...     self._model = CustomModel()

In SAC, we need to initialize the log_alpha and alpha_optimizer.

_init_log()[source]#

Log info about epoch.

Things to log	Description
Train/Epoch	Current epoch.
Metrics/EpCost	Average cost of the epoch.
Metrics/EpRet	Average return of the epoch.
Metrics/EpLen	Average length of the epoch.
Metrics/TestEpCost	Average cost of the evaluate epoch.
Metrics/TestEpRet	Average return of the evaluate epoch.
Metrics/TestEpLen	Average length of the evaluate epoch.
Value/reward_critic	Average value in `rollout()` (from critic network) of the epoch.
Values/cost_critic	Average cost in `rollout()` (from critic network) of the epoch.
Loss/Loss_pi	Loss of the policy network.
Loss/Loss_reward_critic	Loss of the reward critic.
Loss/Loss_cost_critic	Loss of the cost critic network.
Train/LR	Learning rate of the policy network.
Misc/Seed	Seed of the experiment.
Misc/TotalEnvSteps	Total steps of the experiment.
Time/Total	Total time.
Time/Rollout	Rollout time.
Time/Update	Update time.
Time/Evaluate	Evaluate time.
FPS	Frames per second of the epoch.

Return type:: None

_init_model()[source]#

Initialize the model.

The num_critics in critic configuration must be 2.

Return type:: None

_log_when_not_update()[source]#

Log default value when not update.

Return type:: None

_loss_pi(obs)[source]#

Computing pi/actor loss.

The loss function in SAC is defined as:

(4)#\[L = -Q^V (s, \pi (s)) + \alpha \log \pi (s)\]

where \(Q^V\) is the min value of two reward critic networks, and \(\pi\) is the policy network, and \(\alpha\) is the temperature parameter.

Parameters:: obs (torch.Tensor) – The observation sampled from buffer.
Returns:: The loss of pi/actor.
Return type:: Tensor

_update_actor(obs)[source]#

Update actor and alpha if auto_alpha is True.

Parameters:: obs (torch.Tensor) – The observation sampled from buffer.
Return type:: None

_update_reward_critic(obs, action, reward, done, next_obs)[source]#

Update reward critic.

Sample the target action by target actor.
Get the target Q value by target critic.
Use the minimum target Q value to update reward critic.
Add the entropy loss to reward critic.
Log useful information.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.
action (torch.Tensor) – The action sampled from buffer.
reward (torch.Tensor) – The reward sampled from buffer.
done (torch.Tensor) – The terminated sampled from buffer.
next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None