Base Off-policy Algorithms#
|
The Deep Deterministic Policy Gradient (DDPG) algorithm. |
|
The Twin Delayed DDPG (TD3) algorithm. |
|
The Soft Actor-Critic (SAC) algorithm. |
Deep Deterministic Policy Gradient#
Documentation
- class omnisafe.algorithms.off_policy.DDPG(env_id, cfgs)[source]#
The Deep Deterministic Policy Gradient (DDPG) algorithm.
References
Title: Continuous control with deep reinforcement learning
- Authors: Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,
Tom Erez, Yuval Tassa, David Silver, Daan Wierstra.
URL: DDPG
Initialize an instance of algorithm.
- _init()[source]#
The initialization of the algorithm.
User can define the initialization of the algorithm by inheriting this method.
- Return type:
None
Examples
>>> def _init(self) -> None: ... super()._init() ... self._buffer = CustomBuffer() ... self._model = CustomModel()
- _init_env()[source]#
Initialize the environment.
OmniSafe uses
omnisafe.adapter.OffPolicyAdapter
to adapt the environment to this algorithm.User can customize the environment by inheriting this method.
- Return type:
None
Examples
>>> def _init_env(self) -> None: ... self._env = CustomAdapter()
- Raises:
AssertionError – If the number of steps per epoch is not divisible by the number of environments.
AssertionError – If the total number of steps is not divisible by the number of steps per epoch.
- _init_log()[source]#
Log info about epoch.
Things to log
Description
Train/Epoch
Current epoch.
Metrics/EpCost
Average cost of the epoch.
Metrics/EpRet
Average return of the epoch.
Metrics/EpLen
Average length of the epoch.
Metrics/TestEpCost
Average cost of the evaluate epoch.
Metrics/TestEpRet
Average return of the evaluate epoch.
Metrics/TestEpLen
Average length of the evaluate epoch.
Value/reward_critic
Average value in
rollout()
(from critic network) of the epoch.Values/cost_critic
Average cost in
rollout()
(from critic network) of the epoch.Loss/Loss_pi
Loss of the policy network.
Loss/Loss_reward_critic
Loss of the reward critic.
Loss/Loss_cost_critic
Loss of the cost critic network.
Train/LR
Learning rate of the policy network.
Misc/Seed
Seed of the experiment.
Misc/TotalEnvSteps
Total steps of the experiment.
Time/Total
Total time.
Time/Rollout
Rollout time.
Time/Update
Update time.
Time/Evaluate
Evaluate time.
FPS
Frames per second of the epoch.
- Return type:
None
- _init_model()[source]#
Initialize the model.
OmniSafe uses
omnisafe.models.actor_critic.constraint_actor_q_critic.ConstraintActorQCritic
as the default model.User can customize the model by inheriting this method.
- Return type:
None
Examples
>>> def _init_model(self) -> None: ... self._actor_critic = CustomActorQCritic()
- _loss_pi(obs)[source]#
Computing
pi/actor
loss.The loss function in DDPG is defined as:
(2)#\[L = -Q^V (s, \pi (s))\]where \(Q^V\) is the reward critic network, and \(\pi\) is the policy network.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.- Returns:
The loss of pi/actor.
- Return type:
Tensor
- _update()[source]#
Update actor, critic. :rtype:
None
Get the
data
from buffer
Note
obs
observaion
stored in buffer.act
action
stored in buffer.reward
reward
stored in buffer.cost
cost
stored in buffer.next_obs
next observaion
stored in buffer.done
terminated
stored in buffer.Update value net by
_update_reward_critic()
.Update cost net by
_update_cost_critic()
.Update policy net by
_update_actor()
.
The basic process of each update is as follows:
Get the mini-batch data from buffer.
Get the loss of network.
Update the network by loss.
Repeat steps 2, 3 until the
update_iters
times.
- _update_actor(obs)[source]#
Update actor.
Get the loss of actor.
Update actor by loss.
Log useful information.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.- Return type:
None
- _update_cost_critic(obs, action, cost, done, next_obs)[source]#
Update cost critic.
Get the TD loss of cost critic.
Update critic network by loss.
Log useful information.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.action (torch.Tensor) – The
action
sampled from buffer.cost (torch.Tensor) – The
cost
sampled from buffer.done (torch.Tensor) – The
terminated
sampled from buffer.next_obs (torch.Tensor) – The
next observation
sampled from buffer.
- Return type:
None
- _update_reward_critic(obs, action, reward, done, next_obs)[source]#
Update reward critic.
Get the TD loss of reward critic.
Update critic network by loss.
Log useful information.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.action (torch.Tensor) – The
action
sampled from buffer.reward (torch.Tensor) – The
reward
sampled from buffer.done (torch.Tensor) – The
terminated
sampled from buffer.next_obs (torch.Tensor) – The
next observation
sampled from buffer.
- Return type:
None
- learn()[source]#
This is main function for algorithm update.
It is divided into the following steps: :rtype:
tuple
[float
,float
,float
]rollout()
: collect interactive data from environment.update()
: perform actor/critic updates.log()
: epoch/update information for visualization and terminal log print.
- Returns:
ep_ret – average episode return in final epoch.
ep_cost – average episode cost in final epoch.
ep_len – average episode length in final epoch.
Twin Delayed DDPG#
Documentation
- class omnisafe.algorithms.off_policy.TD3(env_id, cfgs)[source]#
The Twin Delayed DDPG (TD3) algorithm.
References
Title: Addressing Function Approximation Error in Actor-Critic Methods
Authors: Scott Fujimoto, Herke van Hoof, David Meger.
URL: TD3
Initialize an instance of algorithm.
- _init_model()[source]#
Initialize the model.
The
num_critics
incritic
configuration must be 2.- Return type:
None
- _update_reward_critic(obs, action, reward, done, next_obs)[source]#
Update reward critic.
Get the target action by target actor.
Add noise to target action.
Clip the noise.
Get the target Q value by target critic.
Use the minimum target Q value to update reward critic.
Log useful information.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.action (torch.Tensor) – The
action
sampled from buffer.reward (torch.Tensor) – The
reward
sampled from buffer.done (torch.Tensor) – The
terminated
sampled from buffer.next_obs (torch.Tensor) – The
next observation
sampled from buffer.
- Return type:
None
Soft Actor-Critic#
Documentation
- class omnisafe.algorithms.off_policy.SAC(env_id, cfgs)[source]#
The Soft Actor-Critic (SAC) algorithm.
References
Title: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Authors: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine.
URL: SAC
Initialize an instance of algorithm.
- property _alpha: float#
The value of alpha.
- _init()[source]#
The initialization of the algorithm.
User can define the initialization of the algorithm by inheriting this method.
- Return type:
None
Examples
>>> def _init(self) -> None: ... super()._init() ... self._buffer = CustomBuffer() ... self._model = CustomModel()
In SAC, we need to initialize the
log_alpha
andalpha_optimizer
.
- _init_log()[source]#
Log info about epoch.
Things to log
Description
Train/Epoch
Current epoch.
Metrics/EpCost
Average cost of the epoch.
Metrics/EpRet
Average return of the epoch.
Metrics/EpLen
Average length of the epoch.
Metrics/TestEpCost
Average cost of the evaluate epoch.
Metrics/TestEpRet
Average return of the evaluate epoch.
Metrics/TestEpLen
Average length of the evaluate epoch.
Value/reward_critic
Average value in
rollout()
(from critic network) of the epoch.Values/cost_critic
Average cost in
rollout()
(from critic network) of the epoch.Loss/Loss_pi
Loss of the policy network.
Loss/Loss_reward_critic
Loss of the reward critic.
Loss/Loss_cost_critic
Loss of the cost critic network.
Train/LR
Learning rate of the policy network.
Misc/Seed
Seed of the experiment.
Misc/TotalEnvSteps
Total steps of the experiment.
Time/Total
Total time.
Time/Rollout
Rollout time.
Time/Update
Update time.
Time/Evaluate
Evaluate time.
FPS
Frames per second of the epoch.
- Return type:
None
- _init_model()[source]#
Initialize the model.
The
num_critics
incritic
configuration must be 2.- Return type:
None
- _loss_pi(obs)[source]#
Computing
pi/actor
loss.The loss function in SAC is defined as:
(4)#\[L = -Q^V (s, \pi (s)) + \alpha \log \pi (s)\]where \(Q^V\) is the min value of two reward critic networks, and \(\pi\) is the policy network, and \(\alpha\) is the temperature parameter.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.- Returns:
The loss of pi/actor.
- Return type:
Tensor
- _update_actor(obs)[source]#
Update actor and alpha if
auto_alpha
is True.- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.- Return type:
None
- _update_reward_critic(obs, action, reward, done, next_obs)[source]#
Update reward critic.
Sample the target action by target actor.
Get the target Q value by target critic.
Use the minimum target Q value to update reward critic.
Add the entropy loss to reward critic.
Log useful information.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.action (torch.Tensor) – The
action
sampled from buffer.reward (torch.Tensor) – The
reward
sampled from buffer.done (torch.Tensor) – The
terminated
sampled from buffer.next_obs (torch.Tensor) – The
next observation
sampled from buffer.
- Return type:
None