Base Model-based Algorithms#

LOOP#

Documentation

class omnisafe.algorithms.model_based.base.LOOP(env_id, cfgs)[source]#

The Learning Off-Policy with Online Planning (LOOP) algorithm.

References

Title: Learning Off-Policy with Online Planning
Authors: Harshit Sikchi, Wenxuan Zhou, David Held.
URL: LOOP

Initialize an instance of algorithm.

_alpha_discount()[source]#

Alpha discount.

Return type:: None

_init()[source]#

The initialization of the algorithm.

User can define the initialization of the algorithm by inheriting this method.

Return type:: None

Examples

>>> def _init(self) -> None:
...     super()._init()
...     self._buffer = CustomBuffer()
...     self._model = CustomModel()

_init_log()[source]#

Initialize logger.

Things to log	Description
Value/alpha	The value of alpha.
Values/reward_critic	Average value in `rollout()` (from critic network) of the epoch.
Values/cost_critic	Average cost in `rollout()` (from critic network) of the epoch.
Loss/Loss_cost_critic	Loss of the cost critic network.
Loss/Loss_reward_critic	Loss of the cost critic network.
Loss/Loss_pi	Loss of the policy network.

Return type:: None

_init_model()[source]#

Initialize the dynamics model and the planner.

LOOP uses following models: :rtype: None

dynamics model: to predict the next state and the cost.
actor_critic: to predict the action and the value.
planner: to generate the action.

_loss_pi(obs)[source]#

Computing pi/actor loss.

The loss function in SAC is defined as:

(2)#\[L = -Q^V (s, \pi (s)) + \alpha \log \pi (s)\]

where \(Q^V\) is the min value of two reward critic networks, and \(\pi\) is the policy network, and \(\alpha\) is the temperature parameter.

Parameters:: obs (torch.Tensor) – The observation sampled from buffer.
Returns:: The loss of pi/actor.
Return type:: Tensor

_save_model()[source]#

Save the model.

Return type:: None

_select_action(current_step, state)[source]#

Select action.

Parameters:

current_step (int) – The current step.
state (torch.Tensor) – The current state.

Returns:

The selected action.

Return type:

Tensor

_store_real_data(state, action, reward, cost, terminated, truncated, next_state, info)[source]#

Store real data in buffer.

Parameters:

state (torch.Tensor) – The state from the environment.
action (torch.Tensor) – The action from the agent.
reward (torch.Tensor) – The reward signal from the environment.
cost (torch.Tensor) – The cost signal from the environment.
terminated (torch.Tensor) – The terminated signal from the environment.
truncated (torch.Tensor) – The truncated signal from the environment.
next_state (torch.Tensor) – The next state from the environment.
info (dict[str, Any]) – The information from the environment.

Return type:

None

_update_actor(obs)[source]#

Update actor using Soft Actor-Critic algorithm.

Get the loss of actor.
Update actor by loss.
Log useful information.

Parameters:: obs (torch.Tensor) – The observation sampled from buffer.
Return type:: None

_update_cost_critic(obs, action, cost, done, next_obs)[source]#

Update cost critic using TD3 algorithm.

Get the TD loss of cost critic.
Update critic network by loss.
Log useful information.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.
action (torch.Tensor) – The action sampled from buffer.
cost (torch.Tensor) – The cost sampled from buffer.
done (torch.Tensor) – The terminated sampled from buffer.
next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

_update_policy(current_step)[source]#

Update policy.

Get the data from buffer

Note

obs	`observaion` stored in buffer.
act	`action` stored in buffer.
reward	`reward` stored in buffer.
cost	`cost` stored in buffer.
next_obs	`next observaion` stored in buffer.
done	`terminated` stored in buffer.

Update value net by _update_reward_critic().
Update cost net by _update_cost_critic().
Update policy net by _update_actor().

The basic process of each update is as follows:

Get the mini-batch data from buffer.
Get the loss of network.
Update the network by loss.
Repeat steps 2, 3 until the update_policy_iters times.

Parameters:: current_step (int) – The current step.
Return type:: None

_update_reward_critic(obs, action, reward, done, next_obs)[source]#

Update reward critic using Soft Actor-Critic.

Get the TD loss of reward critic.
Update critic network by loss.
Log useful information.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.
action (torch.Tensor) – The action sampled from buffer.
reward (torch.Tensor) – The reward sampled from buffer.
done (torch.Tensor) – The terminated sampled from buffer.
next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

PETS#

Documentation

class omnisafe.algorithms.model_based.base.PETS(env_id, cfgs)[source]#

The Probabilistic Ensembles with Trajectory Sampling (PETS) algorithm.

References

Title: Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
Authors: Kurtland Chua, Roberto Calandra, Rowan McAllister, Sergey Levine.
URL: PETS

Initialize an instance of algorithm.

_algo_reset()[source]#

Reset the algorithm.

Return type:: None

_evaluation_single_step(current_step, use_real_input=True)[source]#

Evaluation dynamics model single step.

Parameters:

current_step (int) – The current step.
use_real_input (bool) – Whether to use real input or not.

Return type:

None

_init()[source]#

Initialize the algorithm.

Return type:: None

_init_env()[source]#

Initialize the environment.

Return type:: None

_init_log()[source]#

Initialize logger.

Things to log	Description
Train/Epoch	Current epoch.
TotalEnvSteps	Total steps of the experiment.
Metrics/EpRet	Average return of the epoch.
Metrics/EpCost	Average cost of the epoch.
Metrics/EpLen	Average length of the epoch.
EvalMetrics/EpRet	Average episode return in evaluation.
EvalMetrics/EpCost	Average episode cost in evaluation.
EvalMetrics/EpLen	Average episode length in evaluation.
Loss/DynamicsTrainMseLoss	The training loss of dynamics model.
Loss/DynamicsValMseLoss	The validation loss of dynamics model.
Plan/iter	The number of iterations in the planner.
Plan/last_var_mean	The mean of the last variance in the planner.
Plan/last_var_max	The max of the last variance in the planner.
Plan/last_var_min	The min of the last variance in the planner.
Plan/episode_returns_max	The max of the episode returns in the planner.
Plan/episode_returns_mean	The mean of the episode returns in the planner.
Plan/episode_returns_min	The min of the episode returns in the planner.
Time/Total	The total time of the algorithm.
Time/Rollout	The time of the rollout.
Time/UpdateActorCritic	The time of the actor-critic update.
Time/Eval	The time of the evaluation.
Time/Epoch	The time of the epoch.
Time/FPS	The FPS of the algorithm.
Time/UpdateDynamics	The time of the dynamics update.

Return type:: None

_init_model()[source]#

Initialize dynamics model and planner.

Return type:: None

_save_model()[source]#

Save the model.

Return type:: None

_select_action(current_step, state)[source]#

Select action.

Parameters:

current_step (int) – The current step.
state (torch.Tensor) – The current state.

Returns:

The selected action.

Return type:

Tensor

_store_real_data(state, action, reward, cost, terminated, truncated, next_state, info)[source]#

Store real data in buffer.

Parameters:

state (torch.Tensor) – The state from the environment.
action (torch.Tensor) – The action from the agent.
reward (torch.Tensor) – The reward signal from the environment.
cost (torch.Tensor) – The cost signal from the environment.
terminated (torch.Tensor) – The terminated signal from the environment.
truncated (torch.Tensor) – The truncated signal from the environment.
next_state (torch.Tensor) – The next state from the environment.
info (dict[str, Any]) – The information from the environment.

Return type:

None

_update_dynamics_model()[source]#

Update dynamics model.

Return type:: None

_update_epoch()[source]#

Update function per epoch.

Return type:: None

_update_policy(current_step)[source]#

Update policy.

Return type:: None

draw_picture(timestep, num_episode, pred_state, true_state, save_replay_path='./', name='reward')[source]#

Draw a curve of the predicted value and the ground true value.

Parameters:

timestep (int) – The current step.
num_episode (int) – The number of episodes.
pred_state (list[float]) – The predicted state.
true_state (list[float]) – The true state.
save_replay_path (str) – The path for saving replay.
name (str) – The name of the curve.

Return type:

None

learn()[source]#

This is main function for algorithm update.

It is divided into the following steps: :rtype: tuple[float, float, float]

rollout(): collect interactive data from environment.
update(): perform actor/critic updates.
log(): epoch/update information for visualization and terminal log print.

Returns:

ep_ret – Average episode return in final epoch.
ep_cost – Average episode cost in final epoch.
ep_len – Average episode length in final epoch.