Base Model-based Algorithms#

LOOP#

Documentation

class omnisafe.algorithms.model_based.base.LOOP(env_id, cfgs)[source]#

The Learning Off-Policy with Online Planning (LOOP) algorithm.

References

  • Title: Learning Off-Policy with Online Planning

  • Authors: Harshit Sikchi, Wenxuan Zhou, David Held.

  • URL: LOOP

Initialize an instance of algorithm.

_alpha_discount()[source]#

Alpha discount.

Return type:

None

_init()[source]#

The initialization of the algorithm.

User can define the initialization of the algorithm by inheriting this method.

Return type:

None

Examples

>>> def _init(self) -> None:
...     super()._init()
...     self._buffer = CustomBuffer()
...     self._model = CustomModel()
_init_log()[source]#

Initialize logger.

Things to log

Description

Value/alpha

The value of alpha.

Values/reward_critic

Average value in rollout() (from critic network) of the epoch.

Values/cost_critic

Average cost in rollout() (from critic network) of the epoch.

Loss/Loss_cost_critic

Loss of the cost critic network.

Loss/Loss_reward_critic

Loss of the cost critic network.

Loss/Loss_pi

Loss of the policy network.

Return type:

None

_init_model()[source]#

Initialize the dynamics model and the planner.

LOOP uses following models: :rtype: None

  • dynamics model: to predict the next state and the cost.

  • actor_critic: to predict the action and the value.

  • planner: to generate the action.

_loss_pi(obs)[source]#

Computing pi/actor loss.

The loss function in SAC is defined as:

(2)#\[L = -Q^V (s, \pi (s)) + \alpha \log \pi (s)\]

where \(Q^V\) is the min value of two reward critic networks, and \(\pi\) is the policy network, and \(\alpha\) is the temperature parameter.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.

Returns:

The loss of pi/actor.

Return type:

Tensor

_save_model()[source]#

Save the model.

Return type:

None

_select_action(current_step, state)[source]#

Select action.

Parameters:
  • current_step (int) – The current step.

  • state (torch.Tensor) – The current state.

Returns:

The selected action.

Return type:

Tensor

_store_real_data(state, action, reward, cost, terminated, truncated, next_state, info)[source]#

Store real data in buffer.

Parameters:
  • state (torch.Tensor) – The state from the environment.

  • action (torch.Tensor) – The action from the agent.

  • reward (torch.Tensor) – The reward signal from the environment.

  • cost (torch.Tensor) – The cost signal from the environment.

  • terminated (torch.Tensor) – The terminated signal from the environment.

  • truncated (torch.Tensor) – The truncated signal from the environment.

  • next_state (torch.Tensor) – The next state from the environment.

  • info (dict[str, Any]) – The information from the environment.

Return type:

None

_update_actor(obs)[source]#

Update actor using Soft Actor-Critic algorithm.

  • Get the loss of actor.

  • Update actor by loss.

  • Log useful information.

Parameters:

obs (torch.Tensor) – The observation sampled from buffer.

Return type:

None

_update_cost_critic(obs, action, cost, done, next_obs)[source]#

Update cost critic using TD3 algorithm.

  • Get the TD loss of cost critic.

  • Update critic network by loss.

  • Log useful information.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • action (torch.Tensor) – The action sampled from buffer.

  • cost (torch.Tensor) – The cost sampled from buffer.

  • done (torch.Tensor) – The terminated sampled from buffer.

  • next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

_update_policy(current_step)[source]#

Update policy.

  • Get the data from buffer

Note

obs

observaion stored in buffer.

act

action stored in buffer.

reward

reward stored in buffer.

cost

cost stored in buffer.

next_obs

next observaion stored in buffer.

done

terminated stored in buffer.

The basic process of each update is as follows:

  1. Get the mini-batch data from buffer.

  2. Get the loss of network.

  3. Update the network by loss.

  4. Repeat steps 2, 3 until the update_policy_iters times.

Parameters:

current_step (int) – The current step.

Return type:

None

_update_reward_critic(obs, action, reward, done, next_obs)[source]#

Update reward critic using Soft Actor-Critic.

  • Get the TD loss of reward critic.

  • Update critic network by loss.

  • Log useful information.

Parameters:
  • obs (torch.Tensor) – The observation sampled from buffer.

  • action (torch.Tensor) – The action sampled from buffer.

  • reward (torch.Tensor) – The reward sampled from buffer.

  • done (torch.Tensor) – The terminated sampled from buffer.

  • next_obs (torch.Tensor) – The next observation sampled from buffer.

Return type:

None

PETS#

Documentation

class omnisafe.algorithms.model_based.base.PETS(env_id, cfgs)[source]#

The Probabilistic Ensembles with Trajectory Sampling (PETS) algorithm.

References

  • Title: Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

  • Authors: Kurtland Chua, Roberto Calandra, Rowan McAllister, Sergey Levine.

  • URL: PETS

Initialize an instance of algorithm.

_algo_reset()[source]#

Reset the algorithm.

Return type:

None

_evaluation_single_step(current_step, use_real_input=True)[source]#

Evaluation dynamics model single step.

Parameters:
  • current_step (int) – The current step.

  • use_real_input (bool) – Whether to use real input or not.

Return type:

None

_init()[source]#

Initialize the algorithm.

Return type:

None

_init_env()[source]#

Initialize the environment.

Return type:

None

_init_log()[source]#

Initialize logger.

Things to log

Description

Train/Epoch

Current epoch.

TotalEnvSteps

Total steps of the experiment.

Metrics/EpRet

Average return of the epoch.

Metrics/EpCost

Average cost of the epoch.

Metrics/EpLen

Average length of the epoch.

EvalMetrics/EpRet

Average episode return in evaluation.

EvalMetrics/EpCost

Average episode cost in evaluation.

EvalMetrics/EpLen

Average episode length in evaluation.

Loss/DynamicsTrainMseLoss

The training loss of dynamics model.

Loss/DynamicsValMseLoss

The validation loss of dynamics model.

Plan/iter

The number of iterations in the planner.

Plan/last_var_mean

The mean of the last variance in the planner.

Plan/last_var_max

The max of the last variance in the planner.

Plan/last_var_min

The min of the last variance in the planner.

Plan/episode_returns_max

The max of the episode returns in the planner.

Plan/episode_returns_mean

The mean of the episode returns in the planner.

Plan/episode_returns_min

The min of the episode returns in the planner.

Time/Total

The total time of the algorithm.

Time/Rollout

The time of the rollout.

Time/UpdateActorCritic

The time of the actor-critic update.

Time/Eval

The time of the evaluation.

Time/Epoch

The time of the epoch.

Time/FPS

The FPS of the algorithm.

Time/UpdateDynamics

The time of the dynamics update.

Return type:

None

_init_model()[source]#

Initialize dynamics model and planner.

Return type:

None

_save_model()[source]#

Save the model.

Return type:

None

_select_action(current_step, state)[source]#

Select action.

Parameters:
  • current_step (int) – The current step.

  • state (torch.Tensor) – The current state.

Returns:

The selected action.

Return type:

Tensor

_store_real_data(state, action, reward, cost, terminated, truncated, next_state, info)[source]#

Store real data in buffer.

Parameters:
  • state (torch.Tensor) – The state from the environment.

  • action (torch.Tensor) – The action from the agent.

  • reward (torch.Tensor) – The reward signal from the environment.

  • cost (torch.Tensor) – The cost signal from the environment.

  • terminated (torch.Tensor) – The terminated signal from the environment.

  • truncated (torch.Tensor) – The truncated signal from the environment.

  • next_state (torch.Tensor) – The next state from the environment.

  • info (dict[str, Any]) – The information from the environment.

Return type:

None

_update_dynamics_model()[source]#

Update dynamics model.

Return type:

None

_update_epoch()[source]#

Update function per epoch.

Return type:

None

_update_policy(current_step)[source]#

Update policy.

Return type:

None

draw_picture(timestep, num_episode, pred_state, true_state, save_replay_path='./', name='reward')[source]#

Draw a curve of the predicted value and the ground true value.

Parameters:
  • timestep (int) – The current step.

  • num_episode (int) – The number of episodes.

  • pred_state (list[float]) – The predicted state.

  • true_state (list[float]) – The true state.

  • save_replay_path (str) – The path for saving replay.

  • name (str) – The name of the curve.

Return type:

None

learn()[source]#

This is main function for algorithm update.

It is divided into the following steps: :rtype: tuple[float, float, float]

  • rollout(): collect interactive data from environment.

  • update(): perform actor/critic updates.

  • log(): epoch/update information for visualization and terminal log print.

Returns:
  • ep_ret – Average episode return in final epoch.

  • ep_cost – Average episode cost in final epoch.

  • ep_len – Average episode length in final epoch.