Base Model-based Algorithms#
LOOP#
Documentation
- class omnisafe.algorithms.model_based.base.LOOP(env_id, cfgs)[source]#
The Learning Off-Policy with Online Planning (LOOP) algorithm.
References
Title: Learning Off-Policy with Online Planning
Authors: Harshit Sikchi, Wenxuan Zhou, David Held.
URL: LOOP
Initialize an instance of algorithm.
- _init()[source]#
The initialization of the algorithm.
User can define the initialization of the algorithm by inheriting this method.
- Return type:
None
Examples
>>> def _init(self) -> None: ... super()._init() ... self._buffer = CustomBuffer() ... self._model = CustomModel()
- _init_log()[source]#
Initialize logger.
Things to log
Description
Value/alpha
The value of alpha.
Values/reward_critic
Average value in
rollout()
(from critic network) of the epoch.Values/cost_critic
Average cost in
rollout()
(from critic network) of the epoch.Loss/Loss_cost_critic
Loss of the cost critic network.
Loss/Loss_reward_critic
Loss of the cost critic network.
Loss/Loss_pi
Loss of the policy network.
- Return type:
None
- _init_model()[source]#
Initialize the dynamics model and the planner.
LOOP uses following models: :rtype:
None
dynamics model: to predict the next state and the cost.
actor_critic: to predict the action and the value.
planner: to generate the action.
- _loss_pi(obs)[source]#
Computing
pi/actor
loss.The loss function in SAC is defined as:
(2)#\[L = -Q^V (s, \pi (s)) + \alpha \log \pi (s)\]where \(Q^V\) is the min value of two reward critic networks, and \(\pi\) is the policy network, and \(\alpha\) is the temperature parameter.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.- Returns:
The loss of pi/actor.
- Return type:
Tensor
- _select_action(current_step, state)[source]#
Select action.
- Parameters:
current_step (int) – The current step.
state (torch.Tensor) – The current state.
- Returns:
The selected action.
- Return type:
Tensor
- _store_real_data(state, action, reward, cost, terminated, truncated, next_state, info)[source]#
Store real data in buffer.
- Parameters:
state (torch.Tensor) – The state from the environment.
action (torch.Tensor) – The action from the agent.
reward (torch.Tensor) – The reward signal from the environment.
cost (torch.Tensor) – The cost signal from the environment.
terminated (torch.Tensor) – The terminated signal from the environment.
truncated (torch.Tensor) – The truncated signal from the environment.
next_state (torch.Tensor) – The next state from the environment.
info (dict[str, Any]) – The information from the environment.
- Return type:
None
- _update_actor(obs)[source]#
Update actor using Soft Actor-Critic algorithm.
Get the loss of actor.
Update actor by loss.
Log useful information.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.- Return type:
None
- _update_cost_critic(obs, action, cost, done, next_obs)[source]#
Update cost critic using TD3 algorithm.
Get the TD loss of cost critic.
Update critic network by loss.
Log useful information.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.action (torch.Tensor) – The
action
sampled from buffer.cost (torch.Tensor) – The
cost
sampled from buffer.done (torch.Tensor) – The
terminated
sampled from buffer.next_obs (torch.Tensor) – The
next observation
sampled from buffer.
- Return type:
None
- _update_policy(current_step)[source]#
Update policy.
Get the
data
from buffer
Note
obs
observaion
stored in buffer.act
action
stored in buffer.reward
reward
stored in buffer.cost
cost
stored in buffer.next_obs
next observaion
stored in buffer.done
terminated
stored in buffer.Update value net by
_update_reward_critic()
.Update cost net by
_update_cost_critic()
.Update policy net by
_update_actor()
.
The basic process of each update is as follows:
Get the mini-batch data from buffer.
Get the loss of network.
Update the network by loss.
Repeat steps 2, 3 until the
update_policy_iters
times.
- Parameters:
current_step (int) – The current step.
- Return type:
None
- _update_reward_critic(obs, action, reward, done, next_obs)[source]#
Update reward critic using Soft Actor-Critic.
Get the TD loss of reward critic.
Update critic network by loss.
Log useful information.
- Parameters:
obs (torch.Tensor) – The
observation
sampled from buffer.action (torch.Tensor) – The
action
sampled from buffer.reward (torch.Tensor) – The
reward
sampled from buffer.done (torch.Tensor) – The
terminated
sampled from buffer.next_obs (torch.Tensor) – The
next observation
sampled from buffer.
- Return type:
None
PETS#
Documentation
- class omnisafe.algorithms.model_based.base.PETS(env_id, cfgs)[source]#
The Probabilistic Ensembles with Trajectory Sampling (PETS) algorithm.
References
Title: Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
Authors: Kurtland Chua, Roberto Calandra, Rowan McAllister, Sergey Levine.
URL: PETS
Initialize an instance of algorithm.
- _evaluation_single_step(current_step, use_real_input=True)[source]#
Evaluation dynamics model single step.
- Parameters:
current_step (int) – The current step.
use_real_input (bool) – Whether to use real input or not.
- Return type:
None
- _init_log()[source]#
Initialize logger.
Things to log
Description
Train/Epoch
Current epoch.
TotalEnvSteps
Total steps of the experiment.
Metrics/EpRet
Average return of the epoch.
Metrics/EpCost
Average cost of the epoch.
Metrics/EpLen
Average length of the epoch.
EvalMetrics/EpRet
Average episode return in evaluation.
EvalMetrics/EpCost
Average episode cost in evaluation.
EvalMetrics/EpLen
Average episode length in evaluation.
Loss/DynamicsTrainMseLoss
The training loss of dynamics model.
Loss/DynamicsValMseLoss
The validation loss of dynamics model.
Plan/iter
The number of iterations in the planner.
Plan/last_var_mean
The mean of the last variance in the planner.
Plan/last_var_max
The max of the last variance in the planner.
Plan/last_var_min
The min of the last variance in the planner.
Plan/episode_returns_max
The max of the episode returns in the planner.
Plan/episode_returns_mean
The mean of the episode returns in the planner.
Plan/episode_returns_min
The min of the episode returns in the planner.
Time/Total
The total time of the algorithm.
Time/Rollout
The time of the rollout.
Time/UpdateActorCritic
The time of the actor-critic update.
Time/Eval
The time of the evaluation.
Time/Epoch
The time of the epoch.
Time/FPS
The FPS of the algorithm.
Time/UpdateDynamics
The time of the dynamics update.
- Return type:
None
- _select_action(current_step, state)[source]#
Select action.
- Parameters:
current_step (int) – The current step.
state (torch.Tensor) – The current state.
- Returns:
The selected action.
- Return type:
Tensor
- _store_real_data(state, action, reward, cost, terminated, truncated, next_state, info)[source]#
Store real data in buffer.
- Parameters:
state (torch.Tensor) – The state from the environment.
action (torch.Tensor) – The action from the agent.
reward (torch.Tensor) – The reward signal from the environment.
cost (torch.Tensor) – The cost signal from the environment.
terminated (torch.Tensor) – The terminated signal from the environment.
truncated (torch.Tensor) – The truncated signal from the environment.
next_state (torch.Tensor) – The next state from the environment.
info (dict[str, Any]) – The information from the environment.
- Return type:
None
- draw_picture(timestep, num_episode, pred_state, true_state, save_replay_path='./', name='reward')[source]#
Draw a curve of the predicted value and the ground true value.
- Parameters:
timestep (int) – The current step.
num_episode (int) – The number of episodes.
pred_state (list[float]) – The predicted state.
true_state (list[float]) – The true state.
save_replay_path (str) – The path for saving replay.
name (str) – The name of the curve.
- Return type:
None
- learn()[source]#
This is main function for algorithm update.
It is divided into the following steps: :rtype:
tuple
[float
,float
,float
]rollout()
: collect interactive data from environment.update()
: perform actor/critic updates.log()
: epoch/update information for visualization and terminal log print.
- Returns:
ep_ret – Average episode return in final epoch.
ep_cost – Average episode cost in final epoch.
ep_len – Average episode length in final epoch.