OmniSafe Buffer#

BaseBuffer(obs_space, act_space, size[, device])

Abstract base class for buffer.

OnPolicyBuffer(obs_space, act_space, size, ...)

A buffer for storing trajectories experienced by an agent interacting with the environment.

OffPolicyBuffer(obs_space, act_space, size, ...)

A ReplayBuffer for off_policy Algorithms.

VectorOffPolicyBuffer(obs_space, act_space, ...)

Vectorized on-policy buffer.

VectorOnPolicyBuffer(obs_space, act_space, ...)

Vectorized on-policy buffer.

Base Buffer#

Documentation

class omnisafe.common.buffer.BaseBuffer(obs_space, act_space, size, device=DEVICE_CPU)[source]#

Abstract base class for buffer.

Warning

The buffer only supports Box spaces.

In base buffer, we store the following data:

Name

Shape

Dtype

Description

obs

(size, *obs_space.shape)

torch.float32

The observation from environment.

act

(size, *act_space.shape)

torch.float32

The action from agent.

reward

(size,)

torch.float32

Single step reward.

cost

(size,)

torch.float32

Single step cost.

done

(size,)

torch.float32

Whether the episode is done.

Parameters:
  • obs_space (OmnisafeSpace) – The observation space.

  • act_space (OmnisafeSpace) – The action space.

  • size (int) – The size of the buffer.

  • device (torch.device) – The device of the buffer. Defaults to torch.device('cpu').

Variables:

data (dict[str, torch.Tensor]) – The data of the buffer.

Raises:
  • NotImplementedError – If the observation space or the action space is not Box.

  • NotImplementedError – If the action space or the action space is not Box.

Initialize an instance of BaseBuffer.

add_field(name, shape, dtype)[source]#

Add a field to the buffer.

Examples

>>> buffer = BaseBuffer(...)
>>> buffer.add_field('new_field', (2, 3), torch.float32)
>>> buffer.data['new_field'].shape
>>> (buffer.size, 2, 3)
Parameters:
  • name (str) – The name of the field.

  • shape (tuple of int) – The shape of the field.

  • dtype (torch.dtype) – The dtype of the field.

Return type:

None

property device: device#

The device of the buffer.

property size: int#

The size of the buffer.

abstract store(**data)[source]#

Store a transition in the buffer.

Warning

This is an abstract method.

Examples

>>> buffer = BaseBuffer(...)
>>> buffer.store(obs=obs, act=act, reward=reward, cost=cost, done=done)
Parameters:

data (torch.Tensor) – The data to store.

Return type:

None

On Policy Buffer#

Documentation

class omnisafe.common.buffer.OnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient=0, standardized_adv_r=False, standardized_adv_c=False, device=DEVICE_CPU)[source]#

A buffer for storing trajectories experienced by an agent interacting with the environment.

Besides, The buffer also provides the functionality of calculating the advantages of state-action pairs, ranging from GAE, GAE-RTG , V-trace to Plain method.

Warning

The buffer only supports Box spaces.

Compared to the base buffer, the on-policy buffer stores extra data:

Name

Shape

Dtype

Shape

discounted_ret

(size,)

torch.float32

The discounted sum of return.

value_r

(size,)

torch.float32

The value estimated by reward critic.

value_c

(size,)

torch.float32

The value estimated by cost critic.

adv_r

(size,)

torch.float32

The advantage of the reward.

adv_c

(size,)

torch.float32

The advantage of the cost.

target_value_r

(size,)

torch.float32

The target value of the reward critic.

target_value_c

(size,)

torch.float32

The target value of the cost critic.

logp

(size,)

torch.float32

The log probability of the action.

Parameters:
  • obs_space (OmnisafeSpace) – The observation space.

  • act_space (OmnisafeSpace) – The action space.

  • size (int) – The size of the buffer.

  • gamma (float) – The discount factor.

  • lam (float) – The lambda factor for calculating the advantages.

  • lam_c (float) – The lambda factor for calculating the advantages of the critic.

  • advantage_estimator (AdvatageEstimator) – The advantage estimator.

  • penalty_coefficient (float, optional) – The penalty coefficient. Defaults to 0.

  • standardized_adv_r (bool, optional) – Whether to standardize the advantages of the actor. Defaults to False.

  • standardized_adv_c (bool, optional) – Whether to standardize the advantages of the critic. Defaults to False.

  • device (torch.device, optional) – The device to store the data. Defaults to torch.device('cpu').

Variables:
  • ptr (int) – The pointer of the buffer.

  • path_start (int) – The start index of the current path.

  • max_size (int) – The maximum size of the buffer.

  • data (dict) – The data stored in the buffer.

  • obs_space (OmnisafeSpace) – The observation space.

  • act_space (OmnisafeSpace) – The action space.

  • device (torch.device) – The device to store the data.

Initialize an instance of OnPolicyBuffer.

_calculate_adv_and_value_targets(values, rewards, lam)[source]#

Compute the estimated advantage.

Three methods are supported:

  • GAE (Generalized Advantage Estimation)

    GAE is a variance reduction method for the actor-critic algorithm. It is proposed in the paper High-Dimensional Continuous Control Using Generalized Advantage Estimation.

    GAE calculates the advantage using the following formula:

    (4)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k}\]

    where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\). When \(\lambda =1\), GAE reduces to the Monte Carlo method, which is unbiased but has high variance. When \(\lambda =0\), GAE reduces to the TD(1) method, which is biased but has low variance.

  • V-trace

    V-trace is a variance reduction method for the actor-critic algorithm. It is proposed in the paper IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.

    V-trace calculates the advantage using the following formula:

    (5)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n \rho_{t+n} (1 - d_{t+n}) (V(x_{t+n}) - b_{t+n})\]

    where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\), \(\rho_{t+k} = \frac{\pi(a_{t+k}|s_{t+k})}{b_{t+k}}\), \(b_{t+k}\) is the behavior policy, and \(d_{t+k}\) is the done flag.

  • Plain

    Plain method is the original actor-critic algorithm. It is unbiased but has high variance.

Parameters:
  • vals (torch.Tensor) – The value of states.

  • rews (torch.Tensor) – The reward of states.

  • lam (float) – The lambda parameter in GAE formula.

Returns:
  • adv (torch.Tensor) – The estimated advantage.

  • target_value (torch.Tensor) – The target value for the value function.

Raises:

NotImplementedError – If the advantage estimator is not supported.

Return type:

tuple[Tensor, Tensor]

static _calculate_v_trace(policy_action_probs, values, rewards, behavior_action_probs, gamma=0.99, rho_bar=1.0, c_bar=1.0)[source]#

This function is used to calculate V-trace targets.

(6)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n \rho_{t+n} (1 - d_{t+n}) (V(x_{t+n}) - b_{t+n})\]

Calculate V-trace targets for off-policy actor-critic learning recursively. For more details, please refer to the paper: Espeholt et al. 2018, IMPALA.

Parameters:
  • policy_action_probs (torch.Tensor) – Action probabilities of the policy.

  • values (torch.Tensor) – The value of states.

  • rewards (torch.Tensor) – The reward of states.

  • behavior_action_probs (torch.Tensor) – Action probabilities of the behavior policy.

  • gamma (float, optional) – The discount factor. Defaults to 0.99.

  • rho_bar (float, optional) – The maximum value of importance weights. Defaults to 1.0.

  • c_bar (float, optional) – The maximum value of clipped importance weights. Defaults to 1.0.

Returns:

V-trace targets, shape= (batch_size, sequence_length)

Raises:
  • AssertionError – If the input tensors are scalars.

  • AssertionError – If c_bar is greater than rho_bar.

Return type:

tuple[Tensor, Tensor, Tensor]

finish_path(last_value_r=None, last_value_c=None)[source]#

Finish the current path and calculate the advantages of state-action pairs.

On-policy algorithms need to calculate the advantages of state-action pairs after the path is finished. This function calculates the advantages of state-action pairs and stores them in the buffer, following the steps:

Hint

  1. Calculate the discounted return.

  2. Calculate the advantages of the reward.

  3. Calculate the advantages of the cost.

Parameters:
  • last_value_r (torch.Tensor, optional) – The value of the last state of the current path. Defaults to torch.zeros(1).

  • last_value_c (torch.Tensor, optional) – The value of the last state of the current path. Defaults to torch.zeros(1).

Return type:

None

get()[source]#

Get the data in the buffer. :rtype: dict[str, Tensor]

Hint

We provide a trick to standardize the advantages of state-action pairs. We calculate the mean and standard deviation of the advantages of state-action pairs and then standardize the advantages of state-action pairs. You can turn on this trick by setting the standardized_adv_r to True. The same trick is applied to the advantages of the cost.

Returns:

The data stored and calculated in the buffer.

property standardized_adv_c: bool#

Whether to standardize the advantages of the critic.

property standardized_adv_r: bool#

Whether to standardize the advantages of the actor.

store(**data)[source]#

Store data into the buffer.

Warning

The total size of the data must be less than the buffer size.

Parameters:

data (torch.Tensor) – The data to store.

Return type:

None

Off Policy buffer#

Documentation

class omnisafe.common.buffer.OffPolicyBuffer(obs_space, act_space, size, batch_size, device=DEVICE_CPU)[source]#

A ReplayBuffer for off_policy Algorithms.

Warning

The buffer only supports Box spaces.

Compared to the base buffer, the off-policy buffer stores extra data:

Name

Shape

Dtype

Description

next_obs

(size, *obs_space.shape)

torch.float32

The next observation from environment.

Parameters:
  • obs_space (OmnisafeSpace) – The observation space.

  • act_space (OmnisafeSpace) – The action space.

  • size (int) – The size of the buffer.

  • batch_size (int) – The batch size of the buffer.

  • device (torch.device, optional) – The device of the buffer. Defaults to torch.device('cpu').

Variables:

data (dict[str, torch.Tensor]) – The data stored in the buffer.

Initialize an instance of OffPolicyBuffer.

property batch_size: int#

Return the batch size of the buffer.

property max_size: int#

Return the max size of the buffer.

sample_batch()[source]#

Sample a batch of data from the buffer.

Returns:

The sampled batch of data.

Return type:

dict[str, Tensor]

property size: int#

Return the current size of the buffer.

store(**data)[source]#

Store data into the buffer.

Hint

The ReplayBuffer is a circular buffer. When the buffer is full, the oldest data will be overwritten.

Parameters:

data (torch.Tensor) – The data to be stored.

Return type:

None

Vector On Policy Buffer#

Documentation

class omnisafe.common.buffer.VectorOnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient, standardized_adv_r, standardized_adv_c, num_envs=1, device=DEVICE_CPU)[source]#

Vectorized on-policy buffer.

The vector-on-policy buffer is used to store the data from vector environments. The data is stored in a list of on-policy buffers, each of which corresponds to one environment.

Warning

The buffer only supports Box spaces.

Parameters:
  • obs_space (OmnisafeSpace) – Observation space.

  • act_space (OmnisafeSpace) – Action space.

  • size (int) – Size of the buffer.

  • gamma (float) – Discount factor.

  • lam (float) – Lambda for GAE.

  • lam_c (float) – Lambda for GAE for cost.

  • advantage_estimator (AdvatageEstimator) – Advantage estimator.

  • penalty_coefficient (float) – Penalty coefficient.

  • standardized_adv_r (bool) – Whether to standardize the advantage for reward.

  • standardized_adv_c (bool) – Whether to standardize the advantage for cost.

  • num_envs (int, optional) – Number of environments. Defaults to 1.

  • device (torch.device, optional) – Device to store the data. Defaults to torch.device('cpu').

Variables:

buffers (list[OnPolicyBuffer]) – List of on-policy buffers.

Initialize an instance of VectorOnPolicyBuffer.

finish_path(last_value_r=None, last_value_c=None, idx=0)[source]#

Get the data in the buffer.

In vector-on-policy buffer, we get the data from each buffer and then concatenate them.

Return type:

None

get()[source]#

Get the data in the buffer.

We provide a trick to standardize the advantages of state-action pairs. We calculate the mean and standard deviation of the advantages of state-action pairs and then standardize the advantages of state-action pairs. You can turn on this trick by setting the standardized_adv_r to True. The same trick is applied to the advantages of the cost.

Returns:

The data stored and calculated in the buffer.

Return type:

dict[str, Tensor]

property num_buffers: int#

Number of buffers.

store(**data)[source]#

Store vectorized data into vectorized buffer.

Return type:

None

Vector Off Policy Buffer#

Documentation

class omnisafe.common.buffer.VectorOffPolicyBuffer(obs_space, act_space, size, batch_size, num_envs, device=DEVICE_CPU)[source]#

Vectorized on-policy buffer.

The vector-off-policy buffer is a vectorized version of the off-policy buffer. It stores the data in a single tensor, and the data of each environment is stored in a separate column.

Warning

The buffer only supports Box spaces.

Parameters:
  • obs_space (OmnisafeSpace) – The observation space.

  • act_space (OmnisafeSpace) – The action space.

  • size (int) – The size of the buffer.

  • batch_size (int) – The batch size of the buffer.

  • num_envs (int) – The number of environments.

  • device (torch.device, optional) – The device of the buffer. Defaults to torch.device('cpu').

Variables:

data (dict[str, torch.Tensor]) – The data of the buffer.

Raises:
  • NotImplementedError – If the observation space or the action space is not Box.

  • NotImplementedError – If the action space or the action space is not Box.

Initialize an instance of VectorOffPolicyBuffer.

add_field(name, shape, dtype)[source]#

Add a field to the buffer.

Examples

>>> buffer = BaseBuffer(...)
>>> buffer.add_field('new_field', (2, 3), torch.float32)
>>> buffer.data['new_field'].shape
>>> (buffer.size, 2, 3)
Parameters:
  • name (str) – The name of the field.

  • shape (tuple of int) – The shape of the field.

  • dtype (torch.dtype) – The dtype of the field.

Return type:

None

property num_envs: int#

The number of parallel environments.

sample_batch()[source]#

Sample a batch of data from the buffer.

Returns:

The sampled batch of data.

Return type:

dict[str, Tensor]