OmniSafe Buffer#

`BaseBuffer`(obs_space, act_space, size[, device])	Abstract base class for buffer.
`OnPolicyBuffer`(obs_space, act_space, size, ...)	A buffer for storing trajectories experienced by an agent interacting with the environment.
`OffPolicyBuffer`(obs_space, act_space, size, ...)	A ReplayBuffer for off_policy Algorithms.
`VectorOffPolicyBuffer`(obs_space, act_space, ...)	Vectorized on-policy buffer.
`VectorOnPolicyBuffer`(obs_space, act_space, ...)	Vectorized on-policy buffer.

Base Buffer#

Documentation

class omnisafe.common.buffer.BaseBuffer(obs_space, act_space, size, device=DEVICE_CPU)[source]#

Abstract base class for buffer.

Warning

The buffer only supports Box spaces.

In base buffer, we store the following data:

Name	Shape	Dtype	Description
obs	(size, *obs_space.shape)	torch.float32	The observation from environment.
act	(size, *act_space.shape)	torch.float32	The action from agent.
reward	(size,)	torch.float32	Single step reward.
cost	(size,)	torch.float32	Single step cost.
done	(size,)	torch.float32	Whether the episode is done.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
device (torch.device) – The device of the buffer. Defaults to torch.device('cpu').

Variables:

data (dict[str, torch.Tensor]) – The data of the buffer.

Raises:

NotImplementedError – If the observation space or the action space is not Box.
NotImplementedError – If the action space or the action space is not Box.

Initialize an instance of BaseBuffer.

add_field(name, shape, dtype)[source]#

Add a field to the buffer.

Examples

>>> buffer = BaseBuffer(...)
>>> buffer.add_field('new_field', (2, 3), torch.float32)
>>> buffer.data['new_field'].shape
>>> (buffer.size, 2, 3)

Parameters:

name (str) – The name of the field.
shape (tuple of int) – The shape of the field.
dtype (torch.dtype) – The dtype of the field.

Return type:

None

property device: device#: The device of the buffer.

property size: int#: The size of the buffer.

abstract store(**data)[source]#

Store a transition in the buffer.

Warning

This is an abstract method.

Examples

>>> buffer = BaseBuffer(...)
>>> buffer.store(obs=obs, act=act, reward=reward, cost=cost, done=done)

Parameters:: data (torch.Tensor) – The data to store.
Return type:: None

On Policy Buffer#

Documentation

class omnisafe.common.buffer.OnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient=0, standardized_adv_r=False, standardized_adv_c=False, device=DEVICE_CPU)[source]#

A buffer for storing trajectories experienced by an agent interacting with the environment.

Besides, The buffer also provides the functionality of calculating the advantages of state-action pairs, ranging from GAE, GAE-RTG , V-trace to Plain method.

Warning

The buffer only supports Box spaces.

Compared to the base buffer, the on-policy buffer stores extra data:

Name	Shape	Dtype	Shape
discounted_ret	(size,)	torch.float32	The discounted sum of return.
value_r	(size,)	torch.float32	The value estimated by reward critic.
value_c	(size,)	torch.float32	The value estimated by cost critic.
adv_r	(size,)	torch.float32	The advantage of the reward.
adv_c	(size,)	torch.float32	The advantage of the cost.
target_value_r	(size,)	torch.float32	The target value of the reward critic.
target_value_c	(size,)	torch.float32	The target value of the cost critic.
logp	(size,)	torch.float32	The log probability of the action.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
gamma (float) – The discount factor.
lam (float) – The lambda factor for calculating the advantages.
lam_c (float) – The lambda factor for calculating the advantages of the critic.
advantage_estimator (AdvatageEstimator) – The advantage estimator.
penalty_coefficient (float, optional) – The penalty coefficient. Defaults to 0.
standardized_adv_r (bool, optional) – Whether to standardize the advantages of the actor. Defaults to False.
standardized_adv_c (bool, optional) – Whether to standardize the advantages of the critic. Defaults to False.
device (torch.device, optional) – The device to store the data. Defaults to torch.device('cpu').

Variables:

ptr (int) – The pointer of the buffer.
path_start (int) – The start index of the current path.
max_size (int) – The maximum size of the buffer.
data (dict) – The data stored in the buffer.
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
device (torch.device) – The device to store the data.

Initialize an instance of OnPolicyBuffer.

_calculate_adv_and_value_targets(values, rewards, lam)[source]#

Compute the estimated advantage.

Three methods are supported:

GAE (Generalized Advantage Estimation)

GAE is a variance reduction method for the actor-critic algorithm. It is proposed in the paper High-Dimensional Continuous Control Using Generalized Advantage Estimation.

GAE calculates the advantage using the following formula:

(4)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k}\]

where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\). When \(\lambda =1\), GAE reduces to the Monte Carlo method, which is unbiased but has high variance. When \(\lambda =0\), GAE reduces to the TD(1) method, which is biased but has low variance.
V-trace

V-trace is a variance reduction method for the actor-critic algorithm. It is proposed in the paper IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.

V-trace calculates the advantage using the following formula:

(5)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n \rho_{t+n} (1 - d_{t+n}) (V(x_{t+n}) - b_{t+n})\]

where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\), \(\rho_{t+k} = \frac{\pi(a_{t+k}|s_{t+k})}{b_{t+k}}\), \(b_{t+k}\) is the behavior policy, and \(d_{t+k}\) is the done flag.
Plain

Plain method is the original actor-critic algorithm. It is unbiased but has high variance.

Parameters:

vals (torch.Tensor) – The value of states.
rews (torch.Tensor) – The reward of states.
lam (float) – The lambda parameter in GAE formula.

Returns:

adv (torch.Tensor) – The estimated advantage.
target_value (torch.Tensor) – The target value for the value function.

Raises:

NotImplementedError – If the advantage estimator is not supported.

Return type:

tuple[Tensor, Tensor]

static _calculate_v_trace(policy_action_probs, values, rewards, behavior_action_probs, gamma=0.99, rho_bar=1.0, c_bar=1.0)[source]#

This function is used to calculate V-trace targets.

(6)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n \rho_{t+n} (1 - d_{t+n}) (V(x_{t+n}) - b_{t+n})\]

Calculate V-trace targets for off-policy actor-critic learning recursively. For more details, please refer to the paper: Espeholt et al. 2018, IMPALA.

Parameters:

policy_action_probs (torch.Tensor) – Action probabilities of the policy.
values (torch.Tensor) – The value of states.
rewards (torch.Tensor) – The reward of states.
behavior_action_probs (torch.Tensor) – Action probabilities of the behavior policy.
gamma (float, optional) – The discount factor. Defaults to 0.99.
rho_bar (float, optional) – The maximum value of importance weights. Defaults to 1.0.
c_bar (float, optional) – The maximum value of clipped importance weights. Defaults to 1.0.

Returns:

V-trace targets, shape= (batch_size, sequence_length)

Raises:

AssertionError – If the input tensors are scalars.
AssertionError – If c_bar is greater than rho_bar.

Return type:

tuple[Tensor, Tensor, Tensor]

finish_path(last_value_r=None, last_value_c=None)[source]#

Finish the current path and calculate the advantages of state-action pairs.

On-policy algorithms need to calculate the advantages of state-action pairs after the path is finished. This function calculates the advantages of state-action pairs and stores them in the buffer, following the steps:

Hint

Calculate the discounted return.
Calculate the advantages of the reward.
Calculate the advantages of the cost.

Parameters:

last_value_r (torch.Tensor, optional) – The value of the last state of the current path. Defaults to torch.zeros(1).
last_value_c (torch.Tensor, optional) – The value of the last state of the current path. Defaults to torch.zeros(1).

Return type:

None

get()[source]#

Get the data in the buffer. :rtype: dict[str, Tensor]

Hint

We provide a trick to standardize the advantages of state-action pairs. We calculate the mean and standard deviation of the advantages of state-action pairs and then standardize the advantages of state-action pairs. You can turn on this trick by setting the standardized_adv_r to True. The same trick is applied to the advantages of the cost.

Returns:: The data stored and calculated in the buffer.

property standardized_adv_c: bool#: Whether to standardize the advantages of the critic.

property standardized_adv_r: bool#: Whether to standardize the advantages of the actor.

store(**data)[source]#

Store data into the buffer.

Warning

The total size of the data must be less than the buffer size.

Parameters:: data (torch.Tensor) – The data to store.
Return type:: None

Off Policy buffer#

Documentation

class omnisafe.common.buffer.OffPolicyBuffer(obs_space, act_space, size, batch_size, device=DEVICE_CPU)[source]#

A ReplayBuffer for off_policy Algorithms.

Warning

The buffer only supports Box spaces.

Compared to the base buffer, the off-policy buffer stores extra data:

Name	Shape	Dtype	Description
next_obs	(size, *obs_space.shape)	torch.float32	The next observation from environment.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device('cpu').

Variables:

data (dict[str, torch.Tensor]) – The data stored in the buffer.

Initialize an instance of OffPolicyBuffer.

property batch_size: int#: Return the batch size of the buffer.

property max_size: int#: Return the max size of the buffer.

sample_batch()[source]#

Sample a batch of data from the buffer.

Returns:: The sampled batch of data.
Return type:: dict[str, Tensor]

property size: int#: Return the current size of the buffer.

store(**data)[source]#

Store data into the buffer.

Hint

The ReplayBuffer is a circular buffer. When the buffer is full, the oldest data will be overwritten.

Parameters:: data (torch.Tensor) – The data to be stored.
Return type:: None

Vector On Policy Buffer#

Documentation

class omnisafe.common.buffer.VectorOnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient, standardized_adv_r, standardized_adv_c, num_envs=1, device=DEVICE_CPU)[source]#

Vectorized on-policy buffer.

The vector-on-policy buffer is used to store the data from vector environments. The data is stored in a list of on-policy buffers, each of which corresponds to one environment.

Warning

The buffer only supports Box spaces.

Parameters:

obs_space (OmnisafeSpace) – Observation space.
act_space (OmnisafeSpace) – Action space.
size (int) – Size of the buffer.
gamma (float) – Discount factor.
lam (float) – Lambda for GAE.
lam_c (float) – Lambda for GAE for cost.
advantage_estimator (AdvatageEstimator) – Advantage estimator.
penalty_coefficient (float) – Penalty coefficient.
standardized_adv_r (bool) – Whether to standardize the advantage for reward.
standardized_adv_c (bool) – Whether to standardize the advantage for cost.
num_envs (int, optional) – Number of environments. Defaults to 1.
device (torch.device, optional) – Device to store the data. Defaults to torch.device('cpu').

Variables:

buffers (list[OnPolicyBuffer]) – List of on-policy buffers.

Initialize an instance of VectorOnPolicyBuffer.

finish_path(last_value_r=None, last_value_c=None, idx=0)[source]#

Get the data in the buffer.

In vector-on-policy buffer, we get the data from each buffer and then concatenate them.

Return type:: None

get()[source]#

Get the data in the buffer.

Returns:: The data stored and calculated in the buffer.
Return type:: dict[str, Tensor]

property num_buffers: int#: Number of buffers.

store(**data)[source]#

Store vectorized data into vectorized buffer.

Return type:: None

Vector Off Policy Buffer#

Documentation

class omnisafe.common.buffer.VectorOffPolicyBuffer(obs_space, act_space, size, batch_size, num_envs, device=DEVICE_CPU)[source]#

Vectorized on-policy buffer.

The vector-off-policy buffer is a vectorized version of the off-policy buffer. It stores the data in a single tensor, and the data of each environment is stored in a separate column.

Warning

The buffer only supports Box spaces.

Parameters:

obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
num_envs (int) – The number of environments.
device (torch.device, optional) – The device of the buffer. Defaults to torch.device('cpu').

Variables:

data (dict[str, torch.Tensor]) – The data of the buffer.

Raises:

NotImplementedError – If the observation space or the action space is not Box.
NotImplementedError – If the action space or the action space is not Box.

Initialize an instance of VectorOffPolicyBuffer.

add_field(name, shape, dtype)[source]#

Add a field to the buffer.

Examples

>>> buffer = BaseBuffer(...)
>>> buffer.add_field('new_field', (2, 3), torch.float32)
>>> buffer.data['new_field'].shape
>>> (buffer.size, 2, 3)

Parameters:

name (str) – The name of the field.
shape (tuple of int) – The shape of the field.
dtype (torch.dtype) – The dtype of the field.

Return type:

None

property num_envs: int#: The number of parallel environments.

sample_batch()[source]#

Sample a batch of data from the buffer.

Returns:: The sampled batch of data.
Return type:: dict[str, Tensor]