OmniSafe Buffer#
|
Abstract base class for buffer. |
|
A buffer for storing trajectories experienced by an agent interacting with the environment. |
|
A ReplayBuffer for off_policy Algorithms. |
|
Vectorized on-policy buffer. |
|
Vectorized on-policy buffer. |
Base Buffer#
Documentation
- class omnisafe.common.buffer.BaseBuffer(obs_space, act_space, size, device=DEVICE_CPU)[source]#
Abstract base class for buffer.
Warning
The buffer only supports Box spaces.
In base buffer, we store the following data:
Name
Shape
Dtype
Description
obs
(size, *obs_space.shape)
torch.float32
The observation from environment.
act
(size, *act_space.shape)
torch.float32
The action from agent.
reward
(size,)
torch.float32
Single step reward.
cost
(size,)
torch.float32
Single step cost.
done
(size,)
torch.float32
Whether the episode is done.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
device (torch.device) – The device of the buffer. Defaults to
torch.device('cpu')
.
- Variables:
data (dict[str, torch.Tensor]) – The data of the buffer.
- Raises:
NotImplementedError – If the observation space or the action space is not Box.
NotImplementedError – If the action space or the action space is not Box.
Initialize an instance of
BaseBuffer
.- add_field(name, shape, dtype)[source]#
Add a field to the buffer.
Examples
>>> buffer = BaseBuffer(...) >>> buffer.add_field('new_field', (2, 3), torch.float32) >>> buffer.data['new_field'].shape >>> (buffer.size, 2, 3)
- Parameters:
name (str) – The name of the field.
shape (tuple of int) – The shape of the field.
dtype (torch.dtype) – The dtype of the field.
- Return type:
None
- property device: device#
The device of the buffer.
- property size: int#
The size of the buffer.
On Policy Buffer#
Documentation
- class omnisafe.common.buffer.OnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient=0, standardized_adv_r=False, standardized_adv_c=False, device=DEVICE_CPU)[source]#
A buffer for storing trajectories experienced by an agent interacting with the environment.
Besides, The buffer also provides the functionality of calculating the advantages of state-action pairs, ranging from
GAE
,GAE-RTG
,V-trace
toPlain
method.Warning
The buffer only supports Box spaces.
Compared to the base buffer, the on-policy buffer stores extra data:
Name
Shape
Dtype
Shape
discounted_ret
(size,)
torch.float32
The discounted sum of return.
value_r
(size,)
torch.float32
The value estimated by reward critic.
value_c
(size,)
torch.float32
The value estimated by cost critic.
adv_r
(size,)
torch.float32
The advantage of the reward.
adv_c
(size,)
torch.float32
The advantage of the cost.
target_value_r
(size,)
torch.float32
The target value of the reward critic.
target_value_c
(size,)
torch.float32
The target value of the cost critic.
logp
(size,)
torch.float32
The log probability of the action.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
gamma (float) – The discount factor.
lam (float) – The lambda factor for calculating the advantages.
lam_c (float) – The lambda factor for calculating the advantages of the critic.
advantage_estimator (AdvatageEstimator) – The advantage estimator.
penalty_coefficient (float, optional) – The penalty coefficient. Defaults to 0.
standardized_adv_r (bool, optional) – Whether to standardize the advantages of the actor. Defaults to False.
standardized_adv_c (bool, optional) – Whether to standardize the advantages of the critic. Defaults to False.
device (torch.device, optional) – The device to store the data. Defaults to
torch.device('cpu')
.
- Variables:
ptr (int) – The pointer of the buffer.
path_start (int) – The start index of the current path.
max_size (int) – The maximum size of the buffer.
data (dict) – The data stored in the buffer.
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
device (torch.device) – The device to store the data.
Initialize an instance of
OnPolicyBuffer
.- _calculate_adv_and_value_targets(values, rewards, lam)[source]#
Compute the estimated advantage.
Three methods are supported:
GAE (Generalized Advantage Estimation)
GAE is a variance reduction method for the actor-critic algorithm. It is proposed in the paper High-Dimensional Continuous Control Using Generalized Advantage Estimation.
GAE calculates the advantage using the following formula:
(4)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k}\]where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\). When \(\lambda =1\), GAE reduces to the Monte Carlo method, which is unbiased but has high variance. When \(\lambda =0\), GAE reduces to the TD(1) method, which is biased but has low variance.
V-trace
V-trace is a variance reduction method for the actor-critic algorithm. It is proposed in the paper IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.
V-trace calculates the advantage using the following formula:
(5)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n \rho_{t+n} (1 - d_{t+n}) (V(x_{t+n}) - b_{t+n})\]where \(\delta_{t+k} = r_{t+k} + \gamma*V(s_{t+k+1}) - V(s_{t+k})\), \(\rho_{t+k} = \frac{\pi(a_{t+k}|s_{t+k})}{b_{t+k}}\), \(b_{t+k}\) is the behavior policy, and \(d_{t+k}\) is the done flag.
Plain
Plain method is the original actor-critic algorithm. It is unbiased but has high variance.
- Parameters:
vals (torch.Tensor) – The value of states.
rews (torch.Tensor) – The reward of states.
lam (float) – The lambda parameter in GAE formula.
- Returns:
adv (torch.Tensor) – The estimated advantage.
target_value (torch.Tensor) – The target value for the value function.
- Raises:
NotImplementedError – If the advantage estimator is not supported.
- Return type:
tuple
[Tensor
,Tensor
]
- static _calculate_v_trace(policy_action_probs, values, rewards, behavior_action_probs, gamma=0.99, rho_bar=1.0, c_bar=1.0)[source]#
This function is used to calculate V-trace targets.
(6)#\[A_t = \sum_{k=0}^{n-1} (\lambda \gamma)^k \delta_{t+k} + (\lambda \gamma)^n \rho_{t+n} (1 - d_{t+n}) (V(x_{t+n}) - b_{t+n})\]Calculate V-trace targets for off-policy actor-critic learning recursively. For more details, please refer to the paper: Espeholt et al. 2018, IMPALA.
- Parameters:
policy_action_probs (torch.Tensor) – Action probabilities of the policy.
values (torch.Tensor) – The value of states.
rewards (torch.Tensor) – The reward of states.
behavior_action_probs (torch.Tensor) – Action probabilities of the behavior policy.
gamma (float, optional) – The discount factor. Defaults to 0.99.
rho_bar (float, optional) – The maximum value of importance weights. Defaults to 1.0.
c_bar (float, optional) – The maximum value of clipped importance weights. Defaults to 1.0.
- Returns:
V-trace targets, shape= (batch_size, sequence_length)
- Raises:
AssertionError – If the input tensors are scalars.
AssertionError – If c_bar is greater than rho_bar.
- Return type:
tuple
[Tensor
,Tensor
,Tensor
]
- finish_path(last_value_r=None, last_value_c=None)[source]#
Finish the current path and calculate the advantages of state-action pairs.
On-policy algorithms need to calculate the advantages of state-action pairs after the path is finished. This function calculates the advantages of state-action pairs and stores them in the buffer, following the steps:
Hint
Calculate the discounted return.
Calculate the advantages of the reward.
Calculate the advantages of the cost.
- Parameters:
last_value_r (torch.Tensor, optional) – The value of the last state of the current path. Defaults to torch.zeros(1).
last_value_c (torch.Tensor, optional) – The value of the last state of the current path. Defaults to torch.zeros(1).
- Return type:
None
- get()[source]#
Get the data in the buffer. :rtype:
dict
[str
,Tensor
]Hint
We provide a trick to standardize the advantages of state-action pairs. We calculate the mean and standard deviation of the advantages of state-action pairs and then standardize the advantages of state-action pairs. You can turn on this trick by setting the
standardized_adv_r
toTrue
. The same trick is applied to the advantages of the cost.- Returns:
The data stored and calculated in the buffer.
- property standardized_adv_c: bool#
Whether to standardize the advantages of the critic.
- property standardized_adv_r: bool#
Whether to standardize the advantages of the actor.
Off Policy buffer#
Documentation
- class omnisafe.common.buffer.OffPolicyBuffer(obs_space, act_space, size, batch_size, device=DEVICE_CPU)[source]#
A ReplayBuffer for off_policy Algorithms.
Warning
The buffer only supports Box spaces.
Compared to the base buffer, the off-policy buffer stores extra data:
Name
Shape
Dtype
Description
next_obs
(size, *obs_space.shape)
torch.float32
The next observation from environment.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
device (torch.device, optional) – The device of the buffer. Defaults to
torch.device('cpu')
.
- Variables:
data (dict[str, torch.Tensor]) – The data stored in the buffer.
Initialize an instance of
OffPolicyBuffer
.- property batch_size: int#
Return the batch size of the buffer.
- property max_size: int#
Return the max size of the buffer.
- sample_batch()[source]#
Sample a batch of data from the buffer.
- Returns:
The sampled batch of data.
- Return type:
dict
[str
,Tensor
]
- property size: int#
Return the current size of the buffer.
Vector On Policy Buffer#
Documentation
- class omnisafe.common.buffer.VectorOnPolicyBuffer(obs_space, act_space, size, gamma, lam, lam_c, advantage_estimator, penalty_coefficient, standardized_adv_r, standardized_adv_c, num_envs=1, device=DEVICE_CPU)[source]#
Vectorized on-policy buffer.
The vector-on-policy buffer is used to store the data from vector environments. The data is stored in a list of on-policy buffers, each of which corresponds to one environment.
Warning
The buffer only supports Box spaces.
- Parameters:
obs_space (OmnisafeSpace) – Observation space.
act_space (OmnisafeSpace) – Action space.
size (int) – Size of the buffer.
gamma (float) – Discount factor.
lam (float) – Lambda for GAE.
lam_c (float) – Lambda for GAE for cost.
advantage_estimator (AdvatageEstimator) – Advantage estimator.
penalty_coefficient (float) – Penalty coefficient.
standardized_adv_r (bool) – Whether to standardize the advantage for reward.
standardized_adv_c (bool) – Whether to standardize the advantage for cost.
num_envs (int, optional) – Number of environments. Defaults to 1.
device (torch.device, optional) – Device to store the data. Defaults to
torch.device('cpu')
.
- Variables:
buffers (list[OnPolicyBuffer]) – List of on-policy buffers.
Initialize an instance of
VectorOnPolicyBuffer
.- finish_path(last_value_r=None, last_value_c=None, idx=0)[source]#
Get the data in the buffer.
In vector-on-policy buffer, we get the data from each buffer and then concatenate them.
- Return type:
None
- get()[source]#
Get the data in the buffer.
We provide a trick to standardize the advantages of state-action pairs. We calculate the mean and standard deviation of the advantages of state-action pairs and then standardize the advantages of state-action pairs. You can turn on this trick by setting the
standardized_adv_r
toTrue
. The same trick is applied to the advantages of the cost.- Returns:
The data stored and calculated in the buffer.
- Return type:
dict
[str
,Tensor
]
- property num_buffers: int#
Number of buffers.
Vector Off Policy Buffer#
Documentation
- class omnisafe.common.buffer.VectorOffPolicyBuffer(obs_space, act_space, size, batch_size, num_envs, device=DEVICE_CPU)[source]#
Vectorized on-policy buffer.
The vector-off-policy buffer is a vectorized version of the off-policy buffer. It stores the data in a single tensor, and the data of each environment is stored in a separate column.
Warning
The buffer only supports Box spaces.
- Parameters:
obs_space (OmnisafeSpace) – The observation space.
act_space (OmnisafeSpace) – The action space.
size (int) – The size of the buffer.
batch_size (int) – The batch size of the buffer.
num_envs (int) – The number of environments.
device (torch.device, optional) – The device of the buffer. Defaults to
torch.device('cpu')
.
- Variables:
data (dict[str, torch.Tensor]) – The data of the buffer.
- Raises:
NotImplementedError – If the observation space or the action space is not Box.
NotImplementedError – If the action space or the action space is not Box.
Initialize an instance of
VectorOffPolicyBuffer
.- add_field(name, shape, dtype)[source]#
Add a field to the buffer.
Examples
>>> buffer = BaseBuffer(...) >>> buffer.add_field('new_field', (2, 3), torch.float32) >>> buffer.data['new_field'].shape >>> (buffer.size, 2, 3)
- Parameters:
name (str) – The name of the field.
shape (tuple of int) – The shape of the field.
dtype (torch.dtype) – The dtype of the field.
- Return type:
None
- property num_envs: int#
The number of parallel environments.