Trust Region Policy Optimization#

Quick Facts#

  1. TRPO is an on-policy algorithm.

  2. TRPO is an improvement work done based on NPG .

  3. TRPO is an important theoretical basis for CPO .

  4. An API Documentation is available for TRPO.


TRPO Theorem#

Background#

Trust region policy optimization (TRPO) is an iterative method for optimizing policies in reinforcement learning that ensures monotonic improvements. It works by iteratively finding a local approximation of the objective return and maximizing the approximated function. TRPO guarantees that the new policy is constrained within a trust region relative to the current policy, which is achieved by using KL divergence to measure the distance between the two policies.

TRPO is well-suited for optimizing comprehensive nonlinear policies such as neural networks. It is based on the Natural Policy Gradient (NPG) method, which uses conjugate gradient to avoid expensive computational costs. Furthermore, TRPO incorporates a line search mechanism to ensure that updated policy adhere to the predetermined KL divergence constraint.

Problems of NPG

  • It is very difficult to calculate the Hessian matrix directly.

  • Error introduced by Taylor expansion because of the fixed step length.

  • Low utilization of sampled data.

Advantage of TRPO

  • Using conjugate gradient algorithm to compute the Fisher-Vector product.

  • Using line search algorithm to eliminate the error introduced by Taylor expansion.

  • Using importance sampling to reuse data.


Performance difference over policies#

In policy optimization, our objective is to ensure that every update leads to a consistent improvement in the expected return. To accomplish this, we usually formulate the equation for expected return in a specific format that is both intuitive and straightforward to manipulate.

(1)#\[J^R(\pi') = J^R(\pi) + \{J^R(\pi') - J^R(\pi)\}\]

To achieve monotonic improvements, we only need to consider \(\Delta = J^R(\pi') - J^R(\pi)\) to be non-negative.

As shown in NPG, the difference in performance between two policies \(\pi'\) and \(\pi\) can be expressed as:

Theorem 1 (Performance Difference Bound)

(2)#\[ J^R(\pi') = J^R(\pi) + \mathbb{E}_{\tau \sim \pi'}[\sum_{t=0}^{\infty} \gamma^t A^{R}_{\pi}(s_t,a_t)]\]

where this expectation is taken over trajectories \(\tau=(s_0, a_0, s_1,\\ a_1, \cdots)\), and the notation \(\mathbb{E}_{\tau \sim \pi'}[\cdots]\) indicates that actions are sampled from \(\pi'\) to generate \(\tau\).

Theorem 1 is intuitive as the expected discounted reward of \(\pi'\) can be viewed as the expected discounted reward of \(\pi\), and an extra advantage of \(\pi'\) over \(\pi\). The latter term accounts for how much \(\pi'\) can improve over \(\pi\), which is of our interest.

Note

We can rewrite Theorem 1 with a sum over states instead of timesteps:

(3)#\[\begin{split}\label{equation: performance in discount visit density} J^R(\pi') &=J^R(\pi)+\sum_{t=0}^{\infty} \sum_s P\left(s_t=s \mid \pi'\right) \sum_a \pi' (a \mid s) \gamma^t A^{R}_{\pi}(s, a) \\ &=J^R(\pi)+\sum_s \sum_{t=0}^{\infty} \gamma^t P\left(s_t=s \mid \pi' \right) \sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a) \\ &=J^R(\pi)+\sum_s d_{\pi'}(s) \sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a)\end{split}\]

This equation implies for any policy \(\pi'\), if it has a nonnegative expected advantage at every state \(s\), i.e., \(\sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a) \geq 0\), it is guaranteed to increase the policy performance \(J^R\), or leave it constant in the case that the expected advantage is zero everywhere. However, in the approximate setting, it will typically be unavoidable, due to estimation and approximation errors, that there will be some states \(s\) in which the expected advantage is negative, that is, \(\sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a)<0\).


Surrogate function for the objective#

Eq.3 requires information about future state distribution under \(\pi'\), which is usually unknown and difficult to estimate. The complex dependency of \(d_{\pi'}(s)\) on \(\pi'\) makes Eq.3 difficult to optimize directly. Instead, we introduce the following local approximation to \(J^R\):

(4)#\[L_\pi(\pi')=J^R(\pi)+\sum_s d_\pi(s) \sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a)\]

Here we only replace \(d_{\pi'}\) with \(d_\pi\). It has been proved that if the two policy \(\pi'\) and \(\pi\) are close enough, \(L_\pi(\pi')\) can be considered as equivalent to \(J^R(\pi')\).

Corollary 1 (Performance Difference Bound)

Formally, suppose a parameterized policy \(\pi_{\boldsymbol{\theta}}\), where \(\pi_{\boldsymbol{\theta}}(a \mid s)\) is a differentiable function of the parameter vector \({\boldsymbol{\theta}}\), then \(L_\pi\) matches \(J^R\) to first order (see NPG). That is, for any parameter value \({\boldsymbol{\theta}}_0\), we have:

(5)#\[L_{\pi_{{\boldsymbol{\theta}}_0}}\left(\pi_{{\boldsymbol{\theta}}_0}\right)=J^R\left(\pi_{{\boldsymbol{\theta}}_0}\right)\]
(6)#\[\nabla_{\boldsymbol{\theta}} L_{\pi_{{\boldsymbol{\theta}}_0}}\left(\pi_{\boldsymbol{\theta}}\right)|_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}_0}=\left.\nabla_{\boldsymbol{\theta}} J^R\left(\pi_{\boldsymbol{\theta}}\right)\right|_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}_0}\]

Eq.6 implies that a sufficiently small step \(\pi_{{\boldsymbol{\theta}}_0} \rightarrow \pi'\) improving \(L_{\pi_{{\boldsymbol{\theta}}_{\text {old }}}}\) will also improve \(J^R\), but does not provide explicit guidance on determining the appropriate step size for policy updates.

To address this issue, NPG proposed a policy updating scheme called conservative policy iteration(CPI), which could provide explicit lower bounds on the improvement of \(J^R\). To define the conservative policy iteration update, let \(\pi_{\mathrm{old}}\) denote the current policy, and let \(\pi^{*}=\arg \underset{\pi^{*}}{\max} L_{\pi_{\text {old }}}\left(\pi^{*}\right)\). The new policy \(\pi_{\text {new }}\) was defined to be the following mixture:

(7)#\[\pi_{\text {new }}(a \mid s)=(1-\alpha) \pi_{\text {old }}(a \mid s)+\alpha \pi^{*}(a \mid s)\]

Kakade and Langford derived the following lower bound:

(8)#\[\begin{split}J^R\left(\pi_{\text {new }}\right) &\geq L_{\pi_{\text {old }}}\left(\pi_{\text {new }}\right)-\frac{2 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ \text { where } \epsilon &=\max _s\left|\mathbb{E}_{a \sim \pi^{*}(a \mid s)}\left[A^{R}_{\pi}(s, a)\right]\right|\end{split}\]

However, the lower bound in Eq.8 only applies to mixture policies, so it needs to be extended to general policy cases.


Monotonic Improvement Guarantee for General Stochastic Policies#

Based on the theoretical guarantee Eq.16 in mixture policies case, TRPO extends the lower bound to general policies by replacing \(\alpha\) with a distance measure between \(\pi\) and \(\pi'\), and changing the constant \(\epsilon\) appropriately. The chosen distance measurement is the total variation divergence (TV divergence), which is defined by \(D_{TV}(p \| q)=\frac{1}{2} \sum_i \left|p_i-q_i\right|\) for discrete probability distributions \(p, q\). Define \(D_{\mathrm{TV}}^{\max }(\pi, \pi')\) as

(9)#\[D_{\mathrm{TV}}^{\max}(\pi, \pi')=\max_s D_{\mathrm{TV}}\left(\pi\left(\cdot \mid s\right) \| \pi'\left(\cdot \mid s\right)\right)\]

And the new bound is derived by introducing the \(\alpha\)-coupling method.

Theorem 2 (Performance Difference Bound derived by \(\alpha\)-coupling method)

Let \(\alpha=D_{\mathrm{TV}}^{\max }\left(\pi_{\mathrm{old}}, \pi_{\text {new }}\right)\). Then the following bound holds:

(10)#\[\begin{split}J^{R}\left(\pi_{\text {new }}\right) &\geq L_{\pi_{\text {old }}}\left(\pi_{\text {new }}\right)-\frac{4 \epsilon \gamma}{(1-\gamma)^2} \alpha^2 \\ \text { where } \epsilon &=\max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\end{split}\]

The proof extends Kakade and Langford’s result. Given the fact that the random variables from two distributions with total variation divergence less than \(\alpha\) can be coupled, we easily obtain that they are equal with probability \(1-\alpha\).

Next, we note the following relationship between the total variation divergence and the \(\mathrm{KL}\) divergence: \([D_{\mathrm{TV}}(p \| q)]^2 \leq D_{\mathrm{KL}}(p \| q)\). Let \(D_{\mathrm{KL}}^{\max }(\pi, \pi')=\underset{s}{\max} D_{\mathrm{KL}}(\pi(\cdot|s) \| \pi'(\cdot|s))\). The following bound then follows directly from Theorem 2 :

(11)#\[\begin{split}J^R(\pi') & \geq L_\pi(\pi')-C D_{\mathrm{KL}}^{\max }(\pi, \pi') \\ \quad \text { where } C &=\frac{4 \epsilon \gamma}{(1-\gamma)^2}\end{split}\]

TRPO describes an approximate policy iteration scheme based on the policy improvement bound in Eq.11. Note that for now, we assume exact evaluation of the advantage values \(A^{R}_{\pi}\).

It follows from Eq.11 that TRPO is guaranteed to generate a monotonically improving sequence of policies \(J^R\left(\pi_0\right) \leq J^R\left(\pi_1\right) \leq J^R\left(\pi_2\right) \leq \cdots \leq J^R\left(\pi_n\right)\). To see this, let \(M_i(\pi)=L_{\pi_i}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_i, \pi\right)\). Then

(12)#\[\begin{split}J^{R}\left(\pi_{i+1}\right) &\geq M_i\left(\pi_{i+1}\right) \\ J^{R}\left(\pi_i\right)&=M_i\left(\pi_i\right), \text { therefore, } \\ J^{R}\left(\pi_{i+1}\right)-\eta\left(\pi_i\right)&\geq M_i\left(\pi_{i+1}\right)-M\left(\pi_i\right)\end{split}\]

Thus, by maximizing \(M_i\) at each iteration, we guarantee that the true objective \(J^R\) is non-decreasing.


Practical Implementation#

Approximately Solving the TRPO Update#

Until now, we present the iteration algorithm with theoretically guaranteed monotonic improvement for new policy over the current policy. However, in practice, when we consider policies in parameterized space \(\pi_{{\boldsymbol{\theta}}}(a \mid s)\), the algorithm cannot work well. By plugging in the notation \({\boldsymbol{\theta}}\), our update step becomes

(13)#\[\begin{split}& L_{{\boldsymbol{\theta}}_{old}}({\boldsymbol{\theta}})-C D_{\mathrm{KL}}^{\max }({\boldsymbol{\theta}}_{old}, {\boldsymbol{\theta}}) \\\end{split}\]

where \(C=\frac{4 \epsilon \gamma}{(1-\gamma)^2}\), and \({\boldsymbol{\theta}}_{old}, {\boldsymbol{\theta}}\) are short for \(\pi_{{\boldsymbol{\theta}}_{old}}, \pi_{{\boldsymbol{\theta}}}\). In practice, the penalty coefficient \(C\) for KL divergence would produce a very small step size and the improvement would be too conservative. To allow larger step size, instead of penalty term on KL divergence, TRPO uses fixed KL divergence constraint to bound the distance between \(\pi_{{\boldsymbol{\theta}}_{old}}\) and \(\pi_{{\boldsymbol{\theta}}}\):

(14)#\[\begin{split}\underset{{\boldsymbol{\theta}}}{\max}\quad &L_{{\boldsymbol{\theta}}_{old}}({\boldsymbol{\theta}}) \\ \text{s.t. } \quad &D_{\mathrm{KL}}^{\max }({\boldsymbol{\theta}}_{old}, {\boldsymbol{\theta}}) \le \delta\end{split}\]

This problem imposes a constraint that the KL divergence is bounded at every point in the state space. While it is motivated by the theory, this problem is impractical to solve due to a large number of constraints. Instead, TRPO uses a heuristic approximation that considers the average KL divergence:

(15)#\[\begin{split}\underset{{\boldsymbol{\theta}}}{\max}\quad &L_{{\boldsymbol{\theta}}_{old}}({\boldsymbol{\theta}}) \label{eq:maxklconst} \\ \text{s.t. } \quad &\bar{D}_{\mathrm{KL}}({\boldsymbol{\theta}}_{old}, {\boldsymbol{\theta}}) \le \delta\end{split}\]

where \(\bar{D}_{\mathrm{KL}}:=\mathbb{E}_{s \sim \rho}\left[D_{\mathrm{KL}}\left(\pi_{{\boldsymbol{\theta}}_1}(\cdot \mid s) \| \pi_{{\boldsymbol{\theta}}_2}(\cdot \mid s)\right)\right]\) .The method TRPO describes involves two steps:

Two Steps For TRPO Update

  • Compute a search direction, using a linear approximation to the objective and quadratic approximation to the constraint.

  • Perform a line search in the specified direction, ensuring both improvement of the nonlinear objective and satisfaction of the nonlinear constraint.

Problems

  • It is prohibitively costly to form the full Hessian matrix.

  • How to compute the maximal step length such that the KL divergence is satisfied ?

  • How to ensure improvement of the surrogate objective and satisfaction of the KL divergence ?

Solutions

  • Conjugate gradient algorithm can approximately search the update direction without forming this full Hessian matrix.

  • The max step size can be formed by an intermediate result produced by the conjugate gradient algorithm.

  • A line search algorithm can be used to meet the goal.

Computing the Fisher-Vector Product

TRPO approximately computes the search direction by solving the equation \(Hx=g\), where \(H\) is the Fisher information matrix, i.e., the quadratic approximation to the KL divergence constraint \(\bar{D}_{\mathrm{KL}}\left({\boldsymbol{\theta}}_{\text {old }}, {\boldsymbol{\theta}}\right) \approx \frac{1}{2}\left({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{\text {old }}\right)^T H\left({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{\text {old }}\right)\), where \(H_{i j}=\frac{\partial}{\partial {\boldsymbol{\theta}}_i} \frac{\partial}{\partial {\boldsymbol{\theta}}_j} \bar{D}_{\mathrm{KL}}\left({\boldsymbol{\theta}}_{\text {old }}, {\boldsymbol{\theta}}\right)\) (according to the definition of matrix \(H\)). It is very difficult to calculate the entire \(H\) or \(H^{-1}\) directly, so TRPO uses the conjugate gradient algorithm to approximately solve the equation \(Hx=g\) without forming this full matrix.

Computing The Final Update Step

Having computed the search direction \(s\approx H^{-1}g\), TRPO next needs to compute the appropriate step to ensure improvement of the surrogate objective and satisfaction of the KL divergence constraint. First, TRPO computes the maximal step length \(\beta\) such that \({\boldsymbol{\theta}} + \beta s\) will satisfy the KL divergence constraint. To do this, let \(\delta=\bar{D}_{\mathrm{KL}} \approx \frac{1}{2}(\beta s)^T H(\beta s)=\frac{1}{2} \beta^2 s^T A s\). Finally, we obtain \(\beta=\sqrt{2 \delta / s^T H s}\).

Hint

The term \(s^T H s\) is an intermediate result produced by the conjugate gradient algorithm.

To meet the constraints, TRPO uses line search algorithm to compute the final step length. Detailedly, TRPO performs the line search on the objective \(L_{{\boldsymbol{\theta}}_{\text {old }}}({\boldsymbol{\theta}})-\mathcal{X}\left[\bar{D}_{\text {KL }}\left({\boldsymbol{\theta}}_{\text {old }}, {\boldsymbol{\theta}}\right) \leq \delta\right]\), where \(\mathcal{X}[\ldots]\) equals to \(0\), when its argument is true, and \(+\infty\) when it is false. Starting with the maximal value of the step length \(\beta\) computed in the previous paragraph, TRPO shrinks \(\beta\) exponentially until the objective improves. Without this line search, the algorithm occasionally computes large steps that cause a catastrophic degradation of performance.

Code with OmniSafe#

Quick start#

Run TRPO in OmniSafe

Here are 3 ways to run TRPO in OmniSafe:

  • Run Agent from preset yaml file

  • Run Agent from custom config dict

  • Run Agent from custom terminal config

1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5
6agent = omnisafe.Agent('TRPO', env_id)
7agent.learn()
 1import omnisafe
 2
 3
 4env_id = 'SafetyPointGoal1-v0'
 5custom_cfgs = {
 6    'train_cfgs': {
 7        'total_steps': 10000000,
 8        'vector_env_nums': 1,
 9        'parallel': 1,
10    },
11    'algo_cfgs': {
12        'steps_per_epoch': 20000,
13    },
14    'logger_cfgs': {
15        'use_wandb': False,
16        'use_tensorboard': True,
17    },
18}
19
20agent = omnisafe.Agent('TRPO', env_id, custom_cfgs=custom_cfgs)
21agent.learn()

We use train_policy.py as the entrance file. You can train the agent with TRPO simply using train_policy.py, with arguments about TRPO and environments does the training. For example, to run TRPO in SafetyPointGoal1-v0 , with 1 torch thread, seed 0 and single environment, you can use the following command:

1cd examples
2python train_policy.py --algo TRPO --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 10000000 --device cpu --vector-env-nums 1 --torch-threads 1

Architecture of functions#

  • TRPO.learn()

    • TRPO._env.rollout()

    • TRPO._update()

      • TRPO._buf.get()

      • TRPO._update_actor()

        • TRPO._fvp()

        • conjugate_gradients()

        • TRPO._cpo_search_step()

      • TRPO._update_reward_critic()


Documentation of algorithm specific functions#

trpo._fvp()

TRPO algorithm builds the Hessian-vector product instead of the full Hessian matrix based on an approximation of the KL-divergence, flowing the next steps:

  1. Calculate the KL divergence between two policy. Note that self._actor_critic.actor denotes the actor \(\pi\) and kl denotes the KL divergence.

1self._actor_critic.actor.zero_grad()
2q_dist = self._actor_critic.actor(self._fvp_obs)
3with torch.no_grad():
4    p_dist = self._actor_critic.actor(self._fvp_obs)
5kl = torch.distributions.kl.kl_divergence(p_dist, q_dist).mean()
  1. Use torch.autograd.grad() to compute the Hessian-vector product. Please note that in we compute the gradient of kl_p (The product of the Jacobian of KL divergence and \(g\)) instead of grads (The Jacobian of KL divergence)

 1grads = torch.autograd.grad(
 2    kl,
 3    tuple(self._actor_critic.actor.parameters()),
 4    create_graph=True,
 5)
 6flat_grad_kl = torch.cat([grad.view(-1) for grad in grads])
 7
 8kl_p = (flat_grad_kl * params).sum()
 9grads = torch.autograd.grad(
10    kl_p,
11    tuple(self._actor_critic.actor.parameters()),
12    retain_graph=False,
13)
  1. return the Hessian-vector product.

1flat_grad_grad_kl = torch.cat([grad.contiguous().view(-1) for grad in grads])
2distributed.avg_tensor(flat_grad_grad_kl)
3
4return flat_grad_grad_kl + params * self._cfgs.algo_cfgs.cg_damping

conjugate_gradients()

TRPO algorithm uses conjugate gradients algorithm to search the update direction with Hessian-vector product, The conjugate gradient descent method attempts to solve problem \(Hx=g\) flowing the next steps:

  1. Set the initial solution x and calculate the error r between the x and the target b_vector (\(g\) in above equation). Note that Fvp is the Hessian-vector product, which denotes \(H\).

1vector_x = torch.zeros_like(vector_b)
2vector_r = vector_b - fisher_product(vector_x)
3vector_p = vector_r.clone()
4rdotr = torch.dot(vector_r, vector_r)
  1. Performs n_step conjugate gradient.

 1for _ in range(num_steps):
 2    vector_z = fisher_product(vector_p)
 3    alpha = rdotr / (torch.dot(vector_p, vector_z) + eps)
 4    vector_x += alpha * vector_p
 5    vector_r -= alpha * vector_z
 6    new_rdotr = torch.dot(vector_r, vector_r)
 7    if torch.sqrt(new_rdotr) < residual_tol:
 8        break
 9    vector_mu = new_rdotr / (rdotr + eps)
10    vector_p = vector_r + vector_mu * vector_p
11    rdotr = new_rdotr
12return vector_x
  1. Return the solution of \(x\) without computing \(x=H^{-1}g\).

trpo._search_step_size()

TRPO algorithm performs line-search to ensure constraint satisfaction for rewards and costs, and search around for a satisfied step of policy update to improve loss and reward performance, flowing the next steps:

  1. Get the current policy parameters and initialize the step size.

1# How far to go in a single update
2step_frac = 1.0
3# Get old parameterized policy expression
4theta_old = get_flat_params_from(self._actor_critic.actor)
  1. Calculate the expected reward improvement.

1expected_improve = g_flat.dot(step_dir)
  1. Performs line-search to find a step improve the surrogate while not violating trust region.

  • Search acceptance step ranging from 0 to total step.

1# While not within_trust_region and not out of total_steps:
2for step in range(total_steps):
3    # update theta params
4    new_theta = theta_old + step_frac * step_direction
5    # set new params as params of net
6    set_param_values_to_model(self._actor_critic.actor, new_theta)
  • In each step of for loop, calculate the policy performance and KL divergence.

1with torch.no_grad():
2    loss, _ = self._loss_pi(obs, act, logp, adv)
3    # compute KL distance between new and old policy
4    q_dist = self._actor_critic.actor(obs)
5    # KL-distance of old p-dist and new q-dist, applied in KLEarlyStopping
6    kl = torch.distributions.kl.kl_divergence(p_dist, q_dist).mean().item()
7    kl = distributed.dist_avg(kl)
  • Step only if surrogate is improved and within the trust region.

 1# real loss improve: old policy loss - new policy loss
 2loss_improve = loss_before - loss.item()
 3# average processes.... multi-processing style like: mpi_tools.mpi_avg(xxx)
 4loss_improve = distributed.dist_avg(loss_improve)
 5self._logger.log(f'Expected Improvement: {expected_improve} Actual: {loss_improve}')
 6if not torch.isfinite(loss):
 7    self._logger.log('WARNING: loss_pi not finite')
 8elif loss_improve < 0:
 9    self._logger.log('INFO: did not improve improve <0')
10elif kl > self._cfgs.algo_cfgs.target_kl:
11    self._logger.log('INFO: violated KL constraint.')
12else:
13    # step only if surrogate is improved and when within trust reg.
14    acceptance_step = step + 1
15    self._logger.log(f'Accept step at i={acceptance_step}')
16    break
  1. Return appropriate step direction and acceptance step.


Configs#

Train Configs

  • device (str): Device to use for training, options: cpu, cuda, cuda:0, etc.

  • torch_threads (int): Number of threads to use for PyTorch.

  • total_steps (int): Total number of steps to train the agent.

  • parallel (int): Number of parallel agents, similar to A3C.

  • vector_env_nums (int): Number of the vector environments.

Algorithms Configs

Note

The following configs are specific to TRPO algorithm.

  • cg_damping (float): Damping coefficient for conjugate gradient.

  • cg_iters (int): Number of iterations for conjugate gradient.

  • fvp_sample_freq (int): Frequency of sampling for Fisher vector product.

  • steps_per_epoch (int): Number of steps to update the policy network.

  • update_iters (int): Number of iterations to update the policy network.

  • batch_size (int): Batch size for each iteration.

  • target_kl (float): Target KL divergence.

  • entropy_coef (float): Coefficient of entropy.

  • reward_normalize (bool): Whether to normalize the reward.

  • cost_normalize (bool): Whether to normalize the cost.

  • obs_normalize (bool): Whether to normalize the observation.

  • kl_early_stop (bool): Whether to stop the training when KL divergence is too large.

  • max_grad_norm (float): Maximum gradient norm.

  • use_max_grad_norm (bool): Whether to use maximum gradient norm.

  • use_critic_norm (bool): Whether to use critic norm.

  • critic_norm_coef (float): Coefficient of critic norm.

  • gamma (float): Discount factor.

  • cost_gamma (float): Cost discount factor.

  • lam (float): Lambda for GAE-Lambda.

  • lam_c (float): Lambda for cost GAE-Lambda.

  • adv_estimation_method (str): The method to estimate the advantage.

  • standardized_rew_adv (bool): Whether to use standardized reward advantage.

  • standardized_cost_adv (bool): Whether to use standardized cost advantage.

  • penalty_coef (float): Penalty coefficient for cost.

  • use_cost (bool): Whether to use cost.

Model Configs

  • weight_initialization_mode (str): The type of weight initialization method.

  • actor_type (str): The type of actor, default to gaussian_learning.

  • linear_lr_decay (bool): Whether to use linear learning rate decay.

  • exploration_noise_anneal (bool): Whether to use exploration noise anneal.

  • std_range (list): The range of standard deviation.

Hint

actor (dictionary): parameters for actor network actor

  • activations: tanh

  • hidden_sizes:

  • 64

  • 64

Hint

critic (dictionary): parameters for critic network critic

  • activations: tanh

  • hidden_sizes:

  • 64

  • 64

Logger Configs

  • use_wandb (bool): Whether to use wandb to log the training process.

  • wandb_project (str): The name of wandb project.

  • use_tensorboard (bool): Whether to use tensorboard to log the training process.

  • log_dir (str): The directory to save the log files.

  • window_lens (int): The length of the window to calculate the average reward.

  • save_model_freq (int): The frequency to save the model.


Reference#

Appendix#

Click here to jump to TRPO Theorem

Click here to jump to Code withOmniSafe

Proof of Theorem 1 (Difference between two arbitrary policies)#

Proof of Theorem 1

First note that \(A^{R}_{\pi}(s, a)=\mathbb{E}_{s' \sim \mathbb{P}\left(s^{\prime} \mid s, a\right)}\left[r(s)+\gamma V^R_{\pi}\left(s^{\prime}\right)-V^R_{\pi}(s)\right]\). Therefore,

(16)#\[\begin{split}\mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{\infty} \gamma^t A^{R}_{\pi}\left(s_t, a_t\right)\right] &=\mathbb{E}_{\tau \sim \pi'}\left[\sum _ { t = 0 } ^ { \infty } \gamma ^ { t } \left(r\left(s_t\right)+\gamma V^{R}_{\pi}\left(s_{t+1}\right)-V^{R}_{\pi}\left(s_{t} \right)\right) \right] \\ &=\mathbb{E}_{\tau \sim \pi'}\left[-V^R_{\pi}\left(s_0\right)+\sum_{t=0}^{\infty} \gamma^t r\left(s_t\right)\right] \\ &=-\mathbb{E}_{s_0}\left[V^R_{\pi}\left(s_0\right)\right]+\mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{\infty} \gamma^t r\left(s_t\right)\right] \\ &=-J^R(\pi)+J^R(\pi')\end{split}\]

Proof of Corollary 1#

Proof of Corollary 1

From Eq.2 and Eq.4 , we can easily know that

(17)#\[\begin{split}& L_{\pi_{{\boldsymbol{\theta}}_0}}\left(\pi_{{\boldsymbol{\theta}}_0}\right)=J^{R}\left(\pi_{{\boldsymbol{\theta}}_0}\right)\quad \\ \text{since}~~ &\sum_s \rho_\pi(s) \sum_a \pi'(a \mid s) A^{R}_{\pi}(s, a)=0.\end{split}\]

Now Eq.4 can be written as follows:

(18)#\[J^{R}\left(\pi^{'}_{{\boldsymbol{\theta}}}\right) = J^{R}(\pi_{{\boldsymbol{\theta}}_0}) + \sum_s d_{\pi^{'}_{{\boldsymbol{\theta}}}}(s) \sum_a \pi^{'}_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s,a)\]

So,

(19)#\[\begin{split}\nabla_{{\boldsymbol{\theta}}} J^{R}(\pi_{{\boldsymbol{\theta}}})|_{{\boldsymbol{\theta}} = {\boldsymbol{\theta}}_0} &= J^{R}(\pi_{{\boldsymbol{\theta}}_0}) + \sum_s \nabla d_{\pi_{{\boldsymbol{\theta}}}}(s) \sum_a \pi_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s,a)+\sum_s d_{\pi_{{\boldsymbol{\theta}}}}(s) \sum_a \nabla \pi_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s,a) \\ &= J^{R}(\pi_{{\boldsymbol{\theta}}_0}) + \sum_s d_{\pi_{{\boldsymbol{\theta}}}}(s) \sum_a \nabla \pi_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s,a)\end{split}\]

Note

\(\sum_s \nabla d_{\pi_{{\boldsymbol{\theta}}}}(s) \sum_a \pi_{{\boldsymbol{\theta}}}(a|s) A^{R}_{\pi_{{\boldsymbol{\theta}}}}(s,a)=0\)

Meanwhile,

(20)#\[L_{\pi_{{\boldsymbol{\theta}}_0}}(\pi_{{\boldsymbol{\theta}}})=J^{R}(\pi_{{\boldsymbol{\theta}}_0})+\sum_s d_{\pi_{{\boldsymbol{\theta}}_0}}(s) \sum_a \pi_{{\boldsymbol{\theta}}}(a \mid s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s, a)\]

So,

(21)#\[\nabla L_{\pi_{{\boldsymbol{\theta}}_0}}(\pi_{{\boldsymbol{\theta}}}) | _{{\boldsymbol{\theta}} = {\boldsymbol{\theta}}_0}=J^{R}(\pi_{{\boldsymbol{\theta}}_0})+\sum_s d_{\pi_{{\boldsymbol{\theta}}_0}}(s) \sum_a \nabla \pi_{{\boldsymbol{\theta}}}(a \mid s) A^{R}_{\pi_{{\boldsymbol{\theta}}_0}}(s, a)\]

Combine Eq.19 and Eq.20, we have

(22)#\[\left.\nabla_{\boldsymbol{\theta}} L_{\pi_{{\boldsymbol{\theta}}_0}}\left(\pi_{\boldsymbol{\theta}}\right)\right|_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}_0}=\left.\nabla_{\boldsymbol{\theta}} J^{R}\left(\pi_{\boldsymbol{\theta}}\right)\right|_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}_0}\]

Proof of Theorem 2 (Difference between two arbitrary policies)#

Define \(\bar{A}^R(s)\) as the expected advantage of \(\pi'\) over \(\pi\) at \(s\),

(23)#\[\bar{A}^R(s)=\mathbb{E}_{a \sim \pi^{'}(\cdot \mid s)}\left[A^{R}_{\pi}(s, a)\right]\]

Theorem 1 can be written as follows:

(24)#\[J^R(\pi')=J^R(\pi)+\mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{\infty} \gamma^t \bar{A}^R\left(s_t\right)\right]\]

Note that \(L_\pi\) can be written as

(25)#\[L_\pi(\pi')=J^R(\pi)+\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t \bar{A}^R\left(s_t\right)\right]\]

To bound the difference between \(J^R(\pi')\) and \(L_\pi(\pi')\), we need to bound the difference arising from each timestep. To do this, we first need to introduce a measure of how much \(\pi\) and \(\pi'\) agree. Specifically, we’ll couple the policies, so that define a joint distribution over pairs of actions.

Definition 1

\((\pi, \pi')\) is an \(\alpha\)-coupled policy pair if it defines a joint distribution \((a, a')|s\), such that \(P(a \neq a'|s) \leq \alpha\) for all s. \(\pi\) and \(\pi'\) will denote the marginal distributions of a and \(a'\), respectively.

Computationally, \(\alpha\)-coupling means that if we randomly choose a seed for our random number generator, and then we sample from each of \(\pi\) and \(\pi'\) after setting that seed, the results will agree for at least fraction \(1-\alpha\) of seeds.

Lemma 1

Given that \(\pi, \pi'\) are \(\alpha\)-coupled policies, for all s,

(26)#\[|\bar{A}^R(s)| \leq 2 \alpha \max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\]

Lemma 2

Let \((\pi, \pi')\) be an \(\alpha\)-coupled policy pair. Then

(27)#\[\begin{split}\left|\mathbb{E}_{s_t \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi}\left[\bar{A}^R\left(s_t\right)\right]\right|&\leq 2 \alpha \max _s \bar{A}^R(s) \\ &\leq 4 \alpha\left(1-(1-\alpha)^t\right) \max _s\left|A^{R}_{\pi}(s, a)\right|\end{split}\]

Proof of Lemma 1

(28)#\[\begin{split}\bar{A}^R(s) &= \mathbb{E}_{\tilde{a} \sim \tilde{\pi}}\left[A^{R}_{\pi}(s, \tilde{a})\right] - \mathbb{E}_{a \sim \pi}\left[A^{R}_{\pi}(s, a)\right] \\ &=\mathbb{E}_{(a, \tilde{a}) \sim(\pi, \tilde{\pi})}\left[A^{R}_{\pi}(s, \tilde{a})-A^{R}_{\pi}(s, a)\right]\\ &= P(a \neq \tilde{a} \mid s) \mathbb{E}_{(a, \tilde{a}) \sim(\pi, \tilde{\pi}) \mid a \neq \tilde{a}}\left[A^{R}_{\pi}(s, \tilde{a})-A^{R}_{\pi}(s, a)\right]\end{split}\]

So,

(29)#\[|\bar{A}^R(s)| \leq \alpha \cdot 2 \max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\]

Proof of Lemma 2

Given the coupled policy pair \((\pi, \pi')\), we can also obtain a coupling over the trajectory distributions produced by \(\pi\) and \(\pi'\), respectively. Namely, we have pairs of trajectories \(\tau, \tau'\), where \(\tau\) is obtained by taking actions from \(\pi\), and \(\tau'\) is obtained by taking actions from \(\pi'\), where the same random seed is used to generate both trajectories. We will consider the advantage of \(\pi'\) over \(\pi\) at timestep \(t\), and decompose this expectation based on whether \(\pi\) agrees with \(\pi'\) at all timesteps \(i<t\)

Let \(n_t\) denote the number of times that \(a_i \neq a^{'}_i\) for \(i<t\), i.e., the number of times that \(\pi\) and \(\pi'\) disagree before timestep \(t\).

(30)#\[\begin{split}\mathbb{E}_{s_t \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]&=P\left(n_t=0\right) \mathbb{E}_{s_t \sim \pi' \mid n_t=0}\left[\bar{A}^R\left(s_t\right)\right]\\ &+P\left(n_t>0\right) \mathbb{E}_{s_t \sim \pi' \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\end{split}\]

The expectation decomposes similarly for actions are sampled using \(\pi\) :

(31)#\[\begin{split}\mathbb{E}_{s_t \sim \pi}\left[\bar{A}^R\left(s_t\right)\right]&=P\left(n_t=0\right) \mathbb{E}_{s_t \sim \pi \mid n_t=0}\left[\bar{A}^R\left(s_t\right)\right]\\ &+P\left(n_t>0\right) \mathbb{E}_{s_t \sim \pi \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\end{split}\]

Note that the \(n_t=0\) terms are equal:

(32)#\[\mathbb{E}_{s_t \sim \pi' \mid n_t=0}\left[\bar{A}^R\left(s_t\right)\right]=\mathbb{E}_{s_t \sim \pi \mid n_t=0}\left[\bar{A}^R\left(s_t\right)\right]\]

because \(n_t=0\) indicates that \(\pi\) and \(\pi'\) agreed on all timesteps less than \(t\). Subtracting Equations Eq.26 and Eq.27, we get

(33)#\[\begin{split}&\mathbb{E}_{s_t \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi}\left[\bar{A}^R\left(s_t\right)\right] \\ =&P\left(n_t>0\right)\left(\mathbb{E}_{s_t \sim \pi' \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\right) \label{equation: sub for unfold}\end{split}\]

By definition of \(\alpha, P(\pi, \pi'\) agree at timestep \(i) \geq 1-\alpha\), so \(P\left(n_t=0\right) \geq(1-\alpha)^t\), and

(34)#\[P\left(n_t>0\right) \leq 1-(1-\alpha)^t \label{equation: probability with a couple policy}\]

Next, note that

(35)#\[\begin{split}&\left|\mathbb{E}_{s_t \sim \pi' \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\right| \\ & \leq\left|\mathbb{E}_{s_t \sim \pi' \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\right|+\left|\mathbb{E}_{s_t \sim \pi \mid n_t>0}\left[\bar{A}^R\left(s_t\right)\right]\right| \\ & \leq 4 \alpha \max _{s, a}\left|A^{R}_{\pi}(s, a)\right| \label{equation: abs performance bound nt geq 0}\end{split}\]

Where the second inequality follows from Lemma 2. Plugging Eq.34 and Eq.35 into Eq.33, we get

(36)#\[\left|\mathbb{E}_{s_t \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{s_t \sim \pi}\left[\bar{A}^R\left(s_t\right)\right]\right| \leq 4 \alpha\left(1-(1-\alpha)^t\right) \max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\]

The preceding Lemma bounds the difference in expected advantage at each timestep \(t\). We can sum over time to bound the difference between \(J^R(\pi')\) and \(L_\pi(\pi')\). Subtracting Eq.24 and Eq.25, and defining \(\epsilon=\max _{s, a}\left|A^{R}_{\pi}(s, a)\right|\), we have

(37)#\[\begin{split}\left|J^R(\pi')-L_\pi(\pi')\right| &=\sum_{t=0}^{\infty} \gamma^t\left|\mathbb{E}_{\tau \sim \pi'}\left[\bar{A}^R\left(s_t\right)\right]-\mathbb{E}_{\tau \sim \pi}\left[\bar{A}^R\left(s_t\right)\right]\right| \\ & \leq \sum_{t=0}^{\infty} \gamma^t \cdot 4 \epsilon \alpha\left(1-(1-\alpha)^t\right) \\ &=4 \epsilon \alpha\left(\frac{1}{1-\gamma}-\frac{1}{1-\gamma(1-\alpha)}\right) \\ &=\frac{4 \alpha^2 \gamma \epsilon}{(1-\gamma)(1-\gamma(1-\alpha))} \\ & \leq \frac{4 \alpha^2 \gamma \epsilon}{(1-\gamma)^2} \label{TRPO: difference between L and J}\end{split}\]

Last, to replace \(\alpha\) by the total variation divergence, we need to use the correspondence between TV divergence and coupled random variables:

Note

Suppose \(p_X\) and \(p_Y\) are distributions with \(D_{T V}\left(p_X \| p_Y\right)=\alpha\). Then there exists a joint distribution \((X, Y)\) whose marginals are \(p_X, p_Y\), for which \(X=Y\) with probability \(1-\alpha\). More details in See (Levin et al., 2009), Proposition 4.7.

It follows that if we have two policies \(\pi\) and \(\pi'\) such that

(38)#\[\max_s D_{\mathrm{TV}}(\pi(\cdot|s) \| \pi'(\cdot|s)) \leq \alpha\]

then we can define an \(\alpha\)-coupled policy pair \((\pi, \pi')\) with appropriate marginals. Taking \(\alpha=\underset{s}{\max} D_{T V}\left(\pi(\cdot \mid s) \| \pi'(\cdot \mid s)\right)\) in Eq.37, Theorem 2 follows.