Lagrangian Methods#

Quick Facts#

  1. Lagrangian Method can be applied to almost any RL algorithm.

  2. Lagrangian Method turns unsafe algorithm to a safe one.

  3. The OmniSafe implementation of Lagrangian Methods covers up to 6 kinds of on policy and off policy algorithm.

  4. An API Documentation is available for PPOLag.

Lagrangian Methods Theorem#

Background#

In the previous introduction of algorithms, we know that Safe RL mainly solves the constraint optimization problem of CMDP.

Hint

Constrained optimization problems tend to be more challenging than unconstrained optimization problems.

Therefore, the natural idea is to convert a constrained optimization problem into an unconstrained optimization problem. Then solve it using classical optimization algorithms, such as stochastic gradient descent. Lagrangian methods is a kind of method solving constraint problems that are widely used in machine learning. By using adaptive penalty coefficients to enforce constraints, Lagrangian methods convert the solution of a constrained optimization problem to the solution of an unconstrained optimization problem. In this section, we will briefly introduce Lagrangian methods, and give corresponding implementations in TRPO and PPO. TRPO and PPO are the algorithms we introduced earlier. If you lack understanding of it, it doesn’t matter. Please refer to the TRPO tutorial and PPO tutorial.

Advantages of Lagrangian Methods

  • Relatively simple to implement.

  • The principle is straightforward to understand.

  • Can be applied to a variety of algorithms.

  • Highly scalable.

Problems of Lagrangian Methods

  • Different hyperparameters need to be set for different tasks.

  • Not necessarily valid for all tasks.

  • Problems of overshoot.

  • Difficult to handle multiple cost tasks directly.


Optimization Objective#

As we mentioned in the previous chapters, the optimization problem of CMDPs can be expressed as follows:

(1)#\[\begin{split}\max_{\pi \in \Pi_{\boldsymbol{\theta}}} &J^R(\pi) \\ \text {s.t.}~~& J^{\mathcal{C}}(\pi) \leq d\end{split}\]

where \(\Pi_{\boldsymbol{\theta}} \subseteq \Pi\) denotes the set of parametrized policies with parameters \({\boldsymbol{\theta}}\). In local policy search for CMDPs, we additionally require policy iterates to be feasible for the CMDP, so instead of optimizing over \(\Pi_{\boldsymbol{\theta}}\), algorithm should optimize over \(\Pi_{\boldsymbol{\theta}} \cap \Pi_C\). Specifically, for the TRPO and PPO algorithms, constraints on the differences between old and new policies should also be added. To solve this constrained problem, please read the TRPO tutorial. The final optimization goals are as follows:

(2)#\[\begin{split}\pi_{k+1}&=\arg \max _{\pi \in \Pi_{\boldsymbol{\theta}}} J^R(\pi) \\ \text { s.t. } ~~ J^{\mathcal{C}}(\pi) &\leq d \\ D\left(\pi, \pi_k\right) &\leq \delta\end{split}\]

where \(D\) is some distance measure and \(\delta\) is the step size.


Lagrangian Method Theorem#

Lagrangian methods#

Constrained MDP (CMDP) are often solved using the Lagrange methods. In Lagrange methods, the CMDP is converted into an equivalent unconstrained problem. In addition to the objective, a penalty term is added for infeasibility, thus making infeasible solutions sub-optimal.

Theorem 1

Given a CMDP, the unconstrained problem can be written as:

(3)#\[\min _{\lambda \geq 0} \max _{\boldsymbol{\theta}} G(\lambda, {\boldsymbol{\theta}})=\min _{\lambda \geq 0} \max _{\boldsymbol{\theta}} [J^R(\pi)-\lambda J^C(\pi)]\]

where \(G\) is the Lagrangian and \(\lambda \geq 0\) is the Lagrange multiplier (a penalty coefficient). Notice, as \(\lambda\) increases, the solution to the Problem Eq.1 converges to that of the Problem Eq.3.

Hint

The Lagrangian method is a two-step process.

  • First, we solve the unconstrained problem Eq.3 to find a feasible solution \({\boldsymbol{\theta}}^*\)

  • Then, we increase the penalty coefficient \(\lambda\) until the constraint is satisfied.

The final solution is \(\left({\boldsymbol{\theta}}^*, \lambda^*\right)\). The goal is to find a saddle point \(\left({\boldsymbol{\theta}}^*\left(\lambda^*\right), \lambda^*\right)\) of the Problem Eq.1, which is a feasible solution. (A feasible solution of the CMDP is a solution which satisfies \(J^C(\pi) \leq d\) )


Practical Implementation#

Intuitively, we train the agent to maximize the reward in the classical strategy gradient descent algorithm. If a particular action \(a\) in state \(s\) can bring a relatively higher reward, we increase the probability that the agent will choose action \(a\) under \(s\), and conversely, we will reduce this probability.

Hint

Lagrangian methods add two extra steps to the above process.

  • One is to adjust the reward function, and if the agent’s actions violate the constraint, the reward will reduce accordingly.

  • The second is a slow update of the penalty factor. If the agent violates fewer constraints, the penalty coefficient will gradually decrease, and conversely, it will gradually increase.

Next we will introduce the specific implementation of the Lagrange method in the TRPO and PPO algorithms.

Policy update#

Surrogate function update

Previously, in TRPO and PPO, we used to have the agent sample a series of data from the environment, and at the end of the episode, use this data to update the agent several times, as described in Problem Eq.2. With the addition of the Lagrange method, we need to make a change to the original surrogate function, as it is shown below:

(4)#\[\begin{split}\max _{\pi \in \prod_{\boldsymbol{\theta}}}[J^R(\pi)-\lambda J^C(\pi)] \\ \text { s.t. } D\left(\pi, \pi_k\right) \leq \delta\end{split}\]

In a word, we only need to punish the agent with its reward by \(\lambda\) with each step of updates. In fact, this is just a minor change made on TRPO and PPO.

Lagrange multiplier update

After all rounds of policy updates to the agent are complete, we will perform an update on the Lagrange multiplier, that is:

(5)#\[\begin{split}\min _\lambda(1-\lambda) [J^R(\pi)-\lambda J^C(\pi)] \\ \text { s.t. } \lambda \geq 0\end{split}\]

Specifically, on the \(k^{t h}\) update, the above equation is often written as below in the actual calculation process:

(6)#\[\lambda_{k+1}=\max \left(\lambda_k+ \eta_\lambda\left(J^C(\pi)-d\right), 0\right)\]

where \(\eta_\lambda\) is the learning rate of \(\lambda\).

Ultimately, we only need to add the above two steps to the TRPO and PPO; then we will get the TRPOLag and the PPOLag.

Attention

In practice, we often need to manually set the initial value of \(\lambda\) as well as the learning rate \(\eta_\lambda\). Unfortunately, Lagrange algorithms are algorithms that are sensitive to hyperparameter selection.

  • If the initial value of \(\lambda\) or learning rate \(\eta_\lambda\) is chosen to be large, the agent may suffer from a low reward.

  • Else, it may violate the constraints.

So we often struggle to choose a compromise hyperparameter to balance reward and constraints.


Code with OmniSafe#

Safe RL algorithms for TRPO, PPO, NPG, DDPG, SAC and TD3 are currently implemented in OmniSafe using Lagrangian methods. This section will explain how to deploy Lagrangian methods on PPO algorithms at the code level using PPOLag as an example. OmniSafe has Lagrange as a separate module and you can easily deploy it on most RL algorithms.

Quick start#

Run PPOLag in OmniSafe

Here are 3 ways to run PPOLag in OmniSafe:

  • Run Agent from preset yaml file

  • Run Agent from custom config dict

  • Run Agent from custom terminal config

1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5
6agent = omnisafe.Agent('PPOLag', env_id)
7agent.learn()
 1import omnisafe
 2
 3
 4env_id = 'SafetyPointGoal1-v0'
 5custom_cfgs = {
 6    'train_cfgs': {
 7        'total_steps': 10000000,
 8        'vector_env_nums': 1,
 9        'parallel': 1,
10    },
11    'algo_cfgs': {
12        'steps_per_epoch': 20000,
13    },
14    'logger_cfgs': {
15        'use_wandb': False,
16        'use_tensorboard': True,
17    },
18}
19
20agent = omnisafe.Agent('PPOLag', env_id, custom_cfgs=custom_cfgs)
21agent.learn()

We use train_policy.py as the entrance file. You can train the agent with PPOLag simply using train_policy.py, with arguments about PPOLag and environments does the training. For example, to run PPOLag in SafetyPointGoal1-v0 , with 1 torch thread, seed 0 and single environment, you can use the following command:

1cd examples
2python train_policy.py --algo PPOLag --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 10000000 --device cpu --vector-env-nums 1 --torch-threads 1

Architecture of functions#

  • PPOLag.learn()

    • PPOLag._env.rollout()

    • PPOLag._update()

      • PPOLag._buf.get()

      • PPOLag.update_lagrange_multiplier(ep_costs)

      • PPOLag._update_actor()

      • PPOLag._update_cost_critic()

      • PPOLag._update_reward_critic()


Configs#

Train Configs

  • device (str): Device to use for training, options: cpu, cuda, cuda:0, etc.

  • torch_threads (int): Number of threads to use for PyTorch.

  • total_steps (int): Total number of steps to train the agent.

  • parallel (int): Number of parallel agents, similar to A3C.

  • vector_env_nums (int): Number of the vector environments.

Algorithms Configs

Note

The following configs are specific to PPOLag algorithm.

  • clip (float): Clipping parameter for PPOLag.

  • steps_per_epoch (int): Number of steps to update the policy network.

  • update_iters (int): Number of iterations to update the policy network.

  • batch_size (int): Batch size for each iteration.

  • target_kl (float): Target KL divergence.

  • entropy_coef (float): Coefficient of entropy.

  • reward_normalize (bool): Whether to normalize the reward.

  • cost_normalize (bool): Whether to normalize the cost.

  • obs_normalize (bool): Whether to normalize the observation.

  • kl_early_stop (bool): Whether to stop the training when KL divergence is too large.

  • max_grad_norm (float): Maximum gradient norm.

  • use_max_grad_norm (bool): Whether to use maximum gradient norm.

  • use_critic_norm (bool): Whether to use critic norm.

  • critic_norm_coef (float): Coefficient of critic norm.

  • gamma (float): Discount factor.

  • cost_gamma (float): Cost discount factor.

  • lam (float): Lambda for GAE-Lambda.

  • lam_c (float): Lambda for cost GAE-Lambda.

  • adv_estimation_method (str): The method to estimate the advantage.

  • standardized_rew_adv (bool): Whether to use standardized reward advantage.

  • standardized_cost_adv (bool): Whether to use standardized cost advantage.

  • penalty_coef (float): Penalty coefficient for cost.

  • use_cost (bool): Whether to use cost.

Model Configs

  • weight_initialization_mode (str): The type of weight initialization method.

  • actor_type (str): The type of actor, default to gaussian_learning.

  • linear_lr_decay (bool): Whether to use linear learning rate decay.

  • exploration_noise_anneal (bool): Whether to use exploration noise anneal.

  • std_range (list): The range of standard deviation.

Hint

actor (dictionary): parameters for actor network actor

  • activations: tanh

  • hidden_sizes:

  • 64

  • 64

Hint

critic (dictionary): parameters for critic network critic

  • activations: tanh

  • hidden_sizes:

  • 64

  • 64

Logger Configs

  • use_wandb (bool): Whether to use wandb to log the training process.

  • wandb_project (str): The name of wandb project.

  • use_tensorboard (bool): Whether to use tensorboard to log the training process.

  • log_dir (str): The directory to save the log files.

  • window_lens (int): The length of the window to calculate the average reward.

  • save_model_freq (int): The frequency to save the model.

Lagrange Configs

  • cost_limit (float): Tolerance of constraint violation.

  • lagrangian_multiplier_init (float): Initial value of Lagrange multiplier.

  • lambda_lr (float): Learning rate of Lagrange multiplier.

  • lambda_optimizer (str): Optimizer for Lagrange multiplier.


References#