Lagrangian Methods#

Quick Facts#

Lagrangian Method can be applied to almost any RL algorithm.
Lagrangian Method turns unsafe algorithm to a safe one.
The OmniSafe implementation of Lagrangian Methods covers up to 6 kinds of on policy and off policy algorithm.
An API Documentation is available for PPOLag.

Lagrangian Methods Theorem#

Background#

In the previous introduction of algorithms, we know that Safe RL mainly solves the constraint optimization problem of CMDP.

Hint

Constrained optimization problems tend to be more challenging than unconstrained optimization problems.

Therefore, the natural idea is to convert a constrained optimization problem into an unconstrained optimization problem. Then solve it using classical optimization algorithms, such as stochastic gradient descent. Lagrangian methods is a kind of method solving constraint problems that are widely used in machine learning. By using adaptive penalty coefficients to enforce constraints, Lagrangian methods convert the solution of a constrained optimization problem to the solution of an unconstrained optimization problem. In this section, we will briefly introduce Lagrangian methods, and give corresponding implementations in TRPO and PPO. TRPO and PPO are the algorithms we introduced earlier. If you lack understanding of it, it doesn’t matter. Please refer to the TRPO tutorial and PPO tutorial.

Advantages of Lagrangian Methods

Relatively simple to implement.
The principle is straightforward to understand.
Can be applied to a variety of algorithms.
Highly scalable.

Problems of Lagrangian Methods

Different hyperparameters need to be set for different tasks.
Not necessarily valid for all tasks.
Problems of overshoot.
Difficult to handle multiple cost tasks directly.

Optimization Objective#

As we mentioned in the previous chapters, the optimization problem of CMDPs can be expressed as follows:

(1)#\[\begin{split}\max_{\pi \in \Pi_{\boldsymbol{\theta}}} &J^R(\pi) \\ \text {s.t.}~~& J^{\mathcal{C}}(\pi) \leq d\end{split}\]

where \(\Pi_{\boldsymbol{\theta}} \subseteq \Pi\) denotes the set of parametrized policies with parameters \({\boldsymbol{\theta}}\). In local policy search for CMDPs, we additionally require policy iterates to be feasible for the CMDP, so instead of optimizing over \(\Pi_{\boldsymbol{\theta}}\), algorithm should optimize over \(\Pi_{\boldsymbol{\theta}} \cap \Pi_C\). Specifically, for the TRPO and PPO algorithms, constraints on the differences between old and new policies should also be added. To solve this constrained problem, please read the TRPO tutorial. The final optimization goals are as follows:

(2)#\[\begin{split}\pi_{k+1}&=\arg \max _{\pi \in \Pi_{\boldsymbol{\theta}}} J^R(\pi) \\ \text { s.t. } ~~ J^{\mathcal{C}}(\pi) &\leq d \\ D\left(\pi, \pi_k\right) &\leq \delta\end{split}\]

where \(D\) is some distance measure and \(\delta\) is the step size.

Lagrangian Method Theorem#

Lagrangian methods#

Constrained MDP (CMDP) are often solved using the Lagrange methods. In Lagrange methods, the CMDP is converted into an equivalent unconstrained problem. In addition to the objective, a penalty term is added for infeasibility, thus making infeasible solutions sub-optimal.

Theorem 1

Given a CMDP, the unconstrained problem can be written as:

(3)#\[\min _{\lambda \geq 0} \max _{\boldsymbol{\theta}} G(\lambda, {\boldsymbol{\theta}})=\min _{\lambda \geq 0} \max _{\boldsymbol{\theta}} [J^R(\pi)-\lambda J^C(\pi)]\]

where \(G\) is the Lagrangian and \(\lambda \geq 0\) is the Lagrange multiplier (a penalty coefficient). Notice, as \(\lambda\) increases, the solution to the Problem Eq.1 converges to that of the Problem Eq.3.

The theorem base of Theorem 1 can be found in Lagrange Duality, click this card to jump to view.

Hint

The Lagrangian method is a two-step process.

First, we solve the unconstrained problem Eq.3 to find a feasible solution \({\boldsymbol{\theta}}^*\)
Then, we increase the penalty coefficient \(\lambda\) until the constraint is satisfied.

The final solution is \(\left({\boldsymbol{\theta}}^*, \lambda^*\right)\). The goal is to find a saddle point \(\left({\boldsymbol{\theta}}^*\left(\lambda^*\right), \lambda^*\right)\) of the Problem Eq.1, which is a feasible solution. (A feasible solution of the CMDP is a solution which satisfies \(J^C(\pi) \leq d\) )

Lagrangian Methods#

Quick Facts#

Lagrangian Methods Theorem#

Background#

Optimization Objective#

Lagrangian Method Theorem#

Lagrangian methods#

Practical Implementation#

Policy update#

Code with OmniSafe#

Quick start#

Architecture of functions#

Configs#

References#