Constrained Policy Optimization#
Quick Facts#
CPO is an on-policy algorithm.
CPO can be thought of as being TRPO in Safe RL areas .
The OmniSafe implementation of CPO support parallelization.
An API Documentation is available for CPO.
CPO Theorem#
Background#
Constrained policy optimization (CPO) is a policy search algorithm for safe reinforcement learning that guarantees near-constraint satisfaction at each iteration. CPO builds upon the ideas of TRPO( Trust Region Policy Optimization) to construct surrogate functions that approximate the objectives and constraints, and it is easy to estimate using samples from the current policy. CPO provides tighter bounds for policy search using trust regions, making it the first general-purpose policy search algorithm for safe RL.
Hint
CPO can train neural network policies for high-dimensional control while ensuring that they behave within specified constraints throughout training.
CPO aims to provide an approach for policy search in continuous CMDP. It uses the result from TRPO and NPG to derive a policy improvement step that guarantees both an increase in reward and satisfaction of constraints. Although CPO is slightly inferior in performance, it offers a solid theoretical foundation for solving constrained optimization problems in safe RL.
Hint
CPO is very complex in terms of implementation, but OmniSafe provides a highly readable code implementation to help you get up to speed quickly.
Optimization Objective#
In the previous chapters, we introduced that TRPO solves the following optimization problems:
where \(\Pi_{\boldsymbol{\theta}} \subseteq \Pi\) denotes the set of parametrized policies with parameters \(\boldsymbol{\theta}\), and \(D\) is some distance measure.
In local policy search, we additionally require policy iterates to be feasible for the CMDP. So instead of optimizing over \(\Pi_{\boldsymbol{\theta}}\), CPO optimizes over \(\Pi_{\boldsymbol{\theta}} \cap \Pi_{C}\).
Hint
This update is difficult to implement because it requires evaluating the constraint functions to determine whether a proposed policy \(\pi\) is feasible.
Contribution I
CPO develops a principled approximation with a particular choice of \(D\), where the objective and constraints are replaced with surrogate functions.
Contribution II
CPO proposes that with those surrogates, the update’s worst-case performance and worst-case constraint violation can be bounded with values that depend on a hyperparameter of the algorithm.
Policy Performance Bounds#
CPO presents the theoretical foundation for its approach, a new bound on the difference in returns between two arbitrary policies.
The following Theorem 1 connects the difference in returns (or constraint costs) between two arbitrary policies to an average divergence between them.
Theorem 1 (Difference between two arbitrary policies)
For any function \(f : S \rightarrow \mathbb{R}\) and any policies \(\pi\) and \(\pi'\), define \(\delta^f(s,a,s') \doteq R(s,a,s') + \gamma f(s')-f(s)\),
where \(D_{T V}\left(\pi'|| \pi\right)[s]=\frac{1}{2} \sum_a\left|\pi'(a|s)-\pi(a|s)\right|\) is the total variational divergence between action distributions at \(s\). The conclusion is as follows:
Furthermore, the bounds are tight (when \(\pi=\pi^{\prime}\), all three expressions are identically zero).
By picking \(f=V^{R}_\pi\) or \(f=V^{C}_\pi\), we obtain a Corollary 1, Corollary 2, Corollary 3 below:
Corollary 1
For any policies \(\pi'\), \(\pi\), with \(\epsilon^{R}_{\pi'}=\max _s|\underset{a \sim \pi'}{\mathbb{E}}[A^{R}_\pi(s, a)]|\), the following bound holds:
Corollary 2
For any policies \(\pi'\) and \(\pi\), with \(\epsilon^{C_i}_{\pi'}=\max _s|\underset{a \sim \pi'}{\mathbb{E}}[A^{C_i}_\pi(s, a)]|\)
the following bound holds:
Corollary 3
Trust region methods prefer to constrain the \(KL\) divergence between policies, so CPO use Pinsker’s inequality to connect the \(D_{TV}\) with \(D_{KL}\)
Combining this with Jensen’s inequality, we obtain our final Corollary 3 :
In bound Theorem 1 , Corollary 1, Corollary 2, make the substitution:
Trust Region Methods#
For parameterized stationary policy, trust region algorithms for reinforcement learning have policy updates of the following form:
where \(\bar{D}_{K L}(\pi \| \pi_k)=\underset{s \sim \pi_k}{\mathbb{E}}[D_{K L}(\pi \| \pi_k)[s]]\) and \(\delta \ge 0\) is the step size. The set \(\left\{\pi_{\boldsymbol{\theta}} \in \Pi_{\boldsymbol{\theta}}: \bar{D}_{K L}\left(\pi \| \pi'\right) \leq \delta\right\}\) is called trust region. The success motivation for this update is that it approximates optimizing the lower bound on policy performance given in Corollary 1, which would guarantee monotonic performance improvements.
Hint
In a word, CPO uses a trust region instead of penalties on policy divergence to enable larger step size.
Worst-case Performance of CPO Update#
Here we will introduce the propositions proposed by the CPO, one describes the worst-case performance degradation guarantee that depends on \(\delta\), and the other discusses the worst-case constraint violation in the CPO update.
Trust Region Update Performance
Suppose \(\pi_k, \pi_{k+1}\) are related by Eq.9, and that \(\pi_k \in \Pi_{\boldsymbol{\theta}}\). A lower bound on the policy performance difference between \(\pi_k\) and \(\pi_{k+1}\) is:
where \(\epsilon^R_{\pi_{k+1}}=\max_s\left|\mathbb{E}_{a \sim \pi_{k+1}}\left[A^R_{\pi_k}(s, a)\right]\right|\).
CPO Update Worst-Case Constraint Violation
Suppose \(\pi_k, \pi_{k+1}\) are related by Eq.9, and that \(\pi_k \in \Pi_{\boldsymbol{\theta}}\). An upper bound on the \(C_i\)-return of \(\pi_{k+1}\) is:
where \(\epsilon^{C_i}_{\pi_{k+1}}=\max _s\left|\mathbb{E}_{a \sim \pi_{k+1}}\left[A^{C_i}_{\pi_k}(s, a)\right]\right|\)
Summary#
We mainly introduced the essential inequalities in CPO. Based on those inequalities, CPO presents optimization problems that ultimately need to be solved and propose two proposition about the worst case in the CPO update. Next section, we will discuss how to solve this problem practically. You may be confused when you first read these theoretical derivation processes, and we have given detailed proof of the above formulas in the appendix, which we believe you can understand by reading them a few times.
Practical Implementation#
Overview
In this section, we show how CPO implements an approximation to the update Eq.10, even when optimizing policies with thousands of parameters. To address the issue of approximation and sampling errors that arise in practice and the potential violations described by Proposition 2, CPO proposes to tighten the constraints by constraining the upper bounds of the extra costs instead of the extra costs themselves.
Navigation
Approximately Solving the CPO Update
Feasibility
Tightening Constraints via Cost Shaping
Code With OmniSafe
Approximately Solving the CPO Update#
For policies with high-dimensional parameter spaces like neural networks, Eq.10 can be impractical to solve directly because of the computational cost.
Hint
However, for small step sizes \(\delta\), the objective and cost constraints are well-approximated by linearizing around \(\pi_k\), and the KL-Divergence constraint is well-approximated by second-order expansion.
Denoting the gradient of the objective as \(g\), the gradient of constraint \(i\) as \(b_i\), the Hessian of the \(KL\) divergence as \(H\), and defining \(c_i=J^{C_i}\left(\pi_k\right)-d_i\), the approximation to Eq.10 is:
With \(B=\left[b_1, \ldots, b_m\right]\) and \(c=\left[c_1, \ldots, c_m\right]^T\), a dual to Eq.13 can be express as:
where \(r=g^T H^{-1} B, S=B^T H^{-1} B\). If \(\lambda^*, v^*\) are a solution to the dual, the solution to the primal is
In a word, CPO solves the dual for \(\lambda^*, \nu^*\) and uses it to propose the policy update Eq.15, thus solving Eq.10 in a particular way. In the experiment, CPO also uses two tricks to promise the update’s performance.
Warning
Because of the approximation error, the proposed update may not satisfy the constraints in Eq.10; A backtracking line search is used to ensure surrogate constraint satisfaction.
For high-dimensional policies, it is impractically expensive to invert the Fisher information matrix. This poses a challenge for computing \(H^{-1} \mathrm{~g}\) and \(H^{-1} b\), which appears in the dual. Like TRPO, CPO computes them approximately using the conjugate gradient method.
Feasibility#
CPO may occasionally produce an infeasible iterate \(\pi_k\) due to approximation errors. To handle such cases, CPO proposes an update that purely decreases the constraint value.
This is followed by a line search, similar to before. This approach is principled because it uses the limiting search direction as the intersection of the trust region and the constraint region shrinks to zero.
Tightening Constraints via Cost Shaping#
To build a factor of safety into the algorithm minimizing the chance of constraint violations, CPO chooses to constrain upper bounds on the original constraints, \(C_i^{+}\), instead of the original constraints themselves. CPO does this by cost shaping:
where \(\delta_i: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow R_{+}\) correlates in some useful way with \(C_i\). Because CPO has only one constraint, it partitions states into safe and unsafe states, and the agent suffers a safety cost of 1 for being in an unsafe state.
CPO chooses \(\triangle\) to be the probability of entering an unsafe state within a fixed time horizon, according to a learned model that is updated at each iteration. This choice confers the additional benefit of smoothing out sparse constraints.
Code with OmniSafe#
Quick start#
Run CPO in OmniSafe
Here are 3 ways to run CPO in OmniSafe:
Run Agent from preset yaml file
Run Agent from custom config dict
Run Agent from custom terminal config
1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5
6agent = omnisafe.Agent('CPO', env_id)
7agent.learn()
1import omnisafe
2
3
4env_id = 'SafetyPointGoal1-v0'
5custom_cfgs = {
6 'train_cfgs': {
7 'total_steps': 10000000,
8 'vector_env_nums': 1,
9 'parallel': 1,
10 },
11 'algo_cfgs': {
12 'steps_per_epoch': 20000,
13 },
14 'logger_cfgs': {
15 'use_wandb': False,
16 'use_tensorboard': True,
17 },
18}
19
20agent = omnisafe.Agent('CPO', env_id, custom_cfgs=custom_cfgs)
21agent.learn()
We use train_policy.py
as the entrance file. You can train the agent with CPO simply using train_policy.py
, with arguments about CPO and environments does the training.
For example, to run CPO in SafetyPointGoal1-v0 , with 1 torch thread, seed 0 and single environment, you can use the following command:
1cd examples
2python train_policy.py --algo CPO --env-id SafetyPointGoal1-v0 --parallel 1 --total-steps 10000000 --device cpu --vector-env-nums 1 --torch-threads 1
Here is the documentation of CPO in PyTorch version.
Architecture of functions#
CPO.learn()
CPO._env.rollout()
CPO._update()
CPO._buf.get()
CPO._update_actor()
CPO._fvp()
conjugate_gradients()
CPO._cpo_search_step()
CPO._update_cost_critic()
CPO._update_reward_critic()
Documentation of algorithm specific functions#
cpo._update_actor()
Update the policy network, flowing the next steps:
Get the policy reward performance gradient g (flat as vector)
1theta_old = get_flat_params_from(self._actor_critic.actor)
2self._actor_critic.actor.zero_grad()
3loss_reward, info = self._loss_pi(obs, act, logp, adv_r)
4loss_reward_before = distributed.dist_avg(loss_reward).item()
5p_dist = self._actor_critic.actor(obs)
6
7loss_reward.backward()
8distributed.avg_grads(self._actor_critic.actor)
9
10grads = -get_flat_gradients_from(self._actor_critic.actor)
Get the policy cost performance gradient b and ep_costs (flat as vector)
1self._actor_critic.zero_grad()
2loss_cost = self._loss_pi_cost(obs, act, logp, adv_c)
3loss_cost_before = distributed.dist_avg(loss_cost).item()
4
5loss_cost.backward()
6distributed.avg_grads(self._actor_critic.actor)
7
8b_grads = get_flat_gradients_from(self._actor_critic.actor)
9ep_costs = self._logger.get_stats('Metrics/EpCost')[0] - self._cfgs.algo_cfgs.cost_limit
Build the Hessian-vector product based on an approximation of the \(KL\) divergence, using
conjugate_gradients
.
1p = conjugate_gradients(self._fvp, b_grads, self._cfgs.algo_cfgs.cg_iters)
2q = xHx
3r = grads.dot(p)
4s = b_grads.dot(p)
Divide the optimization case into 5 kinds to compute.
Determine step direction and apply SGD step after grads where set (By
search_step_size()
)
1step_direction, accept_step = self._cpo_search_step(
2 step_direction=step_direction,
3 grads=grads,
4 p_dist=p_dist,
5 obs=obs,
6 act=act,
7 logp=logp,
8 adv_r=adv_r,
9 adv_c=adv_c,
10 loss_reward_before=loss_reward_before,
11 loss_cost_before=loss_cost_before,
12 total_steps=20,
13 violation_c=ep_costs,
14 optim_case=optim_case,
15)
Update actor network parameters
1theta_new = theta_old + step_direction
2set_param_values_to_model(self._actor_critic.actor, theta_new)
cpo._search_step_size()
CPO algorithm performs line-search to ensure constraint satisfaction for rewards and costs, flowing the next steps:
Initialize the step size and get the old flat parameters of the policy network.
1 # get distance each time theta goes towards certain direction
2 step_frac = 1.0
3 # get and flatten parameters from pi-net
4 theta_old = get_flat_params_from(self._actor_critic.actor)
5 # reward improvement, g-flat as gradient of reward
6 expected_reward_improve = grad.dot(step_direction)
Calculate the expected reward improvement.
1expected_rew_improve = g_flat.dot(step_dir)
Performs line-search to find a step to improve the surrogate while not violating the trust region.
Search acceptance step ranging from 0 to total step
1for j in range(total_steps):
2 new_theta = _theta_old + step_frac * step_dir
3 set_param_values_to_model(self.ac.pi.net, new_theta)
4 acceptance_step = j + 1
In each step of for loop, calculate the policy performance and KL divergence.
1with torch.no_grad():
2 loss_pi_rew, _ = self.compute_loss_pi(data=data)
3 loss_pi_cost, _ = self.compute_loss_cost_performance(data=data)
4 q_dist = self.ac.pi.dist(data['obs'])
5 torch_kl = torch.distributions.kl.kl_divergence(p_dist, q_dist).mean().item()
6loss_rew_improve = self.loss_pi_before - loss_pi_rew.item()
7cost_diff = loss_pi_cost.item() - self.loss_pi_cost_before
Step only if the surrogate is improved and within the trust region.
1if not torch.isfinite(loss_pi_rew) and not torch.isfinite(loss_pi_cost):
2 self.logger.log('WARNING: loss_pi not finite')
3elif loss_rew_improve < 0 if optim_case > 1 else False:
4 self.logger.log('INFO: did not improve improve <0')
5
6elif cost_diff > max(-c, 0):
7 self.logger.log(f'INFO: no improve {cost_diff} > {max(-c, 0)}')
8elif torch_kl > self.target_kl * 1.5:
9 self.logger.log(f'INFO: violated KL constraint {torch_kl} at step {j + 1}.')
10else:
11 self.logger.log(f'Accept step at i={j + 1}')
12 break
Return appropriate step direction and acceptance step.
Configs#
Train Configs
device (str): Device to use for training, options:
cpu
,cuda
,cuda:0
, etc.torch_threads (int): Number of threads to use for PyTorch.
total_steps (int): Total number of steps to train the agent.
parallel (int): Number of parallel agents, similar to A3C.
vector_env_nums (int): Number of the vector environments.
Algorithms Configs
Note
The following configs are specific to CPO algorithm.
cg_damping (float): Damping coefficient for conjugate gradient.
cg_iters (int): Number of iterations for conjugate gradient.
fvp_sample_freq (int): Frequency of sampling for Fisher vector product.
steps_per_epoch (int): Number of steps to update the policy network.
update_iters (int): Number of iterations to update the policy network.
batch_size (int): Batch size for each iteration.
target_kl (float): Target KL divergence.
entropy_coef (float): Coefficient of entropy.
reward_normalize (bool): Whether to normalize the reward.
cost_normalize (bool): Whether to normalize the cost.
obs_normalize (bool): Whether to normalize the observation.
kl_early_stop (bool): Whether to stop the training when KL divergence is too large.
max_grad_norm (float): Maximum gradient norm.
use_max_grad_norm (bool): Whether to use maximum gradient norm.
use_critic_norm (bool): Whether to use critic norm.
critic_norm_coef (float): Coefficient of critic norm.
gamma (float): Discount factor.
cost_gamma (float): Cost discount factor.
lam (float): Lambda for GAE-Lambda.
lam_c (float): Lambda for cost GAE-Lambda.
adv_estimation_method (str): The method to estimate the advantage.
standardized_rew_adv (bool): Whether to use standardized reward advantage.
standardized_cost_adv (bool): Whether to use standardized cost advantage.
penalty_coef (float): Penalty coefficient for cost.
use_cost (bool): Whether to use cost.
Model Configs
weight_initialization_mode (str): The type of weight initialization method.
actor_type (str): The type of actor, default to
gaussian_learning
.linear_lr_decay (bool): Whether to use linear learning rate decay.
exploration_noise_anneal (bool): Whether to use exploration noise anneal.
std_range (list): The range of standard deviation.
Hint
actor (dictionary): parameters for actor network actor
activations: tanh
hidden_sizes:
64
64
Hint
critic (dictionary): parameters for critic network critic
activations: tanh
hidden_sizes:
64
64
Logger Configs
use_wandb (bool): Whether to use wandb to log the training process.
wandb_project (str): The name of wandb project.
use_tensorboard (bool): Whether to use tensorboard to log the training process.
log_dir (str): The directory to save the log files.
window_lens (int): The length of the window to calculate the average reward.
save_model_freq (int): The frequency to save the model.
References#
Appendix#
Click here to jump to CPO Theorem Click here to jump to Code with OmniSafe
Proof of theorem 1 (Difference between two arbitrary policies)#
Our analysis will begin with the discounted future state distribution, \(d_\pi\), which is defined as:
Let \(p_\pi^t \in \mathbb{R}^{|\mathcal{S}|}\) denote the vector with components \(p_\pi^t(s)=P\left(s_t=s \mid \pi\right)\), and let \(P_\pi \in \mathbb{R}^{|\mathcal{S}| \times|\mathcal{S}|}\) denotes the transition matrix with components \(P_\pi\left(s^{\prime} \mid s\right)=\int d a P\left(s^{\prime} \mid s, a\right) \pi(a \mid s)\), which shown as below:
Then \(p_\pi^t=P_\pi p_\pi^{t-1}=P_\pi^2 p_\pi^{t-2}=\ldots=P_\pi^t \mu\), where \(\mu\) represents the state distribution of the system at the moment. That is, the initial state distribution, then \(d_\pi\) can then be rewritten as:
Lemma 1
For any function \(f: \mathcal{S} \rightarrow \mathbb{R}\) and any policy \(\pi\) :
where \(\tau \sim \pi\) denotes \(s \sim d_\pi, a \sim \pi\) and \(s^{\prime} \sim P\).
Lemma 2
For any function \(f: \mathcal{S} \rightarrow \mathbb{R}\) and any policies \(\pi\) and \(\pi'\), define
and \(\epsilon^{f}_{\pi^{\prime}}\doteq \max_s \left|\underset{\substack{a \sim \pi \\ s'\sim P}}{\mathbb{E}}\left[R\left(s, a, s^{\prime}\right)+\gamma f\left(s^{\prime}\right)-f(s)\right]\right|\). Then the following bounds hold:
where \(D_{T V}\) is the total variational divergence. Furthermore, the bounds are tight when \(\pi^{\prime}=\pi\), and the LHS and RHS are identically zero.
Lemma 3
The divergence between discounted future state visitation distributions, \(\Vert d_{\pi'}-d_\pi\Vert_1\), is bounded by an average divergence of the policies \(\pi\) and \(\pi'\) :
where \(D_{\mathrm{TV}}(\pi' \| \pi)[s] = \frac{1}{2}\sum_a \Vert\pi'(a|s) - \pi(a|s)\Vert\)
Corollary 4
Define the matrices \(G \doteq\left(I-\gamma P_\pi\right)^{-1}, \bar{G} \doteq\left(I-\gamma P_{\pi^{\prime}}\right)^{-1}\), and \(\Delta=P_{\pi^{\prime}}-P_\pi\). Then:
Thus, with Eq.21
Corollary 5
Begin with the bounds from Lemma 2 and bound the divergence by Lemma 3, Theorem 1 can be finally proved.
Proof
Multiply both sides of Eq.21 by \(\left(I-\gamma P_\pi\right)\), we get:
Then take the inner product with the vector \(f \in \mathbb{R}^{|S|}\) and notice that the vector \(f\) can be arbitrarily picked.
Both sides of the above equation can be rewritten separately by:
Finally, we obtain:
Hint
Supplementary details
Proof
Note that the objective function can be represented as:
Let \(\delta^f(s, a, s^{\prime})\doteq R(s, a, s^{\prime})+\gamma f(s^{\prime})-f(s)\), then by Eq.29, we easily obtain that:
For the first term of the equation, let \(\bar{\delta}^{f}_{\pi'} \in \mathbb{R}^{|S|}\) denotes the vector of components \(\bar{\delta}^{f}_{\pi'}(s)=\underset{\substack{a \sim \pi' \\ s' \sim P}}{\mathbb{E}}\left[\delta^f\left(s, a, s'|s\right)\right]\), then
By using Hölder’s inequality, for any \(p, q \in[1, \infty]\), such that \(\frac{1}{p}+\frac{1}{q}=1\). We have
Hint
Hölder’s inequality:
Let \((\mathcal{S}, \sum, \mu)\) be a measure space and let \(p, q \in [1, \infty]\) with \(\frac{1}{p} + \frac{1}{q} = 1\). Then for all measurable real or complex-valued function \(f\) and \(g\) on \(s\), \(\|f g\|_1 \leq\|f\|_p\|g\|_q\).
If, in addition, \(p, q \in(1, \infty)\) and \(f \in L^p(\mu)\) and \(g \in L^q(\mu)\), then Hölder’s inequality becomes an equality if and only if \(|f|^p\) and \(|g|^q\) are linearly dependent in \(L^1(\mu)\), meaning that there exists real numbers \(\alpha, \beta \geq 0\), not both of them zero, such that \(\alpha|f|^p=\beta|g|^q \mu\) almost everywhere.
The last step is to observe that, by the importance of sampling identity,
After grouping terms, the bounds are obtained.
The lower bound is the same.
Proof
First, using Corollary 4, we obtain
Meanwhile,
And, using Corollary 5, we have,
Hint
Total variation distance of probability measures
\(\Vert d_{\pi'}-d_\pi \Vert_1=\sum_{a \in \mathcal{A}}\left|d_{\pi_{{\boldsymbol{\theta}}^{\prime}}}(a|s)-d_{\pi_{\boldsymbol{\theta}}}(a|s)\right|=2 D_{\mathrm{TV}}\left(d_{\pi_{{\boldsymbol{\theta}}'}}, d_\pi\right)[s]\)
Finally, using (20), we obtain,
Proof of Analytical Solution to LQCLP#
Theorem 2 (Optimizing Linear Objective with Linear, Quadratic Constraints)
Consider the problem
where \(g, b, x \in \mathbb{R}^n, c, \delta \in \mathbb{R}, \delta>0, H \in \mathbb{S}^n\), and \(H \succ 0\). When there is at least one strictly feasible point, the optimal point \(x^*\) satisfies
where \(\lambda^*\) and \(\nu^*\) are defined by
with \(q=g^T H^{-1} g, r=g^T H^{-1} b\), and \(s=b^T H^{-1} b\).
Furthermore, let \(\Lambda_a \doteq\{\lambda \mid \lambda c-r>0, \lambda \geq 0\}\), and \(\Lambda_b \doteq\{\lambda \mid \lambda c-r \leq 0, \lambda \geq 0\}\). The value of \(\lambda^*\) satisfies
with \(\lambda^*=\lambda_a^*\) if \(f_a\left(\lambda_a^*\right)>f_b\left(\lambda_b^*\right)\) and \(\lambda = \lambda_b^*\) otherwise, and \(\operatorname{Proj}(a, S)\) is the projection of a point \(x\) on to a set \(S\).
Hint
the projection of a point \(x \in \mathbb{R}\) onto a convex segment of \(\mathbb{R},[a, b]\), has value \(\operatorname{Proj}(x,[a, b])=\max (a, \min (b, x))\).
Proof for Theorem 2 (Click here)
This is a convex optimization problem. When there is at least one strictly feasible point, strong duality holds by Slater’s theorem. We exploit strong duality to solve the problem analytically. First using the method of Lagrange multipliers, \(\exists \lambda, \mu \geq 0\)
Because of strong duality,
\(p^*=\min_x\max_{\lambda \geq 0, \nu \geq 0} \mathcal{L}(x, \lambda, \nu)\)
Plug in \(x^*\),
\(H \in \mathbb{S}^n \Rightarrow H^T=H \Rightarrow\left(H^{-1}\right)^T=H^{-1}\)
\(\lambda \in \Lambda_a \Rightarrow \nu>0\), then plug in \(\nu=\frac{\lambda c-r}{s} ; \lambda \in \Lambda_a \Rightarrow \nu \leq 0\), then plug in \(\nu=0\)