Mathematical Notations#

Introduction#

In this section, we will provide an introduction to the notations used throughout this tutorial. In reinforcement learning, two fundamental notations are commonly used: linear algebra and constrained Markov decision processes. It is essential to familiarize yourself with these notations before proceeding. When reading formulas in the following chapters, if you come across a mathematical symbol whose meaning you’re unsure of, refer to the notations introduced in this chapter.

Linear Algebra#

Vector

A vector is a mathematical object representing a quantity that has both magnitude and direction. It is an ordered finite list of numbers that can be written as a vertical array surrounded by square brackets.

(1)#\[\begin{split}\boldsymbol{a} = \left[\begin{array}{r} a_1 \\ a_2 \\ \cdots \\ a_n \end{array}\right] \in \mathbb{R}^n\end{split}\]

Usually, we use a bold lowercase letter to denote a vector (e.g. \(\boldsymbol{a}=(a_1,a_2,\cdots,a_n)\in\mathbb{R}^{n}\)), and its \(i^{th}\) element written as \(\boldsymbol{a}[i]=:\boldsymbol{a}_{i},~~1\leq i\leq n.\)

Matrix

Matrix is a mathematical term that refers to a collection of numbers, whether real or complex, arranged in a rectangular array. In this tutorial, we will use bold capital letters to denote matrices, such as the following example: \(\mathbf{A}=(a_{i,j})\in\mathbb{R}^{m\times n}\), and its \((i,j)^{\text{th}}\) element denoted as

(2)#\[\mathbf{A}[i,j]=:a_{i,j},\]

where \(1\leq i\leq m,1\leq j\leq n\).

Constrained Markov Decision Processes#

A Reinforcement Learning (RL) problem is typically formulated as Infinite-horizon Discounted Markov Decision Process (MDP).

It is a tuple \(\mathcal{M}=\{\mathcal{S}, \mathcal{A}, \mathbb{P}, r, \mu, \gamma\}\), where:

Key Notations

  • \(\mathcal{S}\) is a finite set of states;

  • \(\mathcal{A}\) is a finite set of actions;

  • \(\mathbb{P}(\cdot|\cdot,\cdot)\) are the transition probability distribution, \(\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]\);

  • \(\mu\) are the distribution of the initial state \(s_0\), \(\mathcal{S} \rightarrow \mathbb{R}\) ;

  • \(r\) are the reward function, \(\mathcal{S} \rightarrow \mathbb{R}\);

  • \(\gamma\in(0,1)\) are the discount factor.

A stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\) is a probability distribution defined on \(\mathcal{S}\times\mathcal{A}\), \(\pi_{{\boldsymbol{\theta}}}(a|s)\) denotes the probability of playing \(a\) in state \(s\). With explicit notation dropped to reduce clutter, we use \(\pi\) to represent \(\pi_{{\boldsymbol{\theta}}}\).

Markov Decision Processes

Let \(J^R(\pi)\) denote its expected discounted reward,

(3)#\[J^R(\pi) \doteq \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t r\left(s_t\right)\right]\]

Here \(\tau\) denotes a trajectory \((s_0, a_0, s_1, ...)\), and \(\tau \sim \pi\) is shorthand for indicating that the distribution over trajectories depends on a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\): \(s_0 \sim \mu\), \(a_t \sim \pi(\cdot|s_t)\), \(s_{t+1} \sim \mathbb{P}(\cdot | s_t, a_t)\). Meanwhile, let \(R(\tau)\) denote the discounted return of a trajectory. \(R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t)\)

The state action value function

(4)#\[Q^R_{\pi} \left(s, a\right) \doteq \mathbb{E}_{\tau \sim \pi}\left[ R(\tau) | s_0 = s, a_0 = a \right]\]

The value function

(5)#\[V^R_{\pi}\left(s\right) \doteq \mathbb{E}_{\tau \sim \pi}\left[R(\tau) | s_0 = s\right]\]

And the advantage function

(6)#\[A^R_{\pi}(s, a) \doteq Q^R_{\pi}(s, a)-V^R_{\pi}(s)\]

Let \(\mathbb{P}_{\pi}\left(s'\mid s\right)\) denote one-step state transition probability from \(s\) to \(s'\) by executing \(\pi\),

(7)#\[\mathbb{P}_{\pi}\left(s'\mid s\right)=\sum_{a\in\mathcal{A}}\pi\left(a\mid s\right) \mathbb{P}_{\pi}\left(s'\mid s,a\right)\]

Then for any initial state \(s_0 \sim \mu\), we have

(8)#\[\mathbb{P}_{\pi}\left(s_t=s\mid s_0\right)=\sum_{s'\in\mathcal{S}} \mathbb{P}_{\pi}\left(s_t=s\mid s_{t-1}=s'\right)\mathbb{P}_{\pi}\left(s_{t-1}=s'\mid s_0\right)\]

where \(s_0 \sim \mu\) and the actions are chosen according to \(\pi\).

Let \(d_{\boldsymbol{\pi}}\) be the (unnormalized) discounted visitation frequencies here need to explain \(\mathbb{P}\).

(9)#\[\begin{split}\begin{aligned} d_{\boldsymbol{\pi}}(s)&=\sum_{t=0}^{\infty} \gamma^t \mathbb{P}_{\pi}\left(s_t=s \mid s_0\right)\\ &=\mathbb{P}\left(s_0=s\right)+\gamma \mathbb{P}\left(s_1=s\mid s_0\right)+\gamma^2 \mathbb{P}\left(s_2=s\mid s_0\right)+\cdots \end{aligned}\end{split}\]

Constrained Markov Decision Processes

A Constrained Markov Decision Process(CMDP) extends the MDP framework by augmenting with constraints restricting the set of feasible policies. Specifically, we introduce a set \(C\) of auxiliary cost functions: \(C_1, \cdots, C_m\) and cost limits: \(d_1, \cdots, d_m\), that each of them \(C_i\): \(\mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}\) mapping transition tuples to costs.

Let \(J^{C_i}(\pi)\) denote the expected discounted return of policy \(\pi\) in terms of cost function,

(10)#\[\begin{aligned} J^{C_i}(\pi) = \mathbb{E}_{\tau \sim \pi}[\sum_{t=0}^{\infty} \gamma^t C_i(s_t, a_t, s_{t+1})] \end{aligned}\]

So, the feasible set of stationary parameterized policies for CMDP is

(11)#\[\begin{aligned} \Pi_{C} \doteq \{ \pi_{{\boldsymbol{\theta}}} \in \Pi~:~\forall~i, ~ J^{C_i}(\pi) \leq d_i \} \end{aligned}\]

The goal of CMDP is to find the optimal policy \(\pi^{*}\):

(12)#\[\begin{aligned} \label{def:problem-setting} \pi^{*}=\arg\max_{\pi_{\boldsymbol{\theta}} \in\Pi_{C}} J^R(\pi_{{\boldsymbol{\theta}}}) \end{aligned}\]

Respectively we have:

The state action value function

(13)#\[Q^{C}_{\pi} \left(s, a\right) \doteq \mathbb{E}_{\tau \sim \pi}\left[ C(\tau) | s_0 = s, a_0 = a \right]\]

The value function

(14)#\[V^{C}_{\pi}\left(s\right) \doteq \mathbb{E}_{\tau \sim \pi}\left[C(\tau) | s_0 = s\right]\]

And the advantage function

(15)#\[A^{C}_{\pi}(s, a) \doteq Q^{C}_{\pi}(s, a)-V^{C}_{\pi}(s)\]

To summarize all of the above notation, we show the following table,

  • \(\tau\) is a trajectory that consist of \(\left(s_0, a_0, s_1, a_1, \cdots\right)\)

  • \(\pi_{{\boldsymbol{\theta}}}\) or \({\boldsymbol{\theta}}\) is a stationary parameterized policy which is a probability distribution defined on \(\mathcal{S}\times\mathcal{A}\), \(\pi_{{\boldsymbol{\theta}}}(a|s)\) denotes the probability of playing \(a\) in state \(s\).

  • \(J^R(\pi_{{\boldsymbol{\theta}}}),~ J^R({\boldsymbol{\theta}})\) are the expected discounted reward over trajectories, depending on a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\) or a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\).

  • \(J^{C}(\pi_{{\boldsymbol{\theta}}}),~ J^{C}({\boldsymbol{\theta}})\) are the expected discounted cost over trajectories, depending on a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\) or a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\).

  • \(Q_{\pi_{{\boldsymbol{\theta}}}}^{R},~ Q_{{\boldsymbol{\theta}}}^{R}\) are the state action value function for reward.

  • \(Q_{\pi_{{\boldsymbol{\theta}}}}^{C_i},~ Q_{{\boldsymbol{\theta}}}^{C_i}\) are the state action value function for cost.

  • \(V_{\pi_{{\boldsymbol{\theta}}}}^{R},~ V_{{\boldsymbol{\theta}}}^{R}\) are the value function for reward.

  • \(V_{\pi_{{\boldsymbol{\theta}}}}^{C_i},~ V_{{\boldsymbol{\theta}}}^{C_i}\) are the value function for cost.

  • \(A_{\pi_{{\boldsymbol{\theta}}}}^{R},~ A_{{\boldsymbol{\theta}}}^{R}\) are the advantage function for reward.

  • \(A_{\pi_{{\boldsymbol{\theta}}}}^{C_i},~ A_{{\boldsymbol{\theta}}}^{C_i}\) are the advantage function for cost.

References#