Mathematical Notations#
Introduction#
In this section, we will provide an introduction to the notations used throughout this tutorial. In reinforcement learning, two fundamental notations are commonly used: linear algebra and constrained Markov decision processes. It is essential to familiarize yourself with these notations before proceeding. When reading formulas in the following chapters, if you come across a mathematical symbol whose meaning you’re unsure of, refer to the notations introduced in this chapter.
Linear Algebra#
Vector
A vector is a mathematical object representing a quantity that has both magnitude and direction. It is an ordered finite list of numbers that can be written as a vertical array surrounded by square brackets.
Usually, we use a bold lowercase letter to denote a vector (e.g. \(\boldsymbol{a}=(a_1,a_2,\cdots,a_n)\in\mathbb{R}^{n}\)), and its \(i^{th}\) element written as \(\boldsymbol{a}[i]=:\boldsymbol{a}_{i},~~1\leq i\leq n.\)
Matrix
Matrix is a mathematical term that refers to a collection of numbers, whether real or complex, arranged in a rectangular array. In this tutorial, we will use bold capital letters to denote matrices, such as the following example: \(\mathbf{A}=(a_{i,j})\in\mathbb{R}^{m\times n}\), and its \((i,j)^{\text{th}}\) element denoted as
where \(1\leq i\leq m,1\leq j\leq n\).
Constrained Markov Decision Processes#
A Reinforcement Learning (RL) problem is typically formulated as Infinite-horizon Discounted Markov Decision Process (MDP).
It is a tuple \(\mathcal{M}=\{\mathcal{S}, \mathcal{A}, \mathbb{P}, r, \mu, \gamma\}\), where:
Key Notations
\(\mathcal{S}\) is a finite set of states;
\(\mathcal{A}\) is a finite set of actions;
\(\mathbb{P}(\cdot|\cdot,\cdot)\) are the transition probability distribution, \(\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]\);
\(\mu\) are the distribution of the initial state \(s_0\), \(\mathcal{S} \rightarrow \mathbb{R}\) ;
\(r\) are the reward function, \(\mathcal{S} \rightarrow \mathbb{R}\);
\(\gamma\in(0,1)\) are the discount factor.
A stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\) is a probability distribution defined on \(\mathcal{S}\times\mathcal{A}\), \(\pi_{{\boldsymbol{\theta}}}(a|s)\) denotes the probability of playing \(a\) in state \(s\). With explicit notation dropped to reduce clutter, we use \(\pi\) to represent \(\pi_{{\boldsymbol{\theta}}}\).
Markov Decision Processes
Let \(J^R(\pi)\) denote its expected discounted reward,
Here \(\tau\) denotes a trajectory \((s_0, a_0, s_1, ...)\), and \(\tau \sim \pi\) is shorthand for indicating that the distribution over trajectories depends on a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\): \(s_0 \sim \mu\), \(a_t \sim \pi(\cdot|s_t)\), \(s_{t+1} \sim \mathbb{P}(\cdot | s_t, a_t)\). Meanwhile, let \(R(\tau)\) denote the discounted return of a trajectory. \(R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t)\)
The state action value function
The value function
And the advantage function
Let \(\mathbb{P}_{\pi}\left(s'\mid s\right)\) denote one-step state transition probability from \(s\) to \(s'\) by executing \(\pi\),
Then for any initial state \(s_0 \sim \mu\), we have
where \(s_0 \sim \mu\) and the actions are chosen according to \(\pi\).
Let \(d_{\boldsymbol{\pi}}\) be the (unnormalized) discounted visitation frequencies here need to explain \(\mathbb{P}\).
Constrained Markov Decision Processes
A Constrained Markov Decision Process(CMDP) extends the MDP framework by augmenting with constraints restricting the set of feasible policies. Specifically, we introduce a set \(C\) of auxiliary cost functions: \(C_1, \cdots, C_m\) and cost limits: \(d_1, \cdots, d_m\), that each of them \(C_i\): \(\mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}\) mapping transition tuples to costs.
Let \(J^{C_i}(\pi)\) denote the expected discounted return of policy \(\pi\) in terms of cost function,
So, the feasible set of stationary parameterized policies for CMDP is
The goal of CMDP is to find the optimal policy \(\pi^{*}\):
Respectively we have:
The state action value function
The value function
And the advantage function
To summarize all of the above notation, we show the following table,
\(\tau\) is a trajectory that consist of \(\left(s_0, a_0, s_1, a_1, \cdots\right)\)
\(\pi_{{\boldsymbol{\theta}}}\) or \({\boldsymbol{\theta}}\) is a stationary parameterized policy which is a probability distribution defined on \(\mathcal{S}\times\mathcal{A}\), \(\pi_{{\boldsymbol{\theta}}}(a|s)\) denotes the probability of playing \(a\) in state \(s\).
\(J^R(\pi_{{\boldsymbol{\theta}}}),~ J^R({\boldsymbol{\theta}})\) are the expected discounted reward over trajectories, depending on a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\) or a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\).
\(J^{C}(\pi_{{\boldsymbol{\theta}}}),~ J^{C}({\boldsymbol{\theta}})\) are the expected discounted cost over trajectories, depending on a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\) or a stationary parameterized policy \(\pi_{{\boldsymbol{\theta}}}\).
\(Q_{\pi_{{\boldsymbol{\theta}}}}^{R},~ Q_{{\boldsymbol{\theta}}}^{R}\) are the state action value function for reward.
\(Q_{\pi_{{\boldsymbol{\theta}}}}^{C_i},~ Q_{{\boldsymbol{\theta}}}^{C_i}\) are the state action value function for cost.
\(V_{\pi_{{\boldsymbol{\theta}}}}^{R},~ V_{{\boldsymbol{\theta}}}^{R}\) are the value function for reward.
\(V_{\pi_{{\boldsymbol{\theta}}}}^{C_i},~ V_{{\boldsymbol{\theta}}}^{C_i}\) are the value function for cost.
\(A_{\pi_{{\boldsymbol{\theta}}}}^{R},~ A_{{\boldsymbol{\theta}}}^{R}\) are the advantage function for reward.
\(A_{\pi_{{\boldsymbol{\theta}}}}^{C_i},~ A_{{\boldsymbol{\theta}}}^{C_i}\) are the advantage function for cost.