# Probability theory
## 1. Measure theoretic foundations
### Definitions and elementary facts from measure theory
#### Measures and probability measures
>[!info] Definition
>A *discrete probability space* is a finite (or countably infinite) set $\Omega$ (called the set of outcomes),
>equipped with a function $p:\Omega\to \mathbb{R}_{\geq0}$, such that $\sum_{\omega\in\Omega}p(\omega)=1$.
The quantity $p(\omega)$ is called the *probability* of an outcome $\omega$.
>[!info] Definition
>Suppose $\Omega$ is a set. We denote by $2^{\Omega}$ the collection of its subsets.
>1. A collection $\mathcal{F}\subset2^{\Omega}$ is called a *$\sigma$-algebra on $\Omega$* if the following conditions are satisfied:
> - $\emptyset\in\mathcal{F}$;
> - if $A\in\mathcal{F}$, then $A^{c}:=\Omega\setminus A\in\mathcal{F}$;
> - if $A_{1},A_{2},\dots$ is a sequence of subsets of $\Omega$ such that $A_{i}\in\mathcal{F}$ for all $i$, then $\cup_{i=1}^{\infty}A_{i}\in\mathcal{F}$.
> 2. If $\mathcal{A} \subset2^{\Omega}$ the $\sigma(\mathcal{A})$ is the smallest $\sigma$-algebra containing $\mathcal{A}$.
> 3. If $\Omega$ is a topological space with open sets $\tau$ then the *Borel $\sigma$-algebra* is $\mathcal{B}(\Omega,\tau)=\sigma(\tau)$.
> 4. $(\Omega,\mathcal{F})$ is a *measurable space* if $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$.
> 5. a function $\mu:\mathcal{F}\to \mathbb{R}_{\geq0}\cup\{+\infty\}$ is called a *measure* if it satisfies the following properties:
> - $\mu(\emptyset)=0;$
> - ($\sigma$-additivity or countable additivity) If $A_{1},A_{2},\dots$ is a sequence of disjoint sets (that is, $A_{i}\cap A_{j}=\emptyset$ for $i\neq j$), such that $A_{i}\in \mathcal{F}$ for all $i$, then $
> \mu(\cup_{i=1}^{\infty}A_{i})=\sum_{i=1}^{\infty}\mu(A_{i}).$
> 6. A measure $\mu$ is called *a probability measure* if $\mu(\Omega)=1$.
> 7. A *probability space* is a triple $(\Omega, \mathcal{F}, \mathbb{P})$, where $\Omega$ is a set, $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$, and $\mathbb{P}$ is a probability measure on $\mathcal{F}$ ($\mathbb{P}(\Omega)=1$).
^d3fff1
>[!info] Definition - Measurable function
>A map $f: \Omega_1 \rightarrow \Omega_2$ between two measurable spaces $\left(\Omega_1, \mathcal{F}_1\right)$ and $\left(\Omega_2, \mathcal{F}_2\right)$ is called *measurable* if the preimage of any measurable set is measurable.
>[!info] Definition
>Let $\left(\Omega_1, \mathcal{F}_1\right)$ and $\left(\Omega_2, \mathcal{F}_2\right)$ be measurable spaces, $\mu$ a measure on $\left(\Omega_1, \mathcal{F}_1\right)$ and $f:\Omega_{1}\to \Omega_{2}$ a measurable function. Then the *push-forward* of $\mu$ by $f$, denoted $f_{*}(\mu)$, is the measure on $\left(\Omega_2, \mathcal{F}_2\right)$ defined by: $
> f_{*}(\mu)(B)= \mu(f^{-1}(B))
> $
^430066
#### Integration
>[!info] Definition - Lebesgue intergral
> Let $(\Omega,\mu)$ be a measurable space and $h:\Omega\to \mathbb{R}$ measurable a function. The *integral* $\int_{\Omega} h d \mu$ is defined in three steps.
> 1. If $h=\sum_{i=1}^n a_i \mathbb{I}_{A_i}$, where $A_1, \ldots, A_n$, are measurable of finite measure (such $h$ are called *simple functions*), put
> $
> \int_{\Omega} h d \mu:=\sum_{i=1}^n a_i \mu\left(A_i\right)
> $
> 2. If $h$ is a non-negative measurable function, put
> $
> \int_{\Omega} h d \mu:=\sup _{\substack{g \leq h \\ g \text { simple }}} \int_{\Omega} g d \mu
> $
> 3. For a general $h$, put $
> \int_{\Omega} h d \mu:=\int_{\Omega} h \mathbb{I}_{h \geq 0} d \mu-\int_{\Omega}\left(-h \mathbb{I}_{h<0}\right) d \mu$whenever at least one of the terms is finite, otherwise we say that the integral does not exist.
^842506
### Probability spaces
#### Random variables
>[!info] Definition
> Given a probability space ( $\Omega, \mathcal{F}, \mathbb{P}$ ), a measurable map from $\Omega$ to a measurable space ( $\Omega^{\prime}, \mathcal{F}^{\prime}$ ) is called *a random variable* (with values in $\Omega^{\prime}$ ).
> If $X$ is a random variable with values in $\Omega^{\prime}$, then the measure $\mu_{X}$ on $\mathcal{F}^{\prime}$ given by $ \mu_{X}(A):=\mathbb{P}\left(X^{-1}(A)\right) $
> is called a *distribution of the random variable $X$*.
> >[!note] Remark
>This is essentially the push-forward of $\mathbb{P}$ by $X$.
^b736ee
>[!note] Remark
> One can write $\mathbb{P}\left(X^{-1}(A)\right)=\mathbb{P}(\{\omega \in \Omega: X(\omega) \in A\})$.
> It is customary in Probability texts to use capital Latin letters for random variables (e. g., $X$ instead of $f$ ) and abbreviate the last formula to something like $\mathbb{P}(X \in A)$ or $\mathbb{P}(X\leq a)$ (when $a\in \mathbb{R}$).
>[!info] Definition
>A *scalar random variable* is a random variable with values in $\mathbb{R}$ (measurable with respect to the Borel $\sigma$-algebra).
>The *probability distribution function* (or *cumulative distribution*) of $X$ is the function $
> F_X(a):=\mathbb{P}(X \in(-\infty, a]) = \mathbb{P}(X\leq a)$
^dc0272
#### Conditional probability
>[!info] Definition
>Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space, and $A \in \mathcal{F}$, then the *restriction* of $\mathcal{F}$ to $A$: $\mathcal{F}|_A:=\{A \cap B: B \in \mathcal{F}\}$ is a $\sigma$-algebra on $A$, and $\mu(B):=\mathbb{P}(A \cap B)$ is a measure on $\mathcal{F}|_A$.
>If $\mathbb{P}(A)=0$, this measure identically zero; otherwise it could be normalized to be a probability measure on $A$, called the *conditional probability*: $
> \mathbb{P}(B \mid A):=\frac{\mathbb{P}(B \cap A)}{\mathbb{P}(A)} . $
>[!note] Remark
> Exchanging $A$ and $B$ in the above definition, one arrives at *Bayes' formula*:
> $
> \mathbb{P}(B \mid A)=\frac{\mathbb{P}(A \mid B) \mathbb{P}(B)}{\mathbb{P}(A)} .
> $
### Dynkin's $\pi-\lambda$ theorem and uniqueness of measures
>[!info] Definition
> A collection $\mathcal{A}$ of subsets of a set $\Omega$ is called
> - A *$\pi$-system* if it is closed under intersections: $A, B \in \mathcal{A} \Rightarrow A \cap B \in \mathcal{A}$.
> - a *$\lambda$-system* if
> - $\Omega \in \mathcal{A}$;
> - if $A, B \in \mathcal{A}$ and $A \subset B$, then $B \backslash A \in \mathcal{A}$;
> - if $A_1 \subset A_2 \subset \ldots$ all belong to $\mathcal{A}$, then $\cup_{i=1}^{\infty} A_i \in \mathcal{A}$.
> >[!note] Remark
>the last clause can be replaced by
>>- if $A_j \in \mathcal{A}$ for $j \in \mathbb{N}$ and $A_i \cap A_j=\emptyset$ for $i \neq j$, then $\cup_{i=1}^{\infty} A_i \in \mathcal{A}$.
>[!example]
> Assume that $\mu$ and $\nu$ are two probability measures on the same $\sigma$-algebra . Then $\{A \in \mathcal{F}: \mu(A)=\nu(A)\}$ is a $\lambda$-system.
**Lemma.** If $\mathcal{A}$ is both $\pi$- and $\lambda$-system, then it is a $\sigma$-algebra.
**Theorem ( $\pi-\lambda$ theorem or Dynkin's lemma)** Let $\mathcal{A}$ be a $\pi$-system and $\Lambda$ be a $\lambda$-system. Then
$
\mathcal{A} \subset \Lambda \Rightarrow \sigma(\mathcal{A}) \subset \Lambda .
$
**Corollary (uniqueness of measures).**
1. If two measures $\mu_1$ and $\mu_2$ on the same $\sigma$-algebra $\mathcal{F}$ are such that $\mu_1(\Omega)=\mu_2(\Omega)< \infty$ and that they agree on a $\pi$-system $\mathcal{A}$, they agree also on $\sigma(\mathcal{A})$.
2. A probability measure on $(\mathbb{R}, \mathcal{B}(\mathbb{R}))$ is uniquely determined by its probability distribution function $F_{\mathbb{P}}(t):=\mathbb{P}((-\infty, t])$.
#### Usage example
>[!info] Definition
> Let $X,Y$ be scalar random variables. We say that *$X$ and $Y$ are independent* if $
> \mathbb{P}(X \in A, Y \in B)=\mathbb{P}(X \in A) \mathbb{P}(Y \in B), \quad \forall A, B \in \mathcal{B}(\mathbb{R}) . $
**Lemma.** Suppose that $
\mathbb{P}(X \leq s, Y \leq t)=\mathbb{P}(X \leq s) \mathbb{P}(Y \leq t), \quad \forall s, t \in \mathbb{R} .$Then $X$ and $Y$ are independent.
### Caratheodory extension and existence of measures
>[!info] Definition
> A collection $\mathcal{R} \subset 2^{\Omega}$ of subsets of a set $\Omega$ is called a *semi-ring* if the following conditions are satisfied:
> - $\emptyset \in R ;$
> - it is a $\pi$-system, i.e. closed under intersections;
> - if $A, B \in \mathcal{R}$, then there exists a finite collection of disjoint sets $A_1, \ldots, A_n \in \mathcal{R}$ such that $ A \backslash B=\cup_{i=1}^n A_i $
> $\mathcal{R}$ is called a *ring* if, in addition, it is closed under finite union of disjoint sets.
>[!example]
>The following are semi-rings:
> - $\mathcal{I}:=\{[a , b): a, b \in \mathbb{R}\}$
> - $\mathcal{J}:=\{\emptyset\} \cup\{(a, b)$ : $a, b \in \mathbb{R}, a<b\} \cup\left\{\cup_{i=1}^n\left\{c_i\right\}: n \in \mathbb{N}, c_i \in \mathbb{R}\right\}$
**Lemma.** Let $\mathcal{R}_0$ be a semi-ring. Then $
\mathcal{R}:=\left\{\bigcup_{i=1}^n A_i: n \in \mathbb{N}, A_i \in \mathcal{R}_0,\left\{A_i : i=1,\dots, n \right\} \text{ disjoint }\right\}$ is a ring.
>[!info] Definition
> - We say that $\mu$ is a *finitely additive function* on a semi-ring $\mathcal{R}$ if whenever $A_1, \ldots, A_N \in \mathcal{R}$ are mutually disjoint and $A=\bigcup_{i=1}^N A_i \in \mathcal{R}$, then $\mu(A)=\sum_{i=1}^N \mu\left(A_i\right)$.
> - We say that $\mu$ is *countably subadditive* on $\mathcal{R}$ if for any $A=\bigcup_{i=1}^{\infty} A_i$ with $A, A_i \in \mathcal{R}$, we have $\mu(A) \leq \sum_{i=1}^{\infty} \mu\left(A_i\right)$.
> >[!note] Remark
> $\mu$ can be uniquely extended to the ring-extension of $\mathcal{R}$.
> - We say that a function $\mu: \mathcal{R} \rightarrow \mathbb{R}_{\geq 0}$, is a *pre-measure* on $\mathcal{R}$ if $\mu(\emptyset)=0, \mu$ is finitely additive and countably subadditive on $\mathcal{R}$.
^e50295
**Theorem (Caratheodory extension theorem).** Let $\mathcal{R}$ be a semi-ring on a set $\Omega$, and let $\mu: \mathcal{R} \rightarrow \mathbb{R}_{\geq 0}$ be a pre-measure. Then there exists a measure on $\sigma(\mathcal{R})$ that coincides with $\mu$ on $\mathcal{R}$.
Furthermore, if $\Omega$ is the countable union of members of $\mathcal{R}$, then the extension is unique.
*Proof idea.* Define for every $E \in 2^{\Omega}$ the *outer measure* $\mu^*(E):=\inf _{\substack{\cup_{i=1}^{\infty} A_i \supset E \\ A_i \in \mathcal{R}}} \sum_{i=1}^{\infty} \mu\left(A_i\right)$ and let $\mathcal{M}:=\left\{A \in 2^{\Omega}: \mu^*(E)=\mu^*(E \cap A)+\mu^*(E \backslash A), \forall E \in 2^{\Omega}\right\} .$
Show that $\mathcal{M}$ is a $\sigma$-algebra containing $\mathcal{R}$ and that $\mu^{\ast}$ is a measure on $\mathcal{M}$ extending $\mu$.
>[!example] Example + Definition (Lebesgue measure)
> There exists a unique measure $\lambda$ on $\mathbb{R}$ such that for any $a<b$, $\lambda([a, b))=b-a$.
> *Proof.* Use the extension of $\mathcal{J}$ above - the semi-ring of finite unions of open intervals and singletons, with the pre-measure:
> $\operatorname{vol}\left(\cup_{j=1}^M (a_{j},b_{j}) \cup_{j=1}^N\left\{c_j\right\}\right)=\sum_{j=1}^M\left(b_j-a_j\right)$
**Lemma.** For any scalar random variable $X, F_X(a)$ is non-decreasing, right-continuous (i. e., $F_X\left(a_i\right) \rightarrow F_X(a)$ whenever $a_i \searrow a$, and has limits $\lim _{a \rightarrow+\infty} F_X(a)=1$ and $\lim _{a \rightarrow-\infty} F_X(a)=0$. Conversely, if $F$ is any function with these properties, then there exists a probability measure $\mu$ on $\mathcal{B}(\mathbb{R})$ such that $F(a)=\mu((-\infty, a])$.
### Expectation
>[!info] Definition
>The *expectation* of a real-valued random variable $X$ is defined as the [[Probability theory - notes#Integration|(Lebesgue) integral]] over the measure space:
> $
> \mathbb{E}(X)=\int_{\Omega} X d \mathbb{P}
> $
^89fd83
> [!tip] Intuition
> The expectation, or expected value, of a random value, represent a weighted average of the outcomes, where the weights are the respective probabilities of the outcomes. Thus, if $X$ has finitely many possible values $\{x_{1} ,\dots, x_{n}\}$, this is reduced to:
> $ \mathbb{E}(X)= \sum_{i=1}^{n} x_{i} \mathbb{P}(X=x_{i})$
>[!note] Terminology
> "Almost surely" = "except on a measure 0 set".
**Proposition**. The expectation satisfies the following properties:
- (linearity) if $\alpha, \beta \in \mathbb{R}$, and $\mathbb{E} X$ and $\mathbb{E} Y$ exist, then $\mathbb{E}(\alpha X+\beta Y)$ exists, and
$
\mathbb{E}(\alpha X+\beta Y)=\alpha \mathbb{E}(X)+\beta \mathbb{E}(Y) ;
$
- (monotonicity) if $X, Y$ are measurable such that $0 \leq X(\omega) \leq Y(\omega)$ for all $\omega \in \Omega$ and $\mathbb{E} Y$ exists, then $\mathbb{E}(X) \leq \mathbb{E}(Y)$.
- (monotone convergence theorem) if $X_i \geq 0$ are measurable and $X_i \nearrow X$ almost surely, then $\mathbb{E}\left(X_i\right) \rightarrow \mathbb{E}(X)$.
>[!warning]
>It is, in general, not true that $X_n \rightarrow X$ almost surely implies $\mathbb{E} X_n \rightarrow \mathbb{E} X$ :
**EXAMPLE (Growing bump).** Let $\Omega=(0,1)$ with Lebesgue measure $\lambda$, and $X_n=n \mathbb{I}_{\left(0 ; \frac{1}{n}\right)}$. Then $X_n(\omega) \rightarrow 0$ for any $\omega \in(0,1)$, but $\mathbb{E} X_i=n \lambda\left(\left(0 ; \frac{1}{n}\right)\right) \equiv 1$.
**Lemma (Fatou).** If $X_n \geq 0$, then
$
\liminf \mathbb{E} X_n \geq \mathbb{E} \liminf X_n .
$
**Theorem (Lebesgue's dominated convergence theorem).** if $X_i$ are scalar random variables, $X_i \rightarrow X$ almost surely, and there exists a random variable $Y$ with $\mathbb{E}(|Y|)<\infty$ such that $\left|X_i\right| \leq|Y|$ almost surely for all $i$, then $\mathbb{E}\left(X_i\right) \rightarrow \mathbb{E}(X)$.
### Density / Radon-Nikodym derivative
^1a20af
**Lemma.** Let $f \geq 0$ be a measurable function on a measure space $(\Omega, \mathcal{F}, \mu)$ (not necessarily with finite measure). Then $\mu^{\prime}$, defined on $\mathcal{F}$ by$
\mu^{\prime}(A):=\int_A f d \mu=\int_{\Omega}\left(f \cdot \mathbb{I}_A\right) d \mu
$is a measure on $\mathcal{F}$.
>[!info] Definition
> If, for a measure $\mu^{\prime}$, there exists a function such that $\mu^{\prime}(A) \equiv \int_A f d \mu$, then the function $f$ is called the *Radon-Nikodym derivative of $\mu^{\prime}$ with respect to $\mu$*.
> It is denoted $f=\frac{d \mu^{\prime}}{d \mu}$, or $d\mu'=fd\mu$.
>
> In the special case when $\mu$ is the Lebesgue measure (on $\mathbb{R}$ or on $\mathbb{R}^n$ ), and $\mu^{\prime}$ is a probability measure, the function $f$ is called *probability density*.
> If $\mu'=\mu_{X}$ is the probability distribution of a random variable $X$, then $f=f_{X}$ is the probability density *of* $X$, so that $d\mu_{X}=f_{X}d\lambda$.
**Theorem (Radon-Nikodym).** If $\mu,\mu'$ are measures on the same space such that $\mu(A)=0$ implies $\mu^{\prime}(A)=0$, then there exists a function $f$ such that $d \mu^{\prime}=f d \mu$.
>[!note] Remark
>If $X$ is a random variable such that its distribution function $F_{X}$ has a continuous derivative, then $F'_{X}$ is the probability density of $X$.
**Proposition(abstract change of variable theorem).** Let ( $\Omega_1, \mathcal{F}_1, \mathbb{P}$ ) be a probability space, ( $\Omega_2, \mathcal{F}_2$ ) a measurable space, $X: \Omega_1 \rightarrow \Omega_2$ a random variable and $f: \Omega_2 \rightarrow \mathbb{R}$ a measurable function. Then, $
\mathbb{E}(f \circ X)=\int_{\Omega_2} f d \mu_X
$where $\mu_X$ denotes the [[Probability theory - notes#^b736ee|distribution]] of the random variable $X$.
In particular, if $X:\Omega_{1} \to \mathbb{R}$, then: $
\mathbb{E}(X)=\int_{\mathbb{R}} x d \mu_X
$ and if $X$ has a density function $f_{X}$, so that $d\mu_{X}=f_{X}dx$, this gives: $
\mathbb{E}(X)=\int_{\mathbb{R}} xf_{X}(x) dx.
$
![[Probability theory - glossary]]