# Probability theory ## 1. Measure theoretic foundations ### Definitions and elementary facts from measure theory #### Measures and probability measures >[!info] Definition >A *discrete probability space* is a finite (or countably infinite) set $\Omega$ (called the set of outcomes), >equipped with a function $p:\Omega\to \mathbb{R}_{\geq0}$, such that $\sum_{\omega\in\Omega}p(\omega)=1$. The quantity $p(\omega)$ is called the *probability* of an outcome $\omega$. >[!info] Definition >Suppose $\Omega$ is a set. We denote by $2^{\Omega}$ the collection of its subsets. >1. A collection $\mathcal{F}\subset2^{\Omega}$ is called a *$\sigma$-algebra on $\Omega$* if the following conditions are satisfied: > - $\emptyset\in\mathcal{F}$; > - if $A\in\mathcal{F}$, then $A^{c}:=\Omega\setminus A\in\mathcal{F}$; > - if $A_{1},A_{2},\dots$ is a sequence of subsets of $\Omega$ such that $A_{i}\in\mathcal{F}$ for all $i$, then $\cup_{i=1}^{\infty}A_{i}\in\mathcal{F}$. > 2. If $\mathcal{A} \subset2^{\Omega}$ the $\sigma(\mathcal{A})$ is the smallest $\sigma$-algebra containing $\mathcal{A}$. > 3. If $\Omega$ is a topological space with open sets $\tau$ then the *Borel $\sigma$-algebra* is $\mathcal{B}(\Omega,\tau)=\sigma(\tau)$. > 4. $(\Omega,\mathcal{F})$ is a *measurable space* if $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$. > 5. a function $\mu:\mathcal{F}\to \mathbb{R}_{\geq0}\cup\{+\infty\}$ is called a *measure* if it satisfies the following properties: > - $\mu(\emptyset)=0;$ > - ($\sigma$-additivity or countable additivity) If $A_{1},A_{2},\dots$ is a sequence of disjoint sets (that is, $A_{i}\cap A_{j}=\emptyset$ for $i\neq j$), such that $A_{i}\in \mathcal{F}$ for all $i$, then $ > \mu(\cup_{i=1}^{\infty}A_{i})=\sum_{i=1}^{\infty}\mu(A_{i}).$ > 6. A measure $\mu$ is called *a probability measure* if $\mu(\Omega)=1$. > 7. A *probability space* is a triple $(\Omega, \mathcal{F}, \mathbb{P})$, where $\Omega$ is a set, $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$, and $\mathbb{P}$ is a probability measure on $\mathcal{F}$ ($\mathbb{P}(\Omega)=1$). ^d3fff1 >[!info] Definition - Measurable function >A map $f: \Omega_1 \rightarrow \Omega_2$ between two measurable spaces $\left(\Omega_1, \mathcal{F}_1\right)$ and $\left(\Omega_2, \mathcal{F}_2\right)$ is called *measurable* if the preimage of any measurable set is measurable. >[!info] Definition >Let $\left(\Omega_1, \mathcal{F}_1\right)$ and $\left(\Omega_2, \mathcal{F}_2\right)$ be measurable spaces, $\mu$ a measure on $\left(\Omega_1, \mathcal{F}_1\right)$ and $f:\Omega_{1}\to \Omega_{2}$ a measurable function. Then the *push-forward* of $\mu$ by $f$, denoted $f_{*}(\mu)$, is the measure on $\left(\Omega_2, \mathcal{F}_2\right)$ defined by: $ > f_{*}(\mu)(B)= \mu(f^{-1}(B)) > $ ^430066 #### Integration >[!info] Definition - Lebesgue intergral > Let $(\Omega,\mu)$ be a measurable space and $h:\Omega\to \mathbb{R}$ measurable a function. The *integral* $\int_{\Omega} h d \mu$ is defined in three steps. > 1. If $h=\sum_{i=1}^n a_i \mathbb{I}_{A_i}$, where $A_1, \ldots, A_n$, are measurable of finite measure (such $h$ are called *simple functions*), put > $ > \int_{\Omega} h d \mu:=\sum_{i=1}^n a_i \mu\left(A_i\right) > $ > 2. If $h$ is a non-negative measurable function, put > $ > \int_{\Omega} h d \mu:=\sup _{\substack{g \leq h \\ g \text { simple }}} \int_{\Omega} g d \mu > $ > 3. For a general $h$, put $ > \int_{\Omega} h d \mu:=\int_{\Omega} h \mathbb{I}_{h \geq 0} d \mu-\int_{\Omega}\left(-h \mathbb{I}_{h<0}\right) d \mu$whenever at least one of the terms is finite, otherwise we say that the integral does not exist. ^842506 ### Probability spaces #### Random variables >[!info] Definition > Given a probability space ( $\Omega, \mathcal{F}, \mathbb{P}$ ), a measurable map from $\Omega$ to a measurable space ( $\Omega^{\prime}, \mathcal{F}^{\prime}$ ) is called *a random variable* (with values in $\Omega^{\prime}$ ). > If $X$ is a random variable with values in $\Omega^{\prime}$, then the measure $\mu_{X}$ on $\mathcal{F}^{\prime}$ given by $ \mu_{X}(A):=\mathbb{P}\left(X^{-1}(A)\right) $ > is called a *distribution of the random variable $X$*. > >[!note] Remark >This is essentially the push-forward of $\mathbb{P}$ by $X$. ^b736ee >[!note] Remark > One can write $\mathbb{P}\left(X^{-1}(A)\right)=\mathbb{P}(\{\omega \in \Omega: X(\omega) \in A\})$. > It is customary in Probability texts to use capital Latin letters for random variables (e. g., $X$ instead of $f$ ) and abbreviate the last formula to something like $\mathbb{P}(X \in A)$ or $\mathbb{P}(X\leq a)$ (when $a\in \mathbb{R}$). >[!info] Definition >A *scalar random variable* is a random variable with values in $\mathbb{R}$ (measurable with respect to the Borel $\sigma$-algebra). >The *probability distribution function* (or *cumulative distribution*) of $X$ is the function $ > F_X(a):=\mathbb{P}(X \in(-\infty, a]) = \mathbb{P}(X\leq a)$ ^dc0272 #### Conditional probability >[!info] Definition >Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space, and $A \in \mathcal{F}$, then the *restriction* of $\mathcal{F}$ to $A$: $\mathcal{F}|_A:=\{A \cap B: B \in \mathcal{F}\}$ is a $\sigma$-algebra on $A$, and $\mu(B):=\mathbb{P}(A \cap B)$ is a measure on $\mathcal{F}|_A$. >If $\mathbb{P}(A)=0$, this measure identically zero; otherwise it could be normalized to be a probability measure on $A$, called the *conditional probability*: $ > \mathbb{P}(B \mid A):=\frac{\mathbb{P}(B \cap A)}{\mathbb{P}(A)} . $ >[!note] Remark > Exchanging $A$ and $B$ in the above definition, one arrives at *Bayes' formula*: > $ > \mathbb{P}(B \mid A)=\frac{\mathbb{P}(A \mid B) \mathbb{P}(B)}{\mathbb{P}(A)} . > $ ### Dynkin's $\pi-\lambda$ theorem and uniqueness of measures >[!info] Definition > A collection $\mathcal{A}$ of subsets of a set $\Omega$ is called > - A *$\pi$-system* if it is closed under intersections: $A, B \in \mathcal{A} \Rightarrow A \cap B \in \mathcal{A}$. > - a *$\lambda$-system* if > - $\Omega \in \mathcal{A}$; > - if $A, B \in \mathcal{A}$ and $A \subset B$, then $B \backslash A \in \mathcal{A}$; > - if $A_1 \subset A_2 \subset \ldots$ all belong to $\mathcal{A}$, then $\cup_{i=1}^{\infty} A_i \in \mathcal{A}$. > >[!note] Remark >the last clause can be replaced by >>- if $A_j \in \mathcal{A}$ for $j \in \mathbb{N}$ and $A_i \cap A_j=\emptyset$ for $i \neq j$, then $\cup_{i=1}^{\infty} A_i \in \mathcal{A}$. >[!example] > Assume that $\mu$ and $\nu$ are two probability measures on the same $\sigma$-algebra . Then $\{A \in \mathcal{F}: \mu(A)=\nu(A)\}$ is a $\lambda$-system. **Lemma.** If $\mathcal{A}$ is both $\pi$- and $\lambda$-system, then it is a $\sigma$-algebra. **Theorem ( $\pi-\lambda$ theorem or Dynkin's lemma)** Let $\mathcal{A}$ be a $\pi$-system and $\Lambda$ be a $\lambda$-system. Then $ \mathcal{A} \subset \Lambda \Rightarrow \sigma(\mathcal{A}) \subset \Lambda . $ **Corollary (uniqueness of measures).** 1. If two measures $\mu_1$ and $\mu_2$ on the same $\sigma$-algebra $\mathcal{F}$ are such that $\mu_1(\Omega)=\mu_2(\Omega)< \infty$ and that they agree on a $\pi$-system $\mathcal{A}$, they agree also on $\sigma(\mathcal{A})$. 2. A probability measure on $(\mathbb{R}, \mathcal{B}(\mathbb{R}))$ is uniquely determined by its probability distribution function $F_{\mathbb{P}}(t):=\mathbb{P}((-\infty, t])$. #### Usage example >[!info] Definition > Let $X,Y$ be scalar random variables. We say that *$X$ and $Y$ are independent* if $ > \mathbb{P}(X \in A, Y \in B)=\mathbb{P}(X \in A) \mathbb{P}(Y \in B), \quad \forall A, B \in \mathcal{B}(\mathbb{R}) . $ **Lemma.** Suppose that $ \mathbb{P}(X \leq s, Y \leq t)=\mathbb{P}(X \leq s) \mathbb{P}(Y \leq t), \quad \forall s, t \in \mathbb{R} .$Then $X$ and $Y$ are independent. ### Caratheodory extension and existence of measures >[!info] Definition > A collection $\mathcal{R} \subset 2^{\Omega}$ of subsets of a set $\Omega$ is called a *semi-ring* if the following conditions are satisfied: > - $\emptyset \in R ;$ > - it is a $\pi$-system, i.e. closed under intersections; > - if $A, B \in \mathcal{R}$, then there exists a finite collection of disjoint sets $A_1, \ldots, A_n \in \mathcal{R}$ such that $ A \backslash B=\cup_{i=1}^n A_i $ > $\mathcal{R}$ is called a *ring* if, in addition, it is closed under finite union of disjoint sets. >[!example] >The following are semi-rings: > - $\mathcal{I}:=\{[a , b): a, b \in \mathbb{R}\}$ > - $\mathcal{J}:=\{\emptyset\} \cup\{(a, b)$ : $a, b \in \mathbb{R}, a<b\} \cup\left\{\cup_{i=1}^n\left\{c_i\right\}: n \in \mathbb{N}, c_i \in \mathbb{R}\right\}$ **Lemma.** Let $\mathcal{R}_0$ be a semi-ring. Then $ \mathcal{R}:=\left\{\bigcup_{i=1}^n A_i: n \in \mathbb{N}, A_i \in \mathcal{R}_0,\left\{A_i : i=1,\dots, n \right\} \text{ disjoint }\right\}$ is a ring. >[!info] Definition > - We say that $\mu$ is a *finitely additive function* on a semi-ring $\mathcal{R}$ if whenever $A_1, \ldots, A_N \in \mathcal{R}$ are mutually disjoint and $A=\bigcup_{i=1}^N A_i \in \mathcal{R}$, then $\mu(A)=\sum_{i=1}^N \mu\left(A_i\right)$. > - We say that $\mu$ is *countably subadditive* on $\mathcal{R}$ if for any $A=\bigcup_{i=1}^{\infty} A_i$ with $A, A_i \in \mathcal{R}$, we have $\mu(A) \leq \sum_{i=1}^{\infty} \mu\left(A_i\right)$. > >[!note] Remark > $\mu$ can be uniquely extended to the ring-extension of $\mathcal{R}$. > - We say that a function $\mu: \mathcal{R} \rightarrow \mathbb{R}_{\geq 0}$, is a *pre-measure* on $\mathcal{R}$ if $\mu(\emptyset)=0, \mu$ is finitely additive and countably subadditive on $\mathcal{R}$. ^e50295 **Theorem (Caratheodory extension theorem).** Let $\mathcal{R}$ be a semi-ring on a set $\Omega$, and let $\mu: \mathcal{R} \rightarrow \mathbb{R}_{\geq 0}$ be a pre-measure. Then there exists a measure on $\sigma(\mathcal{R})$ that coincides with $\mu$ on $\mathcal{R}$. Furthermore, if $\Omega$ is the countable union of members of $\mathcal{R}$, then the extension is unique. *Proof idea.* Define for every $E \in 2^{\Omega}$ the *outer measure* $\mu^*(E):=\inf _{\substack{\cup_{i=1}^{\infty} A_i \supset E \\ A_i \in \mathcal{R}}} \sum_{i=1}^{\infty} \mu\left(A_i\right)$ and let $\mathcal{M}:=\left\{A \in 2^{\Omega}: \mu^*(E)=\mu^*(E \cap A)+\mu^*(E \backslash A), \forall E \in 2^{\Omega}\right\} .$ Show that $\mathcal{M}$ is a $\sigma$-algebra containing $\mathcal{R}$ and that $\mu^{\ast}$ is a measure on $\mathcal{M}$ extending $\mu$. >[!example] Example + Definition (Lebesgue measure) > There exists a unique measure $\lambda$ on $\mathbb{R}$ such that for any $a<b$, $\lambda([a, b))=b-a$. > *Proof.* Use the extension of $\mathcal{J}$ above - the semi-ring of finite unions of open intervals and singletons, with the pre-measure: > $\operatorname{vol}\left(\cup_{j=1}^M (a_{j},b_{j}) \cup_{j=1}^N\left\{c_j\right\}\right)=\sum_{j=1}^M\left(b_j-a_j\right)$ **Lemma.** For any scalar random variable $X, F_X(a)$ is non-decreasing, right-continuous (i. e., $F_X\left(a_i\right) \rightarrow F_X(a)$ whenever $a_i \searrow a$, and has limits $\lim _{a \rightarrow+\infty} F_X(a)=1$ and $\lim _{a \rightarrow-\infty} F_X(a)=0$. Conversely, if $F$ is any function with these properties, then there exists a probability measure $\mu$ on $\mathcal{B}(\mathbb{R})$ such that $F(a)=\mu((-\infty, a])$. ### Expectation >[!info] Definition >The *expectation* of a real-valued random variable $X$ is defined as the [[Probability theory - notes#Integration|(Lebesgue) integral]] over the measure space: > $ > \mathbb{E}(X)=\int_{\Omega} X d \mathbb{P} > $ ^89fd83 > [!tip] Intuition > The expectation, or expected value, of a random value, represent a weighted average of the outcomes, where the weights are the respective probabilities of the outcomes. Thus, if $X$ has finitely many possible values $\{x_{1} ,\dots, x_{n}\}$, this is reduced to: > $ \mathbb{E}(X)= \sum_{i=1}^{n} x_{i} \mathbb{P}(X=x_{i})$ >[!note] Terminology > "Almost surely" = "except on a measure 0 set". **Proposition**. The expectation satisfies the following properties: - (linearity) if $\alpha, \beta \in \mathbb{R}$, and $\mathbb{E} X$ and $\mathbb{E} Y$ exist, then $\mathbb{E}(\alpha X+\beta Y)$ exists, and $ \mathbb{E}(\alpha X+\beta Y)=\alpha \mathbb{E}(X)+\beta \mathbb{E}(Y) ; $ - (monotonicity) if $X, Y$ are measurable such that $0 \leq X(\omega) \leq Y(\omega)$ for all $\omega \in \Omega$ and $\mathbb{E} Y$ exists, then $\mathbb{E}(X) \leq \mathbb{E}(Y)$. - (monotone convergence theorem) if $X_i \geq 0$ are measurable and $X_i \nearrow X$ almost surely, then $\mathbb{E}\left(X_i\right) \rightarrow \mathbb{E}(X)$. >[!warning] >It is, in general, not true that $X_n \rightarrow X$ almost surely implies $\mathbb{E} X_n \rightarrow \mathbb{E} X$ : **EXAMPLE (Growing bump).** Let $\Omega=(0,1)$ with Lebesgue measure $\lambda$, and $X_n=n \mathbb{I}_{\left(0 ; \frac{1}{n}\right)}$. Then $X_n(\omega) \rightarrow 0$ for any $\omega \in(0,1)$, but $\mathbb{E} X_i=n \lambda\left(\left(0 ; \frac{1}{n}\right)\right) \equiv 1$. **Lemma (Fatou).** If $X_n \geq 0$, then $ \liminf \mathbb{E} X_n \geq \mathbb{E} \liminf X_n . $ **Theorem (Lebesgue's dominated convergence theorem).** if $X_i$ are scalar random variables, $X_i \rightarrow X$ almost surely, and there exists a random variable $Y$ with $\mathbb{E}(|Y|)<\infty$ such that $\left|X_i\right| \leq|Y|$ almost surely for all $i$, then $\mathbb{E}\left(X_i\right) \rightarrow \mathbb{E}(X)$. ### Density / Radon-Nikodym derivative ^1a20af **Lemma.** Let $f \geq 0$ be a measurable function on a measure space $(\Omega, \mathcal{F}, \mu)$ (not necessarily with finite measure). Then $\mu^{\prime}$, defined on $\mathcal{F}$ by$ \mu^{\prime}(A):=\int_A f d \mu=\int_{\Omega}\left(f \cdot \mathbb{I}_A\right) d \mu $is a measure on $\mathcal{F}$. >[!info] Definition > If, for a measure $\mu^{\prime}$, there exists a function such that $\mu^{\prime}(A) \equiv \int_A f d \mu$, then the function $f$ is called the *Radon-Nikodym derivative of $\mu^{\prime}$ with respect to $\mu$*. > It is denoted $f=\frac{d \mu^{\prime}}{d \mu}$, or $d\mu'=fd\mu$. > > In the special case when $\mu$ is the Lebesgue measure (on $\mathbb{R}$ or on $\mathbb{R}^n$ ), and $\mu^{\prime}$ is a probability measure, the function $f$ is called *probability density*. > If $\mu'=\mu_{X}$ is the probability distribution of a random variable $X$, then $f=f_{X}$ is the probability density *of* $X$, so that $d\mu_{X}=f_{X}d\lambda$. **Theorem (Radon-Nikodym).** If $\mu,\mu'$ are measures on the same space such that $\mu(A)=0$ implies $\mu^{\prime}(A)=0$, then there exists a function $f$ such that $d \mu^{\prime}=f d \mu$. >[!note] Remark >If $X$ is a random variable such that its distribution function $F_{X}$ has a continuous derivative, then $F'_{X}$ is the probability density of $X$. **Proposition(abstract change of variable theorem).** Let ( $\Omega_1, \mathcal{F}_1, \mathbb{P}$ ) be a probability space, ( $\Omega_2, \mathcal{F}_2$ ) a measurable space, $X: \Omega_1 \rightarrow \Omega_2$ a random variable and $f: \Omega_2 \rightarrow \mathbb{R}$ a measurable function. Then, $ \mathbb{E}(f \circ X)=\int_{\Omega_2} f d \mu_X $where $\mu_X$ denotes the [[Probability theory - notes#^b736ee|distribution]] of the random variable $X$. In particular, if $X:\Omega_{1} \to \mathbb{R}$, then: $ \mathbb{E}(X)=\int_{\mathbb{R}} x d \mu_X $ and if $X$ has a density function $f_{X}$, so that $d\mu_{X}=f_{X}dx$, this gives: $ \mathbb{E}(X)=\int_{\mathbb{R}} xf_{X}(x) dx. $ ![[Probability theory - glossary]]