# Probability theory Much is based on [Rick Durrett - Probability: Theory and Examples. 5th Edition](https://sites.math.duke.edu/~rtd/PTE/pte.html) ## 1. Measure theoretic foundations ### Definitions and elementary facts from measure theory #### Measures and probability measures >[!info] Definition >A *discrete probability space* is a finite (or countably infinite) set $\Omega$ (called the set of outcomes), >equipped with a function $p:\Omega\to \mathbb{R}_{\geq0}$, such that $\sum_{\omega\in\Omega}p(\omega)=1$. The quantity $p(\omega)$ is called the *probability* of an outcome $\omega$. >[!info] Definition >Suppose $\Omega$ is a set. We denote by $2^{\Omega}$ the collection of its subsets. >1. A collection $\mathcal{F}\subset2^{\Omega}$ is called a *$\sigma$-algebra on $\Omega$* if the following conditions are satisfied: > - $\emptyset\in\mathcal{F}$; > - if $A\in\mathcal{F}$, then $A^{c}:=\Omega\setminus A\in\mathcal{F}$; > - if $A_{1},A_{2},\dots$ is a sequence of subsets of $\Omega$ such that $A_{i}\in\mathcal{F}$ for all $i$, then $\cup_{i=1}^{\infty}A_{i}\in\mathcal{F}$. > 2. If $\mathcal{A} \subset2^{\Omega}$ the $\sigma(\mathcal{A})$ is the smallest $\sigma$-algebra containing $\mathcal{A}$. > 3. If $\Omega$ is a topological space with open sets $\tau$ then the *Borel $\sigma$-algebra* is $\mathcal{B}(\Omega,\tau)=\sigma(\tau)$. > 4. $(\Omega,\mathcal{F})$ is a *measurable space* if $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$. > 5. A function $\mu:\mathcal{F}\to \mathbb{R}_{\geq0}\cup\{+\infty\}$ is called a *measure* (on $\mathcal{F}$) if it satisfies the following properties: > - $\mu(\emptyset)=0;$ > - ($\sigma$-additivity or countable additivity) If $A_{1},A_{2},\dots$ is a sequence of disjoint sets (that is, $A_{i}\cap A_{j}=\emptyset$ for $i\neq j$), such that $A_{i}\in \mathcal{F}$ for all $i$, then $ > \mu(\cup_{i=1}^{\infty}A_{i})=\sum_{i=1}^{\infty}\mu(A_{i}).$ > 6. A *measure space* is a triple $(\Omega, \mathcal{F}, \mu)$ such that $\Omega$ is a set, $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$, and $\mu$ a measure on $\mathcal{F}$. > 7. A measure space $(\Omega, \mathcal{F}, \mu)$ is called *$\sigma$-finite* if there is a sequence of sets $E_i \in \Omega$, such that $\mu\left(E_i\right)<\infty$, and $\Omega=\cup_{i=1}^{\infty} E_i$. > 8. A measure $\mu$ is called *a probability measure* if $\mu(\Omega)=1$. > 9. A *probability space* is a triple $(\Omega, \mathcal{F}, \mathbb{P})$, where $\Omega$ is a set, $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$, and $\mathbb{P}$ is a probability measure on $\mathcal{F}$ ($\mathbb{P}(\Omega)=1$). ^d3fff1 >[!info] Definition > Given a measure space $(\Omega, \mathcal{F}, \mu)$ , a set $A \subset \Omega$ is called a *null-set* (w. r. t. $\mu$ ) if it is contained in a measurable set of measure zero. > The measure $\mu$ is called *complete* if all null-sets are $\mathcal{F}$-measurable. > If $\mu$ is not complete, denote > $ > \overline{\mathcal{F}}:=\{A \cup B \subset \Omega: A \in \mathcal{F}, B \text { is a null-set }\} . > $ >the *completion* of $\mathcal{F}$. > >Note that $\overline{\mathcal{F}}$ is a $\sigma$-algebra, and $\mu$ can be extended to a measure $\bar{\mu}$ on $\overline{\mathcal{F}}$ by $\bar{\mu}(A):=\mu\left(A^{\prime}\right)$, where $A^{\prime} \subset A$ is any $\mathcal{F}$-measurable set such that $A \backslash A^{\prime}$ is a null-set. > When extended to $\overline{\mathcal{F}}$, the measure $\mu$ is complete; we call this procedure a *completion of the measure $\mu$.* >[!info] Definition - Measurable function >A map $f: \Omega_1 \rightarrow \Omega_2$ between two measurable spaces $\left(\Omega_1, \mathcal{F}_1\right)$ and $\left(\Omega_2, \mathcal{F}_2\right)$ is called *measurable* if the preimage of any measurable set is measurable. >[!info] Definition >Let $\left(\Omega_1, \mathcal{F}_1\right)$ and $\left(\Omega_2, \mathcal{F}_2\right)$ be measurable spaces, $\mu$ a measure on $\left(\Omega_1, \mathcal{F}_1\right)$ and $f:\Omega_{1}\to \Omega_{2}$ a measurable function. Then the *push-forward* of $\mu$ by $f$, denoted $f_{*}(\mu)$, is the measure on $\left(\Omega_2, \mathcal{F}_2\right)$ defined by: $ > f_{*}(\mu)(B)= \mu(f^{-1}(B)) > $ ^430066 #### Integration >[!info] Definition - Lebesgue intergral > Let $(\Omega,\mu)$ be a measurable space and $h:\Omega\to \mathbb{R}$ measurable a function. The *integral* $\int_{\Omega} h d \mu$ is defined in three steps. > 1. If $h=\sum_{i=1}^n a_i \mathbb{I}_{A_i}$, where $A_1, \ldots, A_n$, are measurable of finite measure (such $h$ are called *simple functions*), put > $ > \int_{\Omega} h d \mu:=\sum_{i=1}^n a_i \mu\left(A_i\right) > $ > 2. If $h$ is a non-negative measurable function, put > $ > \int_{\Omega} h d \mu:=\sup _{\substack{g \leq h \\ g \text { simple }}} \int_{\Omega} g d \mu > $ > 3. For a general $h$, put $ > \int_{\Omega} h d \mu:=\int_{\Omega} h \mathbb{I}_{h \geq 0} d \mu-\int_{\Omega}\left(-h \mathbb{I}_{h<0}\right) d \mu$whenever at least one of the terms is finite, otherwise we say that the integral does not exist. ^842506 ### Probability spaces #### Random variables >[!info] Definition > Given a probability space ( $\Omega, \mathcal{F}, \mathbb{P}$ ), a measurable map from $\Omega$ to a measurable space ( $\Omega^{\prime}, \mathcal{F}^{\prime}$ ) is called *a random variable* (with values in $\Omega^{\prime}$ ). > If $X$ is a random variable with values in $\Omega^{\prime}$, then the measure $\mu_{X}$ on $\mathcal{F}^{\prime}$ given by $ \mu_{X}(A):=\mathbb{P}\left(X^{-1}(A)\right) $ > is called a *distribution of the random variable $X$*. > >[!note] Remark >This is essentially the push-forward of $\mathbb{P}$ by $X$. ^b736ee >[!note] Remark > One can write $\mathbb{P}\left(X^{-1}(A)\right)=\mathbb{P}(\{\omega \in \Omega: X(\omega) \in A\})$. > It is customary in Probability texts to use capital Latin letters for random variables (e. g., $X$ instead of $f$ ) and abbreviate the last formula to something like $\mathbb{P}(X \in A)$ or $\mathbb{P}(X\leq a)$ (when $a\in \mathbb{R}$). >[!info] Definition >A *scalar random variable* is a random variable with values in $\mathbb{R}$ (measurable with respect to the Borel $\sigma$-algebra). >The *probability distribution function* (or *cumulative distribution*) of $X$ is the function $ > F_X(a):=\mathbb{P}(X \in(-\infty, a]) = \mathbb{P}(X\leq a)$ ^dc0272 #### Conditional probability >[!info] Definition >Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space, and $A \in \mathcal{F}$, then the *restriction* of $\mathcal{F}$ to $A$: $\mathcal{F}|_A:=\{A \cap B: B \in \mathcal{F}\}$ is a $\sigma$-algebra on $A$, and $\mu(B):=\mathbb{P}(A \cap B)$ is a measure on $\mathcal{F}|_A$. >If $\mathbb{P}(A)=0$, this measure identically zero; otherwise it could be normalized to be a probability measure on $A$, called the *conditional probability*: $ > \mathbb{P}(B \mid A):=\frac{\mathbb{P}(B \cap A)}{\mathbb{P}(A)} . $ >[!note] Remark > Exchanging $A$ and $B$ in the above definition, one arrives at *Bayes' formula*: > $ > \mathbb{P}(B \mid A)=\frac{\mathbb{P}(A \mid B) \mathbb{P}(B)}{\mathbb{P}(A)} . > $ ### Dynkin's $\pi-\lambda$ theorem and uniqueness of measures >[!info] Definition > A collection $\mathcal{A}$ of subsets of a set $\Omega$ is called > - A *$\pi$-system* if it is closed under intersections: $A, B \in \mathcal{A} \Rightarrow A \cap B \in \mathcal{A}$. > - a *$\lambda$-system* if > - $\Omega \in \mathcal{A}$; > - if $A, B \in \mathcal{A}$ and $A \subset B$, then $B \backslash A \in \mathcal{A}$; > - if $A_1 \subset A_2 \subset \ldots$ all belong to $\mathcal{A}$, then $\cup_{i=1}^{\infty} A_i \in \mathcal{A}$. > >[!note] Remark >the last clause can be replaced by >>- if $A_j \in \mathcal{A}$ for $j \in \mathbb{N}$ and $A_i \cap A_j=\emptyset$ for $i \neq j$, then $\cup_{i=1}^{\infty} A_i \in \mathcal{A}$. >[!example] > Assume that $\mu$ and $\nu$ are two probability measures on the same $\sigma$-algebra . Then $\{A \in \mathcal{F}: \mu(A)=\nu(A)\}$ is a $\lambda$-system. **Lemma.** If $\mathcal{A}$ is both $\pi$- and $\lambda$-system, then it is a $\sigma$-algebra. **Theorem ( $\pi-\lambda$ theorem or Dynkin's lemma)** Let $\mathcal{A}$ be a $\pi$-system and $\Lambda$ be a $\lambda$-system. Then $ \mathcal{A} \subset \Lambda \Rightarrow \sigma(\mathcal{A}) \subset \Lambda . $ **Corollary (uniqueness of measures).** 1. If two measures $\mu_1$ and $\mu_2$ on the same $\sigma$-algebra $\mathcal{F}$ are such that $\mu_1(\Omega)=\mu_2(\Omega)< \infty$ and that they agree on a $\pi$-system $\mathcal{A}$, they agree also on $\sigma(\mathcal{A})$. 2. A probability measure on $(\mathbb{R}, \mathcal{B}(\mathbb{R}))$ is uniquely determined by its probability distribution function $F_{\mathbb{P}}(t):=\mathbb{P}((-\infty, t])$. #### Usage example >[!info] Definition > Let $X,Y$ be scalar random variables. We say that *$X$ and $Y$ are independent* if $ > \mathbb{P}(X \in A, Y \in B)=\mathbb{P}(X \in A) \mathbb{P}(Y \in B), \quad \forall A, B \in \mathcal{B}(\mathbb{R}) . $ **Lemma.** Suppose that $ \mathbb{P}(X \leq s, Y \leq t)=\mathbb{P}(X \leq s) \mathbb{P}(Y \leq t), \quad \forall s, t \in \mathbb{R} .$Then $X$ and $Y$ are independent. ### Caratheodory extension and existence of measures >[!info] Definition > A collection $\mathcal{R} \subset 2^{\Omega}$ of subsets of a set $\Omega$ is called a *semi-ring* if the following conditions are satisfied: > - $\emptyset \in R ;$ > - it is a $\pi$-system, i.e. closed under intersections; > - if $A, B \in \mathcal{R}$, then there exists a finite collection of disjoint sets $A_1, \ldots, A_n \in \mathcal{R}$ such that $ A \backslash B=\cup_{i=1}^n A_i $ > $\mathcal{R}$ is called a *ring* if, in addition, it is closed under finite union of disjoint sets. >[!example] >The following are semi-rings: > - $\mathcal{I}:=\{[a , b): a, b \in \mathbb{R}\}$ > - $\mathcal{J}:=\{\emptyset\} \cup\{(a, b)$ : $a, b \in \mathbb{R}, a<b\} \cup\left\{\cup_{i=1}^n\left\{c_i\right\}: n \in \mathbb{N}, c_i \in \mathbb{R}\right\}$ **Lemma.** Let $\mathcal{R}_0$ be a semi-ring. Then $ \mathcal{R}:=\left\{\bigcup_{i=1}^n A_i: n \in \mathbb{N}, A_i \in \mathcal{R}_0,\left\{A_i : i=1,\dots, n \right\} \text{ disjoint }\right\}$ is a ring. >[!info] Definition > - We say that $\mu$ is a *finitely additive function* on a semi-ring $\mathcal{R}$ if whenever $A_1, \ldots, A_N \in \mathcal{R}$ are mutually disjoint and $A=\bigcup_{i=1}^N A_i \in \mathcal{R}$, then $\mu(A)=\sum_{i=1}^N \mu\left(A_i\right)$. > - We say that $\mu$ is *countably subadditive* on $\mathcal{R}$ if for any $A=\bigcup_{i=1}^{\infty} A_i$ with $A, A_i \in \mathcal{R}$, we have $\mu(A) \leq \sum_{i=1}^{\infty} \mu\left(A_i\right)$. > >[!note] Remark > $\mu$ can be uniquely extended to the ring-extension of $\mathcal{R}$. > - We say that a function $\mu: \mathcal{R} \rightarrow \mathbb{R}_{\geq 0}$, is a *pre-measure* on $\mathcal{R}$ if $\mu(\emptyset)=0, \mu$ is finitely additive and countably subadditive on $\mathcal{R}$. ^e50295 **Theorem (Caratheodory extension theorem).** Let $\mathcal{R}$ be a semi-ring on a set $\Omega$, and let $\mu: \mathcal{R} \rightarrow \mathbb{R}_{\geq 0}$ be a pre-measure. Then there exists a measure on $\sigma(\mathcal{R})$ that coincides with $\mu$ on $\mathcal{R}$. Furthermore, if $\Omega$ is the countable union of members of $\mathcal{R}$, then the extension is unique. *Proof idea.* Define for every $E \in 2^{\Omega}$ the *outer measure* $\mu^*(E):=\inf _{\substack{\cup_{i=1}^{\infty} A_i \supset E \\ A_i \in \mathcal{R}}} \sum_{i=1}^{\infty} \mu\left(A_i\right)$ and let $\mathcal{M}:=\left\{A \in 2^{\Omega}: \mu^*(E)=\mu^*(E \cap A)+\mu^*(E \backslash A), \forall E \in 2^{\Omega}\right\} .$ Show that $\mathcal{M}$ is a $\sigma$-algebra containing $\mathcal{R}$ and that $\mu^{\ast}$ is a measure on $\mathcal{M}$ extending $\mu$. >[!example] Example + Definition (Lebesgue measure) > There exists a unique measure $\lambda$ on $\mathbb{R}$ such that for any $a<b$, $\lambda([a, b))=b-a$. > *Proof.* Use the extension of $\mathcal{J}$ above - the semi-ring of finite unions of open intervals and singletons, with the pre-measure: > $\operatorname{vol}\left(\cup_{j=1}^M (a_{j},b_{j}) \cup_{j=1}^N\left\{c_j\right\}\right)=\sum_{j=1}^M\left(b_j-a_j\right)$ **Lemma.** For any scalar random variable $X, F_X(a)$ is non-decreasing, right-continuous (i. e., $F_X\left(a_i\right) \rightarrow F_X(a)$ whenever $a_i \searrow a$, and has limits $\lim _{a \rightarrow+\infty} F_X(a)=1$ and $\lim _{a \rightarrow-\infty} F_X(a)=0$. Conversely, if $F$ is any function with these properties, then there exists a probability measure $\mu$ on $\mathcal{B}(\mathbb{R})$ such that $F(a)=\mu((-\infty, a])$. ### Expectation >[!info] Definition >The *expectation* of a real-valued random variable $X$ on $(\Omega,\mathcal{F},\mathbb{P})$ is defined as the [[Probability theory - notes#Integration|(Lebesgue) integral]] over the measure space: > $ > \mathbb{E}(X)=\int_{\Omega} X d \mathbb{P} > $ ^89fd83 > [!tip] Intuition > The expectation, or expected value, of a random value, represent a weighted average of the outcomes, where the weights are the respective probabilities of the outcomes. Thus, if $X$ has finitely many possible values $\{x_{1} ,\dots, x_{n}\}$, this is reduced to: > $ \mathbb{E}(X)= \sum_{i=1}^{n} x_{i} \mathbb{P}(X=x_{i})$ >[!note] Terminology > "Almost surely" = "except on a measure 0 set". #### Properties of integral/expectation ##### Elementary properties **Proposition**. The expectation satisfies the following properties: - For a measurable $A$ $\mathbb{E}(\mathbb{I}_{A})=\mathbb{P}(A)$ - *Linearity*: if $\alpha, \beta \in \mathbb{R}$, and $\mathbb{E} X$ and $\mathbb{E} Y$ exist, then $\mathbb{E}(\alpha X+\beta Y)$ exists, and $ \mathbb{E}(\alpha X+\beta Y)=\alpha \mathbb{E}(X)+\beta \mathbb{E}(Y) ; $ - *Monotonicity*: if $X, Y$ are measurable such that $0 \leq X(\omega) \leq Y(\omega)$ for all $\omega \in \Omega$ and $\mathbb{E} Y$ exists, then $\mathbb{E}(X) \leq \mathbb{E}(Y)$. Furthermore, if an operator on bounded measurable functions satisfies these properties, then it equals the expectation. ##### Convergence properties **Theorem (monotone convergence theorem).** If $X_i \geq 0$ are measurable and $X_i \nearrow X$ almost surely, then $\mathbb{E}\left(X_i\right) \rightarrow \mathbb{E}(X)$. >[!warning]+ >It is, in general, not true that $X_n \rightarrow X$ almost surely implies $\mathbb{E} X_n \rightarrow \mathbb{E} X$ : **EXAMPLE (Growing bump).** Let $\Omega=(0,1)$ with Lebesgue measure $\lambda$, and $X_n=n \mathbb{I}_{\left(0 ; \frac{1}{n}\right)}$. Then $X_n(\omega) \rightarrow 0$ for any $\omega \in(0,1)$, but $\mathbb{E} X_i=n \lambda\left(\left(0 ; \frac{1}{n}\right)\right) \equiv 1$. **Lemma (Fatou).** If $X_n \geq 0$, then $ \liminf \mathbb{E} X_n \geq \mathbb{E} \liminf X_n . $ **Theorem (Lebesgue's dominated convergence theorem).** If $X_i$ are scalar random variables, $X_i \rightarrow X$ almost surely, and there exists a random variable $Y$ with $\mathbb{E}(|Y|)<\infty$ such that $\left|X_i\right| \leq|Y|$ almost surely for all $i$, then $\mathbb{E}\left(X_i\right) \rightarrow \mathbb{E}(X)$. ##### Inequalities **Proposition.** The expectation satisfies the following useful inequalities: - (*Cauchy-Schwartz*) $ \mathbb{E}(X Y) \leq \sqrt{\mathbb{E}\left(X^2\right)} \cdot \sqrt{\mathbb{E}\left(Y^2\right)} . $ - (*Hölder's inequality*) for $p, q>1$ such that $\frac{1}{p}+\frac{1}{q}=1$, $ \mathbb{E}(X Y) \leq\left(\mathbb{E}|X|^p\right)^{\frac{1}{p}}\left(\mathbb{E}|Y|^q\right)^{\frac{1}{q}} . $ - (*Jensen's inequality*) if $f: \mathbb{R} \rightarrow \mathbb{R}$ is a convex function and $\mathbb{E}|X|<\infty, \mathbb{E}(|f(X)|)<\infty$, then $ f(\mathbb{E}(X)) \leq \mathbb{E}(f(X)) $ Particular useful cases are $|E(X)| \leq \mathbb{E}(|X|)$ and $(\mathbb{E} X)^2 \leq \mathbb{E} X^2$. - (*Chebyshev's inequality or Markov's inequality*) for a non-negative random variable $X$ and $a>0$, $ \mathbb{P}(X \geq a) \leq \frac{\mathbb{E} X}{a} . $ In particular, for an increasing function $f:[0, \infty) \rightarrow[0, \infty)$, $ \mathbb{P}(X \geq a)=\mathbb{P}(f(X) \geq f(a)) \leq \frac{\mathbb{E} f(X)}{f(a)} . $ #### Change of variable **Proposition (abstract change of variable theorem).** Let ( $\Omega_1, \mathcal{F}_1, \mathbb{P}$ ) be a probability space, ( $\Omega_2, \mathcal{F}_2$ ) a measurable space, $X: \Omega_1 \rightarrow \Omega_2$ a random variable and $f: \Omega_2 \rightarrow \mathbb{R}$ a measurable function. Then, $ \mathbb{E}(f \circ X)= \int_{\Omega_{1}} f(X) d\mathbb{P} =\int_{\Omega_2} f d \mu_X $where $\mu_X$ denotes the [[Probability theory - notes#^b736ee|distribution]] of the random variable $X$.^change-var In particular, if $X:\Omega_{1} \to \mathbb{R}$, then: $ \mathbb{E}(X)=\int_{\mathbb{R}} x d \mu_X. $ #### Differentiating under the integral >[!question] >What conditions on a function $f:I \times \Omega \to \mathbb{R}$ are needed for the following equality to hold >$\frac{\partial}{\partial x} \int_{\Omega} f(x, \omega) d \mu(\omega)=\int_{\Omega} \frac{\partial}{\partial x} f(x, \omega) d \mu(\omega) ?$ >[!example] Counterexample >Let, for $y>0$ and $x \in \mathbb{R}$, > $ > f(x, y)=\left\{\begin{array}{ll} > e^{-\frac{y^2}{x^2}}, & x \neq 0 \\ > 0, & x=0 > \end{array} .\right. > $ >Then, $\partial_x f(0, y)=0$ for all $y>0$ but by change of variable $w=y/x$: > $ > \left.\partial_x \int_0^{\infty} f(x, y) d y\right|_{x=0}=\int_0^{\infty} e^{-w^2} d w \neq 0 > $ **Theorem (Differentiating an integral, real version).** Let $I \subset \mathbb{R}$ be an open interval, and ( $\Omega, \mathcal{F}, \mu$ ) a measure space. Assume that a function $f: I \times \Omega \rightarrow \mathbb{R}$ satisfies the following properties: - for every $x \in I$, the function $\omega \mapsto f(x, \omega)$ is integrable; - for almost every $\omega$ and every $x \in I$, the derivative $\partial_x f(x, \omega)$ of the function $x \mapsto f(x, \omega)$ exists; - there is a measurable function $h: \Omega \rightarrow \mathbb{R}_{\geq 0}$ such that $\int_{\Omega} h d \mu<\infty$ and $\left|\partial_x f(x, \omega)\right| \leq h(\omega)$ for all $x \in I$ and almost all $\omega \in \Omega$. Then, the derivative $\varphi^{\prime}(x)$ of the function $\varphi(x):=\int_{\Omega} f(x, \omega) d \mu(\omega)$ exists at all $x \in I$, and $\varphi^{\prime}(x)= \int_{\Omega} \partial_x f(x, \omega) d \mu(\omega)$. **Theorem (Differentiating an integral, complex version).** Let $\Lambda \subset \mathbb{C}$ be an open set, a measure space, and assume that a function $f: \Lambda \times \Omega \rightarrow \mathbb{C}$ satisfies the following properties: - for every $z \in \Lambda$, the function $\omega \mapsto f(z, \omega)$ is measurable; - for almost every $\omega$, the function $z \mapsto f(z, \omega)$ is analytic in $\Lambda$; - there is a measurable function $h: \Omega \rightarrow \mathbb{R}_{\geq 0}$ such that $\int_{\Omega} h<\infty$ and $|f(z, \omega)| \leq h(\omega)$ for all $z \in \Lambda$ and almost every $\omega \in \Omega$. Then, the function $\varphi(z):=\int_{\Omega} f(z, \omega) d \mu(\omega)$ is analytic in $\Lambda$, and $\frac{\partial^n}{\partial z^n} \varphi(z)=\int_{\Omega} \frac{\partial^n}{\partial z^n} f(z, \omega) d \mu(\omega)$ for all $n=1,2, \ldots$ and for all $z \in \Lambda$. ### Density / Radon-Nikodym derivative **Lemma.** Let $f \geq 0$ be a measurable function on a measure space $(\Omega, \mathcal{F}, \mu)$ (not necessarily with finite measure). Then $\mu^{\prime}$, defined on $\mathcal{F}$ by$ \mu^{\prime}(A):=\int_A f d \mu=\int_{\Omega}\left(f \cdot \mathbb{I}_A\right) d \mu $is a measure on $\mathcal{F}$. >[!info] Definition > If $(\Omega, \mathcal{F})$ is a measurable space, $\mu,\mu^{\prime}$ are measures on $\mathcal{F}$, and there exists a function $f:\Omega\to \Omega$ such that $\int_{A} d\mu' = \mu^{\prime}(A) \equiv \int_A f d \mu,$ then the function $f$ is called the *Radon-Nikodym derivative of $\mu^{\prime}$ with respect to $\mu$*. > > It is denoted $f=\frac{d \mu^{\prime}}{d \mu}$, or $d\mu'=fd\mu$. >^radon-nikodym **Theorem (Radon-Nikodym).** If $\mu,\mu'$ are measures on the same space such that $\mu(A)=0$ implies $\mu^{\prime}(A)=0$, then there exists a function $f$ such that $d \mu^{\prime}=f d \mu$ (i.e. the Radon-Nikodym derivative of $\mu'$ with respect to $\mu$ exist). >[!info] Definition > > In the special case when $\mu$ is the Lebesgue measure (on $\mathbb{R}$ or on $\mathbb{R}^n$ ), and $\mu^{\prime}$ is a probability measure, the function $f$ satisfying is called *probability density*. > > If $\mu'=\mu_{X}$ is the probability distribution of a random variable $X$, then $f=f_{X}$ is the *probability density of $X$*, so that $d\mu_{X}=f_{X}d\lambda$, or in other words: > $\mathbb{P}(X\in A)= \int_{A}f_{X}d\lambda.$ >^density >[!example] >A scalar random variable with probability density function > $ > f(x)=\frac{1}{\sqrt{2 \pi}} e^{-\frac{x^2}{2}} > $ > is called *standard Gaussian*. A scalar random variable $X$ is called *Gaussian* if $X=\sigma X^{\prime}+\mu$, where $\sigma \geq 0$, $\mu \in \mathbb{R}$, and $X^{\prime}$ is a standard Gaussian. >[!note] Remark >If $X$ is a random variable such that its distribution function $F_{X}$ has a continuous derivative, then $F'_{X}$ is the probability density of $X$, so that $\mathbb{P}(X \leq a) = F_{X}(a) = \int_{-\infty}^{a} F'_{X}(x) \, dx$. >[!note] Remark >By the [[Probability theory - notes#^change-var|change of variable theorem]], if $X$ has a [[Probability theory - notes#^density|density function]] $f_{X}$, so that $d\mu_{X}=f_{X}dx$: > $ > \mathbb{E}(X)=\int_{\mathbb{R}} x d \mu_X=\int_{\mathbb{R}} xf_{X}(x) dx. > $ ### Direct products of measure spaces and Fubini’s theorem **Theorem.** If $\left(\Omega_i, \mathcal{F}_i, \mu_i\right), i=1, \ldots, n$ are $\sigma$-finite measure spaces, then there is a unique measure $\mu_1 \otimes \cdots \otimes \mu_n$ on $\sigma\left(\mathcal{F}_1 \times \cdots \times \mathcal{F}_n\right)$, such that for any $A_1 \in \mathcal{F}_1, \ldots, A_n \in \mathcal{F}_n$, one has $ \mu_1 \otimes \cdots \otimes \mu_n\left(A_1 \times \cdots \times A_n\right)=\prod_{i=1}^n \mu_i\left(A_i\right) . $ >[!note] Notation >Denote by $\mathcal{F}_{1}\otimes\cdots\otimes \mathcal{F}_{n}$ the completion of $\sigma(\mathcal{F}_{1}\times\cdots \times \mathcal{F}_{n})$. **Theorem (Cavalieri principle).** Given $\sigma$-finite measure spaces $(\Omega_1, \mathcal{F}_1, \mu_1)$ and $(\Omega_2, \mathcal{F}_2, \mu_2)$, let $E \in \sigma\left(\mathcal{F}_1 \times \mathcal{F}_2\right)$. Then - for all $\omega \in \Omega_1$, the set $E_\omega:=\left\{\omega^{\prime} \in \Omega_2:\left(\omega, \omega^{\prime}\right) \in E\right\}$ is $\mathcal{F}_2$ measurable; - the function $f_E: \Omega_1 \rightarrow \mathbb{R}$, defined by $f_E(\omega):=\mu_2\left(E_\omega\right)$ is $\mathcal{F}_1$-to- $\mathcal{B}(\mathbb{R})$ measurable and $ \mu_1 \otimes \mu_2(E)=\int_{\Omega_1} f_E d \mu_1 $ **Theorem (Cavalieri principle for completed spaces).** Let $\left(\Omega_1, \mathcal{F}_1, \mu_1\right)$ and $\left(\Omega_2, \mathcal{F}_2, \mu_2\right)$ be complete $\sigma$-finite measure spaces. Assume that $E \subset \Omega_1 \times \Omega_2$ is $\mathcal{F}_1 \otimes \mathcal{F}_2$ measurable. Then - for $\mu_1$-almost every ${ }^{21} \omega \in \Omega_1$, the set $E_\omega:=\left\{\omega^{\prime} \in \Omega_2:\left(\omega, \omega^{\prime}\right) \in E\right\}$ is $\mathcal{F}_2$ measurable; - the function $f_E(\omega):=\mu_2\left(E_\omega\right)$ is $\mathcal{F}_1$-to- $\mathcal{B}(\mathbb{R})$ (defined almost everywhere on $\Omega_1$ ) is measurable; - one has the identity $ \mu_1 \otimes \mu_2(E)=\int_\Omega f_E d \mu_1 $ **Theorem (Tonelli's theorem).** Let ( $\Omega_1, \mathcal{F}_1, \mu_1$ ) and ( $\Omega_2, \mathcal{F}_2, \mu_2$ ) be complete $\sigma$-finite measure spaces. If $f: \Omega_1 \times \Omega_2 \rightarrow \mathbb{R}_{\geq 0}$ is $\mathcal{F}_1 \otimes \mathcal{F}_2$ measurable, then - For a. e. $\omega \in \Omega_1$, the function $f_\omega(\cdot):=f(\omega, \cdot): \Omega_2 \rightarrow \mathbb{R}$ is $\mathcal{F}_2$ measurable. - The function $\omega \mapsto \int_{\Omega_2} f_\omega\left(\omega^{\prime}\right) d \mu_2\left(\omega^{\prime}\right)$, defined almost everywhere on $\Omega_1$, is $\mathcal{F}_1$-measurable. - The following identity holds: $ \int_{\Omega_1 \times \Omega_2} f d\left(\mu_1 \otimes \mu_2\right)=\int_{\Omega_1}\left(\int_{\Omega_2} f_\omega\left(\omega^{\prime}\right) d \mu_2\left(\omega^{\prime}\right)\right) d \mu_1(\omega) $ **Theorem (Fubini's theorem).** Let ( $\Omega_1, \mathcal{F}_1, \mu_1$ ) and ( $\Omega_2, \mathcal{F}_2, \mu_2$ ) be complete $\sigma$-finite measure spaces. If $f: \Omega_1 \times \Omega_2 \rightarrow \mathbb{R}$ is $\mathcal{F}_1 \otimes \mathcal{F}_2$ measurable and such that $ \int_{\Omega_1 \times \Omega_2}|f| d\left(\mu_1 \otimes \mu_2\right)<\infty $ then the conclusion of Tonelli's theorem holds. ## 2. Laws of Large Numbers ### Independence Fix a probability space $(\Omega,\mathcal{F},\mathbb{P})$. >[!info] Definition >Two events $A, B \in \mathcal{F}$ are called *independent* if > $ > \mathbb{P}(A \cap B)=\mathbb{P}(A) \mathbb{P}(B). > $ > >A finite collection $A_1, \ldots, A_n$ of events is called independent if, for any subset $1 \leq i_1<\cdots<i_k \leq n$, one has >$\mathbb{P}\left(A_{i_1} \cap \cdots \cap A_{i_k}\right)=\mathbb{P}\left(A_{i_1}\right) \cdot \ldots \mathbb{P}\left(A_{i_k}\right).$ >A finite collection $X_1, \ldots, X_n$ of random variables is independent if for any measurable sets $A_1, \ldots, A_n$, the events $\left\{X_i \in A_i\right\}, i=1, \ldots, n$, are independent, that is, the preimages of $A_i$ 's under $X_i$ 's are independent: > $ > \mathbb{P}\left(X_1 \in A_1 ; \ldots ; X_n \in A_n\right)=\prod_{i=1}^n \mathbb{P}\left(X_i \in A_i\right) . > $ > A countable collection of events (random variables) is called independent if all its finite sub-collections are independent. > >A collection of random variables are called *independent, identically distributed*, in short *i.i.d.*, if they are independent and all have the same distribution. **Theorem.** Scalar random variables $X_1, \ldots, X_N$ are independent if and only if, for any measurable functions $f_i: \mathbb{R} \rightarrow \mathbb{R}$ such that $\mathbb{E} f_i\left(X_i\right)$ exists, one has $ \mathbb{E}\left(\prod_{i=1}^N f_i\left(X_i\right)\right)=\prod_{i=1}^N \mathbb{E} f_i\left(X_i\right) . $ **Proposition.** If independent scalar random variables $X_1, \ldots, X_N$ have densities $f_1, \ldots, f_N$, then the random vector $X=\left(X_1, \ldots, X_N\right)$ has density $ f\left(x_1, \ldots x_N\right)=f_1\left(x_1\right) \cdots f_N\left(x_N\right) $ with respect to the $N$-dimensional Lebesgue measure $\lambda^N$. Conversely, if the random vector $X=\left(X_1, \ldots, X_N\right)$ has a density $f$, and there exist integrable functions $f_i: \mathbb{R} \rightarrow \mathbb{R} \geq 0$ such that the above equality holds almost everywhere, then $X_1, \ldots, X_N$ are independent. ### Weak law of large numbers **Theorem.** Let $X_1, \ldots, X_n,\dots$ be i.i.d. scalar random variables such that $\mathbb{E}\left|X_1\right|<\infty$. Denote $S_n:=X_1+\cdots+X_n$ and $\mu:=\mathbb{E} X_1$. Then, for every $\varepsilon>0$, $ \mathbb{P}\left(\left|\frac{S_n}{n}-\mu\right|>\varepsilon\right) \xrightarrow{n \rightarrow \infty} 0 . $ **Proposition.** Let $X_1, \ldots, X_n$ be i. i. d. random variables with $\mu:=\mathbb{E} X_i$ and $\sigma^2:=\operatorname{Var} X_1< \infty$, then $ \mathbb{P}\left(\left|\frac{S_n}{n}-\mu\right|>\varepsilon\right) \leq \frac{\sigma^2}{\varepsilon^2 n} . $ ### Strong law of large numbers **Theorem. (Borel-Cantelli lemma)** Assume that $A_1, A_2, \ldots$ are events on the same probability space such that $ \sum_{i=1}^{\infty} \mathbb{P}\left(A_i\right)<\infty . $ Let $N(\omega):=\#\left\{i: \omega \in A_i\right\}$. Then $\mathbb{P}(N=\infty)=0$. Theorem 2.5.1. (Strong ${ }^3$ law of large numbers.) Assume that $X_i$ are i. i. d. scalar random variables with expectation $\mu$, (such that $\mathbb{E} X_1^4<\infty$). Then, with probability 1, $ \frac{1}{n} \sum_{i=1}^n X_i \rightarrow \mu . $ ### Kolmogorov’s zero-one law ### Convergence of random variables >[!info] Definition > Let $X, X_1, X_2, \ldots$ be scalar random variables defined on the same probability space $\Omega$. We say that > - $X_i \rightarrow X$ *a.s*.( $X_i$ converges to $X$ *almost surely*) if there is an event $E$ of probability 1 such that $X_i(\omega) \rightarrow X(\omega)$ for each $\omega \in E$; > - $X_i \xrightarrow{\mathcal{P}} X$ ( $X_i$ converges to $X$ *in probability*), if for any $\varepsilon>0, \mathbb{P}\left(\left|X_i-X\right|>\varepsilon\right) \rightarrow 0$; > - $X_i \xrightarrow{L^p} X$, ( $X_i$ converges to $X$ *in $L^p$* ) where $p \geq 1$, if $\mathbb{E}\left|X_i-X\right|^p \rightarrow 0$. > - The most common cases are $p=1$ (*convergence in mean*) and $p=2$ (*mean-square convergence*). > > Let $X, X_1, X_2, \ldots$ be random variables with values in the same metric space $M$ (but not necessarily defined on the same probability space). We say that > - $X_i \xrightarrow{\mathcal{D}} X$ ( $X_i$ converges to $X$ *in distribution*) if for any bounded continuous function $f: M \rightarrow \mathbb{R}$, one has > $ > \mathbb{E}\left(f\left(X_i\right)\right) \rightarrow(\mathbb{E} f(X)) . > $ **Theorem.** A sequence $X_i$ of scalar random variables converges in distribution to $X$ if an only if $ F_{X_i}(a) \rightarrow F_X(a) $ for all $a \in \mathbb{R}$ such that $F_X$ is continuous at $a$. #### Implications between notions of convergence **Proposition.** There are the following implications between notions of convergence: 1) a. s. $\implies$ in probability; 2) in $L^p$ for any $p \geq 1$ $\implies$ in probability; 3) in probability $\implies$ in distribution; 4) in $L^p$ $\implies$ in $L^q$ if $q<p$. **Remark**. No other implication holds in general #### Converging subsequences **Proposition.** If $X_i \xrightarrow{\mathcal{P}} X$, then there is a subsequence $X_{i_k}$ such that $X_{i_k} \rightarrow X$ almost surely. >[!info] Definition > A sequence $X_i$ of scalar random variables is called *tight* if for any $\varepsilon>0$, there exists $R>0$ such that for any $i$, > $ > \mathbb{P}\left(X_i \in[-R ; R]\right)>1-\varepsilon . > $ **Theorem.** (Helly's selection theorem) - If $X_i$ is any sequence of scalar random variables, then there is a subsequence $i_k$ and a rightcontinuous non-decreasing function $F: \mathbb{R} \rightarrow[0,1]$ such that $F_{X_{i_k}}(a) \rightarrow F(a)$ for all a at which $F$ is continuous. - If, in addition, $X_i$ is tight, then $F$ is a distribution function of a random variable $X$ (and thus $X_{i_k} \xrightarrow{\mathcal{D}} X$ ). #### Convergence in distribution **Theorem.** The following statements are equivalent: (i) $X_n \xrightarrow{\mathcal{D}} X$ (ii) For all open sets $G, \liminf _{n \rightarrow \infty} P\left(X_n \in G\right) \geq P\left(X \in G\right)$. (iii) For all closed sets $K, \limsup _{n \rightarrow \infty} P\left(X_n \in K\right) \leq P\left(X\in K\right)$. (iv) For all Borel sets $A$ with $P\left(X \in \partial A\right)=0$, $\lim _{n \rightarrow \infty} P\left(X_n \in A\right)= P\left(X\in A\right)$. ### Characteristic functions >[!info] Definition > If $X$ is a scalar random variable, then the *characteristic function* of $X$ is defined as > $ > \varphi_X(t)=\mathbb{E} e^{i t X}, \quad t \in \mathbb{R} . > $ >**Remark.** Since $\left|e^{i t X}\right|=1$, the expectation always exists. **Proposition.** If $X$ and $Y$ are independent, then $ \varphi_{X+Y}(t) \equiv \varphi_X(t) \varphi_Y(t). $ >[!info] Definition >Given an integrable function $f: \mathbb{R} \rightarrow \mathbb{C}$, the *Fourier transform* $\mathfrak{F}(f)$ is defined by > $ > \mathfrak{F} f(t)=\int_{\mathbb{R}} e^{i t x} f(x) d x > $ **Proposition.** (Fourier inversion formula) If $f$ is integrable and continuously differentiable, then $ \mathfrak{F} \mathfrak{F} f(t):=\lim _{R \rightarrow \infty} \int_{[-R ; R]} e^{-i t \theta}\left(\int_{\mathbb{R}} e^{i \theta x} f(x) d x\right) d \theta=2 \pi f(t) $ **Corollary.** The distribution of a scalar random variable is uniquely determined by its characteristic function. **Corollary.** If $X$ is a scalar random variable, and $X_1, X_2, \ldots$ is a tight sequence of scalar random variables such that $\varphi_{X_n}(t) \rightarrow \varphi_X(t)$ for all $t \in \mathbb{R}$, then $X_n \xrightarrow{\mathcal{D}} X$. ### The Central Limit Theorem >[!info] Definition >Definition 2.2.2. A scalar random variable with probability density function > > $ > f(x)=\frac{1}{\sqrt{2 \pi}} e^{-\frac{x^2}{2}} > $ > > is called *standard Gaussian*. > A scalar random variable $X$ is called *Gaussian* if $X=\sigma X^{\prime}+\mu$, where $\sigma \geq 0$, $\mu \in \mathbb{R}$, and $X^{\prime}$ is a standard Gaussian. > > A Gaussian random variable has distribution > > $ > \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}} . > $ > > This distribution is denoted by $\mathcal{N}(\mu, \sigma)$. **Theorem.** (The Central Limit Theorem) Let $X_1, \ldots, X_n,\dots$ be independent, identically distributed scalar random variables such that $\mathbb{E} X_1=0$ and $\mathbb{E} X_1^2=\sigma^2<\infty$. Then $ \frac{S_n}{\sqrt{n}}:=\frac{\sum_{i=1}^n X_i}{\sqrt{n}} \xrightarrow{\mathcal{D}} \mathcal{N}(0, \sigma) . $ ![[Probability theory - glossary]]