Random Variables¶

A random variable is a variable that can take on different values randomly. Random variables can be discrete or continuous. A discrete random variable has a finite or countably infinite number of states, while a continuous random variable takes real numerical values.

Probability Distributions¶

A probability distribution describes the likelihood of a random variable (or a set of random variables) taking on each possible state.

The probability mass function (PMF) is used to describe the probability distribution of discrete random variables. It is typically denoted by $ P $. The PMF can act on multiple random variables simultaneously, and the probability distribution of multiple variables is called the joint probability distribution. $ P(\mathbf{x} = x, \mathbf{y} = y) $ represents the probability that both $ \mathbf{x} = x $ and $ \mathbf{y} = y $ occur.

A function $ P $ is the PMF of a random variable $ x $ if and only if it satisfies the following conditions:
1. The domain of $ P $ must be the set of all possible states of $ x $.
2. For all $ x \in \mathbf{x} $, $ 0 \leq P(x) \leq 1 $. (The probability of an impossible event is 0, and the probability of a certain event is 1.)
3. $ \sum_{x \in \mathbf{x}} P(x) = 1 $. This property is called normalization.

For a discrete random variable $ \mathbf{x} $ with $ k $ distinct states, if each state is equally likely (uniform distribution), its PMF is:
$$ P(\mathbf{x} = x_i) = \frac{1}{k} \tag{1} $$
Since $ k $ is a positive integer, $ \frac{1}{k} $ is positive. Verifying normalization:
$$ \sum_i P(\mathbf{x} = x_i) = \sum_i \frac{1}{k} = \frac{k}{k} = 1 \tag{2} $$
Thus, the uniform distribution satisfies the normalization condition.

Marginal Probability¶

The marginal probability distribution refers to the probability distribution of a subset of variables when the joint probability distribution of a set of variables is known but we only need to analyze one subset.

Conditional Probability¶

Conditional probability is the probability of an event occurring given that another event has occurred. The formula is:
$$ P(\mathbf{x} = x \mid \mathbf{y} = y) = \frac{P(\mathbf{x} = x, \mathbf{y} = y)}{P(\mathbf{x} = x)} \tag{3} $$
where $ P(\mathbf{x} = x) > 0 $.

Example
In an automobile factory, two processes (bolting 3 bolts and welding 2 joints) are performed by robots. Let $ X $ be the number of improperly fastened bolts, and $ Y $ be the number of defective welds. The joint distribution of $ (X, Y) $ is given by:

$ Y \setminus X $	0	1	2	3	$ P\{Y = j\} $
0	0.840	0.030	0.020	0.010	0.900
1	0.060	0.010	0.008	0.002	0.080
2	0.010	0.005	0.004	0.001	0.020
$ P\{X = i\} $	0.910	0.045	0.032	0.013	1.000

Find the conditional distribution of $ Y $ given $ X = 1 $:
$$ P{Y = 0 \mid X = 1} = \frac{P{X = 1, Y = 0}}{P{X = 1}} = \frac{0.030}{0.045} $$
$$ P{Y = 1 \mid X = 1} = \frac{P{X = 1, Y = 1}}{P{X = 1}} = \frac{0.010}{0.045} \tag{4} $$
$$ P{Y = 2 \mid X = 1} = \frac{P{X = 1, Y = 2}}{P{X = 1}} = \frac{0.005}{0.045} $$

Independence and Conditional Independence¶

Two random variables $ x $ and $ y $ are independent if their joint probability distribution can be expressed as the product of two factors, one depending only on $ x $ and the other only on $ y $:
$$ \forall x \in \mathbf{x}, y \in \mathbf{y}, \, P(\mathbf{x} = x, \mathbf{y} = y) = P(\mathbf{x} = x) P(\mathbf{y} = y) \tag{5} $$

Example
Experiment $ E $: “Toss two coins, observe heads (H) and tails (T).” Let $ A $ be “Head on coin A” and $ B $ be “Head on coin B.” The sample space is:
$$ S = {HH, HT, TH, TT} \tag{6} $$
Calculations:
$$ P(A) = \frac{2}{4} = \frac{1}{2}, \, P(B) = \frac{2}{4} = \frac{1}{2} \
P(B \mid A) = \frac{1}{2}, \, P(AB) = \frac{1}{4} \tag{7} $$
Since $ P(B \mid A) = P(B) $ and $ P(AB) = P(A)P(B) $, the coins are independent.

Conditional independence means that for all $ x \in \mathbf{x}, y \in \mathbf{y}, z \in \mathbf{z} $, the conditional joint probability factors:
$$ P(\mathbf{x} = x, \mathbf{y} = y \mid \mathbf{z} = z) = P(\mathbf{x} = x \mid \mathbf{z} = z) P(\mathbf{y} = y \mid \mathbf{z} = z) \tag{8} $$

Mathematical Expectation¶

For a discrete random variable $ X $ with distribution $ P\{X = x_k\} = p_k $ ($ k = 1, 2, 3, \ldots $), if the series $ \sum_{k=1}^\infty x_k p_k $ converges absolutely, the sum is the mathematical expectation $ E(X) $:
$$ E(X) = \sum_{k=1}^\infty x_k p_k \tag{10} $$

For a continuous random variable $ X $ with probability density $ f(x) $, if the integral $ \int_{-\infty}^\infty x f(x) dx $ converges absolutely, the expectation is:
$$ E(X) = \int_{-\infty}^\infty x f(x) dx \tag{11} $$

Mathematical expectation (or mean) is completely determined by the probability distribution of $ X $.

Discrete Example
A newborn’s score $ X $ has the following distribution:

$ X $	0	1	2	3	4	5	6	7	8	9	10
$ p_k $	0.002	0.001	0.002	0.005	0.02	0.04	0.18	0.37	0.25	0.12	0.01

Calculating $ E(X) $:
$$ E(X) = 0 \times 0.002 + 1 \times 0.001 + 2 \times 0.002 + 3 \times 0.005 + 4 \times 0.02 + \
5 \times 0.04 + 6 \times 0.18 + 7 \times 0.37 + 8 \times 0.25 + 9 \times 0.12 + 10 \times 0.01 = 7.15 $$

Continuous Example
Two independent electronic devices with exponential distribution $ f(x) = \frac{1}{\theta} e^{-x/\theta} $ ($ x > 0 $). The system lifetime $ N = \min\{X_1, X_2\} $.
The distribution of $ N $ is:
$$ F_{\min}(x) = 1 - [1 - F(x)]^2 = \begin{cases} 1 - e^{-2x/\theta}, & x > 0 \ 0, & x \leq 0 \end{cases} \tag{13} $$
Density: $ f_{\min}(x) = \frac{2}{\theta} e^{-2x/\theta} $ ($ x > 0 $).
Expectation:
$$ E(N) = \int_0^\infty x \cdot \frac{2}{\theta} e^{-2x/\theta} dx = \frac{\theta}{2} \tag{15} $$

Variance¶

The variance of a random variable $ X $, denoted $ D(X) $ or $ \text{Var}(X) $, is defined as:
$$ D(X) = \text{Var}(X) = E\left{[X - E(X)]^2\right} \tag{16} $$
The square root of the variance, $ \sqrt{D(X)} $, is the standard deviation (or mean square deviation).

For a discrete random variable:
$$ D(X) = \sum_{k=1}^\infty [x_k - E(X)]^2 p_k \tag{17} $$

For a continuous random variable:
$$ D(X) = \int_{-\infty}^\infty [x - E(X)]^2 f(x) dx \tag{18} $$

Covariance¶

The covariance between $ X $ and $ Y $ is:
$$ \text{Cov}(X, Y) = E\left{[X - E(X)][Y - E(Y)]\right} \tag{19} $$

The correlation coefficient is:
$$ \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sqrt{D(X)} \sqrt{D(Y)}} \tag{20} $$

Covariance Matrix¶

For a two-dimensional random variable $ (X_1, X_2) $, the covariance matrix is:
$$
\begin{pmatrix}
c_{11} & c_{12} \
c_{21} & c_{22}
\end{pmatrix} \tag{22}
$$
where:
$$
\begin{align}
c_{11} &= E\left{[X_1 - E(X_1)]^2\right} \
c_{12} &= E\left{[X_1 - E(X_1)][X_2 - E(X_2)]\right} \
c_{21} &= E\left{[X_2 - E(X_2)][X_1 - E(X_1)]\right} \
c_{22} &= E\left{[X_2 - E(X_2)]^2\right}
\end{align} \tag{21}
$$

Common Probability Distributions¶

Bernoulli Distribution¶

A single binary random variable controlled by $ \phi \in [0, 1] $, where $ \phi $ is the probability of the variable being 1.

Multinoulli Distribution¶

A distribution over a discrete random variable with $ k $ finite states (also called categorical distribution).

Gaussian Distribution¶

The Gaussian distribution (normal distribution) is defined as:
$$ \mathcal{N}(x; \mu, \sigma^2) = \sqrt{\frac{1}{2\pi \sigma^2}} \exp\left(-\frac{1}{2\sigma^2}(x - \mu)^2\right) \tag{23} $$
The standard normal distribution has $ \mu = 0 $ and $ \sigma = 1 $.

Common Functions¶

Logistic Sigmoid Function¶

\[ \sigma(x) = \frac{1}{1 + \exp(-x)} \tag{24} \]

Softplus Function¶

$$ \varsigma(x) = \log(1 + \exp(x)) \tag{25} $$
This is a smooth (softened) version of $ x^+ = \max(0, x) $.

References¶

Ian Goodfellow, Yoshua Bengio, Aaron Courville. Deep Learning (Chinese Edition). Translated by Shen Jianjian, Li Yujun, Fu Tianfan, Li Kai. Beijing: Posts & Telecom Press.
Zhejiang University, Sheng Zhou, Xie Shiqian, Pan Chengyi. Engineering Mathematics - Probability and Statistics (4th Edition). Beijing: Higher Education Press.