Run this notebook: Open in Colab Open in Kaggle

Lab 05: Probability and Distributions¶

Based on Chapter 6 of Mathematics for Machine Learning (Deisenroth, Faisal, Ong)

This lab covers the core concepts of probability theory as used in machine learning:

Probability Space – Sample space, events, probability axioms
Discrete and Continuous Probabilities – PMF, PDF, CDF
Sum Rule, Product Rule, and Bayes’ Theorem – Fundamental rules of probability
Summary Statistics – Mean, variance, covariance, correlation
Gaussian Distribution – Univariate and multivariate normal distributions
Conjugacy and the Exponential Family – Beta-Bernoulli conjugacy, exponential family form
Change of Variables / Inverse Transform – Transforming random variables

We use only numpy and matplotlib.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse

np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 4)
plt.rcParams['font.size'] = 11
print("Setup complete. NumPy version:", np.__version__)

1. Probability Space (MML Section 6.1)¶

A probability space $(\Omega, \mathcal{A}, P)$ consists of three components:

Sample space $\Omega$: The set of all possible outcomes of an experiment.
Event space $\mathcal{A}$: A collection of subsets of $\Omega$ (events we can assign probabilities to).
Probability $P$: A function $P: \mathcal{A} \to [0, 1]$ satisfying Kolmogorov’s axioms:
1. $P(A) \geq 0$ for all $A \in \mathcal{A}$
2. $P(\Omega) = 1$
3. For mutually exclusive events $A_1, A_2, \ldots$: $P\left(\bigcup_i A_i\right) = \sum_i P(A_i)$

A random variable $X: \Omega \to \mathcal{T}$ maps outcomes to a target space $\mathcal{T}$, transferring the probability to quantities of interest.

# --- Probability Space: Fair Dice Example ---
# Sample space for rolling a single fair die
omega_die = np.array([1, 2, 3, 4, 5, 6])
P_die = np.ones(6) / 6  # uniform probability

print("Sample space Omega =", omega_die)
print("P(each outcome)   =", P_die)
print("Sum P(Omega)       =", P_die.sum(), " (Axiom 2: must be 1)")

# Events
event_even = omega_die[omega_die % 2 == 0]
P_even = P_die[omega_die % 2 == 0].sum()
event_gt4 = omega_die[omega_die > 4]
P_gt4 = P_die[omega_die > 4].sum()

print(f"\nEvent 'even' = {event_even}, P(even) = {P_even:.4f}")
print(f"Event '>4'   = {event_gt4},    P(>4)   = {P_gt4:.4f}")

# Verify axiom 3: P(even OR >4) for non-mutually-exclusive events
# P(A union B) = P(A) + P(B) - P(A intersect B)
event_even_and_gt4 = np.intersect1d(event_even, event_gt4)
P_even_and_gt4 = P_die[np.isin(omega_die, event_even_and_gt4)].sum()
P_union = P_even + P_gt4 - P_even_and_gt4
print(f"\nP(even OR >4) = {P_even} + {P_gt4} - {P_even_and_gt4} = {P_union:.4f}")
print(f"Direct count:   {np.isin(omega_die, np.union1d(event_even, event_gt4)).sum()/6:.4f}")

# --- Two-Coin Experiment (MML Example 6.1) ---
# Sample space for drawing two coins from a bag ($ with prob 0.3, pound with prob 0.7)
# Random variable X = number of $ coins drawn
p_dollar = 0.3
p_pound = 1 - p_dollar

# Sample space and outcomes
outcomes = ['($$)', '($,L)', '(L,$)', '(L,L)']
X_values = [2, 1, 1, 0]  # number of $ coins
probs = [p_dollar*p_dollar, p_dollar*p_pound, p_pound*p_dollar, p_pound*p_pound]

print("Two-coin draw experiment (MML Example 6.1):")
print(f"{'Outcome':<12} {'X (# of $)':<12} {'P(outcome)'}")
print("-" * 36)
for o, x, p in zip(outcomes, X_values, probs):
    print(f"{o:<12} {x:<12} {p:.4f}")

# PMF of X
print("\nPMF of X (number of $ coins):")
for x_val in sorted(set(X_values)):
    p_x = sum(p for x, p in zip(X_values, probs) if x == x_val)
    print(f"  P(X={x_val}) = {p_x:.4f}")
print(f"  Sum        = {sum(probs):.4f}")

# Simulate to verify
n_sim = 100000
coin1 = np.random.choice([1, 0], size=n_sim, p=[p_dollar, p_pound])
coin2 = np.random.choice([1, 0], size=n_sim, p=[p_dollar, p_pound])
X_sim = coin1 + coin2
print(f"\nSimulation ({n_sim:,} trials):")
for x_val in [0, 1, 2]:
    print(f"  P(X={x_val}) approx {(X_sim == x_val).mean():.4f}")

2. Discrete and Continuous Probabilities (MML Section 6.2)¶

Discrete Random Variables¶

For a discrete random variable $X$ taking values $x \in \mathcal{T}$, the probability mass function (PMF) is: $$P(X = x) = p(x), \quad \sum_{x \in \mathcal{T}} p(x) = 1$$

Continuous Random Variables¶

For a continuous random variable $X$, the probability density function (PDF) $f: \mathbb{R}^D \to \mathbb{R}$ satisfies:

$f(\mathbf{x}) \geq 0$ for all $\mathbf{x}$
$\int_{\mathbb{R}^D} f(\mathbf{x}) \, d\mathbf{x} = 1$

The probability of an interval is: $P(a \leq X \leq b) = \int_a^b f(x) \, dx$

The cumulative distribution function (CDF): $F_X(x) = P(X \leq x) = \int_{-\infty}^x f(t) \, dt$

Key distributions:

Bernoulli: $p(x \mid \mu) = \mu^x (1-\mu)^{1-x}$, $x \in \{0,1\}$ (Eq. 6.92)
Binomial: $p(m \mid N, \mu) = \binom{N}{m} \mu^m (1-\mu)^{N-m}$ (Eq. 6.95)

# --- Bernoulli Distribution ---
def bernoulli_pmf(x, mu):
    """PMF of Bernoulli distribution: p(x|mu) = mu^x * (1-mu)^(1-x)"""
    return mu**x * (1 - mu)**(1 - x)

mu_values = [0.2, 0.5, 0.8]
x_vals = np.array([0, 1])

print("Bernoulli Distribution PMF:")
print(f"{'mu':<8} {'P(X=0)':<12} {'P(X=1)':<12} {'E[X]':<10} {'V[X]'}")
print("-" * 52)
for mu in mu_values:
    p0 = bernoulli_pmf(0, mu)
    p1 = bernoulli_pmf(1, mu)
    E_x = mu  # E[X] = mu
    V_x = mu * (1 - mu)  # V[X] = mu(1 - mu)
    print(f"{mu:<8.1f} {p0:<12.4f} {p1:<12.4f} {E_x:<10.4f} {V_x:.4f}")

# --- Binomial Distribution (MML Eq. 6.95) ---
def binomial_pmf(m, N, mu):
    """PMF: p(m|N,mu) = C(N,m) * mu^m * (1-mu)^(N-m)"""
    # Compute log of binomial coefficient for numerical stability
    log_coeff = (np.sum(np.log(np.arange(1, N+1))) 
                 - np.sum(np.log(np.arange(1, m+1))) 
                 - np.sum(np.log(np.arange(1, N-m+1))))
    return np.exp(log_coeff + m*np.log(mu) + (N-m)*np.log(1-mu))

N = 15
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: Binomial PMF for different mu (like MML Figure 6.10)
m_vals = np.arange(0, N+1)
for mu in [0.1, 0.4, 0.75]:
    pmf = np.array([binomial_pmf(m, N, mu) for m in m_vals])
    axes[0].bar(m_vals + (mu - 0.4)*0.25, pmf, width=0.25, alpha=0.7, label=f'$\\mu={mu}$')
axes[0].set_xlabel('m (number of successes)')
axes[0].set_ylabel('p(m)')
axes[0].set_title(f'Binomial Distribution (N={N})')
axes[0].legend()

# Right: CDF for Binomial
for mu in [0.1, 0.4, 0.75]:
    pmf = np.array([binomial_pmf(m, N, mu) for m in m_vals])
    cdf = np.cumsum(pmf)
    axes[1].step(m_vals, cdf, where='mid', label=f'$\\mu={mu}$', linewidth=2)
axes[1].set_xlabel('m')
axes[1].set_ylabel('$F(m) = P(M \\leq m)$')
axes[1].set_title(f'Binomial CDF (N={N})')
axes[1].legend()
axes[1].set_ylim(0, 1.05)

plt.tight_layout()
plt.show()

# Verify mean and variance
mu = 0.4
pmf = np.array([binomial_pmf(m, N, mu) for m in m_vals])
E_m = np.sum(m_vals * pmf)
V_m = np.sum((m_vals - E_m)**2 * pmf)
print(f"Binomial(N={N}, mu={mu}):")
print(f"  E[m] = {E_m:.4f}  (formula: N*mu = {N*mu:.4f})")
print(f"  V[m] = {V_m:.4f}  (formula: N*mu*(1-mu) = {N*mu*(1-mu):.4f})")

# --- Continuous PDF and CDF: Uniform Distribution (MML Example 6.3) ---
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Discrete uniform
z_states = np.array([-1.1, 0.3, 1.5])
p_disc = np.ones(3) / 3
axes[0].stem(z_states, p_disc, linefmt='r-', markerfmt='ro', basefmt='k-')
axes[0].set_xlabel('z')
axes[0].set_ylabel('P(Z = z)')
axes[0].set_title('(a) Discrete Uniform')
axes[0].set_ylim(0, 2.0)
axes[0].set_xlim(-1.5, 2.0)

# Continuous uniform PDF
a, b = 0.9, 1.6
x_cont = np.linspace(-1.5, 2.5, 500)
pdf_uniform = np.where((x_cont >= a) & (x_cont <= b), 1.0/(b - a), 0.0)
axes[1].plot(x_cont, pdf_uniform, 'r-', linewidth=2)
axes[1].fill_between(x_cont, pdf_uniform, alpha=0.2, color='red')
axes[1].set_xlabel('x')
axes[1].set_ylabel('p(x)')
axes[1].set_title(f'(b) Continuous Uniform PDF [{a}, {b}]')
axes[1].set_ylim(0, 2.0)

# Continuous uniform CDF
cdf_uniform = np.where(x_cont < a, 0.0, np.where(x_cont > b, 1.0, (x_cont - a)/(b - a)))
axes[2].plot(x_cont, cdf_uniform, 'b-', linewidth=2)
axes[2].set_xlabel('x')
axes[2].set_ylabel('$F_X(x) = P(X \\leq x)$')
axes[2].set_title(f'(c) Continuous Uniform CDF [{a}, {b}]')
axes[2].set_ylim(-0.05, 1.1)

plt.tight_layout()
plt.show()

# Note: PDF can be > 1 for continuous distributions
print(f"Uniform PDF height = 1/(b-a) = 1/{b-a:.1f} = {1/(b-a):.4f}")
print(f"This is > 1, but integral over [{a}, {b}] = {1/(b-a) * (b-a):.1f} (correct!)")

3. Sum Rule, Product Rule, and Bayes’ Theorem (MML Section 6.3)¶

The two fundamental rules of probability:

Sum Rule (marginalization): $p(\mathbf{x}) = \sum_{\mathbf{y} \in \mathcal{Y}} p(\mathbf{x}, \mathbf{y})$ (Eq. 6.20)

Product Rule: $p(\mathbf{x}, \mathbf{y}) = p(\mathbf{y} \mid \mathbf{x}) \, p(\mathbf{x})$ (Eq. 6.22)

Bayes’ Theorem follows directly from the product rule:

\[p(\mathbf{x} \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathbf{x}) \, p(\mathbf{x})}{p(\mathbf{y})}\]

where:

$p(\mathbf{x})$ is the prior (belief before observing data)
$p(\mathbf{y} \mid \mathbf{x})$ is the likelihood (how likely the data is given $\mathbf{x}$)
$p(\mathbf{x} \mid \mathbf{y})$ is the posterior (updated belief after observing $\mathbf{y}$)
$p(\mathbf{y}) = \int p(\mathbf{y} \mid \mathbf{x}) p(\mathbf{x}) d\mathbf{x}$ is the marginal likelihood / evidence

# --- Joint Distribution: Sum Rule and Product Rule ---
# Create a joint probability table for two discrete RVs
# X has 3 states, Y has 2 states
joint = np.array([
    [0.10, 0.05],  # X=0
    [0.20, 0.15],  # X=1
    [0.25, 0.25],  # X=2
])

print("Joint distribution p(X, Y):")
print(f"{'':>8} {'Y=0':>8} {'Y=1':>8} | {'p(X)':>8}")
print("-" * 40)
for i in range(3):
    p_x_marginal = joint[i, :].sum()
    print(f"X={i:>4}   {joint[i,0]:>8.2f} {joint[i,1]:>8.2f} | {p_x_marginal:>8.2f}")
p_y_marginal = joint.sum(axis=0)
print("-" * 40)
print(f"{'p(Y)':>8} {p_y_marginal[0]:>8.2f} {p_y_marginal[1]:>8.2f} | {joint.sum():>8.2f}")

# Sum rule: marginals
p_x = joint.sum(axis=1)  # marginalize over Y
p_y = joint.sum(axis=0)  # marginalize over X
print(f"\nSum rule verification:")
print(f"  p(X) = sum_y p(X,Y) = {p_x}")
print(f"  p(Y) = sum_x p(X,Y) = {p_y}")

# Product rule: p(X,Y) = p(Y|X) * p(X)
p_y_given_x = joint / p_x[:, None]  # conditional
reconstructed = p_y_given_x * p_x[:, None]
print(f"\nProduct rule: p(Y|X) * p(X) == p(X,Y)?")
print(f"  Max absolute error: {np.abs(reconstructed - joint).max():.2e}")

# --- Bayes' Theorem: Medical Test Example ---
# A disease affects 1% of the population (prevalence)
# Test sensitivity (true positive rate): P(test+ | disease) = 0.95
# Test specificity (true negative rate): P(test- | no disease) = 0.90

P_disease = 0.01       # prior: P(D)
P_no_disease = 0.99    # P(~D)
sensitivity = 0.95     # P(T+ | D)  -- likelihood
specificity = 0.90     # P(T- | ~D)
P_pos_given_no_disease = 1 - specificity  # P(T+ | ~D) = false positive rate

# Evidence: P(T+) via sum rule
P_pos = sensitivity * P_disease + P_pos_given_no_disease * P_no_disease

# Bayes' theorem: P(D | T+) = P(T+ | D) * P(D) / P(T+)
P_disease_given_pos = (sensitivity * P_disease) / P_pos

print("=== Medical Test: Bayes' Theorem ===")
print(f"Prior:        P(Disease)         = {P_disease:.4f}")
print(f"Sensitivity:  P(Test+ | Disease) = {sensitivity:.4f}")
print(f"Specificity:  P(Test- | Healthy) = {specificity:.4f}")
print(f"False pos:    P(Test+ | Healthy) = {P_pos_given_no_disease:.4f}")
print(f"\nEvidence:     P(Test+) = {sensitivity}*{P_disease} + {P_pos_given_no_disease}*{P_no_disease} = {P_pos:.4f}")
print(f"\nPosterior:    P(Disease | Test+) = {P_disease_given_pos:.4f}")
print(f"\nDespite a positive test, the probability of disease is only {P_disease_given_pos:.1%}!")
print("This is because the disease is rare (low prior), so false positives dominate.")

# What about a second independent positive test?
P_disease_2nd = P_disease_given_pos  # use posterior as new prior
P_no_disease_2nd = 1 - P_disease_2nd
P_pos_2nd = sensitivity * P_disease_2nd + P_pos_given_no_disease * P_no_disease_2nd
P_disease_given_2pos = (sensitivity * P_disease_2nd) / P_pos_2nd
print(f"\nAfter 2nd positive test: P(Disease | T+, T+) = {P_disease_given_2pos:.4f} ({P_disease_given_2pos:.1%})")

# --- Visualize how posterior changes with different priors ---
priors = np.linspace(0.001, 0.5, 200)
posteriors = (sensitivity * priors) / (sensitivity * priors + P_pos_given_no_disease * (1 - priors))

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(priors, posteriors, 'b-', linewidth=2, label='Posterior P(D|T+)')
ax.plot(priors, priors, 'k--', alpha=0.3, label='Prior = Posterior line')
ax.axvline(x=0.01, color='red', linestyle=':', alpha=0.7, label=f'Our prior = 0.01')
ax.axhline(y=P_disease_given_pos, color='red', linestyle=':', alpha=0.7)
ax.scatter([0.01], [P_disease_given_pos], color='red', zorder=5, s=80)
ax.set_xlabel('Prior P(Disease)')
ax.set_ylabel('Posterior P(Disease | Test+)')
ax.set_title('Bayes\' Theorem: Posterior vs Prior (sensitivity=0.95, specificity=0.90)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

4. Summary Statistics (MML Section 6.4)¶

Mean (expected value, Def. 6.4): $\mathbb{E}_X[x] = \int x \, p(x) \, dx$ or $\sum_x x \, p(x)$

Variance (Def. 6.7): $\mathbb{V}_X[x] = \mathbb{E}_X[(x - \mu)^2] = \mathbb{E}_X[x^2] - (\mathbb{E}_X[x])^2$ (Eq. 6.44)

Covariance (Def. 6.5): $\text{Cov}[x, y] = \mathbb{E}[xy] - \mathbb{E}[x]\mathbb{E}[y]$ (Eq. 6.36)

Correlation (Def. 6.8): $\text{corr}[x, y] = \frac{\text{Cov}[x,y]}{\sqrt{\mathbb{V}[x] \, \mathbb{V}[y]}} \in [-1, 1]$ (Eq. 6.40)

Empirical estimates (Def. 6.9):

Sample mean: $\bar{\mathbf{x}} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n$ (Eq. 6.41)
Sample covariance: $\mathbf{\Sigma} = \frac{1}{N} \sum_{n=1}^N (\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^\top$ (Eq. 6.42)

# --- Summary Statistics: Three Expressions for Variance (MML 6.4.3) ---
data = np.array([2.1, 3.5, 4.2, 5.8, 6.1, 7.3, 8.0, 3.9, 5.5, 6.7])
N = len(data)

# Expression 1: Standard definition (Eq. 6.43)
mu = np.mean(data)
var_def = np.mean((data - mu)**2)

# Expression 2: Raw-score formula (Eq. 6.44): E[x^2] - (E[x])^2
var_raw = np.mean(data**2) - np.mean(data)**2

# Expression 3: Pairwise distances (Eq. 6.45): (1/N^2) * sum of (xi - xj)^2 = 2 * var
pairwise_sum = 0.0
for i in range(N):
    for j in range(N):
        pairwise_sum += (data[i] - data[j])**2
var_pairwise = pairwise_sum / (2 * N**2)

print("Three Expressions for Variance (MML Section 6.4.3):")
print(f"  Data: {data}")
print(f"  Mean: {mu:.4f}")
print(f"  Var (standard):   E[(x-mu)^2]             = {var_def:.6f}")
print(f"  Var (raw-score):  E[x^2] - E[x]^2         = {var_raw:.6f}")
print(f"  Var (pairwise):   sum(xi-xj)^2 / (2*N^2)  = {var_pairwise:.6f}")
print(f"  All equal? {np.allclose(var_def, var_raw) and np.allclose(var_def, var_pairwise)}")

# --- Covariance, Correlation, and Visualization ---
# Generate two correlated datasets (like MML Figure 6.5)
np.random.seed(42)
n_points = 300

# Positively correlated
cov_pos = np.array([[4.0, 2.5], [2.5, 2.0]])
mean_pos = np.array([0.0, 2.5])
L_pos = np.linalg.cholesky(cov_pos)
data_pos = (L_pos @ np.random.randn(2, n_points)).T + mean_pos

# Negatively correlated
cov_neg = np.array([[4.0, -2.5], [-2.5, 2.0]])
L_neg = np.linalg.cholesky(cov_neg)
data_neg = (L_neg @ np.random.randn(2, n_points)).T + mean_pos

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, data_2d, title in [(axes[0], data_neg, '(a) Negative correlation'),
                            (axes[1], data_pos, '(b) Positive correlation')]:
    ax.scatter(data_2d[:, 0], data_2d[:, 1], alpha=0.3, s=15)
    
    # Compute empirical statistics
    emp_mean = np.mean(data_2d, axis=0)
    emp_cov = (data_2d - emp_mean).T @ (data_2d - emp_mean) / n_points
    emp_corr = emp_cov[0, 1] / np.sqrt(emp_cov[0, 0] * emp_cov[1, 1])
    
    ax.axhline(y=emp_mean[1], color='orange', linewidth=1.5, alpha=0.7)
    ax.axvline(x=emp_mean[0], color='orange', linewidth=1.5, alpha=0.7)
    ax.plot(emp_mean[0], emp_mean[1], 'rx', markersize=10, markeredgewidth=2)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(f'{title}\nCov={emp_cov[0,1]:.2f}, Corr={emp_corr:.2f}')
    ax.grid(True, alpha=0.2)

plt.tight_layout()
plt.show()

# Verify formula: Cov[x,y] = E[xy] - E[x]E[y]
x, y = data_pos[:, 0], data_pos[:, 1]
cov_formula = np.mean(x * y) - np.mean(x) * np.mean(y)
cov_direct = np.mean((x - np.mean(x)) * (y - np.mean(y)))
print(f"Covariance verification (positive data):")
print(f"  E[xy] - E[x]E[y]           = {cov_formula:.4f}")
print(f"  E[(x-mu_x)(y-mu_y)]        = {cov_direct:.4f}")
print(f"  Match: {np.isclose(cov_formula, cov_direct)}")

5. Gaussian Distribution (MML Section 6.5)¶

Univariate Gaussian¶

\[p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \quad \text{(Eq. 6.62)}\]

Multivariate Gaussian¶

\[p(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-D/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right) \quad \text{(Eq. 6.63)}\]

Key properties:

Marginals of a joint Gaussian are Gaussian (Eq. 6.68)
Conditionals of a joint Gaussian are Gaussian (Eq. 6.65–6.67)
Linear/affine transformations of Gaussians are Gaussian (Eq. 6.88)
To sample: $\mathbf{y} = \mathbf{A}\mathbf{x} + \boldsymbol{\mu}$ where $\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and $\boldsymbol{\Sigma} = \mathbf{A}\mathbf{A}^\top$ (Cholesky)

# --- Univariate Gaussian PDF at Different Parameters ---
def gaussian_pdf(x, mu, sigma2):
    """Univariate Gaussian PDF (MML Eq. 6.62)"""
    return (1.0 / np.sqrt(2 * np.pi * sigma2)) * np.exp(-0.5 * (x - mu)**2 / sigma2)

x = np.linspace(-6, 8, 500)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Vary mu
for mu_val in [-2, 0, 2]:
    axes[0].plot(x, gaussian_pdf(x, mu_val, 1.0), linewidth=2, label=f'$\\mu={mu_val}, \\sigma^2=1$')
axes[0].set_title('Effect of Mean $\\mu$')
axes[0].set_xlabel('x')
axes[0].set_ylabel('p(x)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Vary sigma^2
for s2 in [0.25, 1.0, 4.0]:
    axes[1].plot(x, gaussian_pdf(x, 0, s2), linewidth=2, label=f'$\\mu=0, \\sigma^2={s2}$')
axes[1].set_title('Effect of Variance $\\sigma^2$')
axes[1].set_xlabel('x')
axes[1].set_ylabel('p(x)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Verify integration to 1 (numerical)
dx = x[1] - x[0]
integral = np.sum(gaussian_pdf(x, 0, 1.0)) * dx
print(f"Numerical integral of N(0,1) over [{x[0]:.0f},{x[-1]:.0f}]: {integral:.6f} (should be ~1.0)")

# --- Multivariate Gaussian: 2D PDF and Contour Plot ---
def mvn_pdf(X, mu, Sigma):
    """Multivariate Gaussian PDF (MML Eq. 6.63)
    X: (N, D) array of points
    mu: (D,) mean vector
    Sigma: (D, D) covariance matrix
    """
    D = mu.shape[0]
    diff = X - mu
    Sigma_inv = np.linalg.inv(Sigma)
    det_Sigma = np.linalg.det(Sigma)
    norm_const = (2 * np.pi) ** (-D / 2) * det_Sigma ** (-0.5)
    exponent = -0.5 * np.sum(diff @ Sigma_inv * diff, axis=1)
    return norm_const * np.exp(exponent)

# Parameters (like MML Example 6.6, Eq. 6.69)
mu_2d = np.array([0.0, 2.0])
Sigma_2d = np.array([[0.3, -1.0],
                      [-1.0, 5.0]])

# Create grid
x1 = np.linspace(-1.5, 1.5, 100)
x2 = np.linspace(-4, 8, 100)
X1, X2 = np.meshgrid(x1, x2)
grid = np.column_stack([X1.ravel(), X2.ravel()])
Z = mvn_pdf(grid, mu_2d, Sigma_2d).reshape(X1.shape)

# Sample from the distribution using Cholesky decomposition
L = np.linalg.cholesky(Sigma_2d)
n_samples = 200
samples = (L @ np.random.randn(2, n_samples)).T + mu_2d

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Contour plot
cs = axes[0].contour(X1, X2, Z, levels=8, cmap='viridis')
axes[0].scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=10, c='magenta', label='Samples')
axes[0].plot(mu_2d[0], mu_2d[1], 'rx', markersize=12, markeredgewidth=3, label='Mean')
axes[0].set_xlabel('$x_1$')
axes[0].set_ylabel('$x_2$')
axes[0].set_title('2D Gaussian: Contours + Samples')
axes[0].legend()

# Filled contour
cf = axes[1].contourf(X1, X2, Z, levels=20, cmap='Blues')
axes[1].plot(mu_2d[0], mu_2d[1], 'rx', markersize=12, markeredgewidth=3)
plt.colorbar(cf, ax=axes[1], label='p($x_1$, $x_2$)')
axes[1].set_xlabel('$x_1$')
axes[1].set_ylabel('$x_2$')
axes[1].set_title('2D Gaussian: Density')

plt.tight_layout()
plt.show()

# Verify sample statistics match parameters
print(f"True mean:      {mu_2d}")
print(f"Sample mean:    {np.mean(samples, axis=0).round(3)}")
print(f"True covariance:\n{Sigma_2d}")
emp_cov = (samples - samples.mean(axis=0)).T @ (samples - samples.mean(axis=0)) / n_samples
print(f"Sample covariance:\n{emp_cov.round(3)}")

# --- Conditionals and Marginals of Gaussians (MML Eqs. 6.65-6.68) ---
# Using MML Example 6.6 parameters
mu_joint = np.array([0.0, 2.0])
Sigma_joint = np.array([[0.3, -1.0],
                         [-1.0, 5.0]])

# Marginal of x1: N(mu_x, Sigma_xx) -- just read off the parameters
mu_x1_marginal = mu_joint[0]
var_x1_marginal = Sigma_joint[0, 0]

# Conditional p(x1 | x2 = -1) using MML Eqs. 6.66-6.67
x2_obs = -1.0
Sigma_12 = Sigma_joint[0, 1]  # cross-covariance
Sigma_22 = Sigma_joint[1, 1]
Sigma_11 = Sigma_joint[0, 0]

mu_cond = mu_joint[0] + Sigma_12 / Sigma_22 * (x2_obs - mu_joint[1])
var_cond = Sigma_11 - Sigma_12**2 / Sigma_22

print("=== Gaussian Marginals and Conditionals (MML Example 6.6) ===")
print(f"Joint: N(mu=[{mu_joint[0]}, {mu_joint[1]}], Sigma=[[{Sigma_joint[0,0]}, {Sigma_joint[0,1]}], [{Sigma_joint[1,0]}, {Sigma_joint[1,1]}]])")
print(f"\nMarginal p(x1) = N({mu_x1_marginal}, {var_x1_marginal})")
print(f"\nConditional p(x1 | x2={x2_obs}):")
print(f"  mu_cond  = {mu_joint[0]} + ({Sigma_12}/{Sigma_22})*({x2_obs}-{mu_joint[1]}) = {mu_cond:.4f}")
print(f"  var_cond = {Sigma_11} - ({Sigma_12})^2/{Sigma_22} = {var_cond:.4f}")
print(f"  p(x1 | x2={x2_obs}) = N({mu_cond:.2f}, {var_cond:.2f})")

# Plot marginal and conditional
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
x1_range = np.linspace(-1.5, 1.5, 200)

axes[0].plot(x1_range, gaussian_pdf(x1_range, mu_x1_marginal, var_x1_marginal), 'b-', linewidth=2, label='$p(x_1)$')
axes[0].axvline(mu_x1_marginal, color='r', linestyle='--', alpha=0.5, label=f'Mean={mu_x1_marginal}')
sigma_m = np.sqrt(var_x1_marginal)
axes[0].axvspan(mu_x1_marginal - 2*sigma_m, mu_x1_marginal + 2*sigma_m, alpha=0.1, color='red', label='$\\pm 2\\sigma$')
axes[0].set_title('Marginal $p(x_1)$')
axes[0].set_xlabel('$x_1$')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(x1_range, gaussian_pdf(x1_range, mu_cond, var_cond), 'g-', linewidth=2, label=f'$p(x_1|x_2={x2_obs})$')
axes[1].axvline(mu_cond, color='r', linestyle='--', alpha=0.5, label=f'Mean={mu_cond:.2f}')
sigma_c = np.sqrt(var_cond)
axes[1].axvspan(mu_cond - 2*sigma_c, mu_cond + 2*sigma_c, alpha=0.1, color='red', label='$\\pm 2\\sigma$')
axes[1].set_title(f'Conditional $p(x_1 \\mid x_2={x2_obs})$')
axes[1].set_xlabel('$x_1$')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nNote: conditioning narrows the distribution (var {var_x1_marginal:.2f} -> {var_cond:.2f})")
print(f"and shifts the mean ({mu_x1_marginal:.2f} -> {mu_cond:.2f}) based on the observed x2.")

6. Conjugacy and the Exponential Family (MML Section 6.6)¶

Conjugate Priors¶

A prior is conjugate for the likelihood if the posterior is of the same form/type as the prior (Def. 6.13). This allows closed-form Bayesian updates.

Beta-Bernoulli Conjugacy (MML Example 6.12)¶

Likelihood: $p(x \mid \theta) = \theta^x (1-\theta)^{1-x}$ (Bernoulli)
Prior: $p(\theta \mid \alpha, \beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}$ (Beta)
Posterior: $p(\theta \mid x, \alpha, \beta) \propto \text{Beta}(\alpha + x, \beta + (1-x))$

Exponential Family (MML Section 6.6.3)¶

A distribution is in the exponential family if it can be written as: $$p(\mathbf{x} \mid \boldsymbol{\theta}) = h(\mathbf{x}) \exp\left(\langle\boldsymbol{\theta}, \boldsymbol{\phi}(\mathbf{x})\rangle - A(\boldsymbol{\theta})\right) \quad \text{(Eq. 6.107)}$$

where $\boldsymbol{\phi}(\mathbf{x})$ are sufficient statistics, $\boldsymbol{\theta}$ are natural parameters, and $A(\boldsymbol{\theta})$ is the log-partition function.

# --- Beta Distribution ---
def beta_pdf(mu, alpha, beta_param):
    """Beta PDF: p(mu|alpha,beta) = Gamma(a+b)/(Gamma(a)*Gamma(b)) * mu^(a-1) * (1-mu)^(b-1)
    Using log-gamma for numerical stability."""
    from math import lgamma
    log_norm = lgamma(alpha + beta_param) - lgamma(alpha) - lgamma(beta_param)
    # Avoid log(0)
    mu = np.clip(mu, 1e-10, 1 - 1e-10)
    log_pdf = log_norm + (alpha - 1) * np.log(mu) + (beta_param - 1) * np.log(1 - mu)
    return np.exp(log_pdf)

mu_range = np.linspace(0.001, 0.999, 500)

fig, ax = plt.subplots(figsize=(8, 5))
params = [(0.5, 0.5), (1, 1), (2, 0.3), (4, 10), (5, 1)]
for a, b in params:
    ax.plot(mu_range, beta_pdf(mu_range, a, b), linewidth=2, label=f'$\\alpha={a}, \\beta={b}$')
ax.set_xlabel('$\\mu$')
ax.set_ylabel('$p(\\mu \\mid \\alpha, \\beta)$')
ax.set_title('Beta Distribution (MML Figure 6.11)')
ax.legend()
ax.set_ylim(0, 10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# --- Beta-Bernoulli Conjugacy: Sequential Bayesian Update ---
# MML Example 6.12: posterior = Beta(alpha + x, beta + (1-x))
# For multiple observations: posterior = Beta(alpha + sum(x), beta + N - sum(x))

true_theta = 0.7  # true coin bias
np.random.seed(42)
observations = (np.random.rand(50) < true_theta).astype(int)  # coin flips

# Start with uniform prior Beta(1, 1)
alpha_prior, beta_prior = 1.0, 1.0

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
mu_range = np.linspace(0.001, 0.999, 500)
update_points = [0, 1, 3, 10, 25, 50]

for idx, n_obs in enumerate(update_points):
    ax = axes[idx // 3, idx % 3]
    obs_so_far = observations[:n_obs]
    n_heads = obs_so_far.sum()
    n_tails = n_obs - n_heads
    
    alpha_post = alpha_prior + n_heads
    beta_post = beta_prior + n_tails
    
    pdf = beta_pdf(mu_range, alpha_post, beta_post)
    ax.plot(mu_range, pdf, 'b-', linewidth=2)
    ax.fill_between(mu_range, pdf, alpha=0.2)
    ax.axvline(true_theta, color='red', linestyle='--', linewidth=1.5, label=f'True $\\theta$={true_theta}')
    
    post_mean = alpha_post / (alpha_post + beta_post)
    ax.axvline(post_mean, color='green', linestyle=':', linewidth=1.5, label=f'Post. mean={post_mean:.2f}')
    
    ax.set_title(f'N={n_obs}: Beta({alpha_post:.0f}, {beta_post:.0f})')
    ax.set_xlabel('$\\theta$')
    ax.legend(fontsize=8)
    ax.set_xlim(0, 1)

plt.suptitle('Beta-Bernoulli Conjugacy: Sequential Posterior Update', fontsize=13, y=1.01)
plt.tight_layout()
plt.show()

print(f"After {len(observations)} observations ({observations.sum()} heads, {len(observations)-observations.sum()} tails):")
alpha_final = alpha_prior + observations.sum()
beta_final = beta_prior + len(observations) - observations.sum()
print(f"  Posterior: Beta({alpha_final:.0f}, {beta_final:.0f})")
print(f"  Posterior mean: {alpha_final/(alpha_final+beta_final):.4f}")
print(f"  True theta:     {true_theta}")
print(f"  MLE (data only): {observations.mean():.4f}")

# --- Gaussian with Known Variance: Conjugate Prior Update ---
# Likelihood: x ~ N(mu, sigma2_known)
# Prior on mu: mu ~ N(mu_0, sigma2_0)
# Posterior: mu | x ~ N(mu_N, sigma2_N) where
#   sigma2_N = 1 / (1/sigma2_0 + N/sigma2_known)
#   mu_N = sigma2_N * (mu_0/sigma2_0 + sum(x)/sigma2_known)

true_mu = 5.0
sigma2_known = 2.0  # known variance of the data

# Prior
mu_0 = 0.0
sigma2_0 = 10.0

np.random.seed(123)
all_data = np.random.normal(true_mu, np.sqrt(sigma2_known), 100)

x_range = np.linspace(-5, 10, 500)
fig, ax = plt.subplots(figsize=(10, 5))

# Prior
ax.plot(x_range, gaussian_pdf(x_range, mu_0, sigma2_0), 'k--', linewidth=2, label=f'Prior: N({mu_0}, {sigma2_0})')

colors = ['blue', 'green', 'orange', 'red']
for i, N_obs in enumerate([1, 5, 20, 100]):
    data_subset = all_data[:N_obs]
    sigma2_N = 1.0 / (1.0/sigma2_0 + N_obs/sigma2_known)
    mu_N = sigma2_N * (mu_0/sigma2_0 + data_subset.sum()/sigma2_known)
    ax.plot(x_range, gaussian_pdf(x_range, mu_N, sigma2_N), color=colors[i], linewidth=2,
            label=f'N={N_obs}: N({mu_N:.2f}, {sigma2_N:.3f})')

ax.axvline(true_mu, color='red', linestyle=':', alpha=0.5, label=f'True $\\mu$={true_mu}')
ax.set_xlabel('$\\mu$')
ax.set_ylabel('Density')
ax.set_title('Gaussian Conjugate Prior: Posterior Update for Mean')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("As N grows, the posterior concentrates around the true mean and the prior becomes irrelevant.")

# --- Exponential Family Form ---
# Bernoulli as exponential family (MML Example 6.14, Eq. 6.113)
# p(x|mu) = exp[x * log(mu/(1-mu)) + log(1-mu)]
# Natural parameter: theta = log(mu/(1-mu))  (log-odds / logit)
# Sufficient statistic: phi(x) = x
# Log-partition: A(theta) = -log(1-mu) = log(1 + exp(theta))

mu_vals = np.linspace(0.01, 0.99, 200)
theta_vals = np.log(mu_vals / (1 - mu_vals))  # natural parameter

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Sigmoid: mu as function of theta
theta_range = np.linspace(-6, 6, 200)
mu_sigmoid = 1.0 / (1.0 + np.exp(-theta_range))  # MML Eq. 6.118
axes[0].plot(theta_range, mu_sigmoid, 'b-', linewidth=2)
axes[0].set_xlabel('Natural parameter $\\theta = \\log(\\mu/(1-\\mu))$')
axes[0].set_ylabel('$\\mu = \\sigma(\\theta) = 1/(1+e^{-\\theta})$')
axes[0].set_title('Sigmoid: $\\mu$ vs Natural Parameter $\\theta$')
axes[0].axhline(0.5, color='gray', linestyle=':', alpha=0.5)
axes[0].axvline(0, color='gray', linestyle=':', alpha=0.5)
axes[0].grid(True, alpha=0.3)

# Log-partition function A(theta) = log(1 + exp(theta))
A_theta = np.log(1 + np.exp(theta_range))
axes[1].plot(theta_range, A_theta, 'r-', linewidth=2, label='$A(\\theta) = \\log(1+e^\\theta)$')
# dA/dtheta = sigmoid(theta) = E[x] = mu
dA = mu_sigmoid
axes[1].plot(theta_range, dA, 'b--', linewidth=2, label="$A'(\\theta) = \\sigma(\\theta) = E[x]$")
axes[1].set_xlabel('$\\theta$')
axes[1].set_title('Bernoulli: Log-Partition and its Derivative')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Verify: Gaussian in exponential family form (MML Example 6.13)
# phi(x) = [x, x^2], theta = [mu/sigma^2, -1/(2*sigma^2)]
mu_g, sigma2_g = 3.0, 2.0
theta_gauss = np.array([mu_g / sigma2_g, -1.0 / (2 * sigma2_g)])
print("Gaussian N(mu=3, sigma^2=2) in exponential family form:")
print(f"  Sufficient statistics: phi(x) = [x, x^2]")
print(f"  Natural parameters:   theta = [mu/sigma^2, -1/(2*sigma^2)] = {theta_gauss}")
print(f"  Recover: mu = -{theta_gauss[0]}/(2*{theta_gauss[1]}) = {-theta_gauss[0]/(2*theta_gauss[1]):.1f}")
print(f"           sigma^2 = -1/(2*{theta_gauss[1]}) = {-1/(2*theta_gauss[1]):.1f}")

7. Change of Variables / Inverse Transform (MML Section 6.7)¶

Distribution Function Technique¶

Given $X$ with PDF $f(x)$ and an invertible transformation $Y = U(X)$:

Find the CDF: $F_Y(y) = P(Y \leq y)$
Differentiate to get the PDF: $f(y) = \frac{d}{dy} F_Y(y)$

Change of Variables Formula (MML Eq. 6.143)¶

For a univariate invertible function $U$: $$f(y) = f_x(U^{-1}(y)) \cdot \left|\frac{d}{dy} U^{-1}(y)\right|$$

Probability Integral Transform (Theorem 6.15)¶

If $X$ has CDF $F_X$, then $Y = F_X(X) \sim \text{Uniform}(0,1)$.

Conversely, if $U \sim \text{Uniform}(0,1)$, then $X = F_X^{-1}(U)$ has CDF $F_X$. This is the inverse transform method for sampling.

# --- Inverse CDF Method: Uniform -> Exponential ---
# Exponential distribution: f(x) = lambda * exp(-lambda * x), x >= 0
# CDF: F(x) = 1 - exp(-lambda * x)
# Inverse CDF: F^{-1}(u) = -log(1-u) / lambda

lam = 2.0  # rate parameter
n_samples = 10000

# Step 1: Draw uniform samples
u_samples = np.random.uniform(0, 1, n_samples)

# Step 2: Apply inverse CDF
exp_samples = -np.log(1 - u_samples) / lam

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Uniform samples
axes[0].hist(u_samples, bins=50, density=True, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].set_title('Step 1: U ~ Uniform(0,1)')
axes[0].set_xlabel('u')
axes[0].set_ylabel('Density')

# Transformed samples
axes[1].hist(exp_samples, bins=50, density=True, alpha=0.7, color='salmon', edgecolor='black', label='Samples')
x_exp = np.linspace(0, 5, 200)
axes[1].plot(x_exp, lam * np.exp(-lam * x_exp), 'k-', linewidth=2, label=f'Exp($\\lambda={lam}$) PDF')
axes[1].set_title('Step 2: $X = -\\ln(1-U)/\\lambda$')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Density')
axes[1].legend()

# Show the inverse CDF mapping
u_plot = np.linspace(0.01, 0.99, 200)
x_plot = -np.log(1 - u_plot) / lam
axes[2].plot(u_plot, x_plot, 'b-', linewidth=2)
axes[2].set_xlabel('u (uniform)')
axes[2].set_ylabel('$x = F^{-1}(u)$')
axes[2].set_title(f'Inverse CDF: Exp($\\lambda={lam}$)')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Exponential(lambda={lam}):")
print(f"  Theoretical mean = 1/lambda = {1/lam:.4f}")
print(f"  Sample mean      = {exp_samples.mean():.4f}")
print(f"  Theoretical var  = 1/lambda^2 = {1/lam**2:.4f}")
print(f"  Sample var       = {exp_samples.var():.4f}")

# --- Box-Muller Transform: Uniform -> Normal ---
# Two independent uniform samples -> two independent standard normal samples
# x1 = sqrt(-2*ln(u1)) * cos(2*pi*u2)
# x2 = sqrt(-2*ln(u1)) * sin(2*pi*u2)

n_samples = 10000
u1 = np.random.uniform(0, 1, n_samples)
u2 = np.random.uniform(0, 1, n_samples)

# Box-Muller transform
z1 = np.sqrt(-2 * np.log(u1)) * np.cos(2 * np.pi * u2)
z2 = np.sqrt(-2 * np.log(u1)) * np.sin(2 * np.pi * u2)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Before: uniform in [0,1]^2
axes[0].scatter(u1[:2000], u2[:2000], alpha=0.2, s=2)
axes[0].set_xlabel('$u_1$')
axes[0].set_ylabel('$u_2$')
axes[0].set_title('Before: Uniform Samples')
axes[0].set_xlim(0, 1)
axes[0].set_ylim(0, 1)
axes[0].set_aspect('equal')

# After: standard normal
axes[1].scatter(z1[:2000], z2[:2000], alpha=0.2, s=2)
axes[1].set_xlabel('$z_1$')
axes[1].set_ylabel('$z_2$')
axes[1].set_title('After: Box-Muller Normal Samples')
axes[1].set_xlim(-4, 4)
axes[1].set_ylim(-4, 4)
axes[1].set_aspect('equal')

# Histogram of z1
axes[2].hist(z1, bins=60, density=True, alpha=0.7, color='skyblue', edgecolor='black', label='Samples')
x_norm = np.linspace(-4, 4, 200)
axes[2].plot(x_norm, gaussian_pdf(x_norm, 0, 1), 'r-', linewidth=2, label='$\\mathcal{N}(0,1)$ PDF')
axes[2].set_title('Histogram of $z_1$ vs $\\mathcal{N}(0,1)$')
axes[2].set_xlabel('$z_1$')
axes[2].legend()

plt.tight_layout()
plt.show()

print(f"Box-Muller z1: mean={z1.mean():.4f} (expect 0), var={z1.var():.4f} (expect 1)")
print(f"Box-Muller z2: mean={z2.mean():.4f} (expect 0), var={z2.var():.4f} (expect 1)")
print(f"Cov(z1,z2) = {np.cov(z1, z2)[0,1]:.4f} (expect ~0, independent)")

# --- Change of Variables: PDF Transformation (MML Eq. 6.143) ---
# Example: X ~ U(0,1) with pdf f(x) = 3x^2 on [0,1] (MML Example 6.16)
# Transformation Y = X^2
# Theoretical result: f_Y(y) = (3/2) * y^(1/2) on [0,1]

# Using change of variables formula:
# U(x) = x^2, so U^{-1}(y) = sqrt(y)
# d/dy U^{-1}(y) = 1/(2*sqrt(y))
# f_Y(y) = f_X(sqrt(y)) * |1/(2*sqrt(y))| = 3*(sqrt(y))^2 * 1/(2*sqrt(y)) = 3y / (2*sqrt(y)) = (3/2)*sqrt(y)

# Sample X from f(x)=3x^2 using inverse CDF
# CDF: F(x) = x^3, inverse: F^{-1}(u) = u^{1/3}
n = 50000
u = np.random.uniform(0, 1, n)
X_samples = u**(1.0/3.0)  # samples from f(x) = 3x^2
Y_samples = X_samples**2   # apply transformation Y = X^2

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Original distribution f(x) = 3x^2
x_range = np.linspace(0, 1, 200)
axes[0].hist(X_samples, bins=60, density=True, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].plot(x_range, 3 * x_range**2, 'r-', linewidth=2, label='$f_X(x) = 3x^2$')
axes[0].set_title('Original: $X \\sim f(x) = 3x^2$')
axes[0].set_xlabel('x')
axes[0].legend()

# Transformed distribution
y_range = np.linspace(0.001, 1, 200)
f_Y_theory = 1.5 * np.sqrt(y_range)  # (3/2) * y^{1/2}
axes[1].hist(Y_samples, bins=60, density=True, alpha=0.7, color='salmon', edgecolor='black')
axes[1].plot(y_range, f_Y_theory, 'b-', linewidth=2, label='$f_Y(y) = \\frac{3}{2}\\sqrt{y}$')
axes[1].set_title('Transformed: $Y = X^2$')
axes[1].set_xlabel('y')
axes[1].legend()

# Verify CDFs match
y_check = np.linspace(0, 1, 200)
empirical_cdf = np.array([np.mean(Y_samples <= y) for y in y_check])
theoretical_cdf = y_check**(3.0/2.0)  # F_Y(y) = y^{3/2}
axes[2].plot(y_check, empirical_cdf, 'b-', linewidth=2, label='Empirical CDF')
axes[2].plot(y_check, theoretical_cdf, 'r--', linewidth=2, label='$F_Y(y) = y^{3/2}$')
axes[2].set_title('CDF Comparison for $Y = X^2$')
axes[2].set_xlabel('y')
axes[2].set_ylabel('$F_Y(y)$')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Max CDF error: {np.max(np.abs(empirical_cdf - theoretical_cdf)):.4f}")

Summary¶

This lab covered the core probability concepts from MML Chapter 6:

Topic	Key Concept	MML Reference
Probability Space	$(\Omega, \mathcal{A}, P)$ with Kolmogorov axioms	Section 6.1
Discrete/Continuous	PMF, PDF, CDF; Bernoulli, Binomial	Section 6.2, Eqs. 6.92, 6.95
Sum & Product Rules	Marginalization and factorization of joint distributions	Section 6.3, Eqs. 6.20, 6.22
Bayes’ Theorem	Posterior = Likelihood x Prior / Evidence	Eq. 6.23
Summary Statistics	Mean, variance (3 forms), covariance, correlation	Section 6.4, Eqs. 6.35-6.44
Gaussian	Univariate (Eq. 6.62) and multivariate (Eq. 6.63); marginals and conditionals	Section 6.5
Conjugacy	Beta-Bernoulli, Gaussian with known variance	Section 6.6.1, Examples 6.11-6.12
Exponential Family	$p(x	\theta) = h(x)\exp(\theta^\top \phi(x) - A(\theta))$
Change of Variables	Inverse CDF method, $f(y) = f_x(U^{-1}(y))	dU^{-1}/dy

Key takeaways for ML:

Bayes’ theorem is the foundation of Bayesian inference: updating beliefs with data.
The Gaussian distribution is central because marginals, conditionals, and linear transformations all remain Gaussian.
Conjugate priors enable closed-form posterior updates, avoiding costly numerical integration.
The exponential family unifies many common distributions and guarantees finite sufficient statistics.
Change of variables / inverse CDF enables sampling from complex distributions via simple uniform samples.

Topic	Key Concept	MML Reference
Probability Space	\((\Omega, \mathcal{A}, P)\) with Kolmogorov axioms	Section 6.1
Discrete/Continuous	PMF, PDF, CDF; Bernoulli, Binomial	Section 6.2, Eqs. 6.92, 6.95
Sum & Product Rules	Marginalization and factorization of joint distributions	Section 6.3, Eqs. 6.20, 6.22
Bayes’ Theorem	Posterior = Likelihood x Prior / Evidence	Eq. 6.23
Summary Statistics	Mean, variance (3 forms), covariance, correlation	Section 6.4, Eqs. 6.35-6.44
Gaussian	Univariate (Eq. 6.62) and multivariate (Eq. 6.63); marginals and conditionals	Section 6.5
Conjugacy	Beta-Bernoulli, Gaussian with known variance	Section 6.6.1, Examples 6.11-6.12
Exponential Family	$p(x	\theta) = h(x)\exp(\theta^\top \phi(x) - A(\theta))$
Change of Variables	Inverse CDF method, $f(y) = f_x(U^{-1}(y))	dU^{-1}/dy