Lab 09: Density Estimation with Gaussian Mixture ModelsΒΆ
Based on Chapter 11 of Mathematics for Machine Learning (Deisenroth, Faisal, Ong)
In this lab we implement Gaussian Mixture Models from scratch, covering:
Section |
Topic |
MML Reference |
|---|---|---|
1 |
Gaussian Mixture Model definition |
11.1 |
2 |
Parameter Learning via MLE |
11.2 |
3 |
EM Algorithm (full implementation) |
11.3 |
4 |
K-Means as Hard EM |
β |
5 |
Convergence analysis |
11.3 |
6 |
Initialization sensitivity |
β |
7 |
Model selection (BIC) |
β |
8 |
Latent variable perspective |
11.4 |
We use only NumPy and Matplotlib β no scikit-learn.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
np.random.seed(42)
print("NumPy version:", np.__version__)
print("Setup complete.")
1. Gaussian Mixture Model (MML 11.1)ΒΆ
A single Gaussian distribution has limited modelling power β it cannot capture multi-modal data (data with several clusters). A Gaussian Mixture Model (GMM) addresses this by combining \(K\) Gaussian components:
where
\(\pi_k\) are the mixture weights satisfying \(0 \leq \pi_k \leq 1\) and \(\sum_k \pi_k = 1\),
\(\boldsymbol{\mu}_k \in \mathbb{R}^D\) is the mean of the \(k\)-th component,
\(\boldsymbol{\Sigma}_k \in \mathbb{R}^{D \times D}\) is the covariance matrix of the \(k\)-th component.
The collection of all parameters is \(\boldsymbol{\theta} := \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k : k = 1, \ldots, K\}\).
Let us generate data from a 2D GMM with \(K=3\) components.
def multivariate_normal_pdf(x, mu, Sigma):
"""Evaluate the multivariate Gaussian density N(x | mu, Sigma).
Parameters
----------
x : (N, D) array -- data points
mu : (D,) array -- mean
Sigma : (D, D) array -- covariance
Returns
-------
(N,) array of density values
"""
D = mu.shape[0]
diff = x - mu # (N, D)
Sigma_inv = np.linalg.inv(Sigma)
det_Sigma = np.linalg.det(Sigma)
# Mahalanobis distance: (x - mu)^T Sigma^{-1} (x - mu)
mahal = np.sum(diff @ Sigma_inv * diff, axis=1) # (N,)
norm_const = 1.0 / (np.sqrt((2 * np.pi) ** D * det_Sigma))
return norm_const * np.exp(-0.5 * mahal)
# --- True GMM parameters (K=3, D=2) ---
true_pis = np.array([0.3, 0.5, 0.2])
true_mus = np.array([
[-3.0, -3.0],
[ 0.0, 0.0],
[ 3.0, 4.0]
])
true_Sigmas = np.array([
[[1.0, 0.5], [0.5, 1.0]],
[[1.5, -0.3], [-0.3, 0.8]],
[[0.8, 0.0], [0.0, 1.2]]
])
def sample_gmm(pis, mus, Sigmas, N):
"""Draw N samples from a GMM."""
K = len(pis)
D = mus.shape[1]
# Step 1: choose components according to pi
components = np.random.choice(K, size=N, p=pis)
# Step 2: sample from the chosen Gaussian
X = np.empty((N, D))
for n in range(N):
k = components[n]
X[n] = np.random.multivariate_normal(mus[k], Sigmas[k])
return X, components
N = 500
X, true_labels = sample_gmm(true_pis, true_mus, true_Sigmas, N)
print(f"Generated {N} samples from a 2D GMM with K=3 components.")
print(f"Component counts: {[np.sum(true_labels == k) for k in range(3)]}")
print(f"Data shape: {X.shape}")
def draw_ellipse(ax, mu, Sigma, color='black', n_std=2.0, **kwargs):
"""Draw an ellipse representing the n_std contour of a 2D Gaussian."""
eigvals, eigvecs = np.linalg.eigh(Sigma)
angle = np.degrees(np.arctan2(eigvecs[1, 0], eigvecs[0, 0]))
width, height = 2 * n_std * np.sqrt(eigvals)
ell = Ellipse(xy=mu, width=width, height=height, angle=angle,
edgecolor=color, facecolor='none', linewidth=2, **kwargs)
ax.add_patch(ell)
colors = ['#e74c3c', '#2ecc71', '#3498db'] # red, green, blue
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Left: all data (no labels)
axes[0].scatter(X[:, 0], X[:, 1], s=10, alpha=0.5, c='gray')
axes[0].set_title('Raw Data (Labels Hidden)')
axes[0].set_xlabel('$x_1$'); axes[0].set_ylabel('$x_2$')
axes[0].set_aspect('equal')
# Right: coloured by true component
for k in range(3):
mask = true_labels == k
axes[1].scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.5,
c=colors[k], label=f'Component {k+1} ($\\pi_{k+1}={true_pis[k]:.1f}$)')
draw_ellipse(axes[1], true_mus[k], true_Sigmas[k], color=colors[k])
axes[1].plot(*true_mus[k], 'x', color=colors[k], markersize=12, markeredgewidth=3)
axes[1].legend()
axes[1].set_title('True Component Assignments')
axes[1].set_xlabel('$x_1$'); axes[1].set_ylabel('$x_2$')
axes[1].set_aspect('equal')
plt.tight_layout()
plt.show()
print("Left: the raw data we observe in practice (no labels).")
print("Right: the ground truth we want the algorithm to recover.")
2. Parameter Learning via Maximum Likelihood (MML 11.2)ΒΆ
Given an i.i.d. dataset \(\mathcal{X} = \{\mathbf{x}_1, \ldots, \mathbf{x}_N\}\), the log-likelihood is:
Why direct MLE is hardΒΆ
For a single Gaussian the log moves inside the sum and we get a closed-form solution. For a mixture the \(\log\) of a sum has no closed-form optimum because we cannot separate the per-component terms.
Responsibilities (11.2.1)ΒΆ
Taking derivatives and setting them to zero introduces the responsibilities:
\(r_{nk}\) is the probability that component \(k\) generated data point \(\mathbf{x}_n\). It is a soft assignment β each point has a distribution over all \(K\) components.
The necessary conditions for a stationary point yield coupled update equations:
Parameter |
Update |
Eq. |
|---|---|---|
\(\boldsymbol{\mu}_k\) |
\(\displaystyle\frac{1}{N_k}\sum_{n=1}^{N} r_{nk}\, \mathbf{x}_n\) |
(11.20) |
\(\boldsymbol{\Sigma}_k\) |
\(\displaystyle\frac{1}{N_k}\sum_{n=1}^{N} r_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top\) |
(11.30) |
\(\pi_k\) |
\(N_k / N\) |
(11.42) |
where \(N_k = \sum_{n=1}^{N} r_{nk}\) is the total responsibility of component \(k\).
These equations depend on each other through \(r_{nk}\), so we cannot solve them in one shot β we need an iterative algorithm.
# Demonstrate why direct MLE is hard: compute the log-likelihood surface
# for a simple 1D, 2-component GMM and show it has multiple local optima.
# Simple 1D data
data_1d = np.concatenate([
np.random.normal(-2, 0.5, 80),
np.random.normal(2, 0.8, 120)
])
def gmm_log_likelihood_1d(data, pi1, mu1, mu2, sigma1=0.5, sigma2=0.8):
"""Log-likelihood for a 1D, 2-component GMM with fixed variances."""
pi2 = 1.0 - pi1
p1 = pi1 * (1.0 / np.sqrt(2 * np.pi * sigma1**2)) * np.exp(-0.5 * ((data - mu1) / sigma1)**2)
p2 = pi2 * (1.0 / np.sqrt(2 * np.pi * sigma2**2)) * np.exp(-0.5 * ((data - mu2) / sigma2)**2)
return np.sum(np.log(p1 + p2 + 1e-300))
mu1_range = np.linspace(-5, 5, 100)
mu2_range = np.linspace(-5, 5, 100)
M1, M2 = np.meshgrid(mu1_range, mu2_range)
LL = np.zeros_like(M1)
for i in range(M1.shape[0]):
for j in range(M1.shape[1]):
LL[i, j] = gmm_log_likelihood_1d(data_1d, 0.4, M1[i, j], M2[i, j])
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Left: the 1D data
axes[0].hist(data_1d, bins=30, density=True, alpha=0.6, color='steelblue')
axes[0].set_title('1D Mixture Data (2 Components)')
axes[0].set_xlabel('x'); axes[0].set_ylabel('Density')
# Right: log-likelihood as function of (mu1, mu2)
cs = axes[1].contourf(M1, M2, LL, levels=40, cmap='viridis')
plt.colorbar(cs, ax=axes[1], label='Log-likelihood')
axes[1].set_xlabel('$\\mu_1$'); axes[1].set_ylabel('$\\mu_2$')
axes[1].set_title('Log-likelihood surface (fixed $\\sigma_1, \\sigma_2, \\pi_1$)')
# Mark the two symmetric optima
axes[1].plot(-2, 2, 'r*', markersize=15, label='True ($\\mu_1$=-2, $\\mu_2$=2)')
axes[1].plot(2, -2, 'w*', markersize=15, label='Label-swapped optimum')
axes[1].legend(fontsize=10)
axes[1].set_aspect('equal')
plt.tight_layout()
plt.show()
print("The log-likelihood surface has TWO symmetric optima (label-swapping symmetry).")
print("This multi-modality makes direct optimisation difficult -- we use EM instead.")
3. EM Algorithm (MML 11.3)ΒΆ
The Expectation-Maximisation (EM) algorithm (Dempster et al., 1977) is an iterative procedure that alternates between two steps:
E-step: Evaluate the responsibilities \(r_{nk}\) for every data point using the current parameters: $\(r_{nk} = \frac{\pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \qquad (11.53)\)$
M-step: Re-estimate the parameters using the current responsibilities: $\(\boldsymbol{\mu}_k = \frac{1}{N_k}\sum_{n=1}^{N} r_{nk}\, \mathbf{x}_n \qquad (11.54)\)\( \)\(\boldsymbol{\Sigma}_k = \frac{1}{N_k}\sum_{n=1}^{N} r_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \qquad (11.55)\)\( \)\(\pi_k = \frac{N_k}{N} \qquad (11.56)\)$
where \(N_k = \sum_{n=1}^{N} r_{nk}\).
Key property: Every EM iteration increases the log-likelihood (Neal and Hinton, 1999), guaranteeing convergence to a local optimum.
def compute_log_likelihood(X, pis, mus, Sigmas):
"""Compute the GMM log-likelihood (Eq. 11.10)."""
N = X.shape[0]
K = len(pis)
# Weighted densities: (N, K)
weighted = np.zeros((N, K))
for k in range(K):
weighted[:, k] = pis[k] * multivariate_normal_pdf(X, mus[k], Sigmas[k])
# Sum over components, then log, then sum over data
return np.sum(np.log(np.sum(weighted, axis=1) + 1e-300))
def e_step(X, pis, mus, Sigmas):
"""E-step: compute responsibilities r_nk (Eq. 11.53).
Returns
-------
R : (N, K) array of responsibilities
"""
N = X.shape[0]
K = len(pis)
R = np.zeros((N, K))
for k in range(K):
R[:, k] = pis[k] * multivariate_normal_pdf(X, mus[k], Sigmas[k])
# Normalise each row so responsibilities sum to 1
row_sums = np.sum(R, axis=1, keepdims=True) + 1e-300
R = R / row_sums
return R
def m_step(X, R):
"""M-step: update pis, mus, Sigmas (Eqs. 11.54-11.56).
Parameters
----------
X : (N, D) data
R : (N, K) responsibilities
Returns
-------
pis, mus, Sigmas : updated parameters
"""
N, D = X.shape
K = R.shape[1]
N_k = np.sum(R, axis=0) # (K,) -- total responsibility per component
# Update means (Eq. 11.54)
mus = np.zeros((K, D))
for k in range(K):
mus[k] = np.sum(R[:, k:k+1] * X, axis=0) / N_k[k]
# Update covariances (Eq. 11.55)
Sigmas = np.zeros((K, D, D))
for k in range(K):
diff = X - mus[k] # (N, D)
Sigmas[k] = (R[:, k:k+1] * diff).T @ diff / N_k[k]
# Add small regularisation for numerical stability
Sigmas[k] += 1e-6 * np.eye(D)
# Update mixture weights (Eq. 11.56)
pis = N_k / N
return pis, mus, Sigmas
def run_em(X, K, max_iter=100, tol=1e-6, seed=0):
"""Run the full EM algorithm for a GMM.
Returns
-------
pis, mus, Sigmas : final parameters
R : final responsibilities
log_likelihoods : list of log-likelihood at each iteration
history : list of (pis, mus, Sigmas, R) at each iteration
"""
np.random.seed(seed)
N, D = X.shape
# --- Initialisation ---
# Pick K random data points as initial means
indices = np.random.choice(N, K, replace=False)
mus = X[indices].copy()
# Identity covariances
Sigmas = np.array([np.eye(D) for _ in range(K)])
# Uniform weights
pis = np.ones(K) / K
log_likelihoods = []
history = []
for iteration in range(max_iter):
# E-step
R = e_step(X, pis, mus, Sigmas)
# M-step
pis, mus, Sigmas = m_step(X, R)
# Compute log-likelihood
ll = compute_log_likelihood(X, pis, mus, Sigmas)
log_likelihoods.append(ll)
history.append((pis.copy(), mus.copy(), Sigmas.copy(), R.copy()))
# Check convergence
if iteration > 0 and abs(log_likelihoods[-1] - log_likelihoods[-2]) < tol:
break
return pis, mus, Sigmas, R, log_likelihoods, history
print("EM functions defined: e_step, m_step, run_em")
print("Ready to fit a GMM.")
# Run EM on our 2D data
K = 3
pis, mus, Sigmas, R, lls, history = run_em(X, K, max_iter=100, seed=42)
print(f"EM converged after {len(lls)} iterations.")
print(f"\nLearned mixture weights: {pis.round(3)}")
print(f"True mixture weights: {true_pis}")
print(f"\nLearned means:")
for k in range(K):
print(f" mu_{k+1} = {mus[k].round(3)}")
print(f"True means:")
for k in range(K):
print(f" mu_{k+1} = {true_mus[k]}")
# Visualise EM iterations: show cluster assignments evolving
iterations_to_show = [0, 1, 3, len(history)-1]
iterations_to_show = [i for i in iterations_to_show if i < len(history)]
fig, axes = plt.subplots(1, len(iterations_to_show), figsize=(5*len(iterations_to_show), 5))
if len(iterations_to_show) == 1:
axes = [axes]
for ax_idx, it in enumerate(iterations_to_show):
ax = axes[ax_idx]
p, m, S, r = history[it]
# Colour each point by its responsibilities (soft assignment)
# Use RGB from responsibilities
rgb = np.zeros((N, 3))
color_basis = np.array([[0.91, 0.30, 0.24], # red
[0.18, 0.80, 0.44], # green
[0.20, 0.58, 0.86]]) # blue
for k in range(K):
rgb += r[:, k:k+1] * color_basis[k:k+1]
rgb = np.clip(rgb, 0, 1)
ax.scatter(X[:, 0], X[:, 1], c=rgb, s=10, alpha=0.7)
# Draw component ellipses
for k in range(K):
draw_ellipse(ax, m[k], S[k], color=colors[k], n_std=2.0)
ax.plot(*m[k], 'x', color='black', markersize=12, markeredgewidth=3)
ax.set_title(f'Iteration {it+1}', fontsize=14)
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_aspect('equal')
plt.suptitle('EM Algorithm: Evolution of Cluster Assignments', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
print("Colours show soft assignments: each point is a blend of component colours.")
print("As EM progresses, assignments become more confident (purer colours).")
4. K-Means as Hard EMΒΆ
K-Means is a special case of EM where we make hard assignments rather than soft ones:
EM (soft) |
K-Means (hard) |
|
|---|---|---|
Assignment |
\(r_{nk} \in [0,1]\) (probability) |
\(r_{nk} \in \{0,1\}\) (binary) |
Each point belongs to |
All clusters (weighted) |
Exactly one cluster |
Covariances |
Learned per component |
Implicitly \(\sigma^2 I\) for all |
Weights |
Learned \(\pi_k\) |
Implicitly uniform |
K-Means corresponds to EM on a GMM where:
All covariances are \(\sigma^2 \mathbf{I}\) (isotropic) and \(\sigma^2 \to 0\)
All mixture weights are equal
Responsibilities collapse to hard assignments: \(r_{nk} = 1\) if \(k = \arg\min_j \|\mathbf{x}_n - \boldsymbol{\mu}_j\|^2\)
def run_kmeans(X, K, max_iter=100, seed=0):
"""K-Means clustering (hard EM).
Returns
-------
centroids : (K, D) final centroids
labels : (N,) cluster assignments
inertias : list of total within-cluster distance at each iteration
history : list of (centroids, labels) at each iteration
"""
np.random.seed(seed)
N, D = X.shape
# Initialise centroids from random data points
indices = np.random.choice(N, K, replace=False)
centroids = X[indices].copy()
inertias = []
history = []
for iteration in range(max_iter):
# "E-step" (hard): assign each point to nearest centroid
dists = np.zeros((N, K))
for k in range(K):
dists[:, k] = np.sum((X - centroids[k])**2, axis=1)
labels = np.argmin(dists, axis=1)
# "M-step": recompute centroids
new_centroids = np.zeros((K, D))
for k in range(K):
members = X[labels == k]
if len(members) > 0:
new_centroids[k] = members.mean(axis=0)
else:
new_centroids[k] = centroids[k] # keep old if empty
inertia = sum(np.sum((X[labels == k] - new_centroids[k])**2) for k in range(K))
inertias.append(inertia)
history.append((new_centroids.copy(), labels.copy()))
# Check convergence
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
return centroids, labels, inertias, history
centroids, km_labels, inertias, km_history = run_kmeans(X, K=3, seed=42)
print(f"K-Means converged after {len(inertias)} iterations.")
print(f"Final centroids:")
for k in range(3):
print(f" centroid_{k+1} = {centroids[k].round(3)}")
# Compare K-Means (hard) vs EM (soft) assignments
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Left: K-Means result (hard boundaries)
for k in range(3):
mask = km_labels == k
axes[0].scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.5, c=colors[k])
axes[0].plot(*centroids[k], 'kx', markersize=14, markeredgewidth=3)
axes[0].set_title('K-Means (Hard Assignments)', fontsize=14)
axes[0].set_xlabel('$x_1$'); axes[0].set_ylabel('$x_2$')
axes[0].set_aspect('equal')
# Middle: EM result (soft assignments, coloured by max responsibility)
em_labels = np.argmax(R, axis=1)
for k in range(3):
mask = em_labels == k
axes[1].scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.5, c=colors[k])
draw_ellipse(axes[1], mus[k], Sigmas[k], color=colors[k], n_std=2.0)
axes[1].plot(*mus[k], 'kx', markersize=14, markeredgewidth=3)
axes[1].set_title('EM / GMM (Soft Assignments)', fontsize=14)
axes[1].set_xlabel('$x_1$'); axes[1].set_ylabel('$x_2$')
axes[1].set_aspect('equal')
# Right: true labels
for k in range(3):
mask = true_labels == k
axes[2].scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.5, c=colors[k])
draw_ellipse(axes[2], true_mus[k], true_Sigmas[k], color=colors[k], n_std=2.0)
axes[2].plot(*true_mus[k], 'kx', markersize=14, markeredgewidth=3)
axes[2].set_title('Ground Truth', fontsize=14)
axes[2].set_xlabel('$x_1$'); axes[2].set_ylabel('$x_2$')
axes[2].set_aspect('equal')
plt.suptitle('K-Means vs EM vs Ground Truth', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
print("K-Means produces hard Voronoi-like boundaries (spherical clusters assumed).")
print("EM captures elliptical shapes and provides probabilistic assignments.")
5. Convergence AnalysisΒΆ
A fundamental property of the EM algorithm is that the log-likelihood never decreases from one iteration to the next:
This monotonic increase guarantees convergence to a local optimum (not necessarily the global optimum). Let us verify this empirically.
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
# Left: log-likelihood over EM iterations
axes[0].plot(range(1, len(lls)+1), lls, 'o-', color='steelblue', markersize=5)
axes[0].set_xlabel('EM Iteration')
axes[0].set_ylabel('Log-Likelihood')
axes[0].set_title('Log-Likelihood vs. EM Iteration')
axes[0].grid(True, alpha=0.3)
# Verify monotonic increase
diffs = np.diff(lls)
all_non_negative = np.all(diffs >= -1e-10) # allow tiny numerical noise
axes[0].annotate(f'Monotonically increasing: {all_non_negative}',
xy=(0.5, 0.1), xycoords='axes fraction', fontsize=12,
bbox=dict(boxstyle='round', facecolor='lightgreen' if all_non_negative else 'lightyellow'))
# Right: change in log-likelihood (should be >= 0)
axes[1].bar(range(1, len(diffs)+1), diffs, color='steelblue', alpha=0.7)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[1].set_xlabel('EM Iteration')
axes[1].set_ylabel('$\\Delta$ Log-Likelihood')
axes[1].set_title('Change in Log-Likelihood Per Iteration')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Initial log-likelihood: {lls[0]:.2f}")
print(f"Final log-likelihood: {lls[-1]:.2f}")
print(f"All iteration-to-iteration changes >= 0: {all_non_negative}")
print(f"Smallest change: {min(diffs):.2e}")
print(f"\nThis confirms the monotonic convergence property of EM (Neal & Hinton, 1999).")
6. Initialization SensitivityΒΆ
EM converges to a local optimum, so different initialisations can yield different solutions. This is a practical concern: poor initialisation may lead to suboptimal clustering.
A common strategy is to run EM multiple times with different random seeds and pick the run with the highest final log-likelihood.
# Run EM with several different random initialisations
n_runs = 6
results = []
for seed in range(n_runs):
p, m, S, r, ll, hist = run_em(X, K=3, max_iter=100, seed=seed*7)
results.append((p, m, S, r, ll, hist))
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
for idx in range(n_runs):
ax = axes[idx // 3, idx % 3]
p, m, S, r, ll, hist = results[idx]
labels_run = np.argmax(r, axis=1)
for k in range(3):
mask = labels_run == k
ax.scatter(X[mask, 0], X[mask, 1], s=8, alpha=0.4, c=colors[k])
draw_ellipse(ax, m[k], S[k], color=colors[k])
ax.plot(*m[k], 'kx', markersize=10, markeredgewidth=2)
ax.set_title(f'Seed {idx*7} | LL = {ll[-1]:.1f} | {len(ll)} iters', fontsize=11)
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_aspect('equal')
plt.suptitle('EM with Different Random Initialisations', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
final_lls = [results[i][4][-1] for i in range(n_runs)]
best_idx = np.argmax(final_lls)
print("Final log-likelihoods for each run:")
for i in range(n_runs):
marker = " <-- BEST" if i == best_idx else ""
print(f" Seed {i*7:2d}: LL = {final_lls[i]:.2f}{marker}")
print(f"\nBest run: seed {best_idx*7} with LL = {final_lls[best_idx]:.2f}")
print("Different seeds can converge to different local optima (though here the data is well-separated).")
# Compare convergence curves for all runs
fig, ax = plt.subplots(figsize=(10, 6))
for idx in range(n_runs):
ll_curve = results[idx][4]
ax.plot(range(1, len(ll_curve)+1), ll_curve, 'o-', markersize=4,
label=f'Seed {idx*7} (final: {ll_curve[-1]:.1f})', alpha=0.8)
ax.set_xlabel('EM Iteration', fontsize=12)
ax.set_ylabel('Log-Likelihood', fontsize=12)
ax.set_title('Convergence Curves for Different Initialisations', fontsize=14)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("All runs show monotonically increasing log-likelihood.")
print("They may converge to different local optima at different speeds.")
7. Model Selection: Choosing \(K\) with BICΒΆ
How do we choose the number of components \(K\)? Adding more components always increases the log-likelihood (more parameters = better fit), so we need a criterion that penalises model complexity.
The Bayesian Information Criterion (BIC) balances fit and complexity:
where \(M\) is the number of free parameters and \(N\) is the number of data points. Higher BIC is better (less negative). The penalty \(\frac{M}{2}\log N\) discourages overfitting.
For a GMM with \(K\) components in \(D\) dimensions, the number of free parameters is: $\(M = K \cdot D + K \cdot \frac{D(D+1)}{2} + (K - 1) = K\left(D + \frac{D(D+1)}{2} + 1\right) - 1\)\( (means + upper-triangular covariance entries + mixture weights with constraint \)\sum \pi_k = 1$)
def count_gmm_params(K, D):
"""Count free parameters of a GMM with K components in D dimensions."""
means = K * D
covariances = K * D * (D + 1) // 2
weights = K - 1 # one constraint: they sum to 1
return means + covariances + weights
def compute_bic(ll, n_params, N):
"""Compute BIC = log_likelihood - (M/2) * log(N)."""
return ll - 0.5 * n_params * np.log(N)
# Fit GMMs with K = 1, 2, 3, 4, 5
K_values = [1, 2, 3, 4, 5]
D = 2
bic_scores = []
final_log_likelihoods = []
print(f"{'K':>3} | {'Params':>6} | {'Log-Lik':>10} | {'BIC':>10}")
print("-" * 45)
for K_try in K_values:
# Run EM 3 times, pick best
best_ll = -np.inf
for seed in range(3):
p, m, S, r, ll, hist = run_em(X, K_try, max_iter=150, seed=seed*11)
if ll[-1] > best_ll:
best_ll = ll[-1]
n_params = count_gmm_params(K_try, D)
bic = compute_bic(best_ll, n_params, N)
bic_scores.append(bic)
final_log_likelihoods.append(best_ll)
print(f"{K_try:3d} | {n_params:6d} | {best_ll:10.2f} | {bic:10.2f}")
best_K = K_values[np.argmax(bic_scores)]
print(f"\nBest K by BIC: K = {best_K}")
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: log-likelihood vs K
axes[0].plot(K_values, final_log_likelihoods, 'o-', color='steelblue', markersize=8, linewidth=2)
axes[0].set_xlabel('Number of Components $K$', fontsize=12)
axes[0].set_ylabel('Log-Likelihood', fontsize=12)
axes[0].set_title('Log-Likelihood vs. K', fontsize=14)
axes[0].set_xticks(K_values)
axes[0].grid(True, alpha=0.3)
axes[0].annotate('Always increases\n(more parameters)', xy=(4, final_log_likelihoods[3]),
xytext=(3.5, final_log_likelihoods[0]),
arrowprops=dict(arrowstyle='->', color='gray'), fontsize=11, color='gray')
# Right: BIC vs K
axes[1].plot(K_values, bic_scores, 's-', color='#e74c3c', markersize=8, linewidth=2)
axes[1].axvline(x=best_K, color='green', linestyle='--', alpha=0.7, label=f'Best K = {best_K}')
axes[1].set_xlabel('Number of Components $K$', fontsize=12)
axes[1].set_ylabel('BIC Score', fontsize=12)
axes[1].set_title('BIC vs. K (Higher is Better)', fontsize=14)
axes[1].set_xticks(K_values)
axes[1].legend(fontsize=12)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"BIC correctly identifies K = {best_K} as the optimal number of components.")
print("Log-likelihood keeps increasing with K, but BIC penalises excess complexity.")
8. Latent-Variable Perspective (MML 11.4)ΒΆ
We can view the GMM as a latent-variable model with a discrete latent variable \(\mathbf{z} = [z_1, \ldots, z_K]^\top\) that is a one-hot encoding: exactly one \(z_k = 1\) (indicating which component generated the data point) and all others are 0.
Generative ProcessΒΆ
To generate a data point \(\mathbf{x}\):
Sample the component: Draw \(\mathbf{z} \sim \text{Categorical}(\boldsymbol{\pi})\), i.e., pick component \(k\) with probability \(\pi_k\) (Eq. 11.59).
Sample the observation: Given \(z_k = 1\), draw \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\) (Eq. 11.58).
The joint distribution of \(\mathbf{x}\) and \(\mathbf{z}\) is: $\(p(\mathbf{x}, z_k = 1) = p(\mathbf{x} \mid z_k = 1) \, p(z_k = 1) = \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \qquad (11.61)\)$
Marginalising over \(\mathbf{z}\) recovers the GMM: $\(p(\mathbf{x}) = \sum_{k=1}^{K} p(\mathbf{x}, z_k = 1) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\)$
The responsibilities \(r_{nk}\) are simply the posterior \(p(z_k = 1 \mid \mathbf{x}_n)\) obtained by Bayesβ theorem.
# Demonstrate the generative process explicitly
np.random.seed(123)
# Use our learned GMM parameters
n_samples = 300
# Step 1: sample component indicators from Categorical(pi)
z_samples = np.random.choice(K, size=n_samples, p=pis)
# Step 2: sample x from the chosen component
X_gen = np.empty((n_samples, 2))
for n in range(n_samples):
k = z_samples[n]
X_gen[n] = np.random.multivariate_normal(mus[k], Sigmas[k])
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Left: Step 1 -- show the categorical distribution
axes[0].bar(range(K), pis, color=colors[:K], edgecolor='black', linewidth=1.5)
axes[0].set_xticks(range(K))
axes[0].set_xticklabels([f'$k={k+1}$' for k in range(K)])
axes[0].set_ylabel('$\\pi_k$', fontsize=14)
axes[0].set_title('Step 1: Sample $k \\sim \\mathrm{Cat}(\\pi)$', fontsize=14)
axes[0].set_ylim(0, max(pis) * 1.3)
for k in range(K):
axes[0].text(k, pis[k] + 0.01, f'{pis[k]:.3f}', ha='center', fontsize=12)
# Middle: Step 2 -- show generated samples coloured by component
for k in range(K):
mask = z_samples == k
axes[1].scatter(X_gen[mask, 0], X_gen[mask, 1], s=15, alpha=0.5,
c=colors[k], label=f'$k={k+1}$ ({mask.sum()} pts)')
draw_ellipse(axes[1], mus[k], Sigmas[k], color=colors[k])
axes[1].plot(*mus[k], 'kx', markersize=12, markeredgewidth=3)
axes[1].legend(fontsize=10)
axes[1].set_title('Step 2: Sample $\\mathbf{x} \\sim \\mathcal{N}(\\mu_k, \\Sigma_k)$', fontsize=14)
axes[1].set_xlabel('$x_1$'); axes[1].set_ylabel('$x_2$')
axes[1].set_aspect('equal')
# Right: discard z -- observe only x (the marginal)
axes[2].scatter(X_gen[:, 0], X_gen[:, 1], s=15, alpha=0.5, c='gray')
axes[2].set_title('Observed: Discard $\\mathbf{z}$ (Latent)', fontsize=14)
axes[2].set_xlabel('$x_1$'); axes[2].set_ylabel('$x_2$')
axes[2].set_aspect('equal')
plt.suptitle('Generative Process of a GMM (Ancestral Sampling)', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
print("The generative process (MML Sec. 11.4.1):")
print(" 1. Sample a component k from Categorical(pi) -- the latent variable z.")
print(" 2. Sample x from N(mu_k, Sigma_k) conditioned on z.")
print(" 3. Discard z: we only observe x. This gives us the marginal p(x) = sum_k pi_k N(x|mu_k, Sigma_k).")
# Visualise the posterior p(z_k = 1 | x) = r_nk for the learned model
# Create a grid and evaluate responsibilities at each grid point
x_min, x_max = X[:, 0].min() - 2, X[:, 0].max() + 2
y_min, y_max = X[:, 1].min() - 2, X[:, 1].max() + 2
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
grid = np.column_stack([xx.ravel(), yy.ravel()])
R_grid = e_step(grid, pis, mus, Sigmas)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for k in range(K):
resp_map = R_grid[:, k].reshape(xx.shape)
im = axes[k].contourf(xx, yy, resp_map, levels=20, cmap='YlOrRd')
axes[k].scatter(X[:, 0], X[:, 1], s=5, alpha=0.3, c='black')
axes[k].plot(*mus[k], 'kx', markersize=14, markeredgewidth=3)
draw_ellipse(axes[k], mus[k], Sigmas[k], color='white', n_std=2.0, linestyle='--')
plt.colorbar(im, ax=axes[k], shrink=0.8)
axes[k].set_title(f'$p(z_{k+1}=1 \\mid \\mathbf{{x}})$ (Component {k+1})', fontsize=13)
axes[k].set_xlabel('$x_1$'); axes[k].set_ylabel('$x_2$')
axes[k].set_aspect('equal')
plt.suptitle('Posterior Responsibilities $r_{nk} = p(z_k = 1 \\mid \\mathbf{x})$', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
print("Each subplot shows how strongly each region 'belongs' to a given component.")
print("These are the posterior probabilities p(z_k=1|x) -- the Bayesian interpretation of responsibilities.")
# Visualise the full GMM density p(x) on a grid
# Evaluate the mixture density
density = np.zeros(grid.shape[0])
for k in range(K):
density += pis[k] * multivariate_normal_pdf(grid, mus[k], Sigmas[k])
density = density.reshape(xx.shape)
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Left: contour plot of the density
cs = axes[0].contourf(xx, yy, density, levels=30, cmap='viridis')
axes[0].scatter(X[:, 0], X[:, 1], s=5, alpha=0.3, c='white', edgecolors='none')
for k in range(K):
draw_ellipse(axes[0], mus[k], Sigmas[k], color='white', n_std=2.0, linestyle='--')
axes[0].plot(*mus[k], 'wx', markersize=12, markeredgewidth=3)
plt.colorbar(cs, ax=axes[0], label='$p(\\mathbf{x})$')
axes[0].set_title('Learned GMM Density $p(\\mathbf{x})$', fontsize=14)
axes[0].set_xlabel('$x_1$'); axes[0].set_ylabel('$x_2$')
axes[0].set_aspect('equal')
# Right: individual weighted components
for k in range(K):
comp_density = pis[k] * multivariate_normal_pdf(grid, mus[k], Sigmas[k]).reshape(xx.shape)
axes[1].contour(xx, yy, comp_density, levels=5, colors=[colors[k]], alpha=0.8)
axes[1].plot(*mus[k], 'x', color=colors[k], markersize=12, markeredgewidth=3,
label=f'$\\pi_{k+1}={pis[k]:.2f}$')
axes[1].scatter(X[:, 0], X[:, 1], s=5, alpha=0.2, c='gray')
axes[1].legend(fontsize=12)
axes[1].set_title('Individual Weighted Components $\\pi_k \\mathcal{N}(\\mu_k, \\Sigma_k)$', fontsize=14)
axes[1].set_xlabel('$x_1$'); axes[1].set_ylabel('$x_2$')
axes[1].set_aspect('equal')
plt.tight_layout()
plt.show()
print("Left: the full mixture density obtained by summing all weighted components.")
print("Right: the individual components, each contributing proportionally to their weight pi_k.")
SummaryΒΆ
What we implementedΒΆ
Topic |
Key Idea |
|---|---|
GMM definition (11.1) |
\(p(\mathbf{x}) = \sum_k \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\) β a convex combination of Gaussians |
MLE difficulty (11.2) |
The log of a sum prevents a closed-form solution; responsibilities couple the updates |
EM algorithm (11.3) |
E-step computes \(r_{nk}\) (soft assignments); M-step updates \(\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\) |
K-Means as hard EM |
Hard assignments (\(r_{nk} \in \{0,1\}\)), isotropic covariance, uniform weights |
Convergence |
EM monotonically increases the log-likelihood, converging to a local optimum |
Initialisation |
Different starting points can yield different local optima; run multiple restarts |
Model selection |
BIC balances log-likelihood and model complexity to choose \(K\) |
Latent variables (11.4) |
GMM as a generative model: sample \(k \sim \text{Cat}(\boldsymbol{\pi})\), then \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\) |
Key equations from MML Chapter 11ΒΆ
Responsibilities: \(r_{nk} = \frac{\pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_j \pi_j \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}\) (11.17)
Mean update: \(\boldsymbol{\mu}_k^{\text{new}} = \frac{1}{N_k}\sum_n r_{nk}\mathbf{x}_n\) (11.20)
Covariance update: \(\boldsymbol{\Sigma}_k^{\text{new}} = \frac{1}{N_k}\sum_n r_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top\) (11.30)
Weight update: \(\pi_k^{\text{new}} = N_k / N\) (11.42)
Practical takeawaysΒΆ
EM is sensitive to initialisation β use multiple restarts.
Use BIC (or cross-validation) to select \(K\).
Add small regularisation to covariances (\(\epsilon \mathbf{I}\)) to avoid singularities.
K-Means provides a fast, useful initialisation for EM.