Run this notebook: Open in Colab Open in Kaggle

Lab 09: Density Estimation with Gaussian Mixture Models¶

Based on Chapter 11 of Mathematics for Machine Learning (Deisenroth, Faisal, Ong)

In this lab we implement Gaussian Mixture Models from scratch, covering:

Section	Topic	MML Reference
1	Gaussian Mixture Model definition	11.1
2	Parameter Learning via MLE	11.2
3	EM Algorithm (full implementation)	11.3
4	K-Means as Hard EM	–
5	Convergence analysis	11.3
6	Initialization sensitivity	–
7	Model selection (BIC)	–
8	Latent variable perspective	11.4

We use only NumPy and Matplotlib – no scikit-learn.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse

plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
np.random.seed(42)

print("NumPy version:", np.__version__)
print("Setup complete.")

1. Gaussian Mixture Model (MML 11.1)¶

A single Gaussian distribution has limited modelling power – it cannot capture multi-modal data (data with several clusters). A Gaussian Mixture Model (GMM) addresses this by combining $K$ Gaussian components:

\[p(\mathbf{x} \mid \boldsymbol{\theta}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \qquad (11.3)\]

where

$\pi_k$ are the mixture weights satisfying $0 \leq \pi_k \leq 1$ and $\sum_k \pi_k = 1$,
$\boldsymbol{\mu}_k \in \mathbb{R}^D$ is the mean of the $k$-th component,
$\boldsymbol{\Sigma}_k \in \mathbb{R}^{D \times D}$ is the covariance matrix of the $k$-th component.

The collection of all parameters is $\boldsymbol{\theta} := \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k : k = 1, \ldots, K\}$.

Let us generate data from a 2D GMM with $K=3$ components.

def multivariate_normal_pdf(x, mu, Sigma):
    """Evaluate the multivariate Gaussian density N(x | mu, Sigma).
    
    Parameters
    ----------
    x : (N, D) array -- data points
    mu : (D,) array -- mean
    Sigma : (D, D) array -- covariance
    
    Returns
    -------
    (N,) array of density values
    """
    D = mu.shape[0]
    diff = x - mu  # (N, D)
    Sigma_inv = np.linalg.inv(Sigma)
    det_Sigma = np.linalg.det(Sigma)
    
    # Mahalanobis distance: (x - mu)^T Sigma^{-1} (x - mu)
    mahal = np.sum(diff @ Sigma_inv * diff, axis=1)  # (N,)
    
    norm_const = 1.0 / (np.sqrt((2 * np.pi) ** D * det_Sigma))
    return norm_const * np.exp(-0.5 * mahal)


# --- True GMM parameters (K=3, D=2) ---
true_pis = np.array([0.3, 0.5, 0.2])
true_mus = np.array([
    [-3.0, -3.0],
    [ 0.0,  0.0],
    [ 3.0,  4.0]
])
true_Sigmas = np.array([
    [[1.0, 0.5], [0.5, 1.0]],
    [[1.5, -0.3], [-0.3, 0.8]],
    [[0.8, 0.0], [0.0, 1.2]]
])


def sample_gmm(pis, mus, Sigmas, N):
    """Draw N samples from a GMM."""
    K = len(pis)
    D = mus.shape[1]
    # Step 1: choose components according to pi
    components = np.random.choice(K, size=N, p=pis)
    # Step 2: sample from the chosen Gaussian
    X = np.empty((N, D))
    for n in range(N):
        k = components[n]
        X[n] = np.random.multivariate_normal(mus[k], Sigmas[k])
    return X, components


N = 500
X, true_labels = sample_gmm(true_pis, true_mus, true_Sigmas, N)

print(f"Generated {N} samples from a 2D GMM with K=3 components.")
print(f"Component counts: {[np.sum(true_labels == k) for k in range(3)]}")
print(f"Data shape: {X.shape}")

def draw_ellipse(ax, mu, Sigma, color='black', n_std=2.0, **kwargs):
    """Draw an ellipse representing the n_std contour of a 2D Gaussian."""
    eigvals, eigvecs = np.linalg.eigh(Sigma)
    angle = np.degrees(np.arctan2(eigvecs[1, 0], eigvecs[0, 0]))
    width, height = 2 * n_std * np.sqrt(eigvals)
    ell = Ellipse(xy=mu, width=width, height=height, angle=angle,
                  edgecolor=color, facecolor='none', linewidth=2, **kwargs)
    ax.add_patch(ell)


colors = ['#e74c3c', '#2ecc71', '#3498db']  # red, green, blue

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: all data (no labels)
axes[0].scatter(X[:, 0], X[:, 1], s=10, alpha=0.5, c='gray')
axes[0].set_title('Raw Data (Labels Hidden)')
axes[0].set_xlabel('$x_1$'); axes[0].set_ylabel('$x_2$')
axes[0].set_aspect('equal')

# Right: coloured by true component
for k in range(3):
    mask = true_labels == k
    axes[1].scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.5,
                    c=colors[k], label=f'Component {k+1} ($\\pi_{k+1}={true_pis[k]:.1f}$)')
    draw_ellipse(axes[1], true_mus[k], true_Sigmas[k], color=colors[k])
    axes[1].plot(*true_mus[k], 'x', color=colors[k], markersize=12, markeredgewidth=3)

axes[1].legend()
axes[1].set_title('True Component Assignments')
axes[1].set_xlabel('$x_1$'); axes[1].set_ylabel('$x_2$')
axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()

print("Left: the raw data we observe in practice (no labels).")
print("Right: the ground truth we want the algorithm to recover.")

2. Parameter Learning via Maximum Likelihood (MML 11.2)¶

Given an i.i.d. dataset $\mathcal{X} = \{\mathbf{x}_1, \ldots, \mathbf{x}_N\}$, the log-likelihood is:

\[\log p(\mathcal{X} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \log \underbrace{\sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}_{\text{sum inside the log}} \qquad (11.10)\]

Why direct MLE is hard¶

For a single Gaussian the log moves inside the sum and we get a closed-form solution. For a mixture the $\log$ of a sum has no closed-form optimum because we cannot separate the per-component terms.

Responsibilities (11.2.1)¶

Taking derivatives and setting them to zero introduces the responsibilities:

\[r_{nk} := \frac{\pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \qquad (11.17)\]

$r_{nk}$ is the probability that component $k$ generated data point $\mathbf{x}_n$. It is a soft assignment – each point has a distribution over all $K$ components.

The necessary conditions for a stationary point yield coupled update equations:

Parameter	Update	Eq.
$\boldsymbol{\mu}_k$	$\displaystyle\frac{1}{N_k}\sum_{n=1}^{N} r_{nk}\, \mathbf{x}_n$	(11.20)
$\boldsymbol{\Sigma}_k$	$\displaystyle\frac{1}{N_k}\sum_{n=1}^{N} r_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$	(11.30)
$\pi_k$	$N_k / N$	(11.42)

where $N_k = \sum_{n=1}^{N} r_{nk}$ is the total responsibility of component $k$.

These equations depend on each other through $r_{nk}$, so we cannot solve them in one shot – we need an iterative algorithm.

# Demonstrate why direct MLE is hard: compute the log-likelihood surface
# for a simple 1D, 2-component GMM and show it has multiple local optima.

# Simple 1D data
data_1d = np.concatenate([
    np.random.normal(-2, 0.5, 80),
    np.random.normal(2, 0.8, 120)
])

def gmm_log_likelihood_1d(data, pi1, mu1, mu2, sigma1=0.5, sigma2=0.8):
    """Log-likelihood for a 1D, 2-component GMM with fixed variances."""
    pi2 = 1.0 - pi1
    p1 = pi1 * (1.0 / np.sqrt(2 * np.pi * sigma1**2)) * np.exp(-0.5 * ((data - mu1) / sigma1)**2)
    p2 = pi2 * (1.0 / np.sqrt(2 * np.pi * sigma2**2)) * np.exp(-0.5 * ((data - mu2) / sigma2)**2)
    return np.sum(np.log(p1 + p2 + 1e-300))

mu1_range = np.linspace(-5, 5, 100)
mu2_range = np.linspace(-5, 5, 100)
M1, M2 = np.meshgrid(mu1_range, mu2_range)
LL = np.zeros_like(M1)
for i in range(M1.shape[0]):
    for j in range(M1.shape[1]):
        LL[i, j] = gmm_log_likelihood_1d(data_1d, 0.4, M1[i, j], M2[i, j])

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: the 1D data
axes[0].hist(data_1d, bins=30, density=True, alpha=0.6, color='steelblue')
axes[0].set_title('1D Mixture Data (2 Components)')
axes[0].set_xlabel('x'); axes[0].set_ylabel('Density')

# Right: log-likelihood as function of (mu1, mu2)
cs = axes[1].contourf(M1, M2, LL, levels=40, cmap='viridis')
plt.colorbar(cs, ax=axes[1], label='Log-likelihood')
axes[1].set_xlabel('$\\mu_1$'); axes[1].set_ylabel('$\\mu_2$')
axes[1].set_title('Log-likelihood surface (fixed $\\sigma_1, \\sigma_2, \\pi_1$)')
# Mark the two symmetric optima
axes[1].plot(-2, 2, 'r*', markersize=15, label='True ($\\mu_1$=-2, $\\mu_2$=2)')
axes[1].plot(2, -2, 'w*', markersize=15, label='Label-swapped optimum')
axes[1].legend(fontsize=10)
axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()

print("The log-likelihood surface has TWO symmetric optima (label-swapping symmetry).")
print("This multi-modality makes direct optimisation difficult -- we use EM instead.")

3. EM Algorithm (MML 11.3)¶

The Expectation-Maximisation (EM) algorithm (Dempster et al., 1977) is an iterative procedure that alternates between two steps:

E-step: Evaluate the responsibilities $r_{nk}$ for every data point using the current parameters: $$r_{nk} = \frac{\pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \qquad (11.53)$$
M-step: Re-estimate the parameters using the current responsibilities: $$\boldsymbol{\mu}_k = \frac{1}{N_k}\sum_{n=1}^{N} r_{nk}\, \mathbf{x}_n \qquad (11.54)$$\boldsymbol{\Sigma}_k = \frac{1}{N_k}\sum_{n=1}^{N} r_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \qquad (11.55)$$\pi_k = \frac{N_k}{N} \qquad (11.56)$$

where $N_k = \sum_{n=1}^{N} r_{nk}$.

Key property: Every EM iteration increases the log-likelihood (Neal and Hinton, 1999), guaranteeing convergence to a local optimum.

def compute_log_likelihood(X, pis, mus, Sigmas):
    """Compute the GMM log-likelihood (Eq. 11.10)."""
    N = X.shape[0]
    K = len(pis)
    # Weighted densities: (N, K)
    weighted = np.zeros((N, K))
    for k in range(K):
        weighted[:, k] = pis[k] * multivariate_normal_pdf(X, mus[k], Sigmas[k])
    # Sum over components, then log, then sum over data
    return np.sum(np.log(np.sum(weighted, axis=1) + 1e-300))


def e_step(X, pis, mus, Sigmas):
    """E-step: compute responsibilities r_nk (Eq. 11.53).
    
    Returns
    -------
    R : (N, K) array of responsibilities
    """
    N = X.shape[0]
    K = len(pis)
    R = np.zeros((N, K))
    
    for k in range(K):
        R[:, k] = pis[k] * multivariate_normal_pdf(X, mus[k], Sigmas[k])
    
    # Normalise each row so responsibilities sum to 1
    row_sums = np.sum(R, axis=1, keepdims=True) + 1e-300
    R = R / row_sums
    return R


def m_step(X, R):
    """M-step: update pis, mus, Sigmas (Eqs. 11.54-11.56).
    
    Parameters
    ----------
    X : (N, D) data
    R : (N, K) responsibilities
    
    Returns
    -------
    pis, mus, Sigmas : updated parameters
    """
    N, D = X.shape
    K = R.shape[1]
    
    N_k = np.sum(R, axis=0)  # (K,) -- total responsibility per component
    
    # Update means (Eq. 11.54)
    mus = np.zeros((K, D))
    for k in range(K):
        mus[k] = np.sum(R[:, k:k+1] * X, axis=0) / N_k[k]
    
    # Update covariances (Eq. 11.55)
    Sigmas = np.zeros((K, D, D))
    for k in range(K):
        diff = X - mus[k]  # (N, D)
        Sigmas[k] = (R[:, k:k+1] * diff).T @ diff / N_k[k]
        # Add small regularisation for numerical stability
        Sigmas[k] += 1e-6 * np.eye(D)
    
    # Update mixture weights (Eq. 11.56)
    pis = N_k / N
    
    return pis, mus, Sigmas


def run_em(X, K, max_iter=100, tol=1e-6, seed=0):
    """Run the full EM algorithm for a GMM.
    
    Returns
    -------
    pis, mus, Sigmas : final parameters
    R : final responsibilities
    log_likelihoods : list of log-likelihood at each iteration
    history : list of (pis, mus, Sigmas, R) at each iteration
    """
    np.random.seed(seed)
    N, D = X.shape
    
    # --- Initialisation ---
    # Pick K random data points as initial means
    indices = np.random.choice(N, K, replace=False)
    mus = X[indices].copy()
    # Identity covariances
    Sigmas = np.array([np.eye(D) for _ in range(K)])
    # Uniform weights
    pis = np.ones(K) / K
    
    log_likelihoods = []
    history = []
    
    for iteration in range(max_iter):
        # E-step
        R = e_step(X, pis, mus, Sigmas)
        
        # M-step
        pis, mus, Sigmas = m_step(X, R)
        
        # Compute log-likelihood
        ll = compute_log_likelihood(X, pis, mus, Sigmas)
        log_likelihoods.append(ll)
        history.append((pis.copy(), mus.copy(), Sigmas.copy(), R.copy()))
        
        # Check convergence
        if iteration > 0 and abs(log_likelihoods[-1] - log_likelihoods[-2]) < tol:
            break
    
    return pis, mus, Sigmas, R, log_likelihoods, history


print("EM functions defined: e_step, m_step, run_em")
print("Ready to fit a GMM.")

# Run EM on our 2D data
K = 3
pis, mus, Sigmas, R, lls, history = run_em(X, K, max_iter=100, seed=42)

print(f"EM converged after {len(lls)} iterations.")
print(f"\nLearned mixture weights: {pis.round(3)}")
print(f"True mixture weights:    {true_pis}")
print(f"\nLearned means:")
for k in range(K):
    print(f"  mu_{k+1} = {mus[k].round(3)}")
print(f"True means:")
for k in range(K):
    print(f"  mu_{k+1} = {true_mus[k]}")

# Visualise EM iterations: show cluster assignments evolving
iterations_to_show = [0, 1, 3, len(history)-1]
iterations_to_show = [i for i in iterations_to_show if i < len(history)]

fig, axes = plt.subplots(1, len(iterations_to_show), figsize=(5*len(iterations_to_show), 5))
if len(iterations_to_show) == 1:
    axes = [axes]

for ax_idx, it in enumerate(iterations_to_show):
    ax = axes[ax_idx]
    p, m, S, r = history[it]
    
    # Colour each point by its responsibilities (soft assignment)
    # Use RGB from responsibilities
    rgb = np.zeros((N, 3))
    color_basis = np.array([[0.91, 0.30, 0.24],   # red
                            [0.18, 0.80, 0.44],   # green
                            [0.20, 0.58, 0.86]])   # blue
    for k in range(K):
        rgb += r[:, k:k+1] * color_basis[k:k+1]
    rgb = np.clip(rgb, 0, 1)
    
    ax.scatter(X[:, 0], X[:, 1], c=rgb, s=10, alpha=0.7)
    
    # Draw component ellipses
    for k in range(K):
        draw_ellipse(ax, m[k], S[k], color=colors[k], n_std=2.0)
        ax.plot(*m[k], 'x', color='black', markersize=12, markeredgewidth=3)
    
    ax.set_title(f'Iteration {it+1}', fontsize=14)
    ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
    ax.set_aspect('equal')

plt.suptitle('EM Algorithm: Evolution of Cluster Assignments', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

print("Colours show soft assignments: each point is a blend of component colours.")
print("As EM progresses, assignments become more confident (purer colours).")

4. K-Means as Hard EM¶

K-Means is a special case of EM where we make hard assignments rather than soft ones:

	EM (soft)	K-Means (hard)
Assignment	$r_{nk} \in [0,1]$ (probability)	$r_{nk} \in \{0,1\}$ (binary)
Each point belongs to	All clusters (weighted)	Exactly one cluster
Covariances	Learned per component	Implicitly $\sigma^2 I$ for all
Weights	Learned $\pi_k$	Implicitly uniform

K-Means corresponds to EM on a GMM where:

All covariances are $\sigma^2 \mathbf{I}$ (isotropic) and $\sigma^2 \to 0$
All mixture weights are equal
Responsibilities collapse to hard assignments: $r_{nk} = 1$ if $k = \arg\min_j \|\mathbf{x}_n - \boldsymbol{\mu}_j\|^2$

def run_kmeans(X, K, max_iter=100, seed=0):
    """K-Means clustering (hard EM).
    
    Returns
    -------
    centroids : (K, D) final centroids
    labels : (N,) cluster assignments
    inertias : list of total within-cluster distance at each iteration
    history : list of (centroids, labels) at each iteration
    """
    np.random.seed(seed)
    N, D = X.shape
    
    # Initialise centroids from random data points
    indices = np.random.choice(N, K, replace=False)
    centroids = X[indices].copy()
    
    inertias = []
    history = []
    
    for iteration in range(max_iter):
        # "E-step" (hard): assign each point to nearest centroid
        dists = np.zeros((N, K))
        for k in range(K):
            dists[:, k] = np.sum((X - centroids[k])**2, axis=1)
        labels = np.argmin(dists, axis=1)
        
        # "M-step": recompute centroids
        new_centroids = np.zeros((K, D))
        for k in range(K):
            members = X[labels == k]
            if len(members) > 0:
                new_centroids[k] = members.mean(axis=0)
            else:
                new_centroids[k] = centroids[k]  # keep old if empty
        
        inertia = sum(np.sum((X[labels == k] - new_centroids[k])**2) for k in range(K))
        inertias.append(inertia)
        history.append((new_centroids.copy(), labels.copy()))
        
        # Check convergence
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    
    return centroids, labels, inertias, history


centroids, km_labels, inertias, km_history = run_kmeans(X, K=3, seed=42)

print(f"K-Means converged after {len(inertias)} iterations.")
print(f"Final centroids:")
for k in range(3):
    print(f"  centroid_{k+1} = {centroids[k].round(3)}")

# Compare K-Means (hard) vs EM (soft) assignments
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Left: K-Means result (hard boundaries)
for k in range(3):
    mask = km_labels == k
    axes[0].scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.5, c=colors[k])
    axes[0].plot(*centroids[k], 'kx', markersize=14, markeredgewidth=3)
axes[0].set_title('K-Means (Hard Assignments)', fontsize=14)
axes[0].set_xlabel('$x_1$'); axes[0].set_ylabel('$x_2$')
axes[0].set_aspect('equal')

# Middle: EM result (soft assignments, coloured by max responsibility)
em_labels = np.argmax(R, axis=1)
for k in range(3):
    mask = em_labels == k
    axes[1].scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.5, c=colors[k])
    draw_ellipse(axes[1], mus[k], Sigmas[k], color=colors[k], n_std=2.0)
    axes[1].plot(*mus[k], 'kx', markersize=14, markeredgewidth=3)
axes[1].set_title('EM / GMM (Soft Assignments)', fontsize=14)
axes[1].set_xlabel('$x_1$'); axes[1].set_ylabel('$x_2$')
axes[1].set_aspect('equal')

# Right: true labels
for k in range(3):
    mask = true_labels == k
    axes[2].scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.5, c=colors[k])
    draw_ellipse(axes[2], true_mus[k], true_Sigmas[k], color=colors[k], n_std=2.0)
    axes[2].plot(*true_mus[k], 'kx', markersize=14, markeredgewidth=3)
axes[2].set_title('Ground Truth', fontsize=14)
axes[2].set_xlabel('$x_1$'); axes[2].set_ylabel('$x_2$')
axes[2].set_aspect('equal')

plt.suptitle('K-Means vs EM vs Ground Truth', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

print("K-Means produces hard Voronoi-like boundaries (spherical clusters assumed).")
print("EM captures elliptical shapes and provides probabilistic assignments.")

5. Convergence Analysis¶

A fundamental property of the EM algorithm is that the log-likelihood never decreases from one iteration to the next:

\[\log p(\mathcal{X} \mid \boldsymbol{\theta}^{(t+1)}) \geq \log p(\mathcal{X} \mid \boldsymbol{\theta}^{(t)})\]

This monotonic increase guarantees convergence to a local optimum (not necessarily the global optimum). Let us verify this empirically.

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Left: log-likelihood over EM iterations
axes[0].plot(range(1, len(lls)+1), lls, 'o-', color='steelblue', markersize=5)
axes[0].set_xlabel('EM Iteration')
axes[0].set_ylabel('Log-Likelihood')
axes[0].set_title('Log-Likelihood vs. EM Iteration')
axes[0].grid(True, alpha=0.3)

# Verify monotonic increase
diffs = np.diff(lls)
all_non_negative = np.all(diffs >= -1e-10)  # allow tiny numerical noise
axes[0].annotate(f'Monotonically increasing: {all_non_negative}',
                 xy=(0.5, 0.1), xycoords='axes fraction', fontsize=12,
                 bbox=dict(boxstyle='round', facecolor='lightgreen' if all_non_negative else 'lightyellow'))

# Right: change in log-likelihood (should be >= 0)
axes[1].bar(range(1, len(diffs)+1), diffs, color='steelblue', alpha=0.7)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[1].set_xlabel('EM Iteration')
axes[1].set_ylabel('$\\Delta$ Log-Likelihood')
axes[1].set_title('Change in Log-Likelihood Per Iteration')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Initial log-likelihood: {lls[0]:.2f}")
print(f"Final log-likelihood:   {lls[-1]:.2f}")
print(f"All iteration-to-iteration changes >= 0: {all_non_negative}")
print(f"Smallest change: {min(diffs):.2e}")
print(f"\nThis confirms the monotonic convergence property of EM (Neal & Hinton, 1999).")

6. Initialization Sensitivity¶

EM converges to a local optimum, so different initialisations can yield different solutions. This is a practical concern: poor initialisation may lead to suboptimal clustering.

A common strategy is to run EM multiple times with different random seeds and pick the run with the highest final log-likelihood.

# Run EM with several different random initialisations
n_runs = 6
results = []
for seed in range(n_runs):
    p, m, S, r, ll, hist = run_em(X, K=3, max_iter=100, seed=seed*7)
    results.append((p, m, S, r, ll, hist))

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

for idx in range(n_runs):
    ax = axes[idx // 3, idx % 3]
    p, m, S, r, ll, hist = results[idx]
    
    labels_run = np.argmax(r, axis=1)
    for k in range(3):
        mask = labels_run == k
        ax.scatter(X[mask, 0], X[mask, 1], s=8, alpha=0.4, c=colors[k])
        draw_ellipse(ax, m[k], S[k], color=colors[k])
        ax.plot(*m[k], 'kx', markersize=10, markeredgewidth=2)
    
    ax.set_title(f'Seed {idx*7} | LL = {ll[-1]:.1f} | {len(ll)} iters', fontsize=11)
    ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
    ax.set_aspect('equal')

plt.suptitle('EM with Different Random Initialisations', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

final_lls = [results[i][4][-1] for i in range(n_runs)]
best_idx = np.argmax(final_lls)
print("Final log-likelihoods for each run:")
for i in range(n_runs):
    marker = " <-- BEST" if i == best_idx else ""
    print(f"  Seed {i*7:2d}: LL = {final_lls[i]:.2f}{marker}")
print(f"\nBest run: seed {best_idx*7} with LL = {final_lls[best_idx]:.2f}")
print("Different seeds can converge to different local optima (though here the data is well-separated).")

# Compare convergence curves for all runs
fig, ax = plt.subplots(figsize=(10, 6))

for idx in range(n_runs):
    ll_curve = results[idx][4]
    ax.plot(range(1, len(ll_curve)+1), ll_curve, 'o-', markersize=4,
            label=f'Seed {idx*7} (final: {ll_curve[-1]:.1f})', alpha=0.8)

ax.set_xlabel('EM Iteration', fontsize=12)
ax.set_ylabel('Log-Likelihood', fontsize=12)
ax.set_title('Convergence Curves for Different Initialisations', fontsize=14)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("All runs show monotonically increasing log-likelihood.")
print("They may converge to different local optima at different speeds.")

7. Model Selection: Choosing $K$ with BIC¶

How do we choose the number of components $K$? Adding more components always increases the log-likelihood (more parameters = better fit), so we need a criterion that penalises model complexity.

The Bayesian Information Criterion (BIC) balances fit and complexity:

\[\text{BIC} = \log p(\mathcal{X} \mid \hat{\boldsymbol{\theta}}) - \frac{M}{2} \log N\]

where $M$ is the number of free parameters and $N$ is the number of data points. Higher BIC is better (less negative). The penalty $\frac{M}{2}\log N$ discourages overfitting.

For a GMM with $K$ components in $D$ dimensions, the number of free parameters is: $$M = K \cdot D + K \cdot \frac{D(D+1)}{2} + (K - 1) = K\left(D + \frac{D(D+1)}{2} + 1\right) - 1$$ (means + upper-triangular covariance entries + mixture weights with constraint $\sum \pi_k = 1$)

def count_gmm_params(K, D):
    """Count free parameters of a GMM with K components in D dimensions."""
    means = K * D
    covariances = K * D * (D + 1) // 2
    weights = K - 1  # one constraint: they sum to 1
    return means + covariances + weights


def compute_bic(ll, n_params, N):
    """Compute BIC = log_likelihood - (M/2) * log(N)."""
    return ll - 0.5 * n_params * np.log(N)


# Fit GMMs with K = 1, 2, 3, 4, 5
K_values = [1, 2, 3, 4, 5]
D = 2
bic_scores = []
final_log_likelihoods = []

print(f"{'K':>3} | {'Params':>6} | {'Log-Lik':>10} | {'BIC':>10}")
print("-" * 45)

for K_try in K_values:
    # Run EM 3 times, pick best
    best_ll = -np.inf
    for seed in range(3):
        p, m, S, r, ll, hist = run_em(X, K_try, max_iter=150, seed=seed*11)
        if ll[-1] > best_ll:
            best_ll = ll[-1]
    
    n_params = count_gmm_params(K_try, D)
    bic = compute_bic(best_ll, n_params, N)
    bic_scores.append(bic)
    final_log_likelihoods.append(best_ll)
    
    print(f"{K_try:3d} | {n_params:6d} | {best_ll:10.2f} | {bic:10.2f}")

best_K = K_values[np.argmax(bic_scores)]
print(f"\nBest K by BIC: K = {best_K}")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: log-likelihood vs K
axes[0].plot(K_values, final_log_likelihoods, 'o-', color='steelblue', markersize=8, linewidth=2)
axes[0].set_xlabel('Number of Components $K$', fontsize=12)
axes[0].set_ylabel('Log-Likelihood', fontsize=12)
axes[0].set_title('Log-Likelihood vs. K', fontsize=14)
axes[0].set_xticks(K_values)
axes[0].grid(True, alpha=0.3)
axes[0].annotate('Always increases\n(more parameters)', xy=(4, final_log_likelihoods[3]),
                 xytext=(3.5, final_log_likelihoods[0]),
                 arrowprops=dict(arrowstyle='->', color='gray'), fontsize=11, color='gray')

# Right: BIC vs K
axes[1].plot(K_values, bic_scores, 's-', color='#e74c3c', markersize=8, linewidth=2)
axes[1].axvline(x=best_K, color='green', linestyle='--', alpha=0.7, label=f'Best K = {best_K}')
axes[1].set_xlabel('Number of Components $K$', fontsize=12)
axes[1].set_ylabel('BIC Score', fontsize=12)
axes[1].set_title('BIC vs. K (Higher is Better)', fontsize=14)
axes[1].set_xticks(K_values)
axes[1].legend(fontsize=12)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"BIC correctly identifies K = {best_K} as the optimal number of components.")
print("Log-likelihood keeps increasing with K, but BIC penalises excess complexity.")

8. Latent-Variable Perspective (MML 11.4)¶

We can view the GMM as a latent-variable model with a discrete latent variable $\mathbf{z} = [z_1, \ldots, z_K]^\top$ that is a one-hot encoding: exactly one $z_k = 1$ (indicating which component generated the data point) and all others are 0.

Generative Process¶

To generate a data point $\mathbf{x}$:

Sample the component: Draw $\mathbf{z} \sim \text{Categorical}(\boldsymbol{\pi})$, i.e., pick component $k$ with probability $\pi_k$ (Eq. 11.59).
Sample the observation: Given $z_k = 1$, draw $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ (Eq. 11.58).

The joint distribution of $\mathbf{x}$ and $\mathbf{z}$ is: $$p(\mathbf{x}, z_k = 1) = p(\mathbf{x} \mid z_k = 1) \, p(z_k = 1) = \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \qquad (11.61)$$

Marginalising over $\mathbf{z}$ recovers the GMM: $$p(\mathbf{x}) = \sum_{k=1}^{K} p(\mathbf{x}, z_k = 1) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

The responsibilities $r_{nk}$ are simply the posterior $p(z_k = 1 \mid \mathbf{x}_n)$ obtained by Bayes’ theorem.

# Demonstrate the generative process explicitly
np.random.seed(123)

# Use our learned GMM parameters
n_samples = 300

# Step 1: sample component indicators from Categorical(pi)
z_samples = np.random.choice(K, size=n_samples, p=pis)

# Step 2: sample x from the chosen component
X_gen = np.empty((n_samples, 2))
for n in range(n_samples):
    k = z_samples[n]
    X_gen[n] = np.random.multivariate_normal(mus[k], Sigmas[k])

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Left: Step 1 -- show the categorical distribution
axes[0].bar(range(K), pis, color=colors[:K], edgecolor='black', linewidth=1.5)
axes[0].set_xticks(range(K))
axes[0].set_xticklabels([f'$k={k+1}$' for k in range(K)])
axes[0].set_ylabel('$\\pi_k$', fontsize=14)
axes[0].set_title('Step 1: Sample $k \\sim \\mathrm{Cat}(\\pi)$', fontsize=14)
axes[0].set_ylim(0, max(pis) * 1.3)
for k in range(K):
    axes[0].text(k, pis[k] + 0.01, f'{pis[k]:.3f}', ha='center', fontsize=12)

# Middle: Step 2 -- show generated samples coloured by component
for k in range(K):
    mask = z_samples == k
    axes[1].scatter(X_gen[mask, 0], X_gen[mask, 1], s=15, alpha=0.5,
                    c=colors[k], label=f'$k={k+1}$ ({mask.sum()} pts)')
    draw_ellipse(axes[1], mus[k], Sigmas[k], color=colors[k])
    axes[1].plot(*mus[k], 'kx', markersize=12, markeredgewidth=3)

axes[1].legend(fontsize=10)
axes[1].set_title('Step 2: Sample $\\mathbf{x} \\sim \\mathcal{N}(\\mu_k, \\Sigma_k)$', fontsize=14)
axes[1].set_xlabel('$x_1$'); axes[1].set_ylabel('$x_2$')
axes[1].set_aspect('equal')

# Right: discard z -- observe only x (the marginal)
axes[2].scatter(X_gen[:, 0], X_gen[:, 1], s=15, alpha=0.5, c='gray')
axes[2].set_title('Observed: Discard $\\mathbf{z}$ (Latent)', fontsize=14)
axes[2].set_xlabel('$x_1$'); axes[2].set_ylabel('$x_2$')
axes[2].set_aspect('equal')

plt.suptitle('Generative Process of a GMM (Ancestral Sampling)', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

print("The generative process (MML Sec. 11.4.1):")
print("  1. Sample a component k from Categorical(pi) -- the latent variable z.")
print("  2. Sample x from N(mu_k, Sigma_k) conditioned on z.")
print("  3. Discard z: we only observe x. This gives us the marginal p(x) = sum_k pi_k N(x|mu_k, Sigma_k).")

# Visualise the posterior p(z_k = 1 | x) = r_nk for the learned model
# Create a grid and evaluate responsibilities at each grid point

x_min, x_max = X[:, 0].min() - 2, X[:, 0].max() + 2
y_min, y_max = X[:, 1].min() - 2, X[:, 1].max() + 2
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
grid = np.column_stack([xx.ravel(), yy.ravel()])

R_grid = e_step(grid, pis, mus, Sigmas)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for k in range(K):
    resp_map = R_grid[:, k].reshape(xx.shape)
    im = axes[k].contourf(xx, yy, resp_map, levels=20, cmap='YlOrRd')
    axes[k].scatter(X[:, 0], X[:, 1], s=5, alpha=0.3, c='black')
    axes[k].plot(*mus[k], 'kx', markersize=14, markeredgewidth=3)
    draw_ellipse(axes[k], mus[k], Sigmas[k], color='white', n_std=2.0, linestyle='--')
    plt.colorbar(im, ax=axes[k], shrink=0.8)
    axes[k].set_title(f'$p(z_{k+1}=1 \\mid \\mathbf{{x}})$ (Component {k+1})', fontsize=13)
    axes[k].set_xlabel('$x_1$'); axes[k].set_ylabel('$x_2$')
    axes[k].set_aspect('equal')

plt.suptitle('Posterior Responsibilities $r_{nk} = p(z_k = 1 \\mid \\mathbf{x})$', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

print("Each subplot shows how strongly each region 'belongs' to a given component.")
print("These are the posterior probabilities p(z_k=1|x) -- the Bayesian interpretation of responsibilities.")

# Visualise the full GMM density p(x) on a grid

# Evaluate the mixture density
density = np.zeros(grid.shape[0])
for k in range(K):
    density += pis[k] * multivariate_normal_pdf(grid, mus[k], Sigmas[k])
density = density.reshape(xx.shape)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: contour plot of the density
cs = axes[0].contourf(xx, yy, density, levels=30, cmap='viridis')
axes[0].scatter(X[:, 0], X[:, 1], s=5, alpha=0.3, c='white', edgecolors='none')
for k in range(K):
    draw_ellipse(axes[0], mus[k], Sigmas[k], color='white', n_std=2.0, linestyle='--')
    axes[0].plot(*mus[k], 'wx', markersize=12, markeredgewidth=3)
plt.colorbar(cs, ax=axes[0], label='$p(\\mathbf{x})$')
axes[0].set_title('Learned GMM Density $p(\\mathbf{x})$', fontsize=14)
axes[0].set_xlabel('$x_1$'); axes[0].set_ylabel('$x_2$')
axes[0].set_aspect('equal')

# Right: individual weighted components
for k in range(K):
    comp_density = pis[k] * multivariate_normal_pdf(grid, mus[k], Sigmas[k]).reshape(xx.shape)
    axes[1].contour(xx, yy, comp_density, levels=5, colors=[colors[k]], alpha=0.8)
    axes[1].plot(*mus[k], 'x', color=colors[k], markersize=12, markeredgewidth=3,
                label=f'$\\pi_{k+1}={pis[k]:.2f}$')

axes[1].scatter(X[:, 0], X[:, 1], s=5, alpha=0.2, c='gray')
axes[1].legend(fontsize=12)
axes[1].set_title('Individual Weighted Components $\\pi_k \\mathcal{N}(\\mu_k, \\Sigma_k)$', fontsize=14)
axes[1].set_xlabel('$x_1$'); axes[1].set_ylabel('$x_2$')
axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()

print("Left: the full mixture density obtained by summing all weighted components.")
print("Right: the individual components, each contributing proportionally to their weight pi_k.")

Summary¶

What we implemented¶

Topic	Key Idea
GMM definition (11.1)	$p(\mathbf{x}) = \sum_k \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ – a convex combination of Gaussians
MLE difficulty (11.2)	The log of a sum prevents a closed-form solution; responsibilities couple the updates
EM algorithm (11.3)	E-step computes $r_{nk}$ (soft assignments); M-step updates $\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k$
K-Means as hard EM	Hard assignments ($r_{nk} \in \{0,1\}$), isotropic covariance, uniform weights
Convergence	EM monotonically increases the log-likelihood, converging to a local optimum
Initialisation	Different starting points can yield different local optima; run multiple restarts
Model selection	BIC balances log-likelihood and model complexity to choose $K$
Latent variables (11.4)	GMM as a generative model: sample $k \sim \text{Cat}(\boldsymbol{\pi})$, then $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$

Key equations from MML Chapter 11¶

Responsibilities: $r_{nk} = \frac{\pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_j \pi_j \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$ (11.17)
Mean update: $\boldsymbol{\mu}_k^{\text{new}} = \frac{1}{N_k}\sum_n r_{nk}\mathbf{x}_n$ (11.20)
Covariance update: $\boldsymbol{\Sigma}_k^{\text{new}} = \frac{1}{N_k}\sum_n r_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$ (11.30)
Weight update: $\pi_k^{\text{new}} = N_k / N$ (11.42)

Practical takeaways¶

EM is sensitive to initialisation – use multiple restarts.
Use BIC (or cross-validation) to select $K$.
Add small regularisation to covariances ($\epsilon \mathbf{I}$) to avoid singularities.
K-Means provides a fast, useful initialisation for EM.

Parameter	Update	Eq.
\(\boldsymbol{\mu}_k\)	\(\displaystyle\frac{1}{N_k}\sum_{n=1}^{N} r_{nk}\, \mathbf{x}_n\)	(11.20)
\(\boldsymbol{\Sigma}_k\)	\(\displaystyle\frac{1}{N_k}\sum_{n=1}^{N} r_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top\)	(11.30)
\(\pi_k\)	\(N_k / N\)	(11.42)

	EM (soft)	K-Means (hard)
Assignment	\(r_{nk} \in [0,1]\) (probability)	\(r_{nk} \in \{0,1\}\) (binary)
Each point belongs to	All clusters (weighted)	Exactly one cluster
Covariances	Learned per component	Implicitly \(\sigma^2 I\) for all
Weights	Learned \(\pi_k\)	Implicitly uniform

Topic	Key Idea
GMM definition (11.1)	\(p(\mathbf{x}) = \sum_k \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\) – a convex combination of Gaussians
MLE difficulty (11.2)	The log of a sum prevents a closed-form solution; responsibilities couple the updates
EM algorithm (11.3)	E-step computes \(r_{nk}\) (soft assignments); M-step updates \(\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\)
K-Means as hard EM	Hard assignments (\(r_{nk} \in \{0,1\}\)), isotropic covariance, uniform weights
Convergence	EM monotonically increases the log-likelihood, converging to a local optimum
Initialisation	Different starting points can yield different local optima; run multiple restarts
Model selection	BIC balances log-likelihood and model complexity to choose \(K\)
Latent variables (11.4)	GMM as a generative model: sample \(k \sim \text{Cat}(\boldsymbol{\pi})\), then \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\)