Lab 08: Dimensionality Reduction with Principal Component AnalysisΒΆ
Based on Chapter 10 of Mathematics for Machine Learning (Deisenroth, Faisal, Ong)
Principal Component Analysis (PCA) is one of the most fundamental algorithms for linear dimensionality reduction. High-dimensional data is often overcomplete β many dimensions are redundant and can be explained by combinations of others. PCA exploits this structure to find a compact, lower-dimensional representation that preserves as much information as possible.
In this lab we derive and implement PCA from first principles, covering:
Section |
Topic |
MML Reference |
|---|---|---|
1 |
Problem Setting |
10.1 |
2 |
Maximum Variance Perspective |
10.2 |
3 |
Projection Perspective |
10.3 |
4 |
Eigenvector Computation & Low-Rank Approximation |
10.4 |
5 |
PCA in High Dimensions |
10.5 |
6 |
Key Steps of PCA in Practice |
10.6 |
7 |
Latent Variable Perspective |
10.7 |
8 |
Summary |
β |
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
import matplotlib.transforms as transforms
np.random.seed(42)
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3
print("NumPy version:", np.__version__)
print("Setup complete.")
1. Problem Setting (MML 10.1)ΒΆ
We consider an i.i.d. dataset \(\mathcal{X} = \{\mathbf{x}_1, \ldots, \mathbf{x}_N\}\), \(\mathbf{x}_n \in \mathbb{R}^D\). PCA seeks projections \(\tilde{\mathbf{x}}_n\) that are as similar to the original \(\mathbf{x}_n\) as possible but live in a lower-dimensional subspace of dimension \(M < D\).
The key objects are:
Projection matrix \(\mathbf{B} := [\mathbf{b}_1, \ldots, \mathbf{b}_M] \in \mathbb{R}^{D \times M}\) with orthonormal columns
Low-dimensional code \(\mathbf{z}_n = \mathbf{B}^\top \mathbf{x}_n \in \mathbb{R}^M\) (encoder)
Reconstruction \(\tilde{\mathbf{x}}_n = \mathbf{B}\mathbf{z}_n = \mathbf{B}\mathbf{B}^\top\mathbf{x}_n \in \mathbb{R}^D\) (decoder)
Data covariance matrix \(\mathbf{S} = \frac{1}{N}\sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^\top\) (assuming centered data)
Let us generate a rich 2D dataset with clear correlation structure.
# Generate 2D data with strong correlation
N = 200
mean_true = np.array([2.0, 3.0])
# Covariance with correlation: large variance along a tilted direction
angle = np.pi / 6 # 30 degrees
lambda1, lambda2 = 5.0, 0.5 # eigenvalues (variance along principal axes)
R = np.array([[np.cos(angle), -np.sin(angle)],
[np.sin(angle), np.cos(angle)]])
Sigma_true = R @ np.diag([lambda1, lambda2]) @ R.T
X_raw = np.random.multivariate_normal(mean_true, Sigma_true, N) # (N, D)
print(f"Dataset shape: {X_raw.shape} (N={N}, D={X_raw.shape[1]})")
print(f"Sample mean: [{X_raw.mean(axis=0)[0]:.3f}, {X_raw.mean(axis=0)[1]:.3f}]")
print(f"True covariance matrix:")
print(np.round(Sigma_true, 3))
# Step 1: Center the data (subtract mean)
mu = X_raw.mean(axis=0)
X_centered = X_raw - mu
# Compute data covariance matrix S = (1/N) X^T X (Eq. 10.1 / 10.45)
S = (X_centered.T @ X_centered) / N
print("Data covariance matrix S:")
print(np.round(S, 3))
print(f"\nTrue covariance for comparison:")
print(np.round(Sigma_true, 3))
# Visualize raw and centered data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X_raw[:, 0], X_raw[:, 1], alpha=0.5, s=20, c='steelblue')
axes[0].scatter(*mean_true, c='red', s=100, marker='x', linewidths=2, label='True mean')
axes[0].set_xlabel('$x_1$'); axes[0].set_ylabel('$x_2$')
axes[0].set_title('(a) Original Data'); axes[0].legend(); axes[0].set_aspect('equal')
axes[1].scatter(X_centered[:, 0], X_centered[:, 1], alpha=0.5, s=20, c='steelblue')
axes[1].scatter(0, 0, c='red', s=100, marker='x', linewidths=2, label='Mean = 0')
axes[1].set_xlabel('$x_1$'); axes[1].set_ylabel('$x_2$')
axes[1].set_title('(b) Centered Data ($\\mu$ subtracted)'); axes[1].legend(); axes[1].set_aspect('equal')
plt.tight_layout()
plt.show()
2. Maximum Variance Perspective (MML 10.2)ΒΆ
The maximum variance perspective derives PCA by seeking the direction that maximizes the variance of the projected data.
2.1 Direction with Maximal VarianceΒΆ
For a single direction \(\mathbf{b}_1 \in \mathbb{R}^D\), the variance of the projected data is (Eq. 10.9b):
We maximize \(V_1\) subject to \(\|\mathbf{b}_1\|^2 = 1\). Using the Lagrangian (Eq. 10.11):
Setting the derivative to zero gives (Eq. 10.13):
This is an eigenvalue equation: \(\mathbf{b}_1\) is an eigenvector of \(\mathbf{S}\), and \(\lambda_1\) is the corresponding eigenvalue. The variance along \(\mathbf{b}_1\) is \(V_1 = \mathbf{b}_1^\top\mathbf{S}\mathbf{b}_1 = \lambda_1\). To maximize variance, we choose the eigenvector associated with the largest eigenvalue.
2.2 M-dimensional SubspaceΒΆ
The total variance captured by the first \(M\) principal components is (Eq. 10.24):
where \(\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_D\) are the eigenvalues sorted in descending order.
# Eigendecomposition of the covariance matrix
eigenvalues, eigenvectors = np.linalg.eigh(S)
# Sort in descending order (eigh returns ascending)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx] # columns are eigenvectors
print("Eigenvalues (variance along each PC):")
for i, lam in enumerate(eigenvalues):
print(f" lambda_{i+1} = {lam:.4f} (captures {lam/eigenvalues.sum()*100:.1f}% of total variance)")
print(f"\nTotal variance = trace(S) = {np.trace(S):.4f}")
print(f"Sum of eigenvalues = {eigenvalues.sum():.4f}")
print(f"\nPrincipal component 1 (b1): [{eigenvectors[0,0]:.4f}, {eigenvectors[1,0]:.4f}]")
print(f"Principal component 2 (b2): [{eigenvectors[0,1]:.4f}, {eigenvectors[1,1]:.4f}]")
print(f"\nb1 . b2 = {eigenvectors[:,0] @ eigenvectors[:,1]:.10f} (orthogonal)")
# Visualize the principal components on the scatter plot
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(X_centered[:, 0], X_centered[:, 1], alpha=0.4, s=20, c='steelblue', label='Centered data')
# Draw eigenvectors scaled by sqrt(eigenvalue) for visual clarity
colors = ['#e74c3c', '#2ecc71']
labels = ['PC1', 'PC2']
for i in range(2):
scale = 2.5 * np.sqrt(eigenvalues[i]) # scale for visibility
vec = eigenvectors[:, i] * scale
ax.annotate('', xy=vec, xytext=(0, 0),
arrowprops=dict(arrowstyle='->', color=colors[i], lw=3))
ax.text(vec[0]*1.15, vec[1]*1.15, f'{labels[i]} ($\\lambda_{i+1}$={eigenvalues[i]:.2f})',
fontsize=12, fontweight='bold', color=colors[i])
# Draw the covariance ellipse (2 standard deviations)
theta = np.linspace(0, 2*np.pi, 100)
ellipse = np.column_stack([np.cos(theta), np.sin(theta)])
ellipse_transformed = (eigenvectors @ np.diag(2*np.sqrt(eigenvalues)) @ ellipse.T).T
ax.plot(ellipse_transformed[:, 0], ellipse_transformed[:, 1], 'k--', alpha=0.5, label='2$\\sigma$ ellipse')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('Maximum Variance Perspective: Principal Components\nare eigenvectors of S')
ax.legend(loc='upper left'); ax.set_aspect('equal')
ax.axhline(0, color='gray', lw=0.5); ax.axvline(0, color='gray', lw=0.5)
plt.tight_layout()
plt.show()
# Verify: variance of projected data equals eigenvalue
b1 = eigenvectors[:, 0] # first principal component
b2 = eigenvectors[:, 1] # second principal component
# Project data onto PC1: z_1n = b1^T x_n
z1 = X_centered @ b1 # (N,)
z2 = X_centered @ b2 # (N,)
var_z1 = np.var(z1) # = (1/N) sum z_1n^2
var_z2 = np.var(z2)
print("Verification: Variance of projected data = eigenvalue")
print(f" Var(z1) = {var_z1:.4f}, lambda_1 = {eigenvalues[0]:.4f} (Eq. 10.15: V1 = b1^T S b1 = lambda_1)")
print(f" Var(z2) = {var_z2:.4f}, lambda_2 = {eigenvalues[1]:.4f}")
print(f"\n b1^T S b1 = {b1 @ S @ b1:.4f}")
print(f" b2^T S b2 = {b2 @ S @ b2:.4f}")
3. Projection Perspective (MML 10.3)ΒΆ
An equivalent derivation views PCA as minimizing the average reconstruction error (Eq. 10.29):
The key results are:
The optimal coordinates are \(z_{in} = \mathbf{b}_i^\top \mathbf{x}_n\) (orthogonal projection, Eq. 10.32)
The optimal reconstruction is \(\tilde{\mathbf{x}}_n = \mathbf{B}\mathbf{B}^\top \mathbf{x}_n\) (Eq. 10.34)
The reconstruction error equals the sum of discarded eigenvalues: \(J_M = \sum_{j=M+1}^{D} \lambda_j\) (Eq. 10.44)
This shows maximizing variance and minimizing reconstruction error give the same solution: $\(\underbrace{V_M}_{\text{captured variance}} + \underbrace{J_M}_{\text{reconstruction error}} = \underbrace{\sum_{d=1}^D \lambda_d}_{\text{total variance}} = \text{tr}(\mathbf{S})\)$
# Project onto 1D subspace spanned by b1 (M=1)
B1 = eigenvectors[:, 0:1] # (D, 1)
Z1 = X_centered @ B1 # (N, 1) -- codes
X_proj1 = Z1 @ B1.T # (N, D) -- reconstructions: x_tilde = B z = B B^T x
# Compute reconstruction error
recon_error_1 = np.mean(np.sum((X_centered - X_proj1)**2, axis=1))
print("=== Projection onto M=1 principal subspace ===")
print(f"Reconstruction error J_1 = {recon_error_1:.4f}")
print(f"Sum of discarded eigenvalues (lambda_2) = {eigenvalues[1]:.4f}")
print(f"Captured variance V_1 = {eigenvalues[0]:.4f}")
print(f"V_1 + J_1 = {eigenvalues[0] + recon_error_1:.4f}")
print(f"Total variance tr(S) = {np.trace(S):.4f}")
print(f"\nEquivalence confirmed: V_M + J_M = tr(S)")
# Visualize projection and reconstruction error
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Left: projection onto PC1 with displacement vectors
ax = axes[0]
ax.scatter(X_centered[:, 0], X_centered[:, 1], alpha=0.3, s=15, c='steelblue', label='Original')
ax.scatter(X_proj1[:, 0], X_proj1[:, 1], alpha=0.5, s=15, c='darkorange', label='Projected ($\\tilde{x}$)')
# Draw displacement vectors for a subset
for i in range(0, N, 8):
ax.plot([X_centered[i, 0], X_proj1[i, 0]],
[X_centered[i, 1], X_proj1[i, 1]], 'r-', alpha=0.2, lw=0.8)
# Draw principal subspace line
t = np.linspace(-8, 8, 100)
line = np.outer(t, b1)
ax.plot(line[:, 0], line[:, 1], 'k-', lw=1.5, alpha=0.5, label='Principal subspace $U$')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title(f'Projection Perspective (M=1)\n$J_1$ = {recon_error_1:.3f}')
ax.legend(); ax.set_aspect('equal')
# Right: show that displacement vectors are orthogonal to principal subspace
ax = axes[1]
displacements = X_centered - X_proj1 # x_n - x_tilde_n
dot_products = displacements @ b1 # should be ~0 (orthogonal to U)
ax.hist(dot_products, bins=30, color='steelblue', edgecolor='white', alpha=0.7)
ax.axvline(0, color='red', lw=2, linestyle='--')
ax.set_xlabel('$(\\mathbf{x}_n - \\tilde{\\mathbf{x}}_n)^\\top \\mathbf{b}_1$')
ax.set_ylabel('Count')
ax.set_title('Displacement vectors are orthogonal to $U$\n(dot product with $b_1$ is zero)')
plt.tight_layout()
plt.show()
4. Eigenvector Computation and Low-Rank Approximations (MML 10.4)ΒΆ
Step-by-step PCA from scratchΒΆ
Center the data: \(\mathbf{x}_n \leftarrow \mathbf{x}_n - \boldsymbol{\mu}\)
Compute covariance: \(\mathbf{S} = \frac{1}{N}\mathbf{X}\mathbf{X}^\top\) where \(\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_N] \in \mathbb{R}^{D \times N}\)
Eigendecompose: \(\mathbf{S} = \sum_{d=1}^{D} \lambda_d \mathbf{b}_d \mathbf{b}_d^\top\)
Select top-\(M\) components: \(\mathbf{B} = [\mathbf{b}_1, \ldots, \mathbf{b}_M]\)
Connection to SVD (Eq. 10.47-10.49)ΒΆ
If \(\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\) (SVD), then \(\mathbf{S} = \frac{1}{N}\mathbf{U}\mathbf{\Sigma}^2\mathbf{U}^\top\). The eigenvalues of \(\mathbf{S}\) relate to singular values via \(\lambda_d = \sigma_d^2 / N\).
Let us now implement PCA from scratch on a richer 3D dataset.
# Generate a 3D dataset that lies mostly in a 2D plane (with some noise)
np.random.seed(123)
N3d = 300
D = 3
# Latent 2D coordinates
z_latent = np.random.randn(N3d, 2) * np.array([3.0, 1.5])
# Mapping from 2D to 3D (tilted plane + noise)
W_true = np.array([[1.0, 0.5],
[0.5, 1.0],
[0.8, -0.3]])
noise = np.random.randn(N3d, D) * 0.3
X3d_raw = z_latent @ W_true.T + noise + np.array([2, -1, 3])
print(f"3D dataset shape: {X3d_raw.shape}")
print(f"Mean: {X3d_raw.mean(axis=0).round(3)}")
def pca_from_scratch(X, M=None):
"""
PCA via eigendecomposition of the covariance matrix.
Parameters:
X : (N, D) data matrix (rows are data points)
M : number of components to keep (default: all)
Returns:
Z : (N, M) projected coordinates
B : (D, M) principal component matrix (columns = eigenvectors)
eigenvalues : (D,) sorted eigenvalues
mu : (D,) data mean
"""
N, D_dim = X.shape
if M is None:
M = D_dim
# Step 1: Center
mu = X.mean(axis=0)
X_c = X - mu
# Step 2: Covariance matrix S = (1/N) X_c^T X_c
S = (X_c.T @ X_c) / N
# Step 3: Eigendecomposition
evals, evecs = np.linalg.eigh(S)
# Sort descending
order = np.argsort(evals)[::-1]
evals = evals[order]
evecs = evecs[:, order]
# Step 4: Select top-M
B = evecs[:, :M] # (D, M)
evals_all = evals # all eigenvalues for analysis
# Project: z = B^T (x - mu)
Z = X_c @ B # (N, M)
return Z, B, evals_all, mu
# Apply PCA to 3D data
Z3d, B3d, evals3d, mu3d = pca_from_scratch(X3d_raw, M=3)
print("Eigenvalues of the 3D dataset:")
for i, lam in enumerate(evals3d):
print(f" lambda_{i+1} = {lam:.4f} ({lam/evals3d.sum()*100:.1f}% of variance)")
print(f"\nCumulative variance: {np.cumsum(evals3d)/evals3d.sum() * 100}")
print(f"\nFirst 2 PCs capture {evals3d[:2].sum()/evals3d.sum()*100:.1f}% of variance")
print(f"=> Data is essentially 2D (as constructed).")
# 3D visualization with principal components
fig = plt.figure(figsize=(14, 6))
# Left: 3D scatter with PC arrows
ax = fig.add_subplot(121, projection='3d')
X3d_c = X3d_raw - mu3d
ax.scatter(X3d_c[:, 0], X3d_c[:, 1], X3d_c[:, 2], alpha=0.3, s=10, c='steelblue')
colors_3d = ['#e74c3c', '#2ecc71', '#3498db']
for i in range(3):
scale = 2.0 * np.sqrt(evals3d[i])
vec = B3d[:, i] * scale
ax.quiver(0, 0, 0, vec[0], vec[1], vec[2], color=colors_3d[i],
arrow_length_ratio=0.1, linewidth=3, label=f'PC{i+1} ($\\lambda$={evals3d[i]:.2f})')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.set_zlabel('$x_3$')
ax.set_title('3D Data with Principal Components')
ax.legend(loc='upper left', fontsize=9)
# Right: 2D projection onto first two PCs
Z2 = Z3d[:, :2]
ax2 = fig.add_subplot(122)
ax2.scatter(Z2[:, 0], Z2[:, 1], alpha=0.4, s=15, c='steelblue')
ax2.set_xlabel('$z_1$ (PC1)'); ax2.set_ylabel('$z_2$ (PC2)')
ax2.set_title('Projected onto 2D Principal Subspace\n(codes $\\mathbf{z} = \\mathbf{B}^\\top \\mathbf{x}$)')
ax2.set_aspect('equal')
plt.tight_layout()
plt.show()
# Low-rank approximation: reconstruct with k = 1, 2, 3 components
# and measure reconstruction error
X3d_c = X3d_raw - mu3d
errors = []
var_captured = []
print("Low-Rank Approximation Results (3D data):")
print(f"{'M':>3} | {'Recon Error J_M':>16} | {'Sum discarded evals':>20} | {'Variance Captured':>18}")
print("-" * 65)
for M_k in range(1, D + 1):
B_k = B3d[:, :M_k] # top-k eigenvectors
X_recon = X3d_c @ B_k @ B_k.T # reconstruct: BB^T x
err = np.mean(np.sum((X3d_c - X_recon)**2, axis=1))
discarded = evals3d[M_k:].sum()
captured = evals3d[:M_k].sum() / evals3d.sum() * 100
errors.append(err)
var_captured.append(captured)
print(f"{M_k:>3} | {err:>16.4f} | {discarded:>20.4f} | {captured:>17.1f}%")
print(f"\nTotal variance = {evals3d.sum():.4f}")
# Visualize: reconstruction with 1 vs 2 vs 3 components (3D data)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx_m, M_k in enumerate([1, 2, 3]):
ax = axes[idx_m]
B_k = B3d[:, :M_k]
X_recon = X3d_c @ B_k @ B_k.T + mu3d # add mean back
err_k = np.mean(np.sum((X3d_raw - X_recon)**2, axis=1))
# Plot first two coordinates for visualization
ax.scatter(X3d_raw[:, 0], X3d_raw[:, 1], alpha=0.2, s=10, c='steelblue', label='Original')
ax.scatter(X_recon[:, 0], X_recon[:, 1], alpha=0.3, s=10, c='darkorange', label='Reconstructed')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title(f'M={M_k} components\nError={err_k:.3f}, Var={var_captured[idx_m]:.1f}%')
ax.legend(fontsize=9); ax.set_aspect('equal')
plt.suptitle('Low-Rank Reconstruction of 3D Data (projected to $x_1$-$x_2$ plane)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
5. Reconstruction Error vs. Number of ComponentsΒΆ
For a more compelling demonstration, let us generate higher-dimensional data (10D) with an intrinsic dimensionality of about 3, and plot how reconstruction error drops as we add components.
From Eq. 10.62: \(\frac{1}{N}\sum_{n=1}^{N}\|\mathbf{x}_n - \tilde{\mathbf{x}}_n\|^2 = \sum_{i=M+1}^{D} \lambda_i\)
# Generate 10D data with intrinsic dim ~3
np.random.seed(99)
N_hd = 500
D_hd = 10
M_intrinsic = 3
# Latent factors
Z_latent = np.random.randn(N_hd, M_intrinsic) * np.array([5.0, 2.0, 1.0])
# Random mixing matrix
W_mix = np.random.randn(M_intrinsic, D_hd)
noise_hd = np.random.randn(N_hd, D_hd) * 0.2
X_hd = Z_latent @ W_mix + noise_hd
# Apply PCA
_, B_hd, evals_hd, mu_hd = pca_from_scratch(X_hd, M=D_hd)
X_hd_c = X_hd - mu_hd
# Compute reconstruction error for each M
recon_errors_hd = []
for M_k in range(1, D_hd + 1):
Bk = B_hd[:, :M_k]
Xr = X_hd_c @ Bk @ Bk.T
err = np.mean(np.sum((X_hd_c - Xr)**2, axis=1))
recon_errors_hd.append(err)
# Also compute from eigenvalues (should match)
recon_errors_theory = [evals_hd[M_k:].sum() for M_k in range(1, D_hd + 1)]
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Scree plot (eigenvalues)
ax = axes[0]
ax.bar(range(1, D_hd + 1), evals_hd, color='steelblue', edgecolor='white', alpha=0.8)
ax.set_xlabel('Principal Component'); ax.set_ylabel('Eigenvalue $\\lambda_m$')
ax.set_title('Scree Plot')
ax.set_xticks(range(1, D_hd + 1))
# Cumulative variance explained
ax = axes[1]
cumvar = np.cumsum(evals_hd) / evals_hd.sum() * 100
ax.plot(range(1, D_hd + 1), cumvar, 'o-', color='steelblue', markersize=8)
ax.axhline(95, color='red', linestyle='--', alpha=0.7, label='95% threshold')
ax.set_xlabel('Number of Components $M$'); ax.set_ylabel('Cumulative Variance (%)')
ax.set_title('Cumulative Variance Explained')
ax.set_xticks(range(1, D_hd + 1)); ax.legend()
ax.set_ylim([0, 105])
# Reconstruction error vs M
ax = axes[2]
ax.plot(range(1, D_hd + 1), recon_errors_hd, 's-', color='#e74c3c', markersize=8, label='Empirical $J_M$')
ax.plot(range(1, D_hd + 1), recon_errors_theory, 'o--', color='steelblue', markersize=6,
alpha=0.7, label='$\\sum_{j=M+1}^D \\lambda_j$')
ax.set_xlabel('Number of Components $M$'); ax.set_ylabel('Reconstruction Error $J_M$')
ax.set_title('Reconstruction Error vs. $M$\n(Eq. 10.62)')
ax.set_xticks(range(1, D_hd + 1)); ax.legend()
plt.tight_layout()
plt.show()
print(f"With M=3 components: {cumvar[2]:.1f}% variance captured (intrinsic dim = {M_intrinsic})")
6. PCA in High Dimensions (MML 10.5)ΒΆ
When \(D \gg N\) (many more dimensions than data points), computing the \(D \times D\) covariance matrix is expensive. The dual trick exploits the fact that \(\mathbf{S} = \frac{1}{N}\mathbf{X}\mathbf{X}^\top\) has at most \(N\) nonzero eigenvalues.
Key idea: Instead of eigendecomposing the \(D \times D\) matrix \(\mathbf{S} = \frac{1}{N}\mathbf{X}\mathbf{X}^\top\), we eigendecompose the smaller \(N \times N\) matrix \(\frac{1}{N}\mathbf{X}^\top\mathbf{X}\) (Eq. 10.56):
The eigenvectors of \(\mathbf{S}\) are recovered as \(\mathbf{b}_m = \mathbf{X}\mathbf{c}_m\) (normalized to unit length). This reduces cost from \(O(D^3)\) to \(O(N^3)\).
Both matrices share the same nonzero eigenvalues (Eq. 10.57).
# Demonstrate the dual trick: D >> N scenario
np.random.seed(77)
D_high = 1000 # very high dimensional
N_small = 50 # few samples
M_true = 5 # intrinsic dimensionality
# Generate data: N_small points in D_high dimensions, intrinsic dim = M_true
Z_hd = np.random.randn(N_small, M_true) * np.array([10, 5, 3, 1.5, 0.5])
W_hd = np.random.randn(M_true, D_high) / np.sqrt(D_high)
X_highdim = Z_hd @ W_hd + np.random.randn(N_small, D_high) * 0.05
# Center
mu_hd = X_highdim.mean(axis=0)
X_hd_c = X_highdim - mu_hd
print(f"Data dimensions: N={N_small}, D={D_high}")
print(f"Covariance matrix XX^T/N would be {D_high}x{D_high} = {D_high**2:,} entries")
print(f"Dual matrix X^TX/N is only {N_small}x{N_small} = {N_small**2} entries")
print(f"Speedup factor: {D_high**2 / N_small**2:.0f}x fewer entries\n")
# --- Method 1: Standard PCA (expensive: D x D covariance) ---
import time
t0 = time.time()
S_standard = (X_hd_c.T @ X_hd_c) / N_small # D x D
evals_std, evecs_std = np.linalg.eigh(S_standard)
idx_std = np.argsort(evals_std)[::-1]
evals_std = evals_std[idx_std]
evecs_std = evecs_std[:, idx_std]
t_standard = time.time() - t0
# --- Method 2: Dual PCA (cheap: N x N matrix) ---
t0 = time.time()
S_dual = (X_hd_c @ X_hd_c.T) / N_small # N x N (this is X^T X / N in book's convention)
evals_dual, evecs_dual = np.linalg.eigh(S_dual)
idx_dual = np.argsort(evals_dual)[::-1]
evals_dual = evals_dual[idx_dual]
evecs_dual = evecs_dual[:, idx_dual]
# Recover eigenvectors of S: b_m = X c_m (then normalize)
# Note: in our row convention, X_hd_c is N x D, so b_m = X_hd_c^T c_m
B_dual = X_hd_c.T @ evecs_dual # D x N
# Normalize columns to unit length
norms = np.linalg.norm(B_dual, axis=0)
norms[norms < 1e-10] = 1.0 # avoid division by zero for zero eigenvalues
B_dual = B_dual / norms
t_dual = time.time() - t0
print(f"Standard PCA time: {t_standard:.4f}s")
print(f"Dual PCA time: {t_dual:.4f}s")
print(f"\nTop-10 eigenvalues comparison:")
print(f"{'Standard':>12} | {'Dual':>12} | {'Match':>8}")
print("-" * 38)
for i in range(10):
match = np.abs(evals_std[i] - evals_dual[i]) < 1e-6
print(f"{evals_std[i]:>12.6f} | {evals_dual[i]:>12.6f} | {'yes' if match else 'no':>8}")
# Verify that eigenvectors from both methods span the same subspace
# (they may differ by sign, so we check |cos(angle)| = 1)
M_check = 5
print(f"Checking alignment of top-{M_check} eigenvectors (|cos angle| should be ~1):")
for i in range(M_check):
cos_angle = np.abs(evecs_std[:, i] @ B_dual[:, i])
print(f" PC{i+1}: |cos(angle)| = {cos_angle:.8f}")
# Plot eigenvalue spectrum comparison
fig, ax = plt.subplots(figsize=(10, 5))
valid = evals_dual > 1e-10 # only nonzero eigenvalues
ax.semilogy(range(1, N_small + 1), evals_dual, 'o-', color='steelblue',
markersize=6, label='Dual eigenvalues ($N \\times N$)')
ax.semilogy(range(1, min(N_small + 1, len(evals_std) + 1)),
evals_std[:N_small], 'x--', color='#e74c3c',
markersize=8, alpha=0.7, label='Standard eigenvalues ($D \\times D$)')
ax.axvline(M_true, color='green', linestyle=':', lw=2, label=f'True intrinsic dim = {M_true}')
ax.set_xlabel('Index'); ax.set_ylabel('Eigenvalue (log scale)')
ax.set_title(f'PCA in High Dimensions: D={D_high}, N={N_small}\nDual trick gives identical eigenvalues')
ax.legend()
plt.tight_layout()
plt.show()
7. Key Steps of PCA in Practice (MML 10.6)ΒΆ
The full PCA pipeline from Section 10.6 of MML:
Mean subtraction: Center the data by subtracting \(\boldsymbol{\mu}\)
Standardization: Divide each dimension by its standard deviation \(\sigma_d\) so that all features have unit variance
Eigendecomposition: Compute eigenvalues/vectors of the (standardized) covariance matrix
Projection: Map data to the principal subspace via \(\mathbf{z} = \mathbf{B}^\top \mathbf{x}_{\text{std}}\)
Reconstruction: Undo projection and standardization: \(\tilde{x}^{(d)} \leftarrow \tilde{x}_{\text{std}}^{(d)} \sigma_d + \mu_d\)
Let us implement this full pipeline and visualize each step, following Figure 10.11 of MML.
# Generate 2D data matching Figure 10.11 of MML
np.random.seed(42)
N_pp = 150
# Data with different scales on each axis
mean_pp = np.array([3.0, 1.5])
cov_pp = np.array([[4.0, 2.5],
[2.5, 2.0]])
X_pp = np.random.multivariate_normal(mean_pp, cov_pp, N_pp)
# === Full PCA Pipeline ===
# Step 1: Mean subtraction
mu_pp = X_pp.mean(axis=0)
X_step1 = X_pp - mu_pp
# Step 2: Standardization (divide by std dev)
sigma_pp = X_pp.std(axis=0)
X_step2 = X_step1 / sigma_pp # now unit variance per dimension
# Step 3: Eigendecomposition of standardized covariance
S_std = (X_step2.T @ X_step2) / N_pp
evals_pp, evecs_pp = np.linalg.eigh(S_std)
order_pp = np.argsort(evals_pp)[::-1]
evals_pp = evals_pp[order_pp]
evecs_pp = evecs_pp[:, order_pp]
# Step 4: Project onto M=1 principal subspace
M_pp = 1
B_pp = evecs_pp[:, :M_pp] # (2, 1)
Z_pp = X_step2 @ B_pp # (N, 1)
# Step 5: Reconstruct in standardized space, then undo standardization
X_recon_std = Z_pp @ B_pp.T # in standardized space
X_recon_orig = X_recon_std * sigma_pp + mu_pp # back to original space
print("Pipeline Summary:")
print(f" Original mean: {mu_pp.round(3)}")
print(f" Original std: {sigma_pp.round(3)}")
print(f" Std covariance eigenvalues: {evals_pp.round(4)}")
print(f" PC1 direction (standardized): {evecs_pp[:, 0].round(4)}")
print(f" Variance captured by PC1: {evals_pp[0]/evals_pp.sum()*100:.1f}%")
# Visualize all 6 steps (following Figure 10.11 of MML)
fig, axes = plt.subplots(2, 3, figsize=(18, 11))
# (a) Original dataset
ax = axes[0, 0]
ax.scatter(X_pp[:, 0], X_pp[:, 1], alpha=0.4, s=15, c='steelblue')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('(a) Original Dataset')
# (b) Centered data
ax = axes[0, 1]
ax.scatter(X_step1[:, 0], X_step1[:, 1], alpha=0.4, s=15, c='steelblue')
ax.axhline(0, color='gray', lw=0.5); ax.axvline(0, color='gray', lw=0.5)
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('(b) Step 1: Mean Subtracted')
# (c) Standardized
ax = axes[0, 2]
ax.scatter(X_step2[:, 0], X_step2[:, 1], alpha=0.4, s=15, c='steelblue')
ax.axhline(0, color='gray', lw=0.5); ax.axvline(0, color='gray', lw=0.5)
# Show unit variance arrows
ax.annotate('', xy=(1, 0), xytext=(-1, 0), arrowprops=dict(arrowstyle='<->', color='red', lw=2))
ax.annotate('', xy=(0, 1), xytext=(0, -1), arrowprops=dict(arrowstyle='<->', color='red', lw=2))
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('(c) Step 2: Standardized\n(unit variance per dim)')
# (d) Eigendecomposition
ax = axes[1, 0]
ax.scatter(X_step2[:, 0], X_step2[:, 1], alpha=0.3, s=15, c='steelblue')
for i in range(2):
scale = 2.0 * np.sqrt(evals_pp[i])
vec = evecs_pp[:, i] * scale
color = '#e74c3c' if i == 0 else '#2ecc71'
ax.annotate('', xy=vec, xytext=(0, 0),
arrowprops=dict(arrowstyle='->', color=color, lw=3))
# Draw covariance ellipse
theta_e = np.linspace(0, 2*np.pi, 100)
ell = np.column_stack([np.cos(theta_e), np.sin(theta_e)])
ell_t = (evecs_pp @ np.diag(2*np.sqrt(evals_pp)) @ ell.T).T
ax.plot(ell_t[:, 0], ell_t[:, 1], 'k--', alpha=0.5)
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('(d) Step 3: Eigendecomposition\n(arrows = eigenvectors, ellipse = covariance)')
# (e) Projected data
ax = axes[1, 1]
X_proj_std = Z_pp @ B_pp.T # projected in standardized space
ax.scatter(X_step2[:, 0], X_step2[:, 1], alpha=0.15, s=10, c='steelblue', label='Standardized')
ax.scatter(X_proj_std[:, 0], X_proj_std[:, 1], alpha=0.4, s=15, c='darkorange', label='Projected')
# Draw principal subspace line
t_line = np.linspace(-4, 4, 100)
line_pp = np.outer(t_line, evecs_pp[:, 0])
ax.plot(line_pp[:, 0], line_pp[:, 1], 'k-', lw=1, alpha=0.5)
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('(e) Step 4: Projection onto $U$')
ax.legend(fontsize=9)
# (f) Back to original space
ax = axes[1, 2]
ax.scatter(X_pp[:, 0], X_pp[:, 1], alpha=0.2, s=10, c='steelblue', label='Original')
ax.scatter(X_recon_orig[:, 0], X_recon_orig[:, 1], alpha=0.4, s=15, c='darkorange', label='Reconstructed')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('(f) Step 5: Undo standardization\n(back to original space)')
ax.legend(fontsize=9)
for ax_row in axes:
for a in ax_row:
a.set_aspect('equal')
plt.suptitle('Full PCA Pipeline (MML Figure 10.11)', fontsize=16, y=1.01)
plt.tight_layout()
plt.show()
# Scree plot and variance explained for the standardized data
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Scree plot
ax = axes[0]
ax.bar(range(1, 3), evals_pp, color='steelblue', edgecolor='white', width=0.5)
ax.set_xlabel('Principal Component')
ax.set_ylabel('Eigenvalue $\\lambda_m$')
ax.set_title('Scree Plot (Standardized 2D Data)')
ax.set_xticks([1, 2])
# Variance explained
ax = axes[1]
cumvar_pp = np.cumsum(evals_pp) / evals_pp.sum() * 100
individual = evals_pp / evals_pp.sum() * 100
ax.bar(range(1, 3), individual, color='steelblue', edgecolor='white', width=0.5,
alpha=0.6, label='Individual')
ax.plot(range(1, 3), cumvar_pp, 'ro-', markersize=10, label='Cumulative')
ax.axhline(95, color='gray', linestyle='--', alpha=0.5, label='95% threshold')
ax.set_xlabel('Number of Components $M$')
ax.set_ylabel('Variance Explained (%)')
ax.set_title('Variance Explained')
ax.set_xticks([1, 2])
ax.set_ylim([0, 110])
ax.legend()
plt.tight_layout()
plt.show()
print(f"PC1 captures {individual[0]:.1f}% of variance, PC2 captures {individual[1]:.1f}%")
8. Latent Variable Perspective (MML 10.7)ΒΆ
The probabilistic PCA (PPCA) model (Tipping & Bishop, 1999) posits a generative model:
Key insight from MML Section 10.7: Standard PCA is the maximum likelihood solution of this model as \(\sigma^2 \to 0\).
The MLE for \(\mathbf{W}\) is: $\(\mathbf{W}_{\text{ML}} = \mathbf{B}_M(\mathbf{\Lambda}_M - \sigma^2 \mathbf{I})^{1/2}\mathbf{R}\)$
where \(\mathbf{B}_M\) contains the top-\(M\) eigenvectors of \(\mathbf{S}\), \(\mathbf{\Lambda}_M = \text{diag}(\lambda_1,\ldots,\lambda_M)\), and \(\mathbf{R}\) is an arbitrary rotation. As \(\sigma^2 \to 0\), this reduces to standard PCA.
The MLE for \(\sigma^2\) is: $\(\sigma^2_{\text{ML}} = \frac{1}{D-M}\sum_{j=M+1}^{D}\lambda_j\)$
which is the average variance in the discarded dimensions.
Let us demonstrate this connection.
# Probabilistic PCA: demonstrate the generative model and MLE connection
np.random.seed(42)
# True generative model parameters
D_ppca = 5
M_ppca = 2
N_ppca = 500
sigma2_true = 0.3 # observation noise
W_true_ppca = np.array([[2.0, 0.0],
[1.0, 1.5],
[0.5, -1.0],
[-0.3, 0.8],
[0.1, 0.2]])
mu_true_ppca = np.array([1.0, -0.5, 2.0, 0.0, 1.5])
# Generate data from the PPCA model
Z_ppca = np.random.randn(N_ppca, M_ppca) # latent variables z ~ N(0, I)
noise_ppca = np.random.randn(N_ppca, D_ppca) * np.sqrt(sigma2_true)
X_ppca = Z_ppca @ W_true_ppca.T + mu_true_ppca + noise_ppca # x = Wz + mu + eps
print("Probabilistic PCA Generative Model:")
print(f" z ~ N(0, I_{M_ppca}), x|z ~ N(Wz + mu, {sigma2_true}*I_{D_ppca})")
print(f" Generated {N_ppca} samples in {D_ppca}D from {M_ppca}D latent space")
# Apply standard PCA
Z_pca, B_pca, evals_pca, mu_pca = pca_from_scratch(X_ppca, M=D_ppca)
print(f"\nEigenvalues of data covariance:")
for i, ev in enumerate(evals_pca):
print(f" lambda_{i+1} = {ev:.4f}")
# MLE for sigma^2: average of discarded eigenvalues
sigma2_mle = evals_pca[M_ppca:].sum() / (D_ppca - M_ppca)
print(f"\nMLE sigma^2 = (1/(D-M)) * sum(lambda_{{M+1}},...,lambda_D)")
print(f" = {sigma2_mle:.4f}")
print(f"True sigma^2 = {sigma2_true:.4f}")
# MLE for W (up to rotation)
Lambda_M = np.diag(evals_pca[:M_ppca])
B_M = B_pca[:, :M_ppca]
W_mle = B_M @ np.sqrt(Lambda_M - sigma2_mle * np.eye(M_ppca))
print(f"\nW_ML shape: {W_mle.shape}")
print(f"Column norms of W_ML: {np.linalg.norm(W_mle, axis=0).round(4)}")
print(f"Column norms of W_true: {np.linalg.norm(W_true_ppca, axis=0).round(4)}")
# Show that as sigma^2 -> 0, PPCA reduces to standard PCA
# We demonstrate this by generating data with decreasing noise levels
np.random.seed(42)
sigma2_values = [2.0, 0.5, 0.1, 0.01, 0.001]
N_demo = 300
D_demo = 3
M_demo = 1
W_demo = np.array([[2.0], [1.0], [0.5]])
fig, axes = plt.subplots(1, len(sigma2_values), figsize=(20, 4), subplot_kw={'projection': '3d'})
for idx, s2 in enumerate(sigma2_values):
z_demo = np.random.randn(N_demo, M_demo)
x_demo = z_demo @ W_demo.T + np.random.randn(N_demo, D_demo) * np.sqrt(s2)
# PCA
x_c = x_demo - x_demo.mean(axis=0)
S_demo = (x_c.T @ x_c) / N_demo
ev_demo, _ = np.linalg.eigh(S_demo)
ev_demo = np.sort(ev_demo)[::-1]
sigma2_est = ev_demo[M_demo:].sum() / (D_demo - M_demo)
ax = axes[idx]
ax.scatter(x_c[:, 0], x_c[:, 1], x_c[:, 2], alpha=0.3, s=5, c='steelblue')
ax.set_title(f'$\\sigma^2$={s2}\n$\\hat{{\\sigma}}^2$={sigma2_est:.3f}', fontsize=10)
ax.set_xlabel('$x_1$', fontsize=8)
ax.set_ylabel('$x_2$', fontsize=8)
ax.set_zlabel('$x_3$', fontsize=8)
plt.suptitle('PPCA: As $\\sigma^2 \\to 0$, data concentrates on the linear subspace\n'
'(standard PCA = MLE of PPCA in the zero-noise limit)', fontsize=13, y=1.08)
plt.tight_layout()
plt.show()
# Quantitative comparison: marginal likelihood of PPCA model
# p(x) = N(x | mu, WW^T + sigma^2 I)
# The marginal covariance is C = WW^T + sigma^2 I
# Use our 5D PPCA data to show how well the model fits
X_ppca_c = X_ppca - mu_pca
# Standard PCA reconstruction error vs PPCA
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: eigenvalue spectrum with sigma^2 line
ax = axes[0]
ax.bar(range(1, D_ppca + 1), evals_pca, color='steelblue', edgecolor='white', width=0.5)
ax.axhline(sigma2_mle, color='red', linestyle='--', lw=2,
label=f'$\\sigma^2_{{ML}}$ = {sigma2_mle:.3f} (avg. discarded $\\lambda$)')
ax.axhline(sigma2_true, color='green', linestyle=':', lw=2,
label=f'$\\sigma^2_{{true}}$ = {sigma2_true:.3f}')
ax.set_xlabel('Component'); ax.set_ylabel('Eigenvalue')
ax.set_title('PPCA: Eigenvalue Spectrum\n$\\sigma^2$ = average of noise eigenvalues')
ax.set_xticks(range(1, D_ppca + 1))
ax.legend(fontsize=10)
# Right: compare standard PCA projection with PPCA latent recovery
ax = axes[1]
Z_standard = X_ppca_c @ B_pca[:, :M_ppca] # standard PCA codes
ax.scatter(Z_standard[:, 0], Z_standard[:, 1], alpha=0.3, s=10, c='steelblue',
label='PCA codes (standard)')
ax.scatter(Z_ppca[:, 0], Z_ppca[:, 1], alpha=0.1, s=10, c='red',
label='True latent $z$ (unknown in practice)')
ax.set_xlabel('$z_1$'); ax.set_ylabel('$z_2$')
ax.set_title('Latent Space: PCA codes vs True latent variables\n(differ by rotation, consistent with theory)')
ax.legend(fontsize=9)
ax.set_aspect('equal')
plt.tight_layout()
plt.show()
# Show that PCA codes and true latents are related by a rotation
# Compute the correlation between PCA codes and true z (after optimal alignment)
U_align, _, Vt_align = np.linalg.svd(Z_ppca.T @ Z_standard)
R_optimal = U_align @ Vt_align
Z_aligned = Z_standard @ R_optimal.T
corr_1 = np.corrcoef(Z_ppca[:, 0], Z_aligned[:, 0])[0, 1]
corr_2 = np.corrcoef(Z_ppca[:, 1], Z_aligned[:, 1])[0, 1]
print(f"Correlation between true latents and aligned PCA codes:")
print(f" z_1: r = {corr_1:.4f}")
print(f" z_2: r = {corr_2:.4f}")
print("PCA recovers the latent structure up to rotation (as predicted by theory).")
SummaryΒΆ
This lab implemented PCA from first principles following Chapter 10 of MML. Here are the key takeaways:
Three Equivalent Perspectives on PCAΒΆ
Perspective |
Objective |
Solution |
|---|---|---|
Maximum Variance (10.2) |
Maximize \(V_M = \mathbf{b}^\top\mathbf{S}\mathbf{b}\) |
Eigenvectors of \(\mathbf{S}\) with largest eigenvalues |
Minimum Reconstruction Error (10.3) |
Minimize \(J_M = \frac{1}{N}\sum|\mathbf{x}_n - \tilde{\mathbf{x}}_n|^2\) |
Same eigenvectors (equivalent!) |
Probabilistic / Latent Variable (10.7) |
Maximize likelihood of PPCA model |
Same eigenvectors (MLE as \(\sigma^2 \to 0\)) |
Key EquationsΒΆ
Covariance: \(\mathbf{S} = \frac{1}{N}\mathbf{X}\mathbf{X}^\top\) (Eq. 10.1/10.45)
Eigenvalue equation: \(\mathbf{S}\mathbf{b}_m = \lambda_m\mathbf{b}_m\) (Eq. 10.13)
Projection: \(\tilde{\mathbf{x}} = \mathbf{B}\mathbf{B}^\top\mathbf{x}\) (Eq. 10.34)
Variance-error duality: \(V_M + J_M = \text{tr}(\mathbf{S})\) (Eq. 10.24 + 10.25)
SVD connection: \(\lambda_d = \sigma_d^2/N\) (Eq. 10.49)
Dual trick (\(D \gg N\)): eigendecompose \(\frac{1}{N}\mathbf{X}^\top\mathbf{X}\) instead (Eq. 10.56)
PPCA noise: \(\sigma^2_{\text{ML}} = \frac{1}{D-M}\sum_{j=M+1}^{D}\lambda_j\)
Practical Pipeline (10.6)ΒΆ
Center data (subtract mean)
Standardize (divide by standard deviation)
Eigendecompose covariance \(\to\) select top-\(M\) components
Project: \(\mathbf{z} = \mathbf{B}^\top\mathbf{x}_{\text{std}}\)
Reconstruct: undo standardization to return to original space