Lab 07: Linear RegressionΒΆ
Based on Chapters 8β9 of Mathematics for Machine Learning (Deisenroth, Faisal, Ong)
In this lab we implement the core ideas of linear regression from first principles using only NumPy and Matplotlib. We cover:
Data, Models, and Learning (Ch 8.1) β generating synthetic data, defining a linear model and loss function
Empirical Risk Minimization (Ch 8.2) β MSE loss, loss landscape visualization
Parameter Estimation via MLE (Ch 9.2) β closed-form normal equations
Overfitting (Ch 9.2.2) β polynomial regression, training vs test error, bias-variance tradeoff
Bayesian Linear Regression (Ch 9.3) β prior/posterior updates, predictive uncertainty
Maximum Likelihood as Orthogonal Projection (Ch 9.4) β geometric interpretation
Model Selection (Ch 8.6) β cross-validation, AIC/BIC
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
print('Setup complete.')
1. Data, Models, and Learning (MML 8.1)ΒΆ
A machine learning system has three components: data, models, and learning.
Data as vectors. We represent each example \(\mathbf{x}_n \in \mathbb{R}^D\) as a feature vector with a corresponding label \(y_n \in \mathbb{R}\). A dataset is a set of example-label pairs \(\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\}\).
Models as functions. A predictor maps inputs to outputs. For linear regression we use:
By prepending a 1 to each input (i.e. \(\mathbf{x}_n = [1, x_n^{(1)}, \ldots, x_n^{(D)}]^\top\)), we absorb the bias into the parameter vector and write \(f(\mathbf{x}_n, \boldsymbol{\theta}) = \boldsymbol{\theta}^\top \mathbf{x}_n\).
Noise model. Observations are corrupted by i.i.d. Gaussian noise:
Loss function. We measure how well the predictor fits the data using the squared loss:
# --- Generate synthetic 1-D linear data with noise ---
true_theta_0 = 2.0 # true intercept
true_theta_1 = 1.5 # true slope
sigma = 0.8 # noise standard deviation
N = 50 # number of data points
x_raw = np.random.uniform(-3, 3, N)
y = true_theta_0 + true_theta_1 * x_raw + np.random.normal(0, sigma, N)
# Design matrix with bias column: X is N x 2
X = np.column_stack([np.ones(N), x_raw])
true_theta = np.array([true_theta_0, true_theta_1])
# Define loss function
def squared_loss(y_true, y_pred):
"""Per-sample squared loss."""
return (y_true - y_pred) ** 2
def mse_loss(y_true, y_pred):
"""Mean squared error (empirical risk with squared loss)."""
return np.mean(squared_loss(y_true, y_pred))
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: data + true line
x_plot = np.linspace(-3.5, 3.5, 200)
axes[0].scatter(x_raw, y, alpha=0.6, label='Noisy observations', zorder=3)
axes[0].plot(x_plot, true_theta_0 + true_theta_1 * x_plot, 'k-', lw=2,
label=f'True function: $y = {true_theta_0} + {true_theta_1}x$')
axes[0].set_xlabel('$x$'); axes[0].set_ylabel('$y$')
axes[0].set_title('Synthetic Dataset')
axes[0].legend(); axes[0].grid(True, alpha=0.3)
# Right: show individual losses as vertical bars
y_pred_true = true_theta_0 + true_theta_1 * x_raw
axes[1].scatter(x_raw, y, alpha=0.6, zorder=3)
axes[1].plot(x_plot, true_theta_0 + true_theta_1 * x_plot, 'k-', lw=2)
for i in range(N):
axes[1].plot([x_raw[i], x_raw[i]], [y[i], y_pred_true[i]], 'r-', alpha=0.3, lw=1)
axes[1].set_xlabel('$x$'); axes[1].set_ylabel('$y$')
axes[1].set_title(f'Residuals (MSE = {mse_loss(y, y_pred_true):.3f})')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f'Dataset shape: X = {X.shape}, y = {y.shape}')
print(f'True parameters: theta_0 = {true_theta_0}, theta_1 = {true_theta_1}')
print(f'Noise std: sigma = {sigma}')
2. Empirical Risk Minimization (MML 8.2)ΒΆ
The empirical risk is the average loss on the training data (MML Eq. 8.6):
For the squared loss \(\ell(y_n, \hat{y}_n) = (y_n - \hat{y}_n)^2\) and a linear predictor \(\hat{y}_n = \boldsymbol{\theta}^\top \mathbf{x}_n\), this becomes:
Empirical risk minimization seeks the parameter \(\boldsymbol{\theta}^*\) that minimizes \(\mathbf{R}_{\text{emp}}\). Below we compute and visualize the loss landscape over a grid of \((\theta_0, \theta_1)\) values.
# --- Compute the MSE loss landscape over a grid of (theta_0, theta_1) ---
theta0_range = np.linspace(-1, 5, 100)
theta1_range = np.linspace(-1, 4, 100)
T0, T1 = np.meshgrid(theta0_range, theta1_range)
Loss = np.zeros_like(T0)
for i in range(T0.shape[0]):
for j in range(T0.shape[1]):
theta_ij = np.array([T0[i, j], T1[i, j]])
y_pred = X @ theta_ij
Loss[i, j] = mse_loss(y, y_pred)
# Visualize as contour plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Contour plot
cs = axes[0].contourf(T0, T1, Loss, levels=50, cmap='viridis')
axes[0].contour(T0, T1, Loss, levels=20, colors='white', linewidths=0.5, alpha=0.4)
axes[0].plot(true_theta_0, true_theta_1, 'r*', markersize=15, label='True parameters')
axes[0].set_xlabel(r'$\theta_0$ (intercept)')
axes[0].set_ylabel(r'$\theta_1$ (slope)')
axes[0].set_title('MSE Loss Landscape')
axes[0].legend()
plt.colorbar(cs, ax=axes[0], label='MSE')
# 3D surface
ax3d = fig.add_subplot(122, projection='3d')
ax3d.plot_surface(T0, T1, Loss, cmap='viridis', alpha=0.8, edgecolor='none')
ax3d.scatter([true_theta_0], [true_theta_1],
[mse_loss(y, X @ true_theta)], color='red', s=100, zorder=5)
ax3d.set_xlabel(r'$\theta_0$'); ax3d.set_ylabel(r'$\theta_1$'); ax3d.set_zlabel('MSE')
ax3d.set_title('MSE Surface')
ax3d.view_init(elev=25, azim=-60)
# Remove the flat 2D axes[1] since we replaced it with 3D
axes[1].remove()
plt.tight_layout()
plt.show()
# Find minimum on grid
min_idx = np.unravel_index(np.argmin(Loss), Loss.shape)
print(f'Grid-search minimum at theta_0 = {T0[min_idx]:.2f}, theta_1 = {T1[min_idx]:.2f}')
print(f'Minimum MSE on grid = {Loss[min_idx]:.4f}')
3. Parameter Estimation via Maximum Likelihood (MML 9.2)ΒΆ
Under the Gaussian noise model \(p(y_n \mid \mathbf{x}_n, \boldsymbol{\theta}) = \mathcal{N}(y_n \mid \mathbf{x}_n^\top \boldsymbol{\theta}, \sigma^2)\), the negative log-likelihood is (MML Eq. 9.10b):
Setting the gradient to zero (MML Eq. 9.11cβ9.12c):
yields the normal equations and the MLE solution:
The noise variance can also be estimated via MLE (MML Eq. 9.22):
# --- Closed-form MLE via normal equations ---
def mle_linear_regression(X, y):
"""Compute MLE parameters: theta_ML = (X^T X)^{-1} X^T y (MML 9.12c)."""
theta_ml = np.linalg.solve(X.T @ X, X.T @ y)
return theta_ml
def estimate_noise_variance(X, y, theta):
"""MLE noise variance estimate (MML 9.22)."""
residuals = y - X @ theta
return np.mean(residuals ** 2)
theta_ml = mle_linear_regression(X, y)
sigma2_ml = estimate_noise_variance(X, y, theta_ml)
print(f'MLE parameters: theta_0 = {theta_ml[0]:.4f}, theta_1 = {theta_ml[1]:.4f}')
print(f'True parameters: theta_0 = {true_theta_0}, theta_1 = {true_theta_1}')
print(f'MLE noise variance: sigma^2_ML = {sigma2_ml:.4f} (true: {sigma**2:.4f})')
# Visualize MLE fit
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: MLE fit on data
y_ml = theta_ml[0] + theta_ml[1] * x_plot
y_true_plot = true_theta_0 + true_theta_1 * x_plot
axes[0].scatter(x_raw, y, alpha=0.5, label='Data', zorder=3)
axes[0].plot(x_plot, y_true_plot, 'k--', lw=2, label='True function')
axes[0].plot(x_plot, y_ml, 'r-', lw=2, label=f'MLE: $y = {theta_ml[0]:.2f} + {theta_ml[1]:.2f}x$')
axes[0].set_xlabel('$x$'); axes[0].set_ylabel('$y$')
axes[0].set_title('Maximum Likelihood Estimate')
axes[0].legend(); axes[0].grid(True, alpha=0.3)
# Right: MLE on the contour plot
cs = axes[1].contourf(T0, T1, Loss, levels=50, cmap='viridis')
axes[1].plot(true_theta_0, true_theta_1, 'r*', markersize=15, label='True parameters')
axes[1].plot(theta_ml[0], theta_ml[1], 'wo', markersize=10,
markeredgecolor='white', markeredgewidth=2, label='MLE estimate')
axes[1].set_xlabel(r'$\theta_0$'); axes[1].set_ylabel(r'$\theta_1$')
axes[1].set_title('MLE on Loss Landscape')
axes[1].legend()
plt.colorbar(cs, ax=axes[1], label='MSE')
plt.tight_layout()
plt.show()
4. Overfitting in Linear Regression (MML 9.2.2)ΒΆ
βLinear regressionβ refers to models that are linear in the parameters, but the inputs can undergo any nonlinear transformation. Using a polynomial feature map (MML Eq. 9.14):
we can model polynomials of degree \(M\) within the linear regression framework. The feature matrix \(\boldsymbol{\Phi} \in \mathbb{R}^{N \times (M+1)}\) (MML Eq. 9.16) replaces \(\mathbf{X}\) in the normal equations:
Overfitting occurs when the model is too flexible (high \(M\)), fitting noise in the training data rather than the underlying signal. The training error decreases monotonically with \(M\), but the test error follows a U-shaped curve (see MML Fig. 9.6).
We evaluate generalization using the Root Mean Square Error (MML Eq. 9.23):
# --- Polynomial regression: generate data from a known nonlinear function ---
np.random.seed(42)
N_poly = 10
x_poly = np.sort(np.random.uniform(-5, 5, N_poly))
y_poly = -np.sin(x_poly / 5) + np.cos(x_poly) + np.random.normal(0, 0.2, N_poly)
# Test set (dense grid)
N_test = 200
x_test = np.linspace(-5, 5, N_test)
y_test = -np.sin(x_test / 5) + np.cos(x_test) + np.random.normal(0, 0.2, N_test)
def polynomial_features(x, degree):
"""Build feature matrix Phi (MML Eq. 9.17): [1, x, x^2, ..., x^M]."""
return np.column_stack([x ** k for k in range(degree + 1)])
def fit_polynomial(x_train, y_train, degree):
"""Fit polynomial of given degree via MLE (MML 9.19)."""
Phi = polynomial_features(x_train, degree)
theta = np.linalg.solve(Phi.T @ Phi, Phi.T @ y_train)
return theta
def rmse(y_true, y_pred):
"""Root Mean Square Error (MML Eq. 9.23)."""
return np.sqrt(np.mean((y_true - y_pred) ** 2))
# Fit polynomials of different degrees and visualize
degrees_to_show = [0, 1, 3, 4, 6, 9]
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
x_dense = np.linspace(-5, 5, 300)
for ax, M in zip(axes.flat, degrees_to_show):
theta_m = fit_polynomial(x_poly, y_poly, M)
Phi_dense = polynomial_features(x_dense, M)
y_fit = Phi_dense @ theta_m
train_rmse = rmse(y_poly, polynomial_features(x_poly, M) @ theta_m)
ax.scatter(x_poly, y_poly, c='blue', marker='+', s=100, linewidths=2,
label='Training data', zorder=5)
ax.plot(x_dense, y_fit, 'b-', lw=2, label='MLE fit')
ax.set_ylim(-5, 5)
ax.set_title(f'$M = {M}$ (RMSE = {train_rmse:.3f})')
ax.set_xlabel('$x$'); ax.set_ylabel('$y$')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)
plt.suptitle('Polynomial Fits of Different Degrees (cf. MML Fig. 9.5)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
# --- Training error vs test error as function of polynomial degree (MML Fig. 9.6) ---
max_degree = 9
degrees = np.arange(0, max_degree + 1)
train_errors = []
test_errors = []
for M in degrees:
theta_m = fit_polynomial(x_poly, y_poly, M)
# Training RMSE
Phi_train = polynomial_features(x_poly, M)
train_errors.append(rmse(y_poly, Phi_train @ theta_m))
# Test RMSE
Phi_test = polynomial_features(x_test, M)
test_errors.append(rmse(y_test, Phi_test @ theta_m))
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(degrees, train_errors, 'b-o', lw=2, label='Training error (RMSE)')
ax.plot(degrees, test_errors, 'r-o', lw=2, label='Test error (RMSE)')
ax.set_xlabel('Degree of polynomial $M$')
ax.set_ylabel('RMSE')
ax.set_title('Training vs Test Error (cf. MML Fig. 9.6)')
ax.set_ylim(0, min(10, max(test_errors) * 1.2))
ax.set_xticks(degrees)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
best_degree = degrees[np.argmin(test_errors)]
ax.axvline(best_degree, color='green', linestyle='--', alpha=0.7,
label=f'Best test error at M={best_degree}')
ax.legend(fontsize=12)
plt.tight_layout()
plt.show()
print(f'Best polynomial degree (by test RMSE): M = {best_degree}')
print(f'Training RMSE at M={best_degree}: {train_errors[best_degree]:.4f}')
print(f'Test RMSE at M={best_degree}: {test_errors[best_degree]:.4f}')
# --- Bias-Variance Tradeoff Visualization ---
# Simulate many datasets to decompose expected test error into bias^2 + variance
np.random.seed(0)
n_simulations = 200
n_train = 10
x_eval = np.linspace(-4, 4, 50) # fixed evaluation points
f_true_eval = -np.sin(x_eval / 5) + np.cos(x_eval) # true function at eval points
degrees_bv = [1, 3, 5, 9]
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
bias_sq_list = []
variance_list = []
for ax, M in zip(axes, degrees_bv):
predictions = np.zeros((n_simulations, len(x_eval)))
for sim in range(n_simulations):
x_sim = np.random.uniform(-5, 5, n_train)
y_sim = -np.sin(x_sim / 5) + np.cos(x_sim) + np.random.normal(0, 0.2, n_train)
theta_sim = fit_polynomial(x_sim, y_sim, M)
Phi_eval = polynomial_features(x_eval, M)
predictions[sim, :] = Phi_eval @ theta_sim
mean_pred = np.mean(predictions, axis=0)
bias_sq = np.mean((mean_pred - f_true_eval) ** 2)
variance = np.mean(np.var(predictions, axis=0))
bias_sq_list.append(bias_sq)
variance_list.append(variance)
# Plot some sample fits
for sim in range(min(20, n_simulations)):
ax.plot(x_eval, predictions[sim, :], 'r-', alpha=0.1, lw=0.5)
ax.plot(x_eval, mean_pred, 'b-', lw=2, label='Mean prediction')
ax.plot(x_eval, f_true_eval, 'k--', lw=2, label='True function')
ax.set_title(f'M={M}\nBias$^2$={bias_sq:.3f}, Var={variance:.3f}')
ax.set_ylim(-3, 3)
ax.set_xlabel('$x$')
ax.legend(fontsize=7)
ax.grid(True, alpha=0.3)
plt.suptitle('Bias-Variance Tradeoff: Individual Fits (red), Mean (blue), Truth (black)',
fontsize=13, y=1.04)
plt.tight_layout()
plt.show()
print('Bias-Variance decomposition across polynomial degrees:')
for M, b2, v in zip(degrees_bv, bias_sq_list, variance_list):
print(f' M={M}: Bias^2 = {b2:.4f}, Variance = {v:.4f}, Total = {b2+v:.4f}')
5. Bayesian Linear Regression (MML 9.3)ΒΆ
Instead of a point estimate \(\boldsymbol{\theta}_{\text{ML}}\), Bayesian linear regression maintains a distribution over parameters.
Model (MML Eq. 9.35):
Prior: \(p(\boldsymbol{\theta}) = \mathcal{N}(\boldsymbol{\theta} \mid \mathbf{m}_0, \mathbf{S}_0)\)
Likelihood: \(p(y \mid \mathbf{x}, \boldsymbol{\theta}) = \mathcal{N}(y \mid \boldsymbol{\phi}^\top(\mathbf{x})\boldsymbol{\theta}, \sigma^2)\)
Posterior (MML Theorem 9.1, Eq. 9.43):
where: $\(\mathbf{S}_N = (\mathbf{S}_0^{-1} + \sigma^{-2} \boldsymbol{\Phi}^\top \boldsymbol{\Phi})^{-1} \tag{MML 9.43b}\)\( \)\(\mathbf{m}_N = \mathbf{S}_N (\mathbf{S}_0^{-1} \mathbf{m}_0 + \sigma^{-2} \boldsymbol{\Phi}^\top \mathbf{y}) \tag{MML 9.43c}\)$
Posterior predictive distribution (MML Eq. 9.57c):
# --- Bayesian Linear Regression Implementation ---
class BayesianLinearRegression:
"""Bayesian linear regression with conjugate Gaussian prior (MML 9.3)."""
def __init__(self, m0, S0, sigma2):
"""
Args:
m0: prior mean (K,)
S0: prior covariance (K, K)
sigma2: known noise variance
"""
self.m = m0.copy()
self.S = S0.copy()
self.sigma2 = sigma2
def update(self, Phi, y):
"""Update posterior with new data (MML Eq. 9.43b-c)."""
S0_inv = np.linalg.inv(self.S)
S_N_inv = S0_inv + (1.0 / self.sigma2) * Phi.T @ Phi
self.S = np.linalg.inv(S_N_inv)
self.m = self.S @ (S0_inv @ self.m + (1.0 / self.sigma2) * Phi.T @ y)
def predict(self, Phi_star):
"""Posterior predictive distribution (MML Eq. 9.57c).
Returns mean and variance for each test point."""
mu = Phi_star @ self.m
# Variance for each test point: phi^T S_N phi + sigma^2
var = np.sum(Phi_star @ self.S * Phi_star, axis=1) + self.sigma2
return mu, var
print('BayesianLinearRegression class defined.')
# --- Visualize prior and posterior distributions over weights ---
# Use simple 1D linear model: y = theta_0 + theta_1 * x
np.random.seed(42)
sigma2_blr = 0.5
# Prior: p(theta) = N(0, alpha^2 * I)
alpha2 = 2.0
m0 = np.array([0.0, 0.0])
S0 = alpha2 * np.eye(2)
# Generate training data
N_blr = 20
x_blr = np.random.uniform(-3, 3, N_blr)
y_blr = true_theta_0 + true_theta_1 * x_blr + np.random.normal(0, np.sqrt(sigma2_blr), N_blr)
Phi_blr = np.column_stack([np.ones(N_blr), x_blr])
# Visualize sequential prior -> posterior update
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
data_sizes = [0, 1, 3, N_blr]
for col, n_data in enumerate(data_sizes):
blr = BayesianLinearRegression(m0.copy(), S0.copy(), sigma2_blr)
if n_data > 0:
blr.update(Phi_blr[:n_data], y_blr[:n_data])
# Top row: weight-space posterior (contour plot of 2D Gaussian)
t0_grid = np.linspace(-3, 6, 100)
t1_grid = np.linspace(-3, 5, 100)
T0g, T1g = np.meshgrid(t0_grid, t1_grid)
pos = np.dstack([T0g, T1g])
# Evaluate 2D Gaussian density
S_inv = np.linalg.inv(blr.S)
det_S = np.linalg.det(blr.S)
diff = pos - blr.m
exponent = -0.5 * np.einsum('...i,ij,...j', diff, S_inv, diff)
density = np.exp(exponent) / (2 * np.pi * np.sqrt(det_S))
axes[0, col].contourf(T0g, T1g, density, levels=30, cmap='Blues')
axes[0, col].plot(true_theta_0, true_theta_1, 'r*', markersize=12)
axes[0, col].plot(blr.m[0], blr.m[1], 'k+', markersize=12, markeredgewidth=2)
axes[0, col].set_xlabel(r'$\theta_0$'); axes[0, col].set_ylabel(r'$\theta_1$')
title = f'Prior' if n_data == 0 else f'Posterior (N={n_data})'
axes[0, col].set_title(title)
# Bottom row: function-space view with uncertainty bands
x_pred = np.linspace(-4, 4, 200)
Phi_pred = np.column_stack([np.ones(200), x_pred])
mu_pred, var_pred = blr.predict(Phi_pred)
std_pred = np.sqrt(var_pred)
axes[1, col].fill_between(x_pred, mu_pred - 2*std_pred, mu_pred + 2*std_pred,
alpha=0.15, color='blue', label='95% CI')
axes[1, col].fill_between(x_pred, mu_pred - std_pred, mu_pred + std_pred,
alpha=0.3, color='blue', label='67% CI')
axes[1, col].plot(x_pred, mu_pred, 'b-', lw=2, label='Mean')
axes[1, col].plot(x_pred, true_theta_0 + true_theta_1 * x_pred, 'k--', lw=1.5,
label='True')
if n_data > 0:
axes[1, col].scatter(x_blr[:n_data], y_blr[:n_data], c='red', s=40, zorder=5)
# Sample functions from the posterior
for _ in range(5):
theta_sample = np.random.multivariate_normal(blr.m, blr.S)
axes[1, col].plot(x_pred, Phi_pred @ theta_sample, 'r-', alpha=0.3, lw=0.8)
axes[1, col].set_ylim(-8, 12)
axes[1, col].set_xlabel('$x$'); axes[1, col].set_ylabel('$y$')
axes[1, col].set_title(f'Predictions (N={n_data})')
axes[1, col].legend(fontsize=7)
axes[1, col].grid(True, alpha=0.3)
plt.suptitle('Bayesian Linear Regression: Prior to Posterior Update (MML 9.3)',
fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print(f'Prior mean: {m0}')
print(f'Posterior mean (N={N_blr}): {blr.m}')
print(f'Posterior covariance diagonal: {np.diag(blr.S)}')
# --- BLR with polynomial features: compare MLE, MAP, and BLR (MML Fig. 9.10-9.11) ---
np.random.seed(42)
M_blr = 5 # polynomial degree
sigma2_poly = 0.04 # noise variance for polynomial example
# Use the polynomial training data from Section 4
Phi_train_poly = polynomial_features(x_poly, M_blr)
K = M_blr + 1 # number of parameters
# BLR prior: N(0, (1/4)*I) as in MML Example 9.7
m0_poly = np.zeros(K)
S0_poly = 0.25 * np.eye(K)
# Fit BLR
blr_poly = BayesianLinearRegression(m0_poly, S0_poly, sigma2_poly)
blr_poly.update(Phi_train_poly, y_poly)
# MLE
theta_mle_poly = fit_polynomial(x_poly, y_poly, M_blr)
# MAP estimate = posterior mean (equals regularized least squares, MML Eq. 9.31)
theta_map_poly = blr_poly.m
# Predictions
x_dense_blr = np.linspace(-5, 5, 300)
Phi_dense_blr = polynomial_features(x_dense_blr, M_blr)
mu_blr, var_blr = blr_poly.predict(Phi_dense_blr)
std_blr = np.sqrt(var_blr)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Panel (a): Training data
axes[0].scatter(x_poly, y_poly, c='blue', marker='+', s=100, linewidths=2, zorder=5)
axes[0].set_title('(a) Training Data')
axes[0].set_xlabel('$x$'); axes[0].set_ylabel('$y$')
axes[0].set_ylim(-5, 5); axes[0].grid(True, alpha=0.3)
# Panel (b): Posterior over functions with uncertainty bands
axes[1].fill_between(x_dense_blr, mu_blr - 2*std_blr, mu_blr + 2*std_blr,
alpha=0.15, color='gray', label='95% CI')
axes[1].fill_between(x_dense_blr, mu_blr - std_blr, mu_blr + std_blr,
alpha=0.3, color='gray', label='67% CI')
axes[1].scatter(x_poly, y_poly, c='blue', marker='+', s=100, linewidths=2, zorder=5)
axes[1].plot(x_dense_blr, Phi_dense_blr @ theta_mle_poly, 'b-', lw=2, label='MLE')
axes[1].plot(x_dense_blr, Phi_dense_blr @ theta_map_poly, 'orange', lw=2, label='MAP = BLR mean')
axes[1].set_title(f'(b) Posterior over Functions (M={M_blr})')
axes[1].set_xlabel('$x$'); axes[1].set_ylabel('$y$')
axes[1].set_ylim(-5, 5); axes[1].legend(fontsize=9); axes[1].grid(True, alpha=0.3)
# Panel (c): Samples from the posterior
for i in range(10):
theta_sample = np.random.multivariate_normal(blr_poly.m, blr_poly.S)
axes[2].plot(x_dense_blr, Phi_dense_blr @ theta_sample, 'r-', alpha=0.4, lw=1)
axes[2].scatter(x_poly, y_poly, c='blue', marker='+', s=100, linewidths=2, zorder=5)
axes[2].set_title('(c) Posterior Samples')
axes[2].set_xlabel('$x$'); axes[2].set_ylabel('$y$')
axes[2].set_ylim(-5, 5); axes[2].grid(True, alpha=0.3)
plt.suptitle('Bayesian Linear Regression with Polynomials (cf. MML Fig. 9.10)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print(f'MLE parameters: {theta_mle_poly}')
print(f'MAP parameters: {theta_map_poly}')
print(f'Max difference (MLE vs MAP): {np.max(np.abs(theta_mle_poly - theta_map_poly)):.4f}')
6. Maximum Likelihood as Orthogonal Projection (MML 9.4)ΒΆ
The MLE solution has a beautiful geometric interpretation. For the model \(y = x\theta + \epsilon\), the maximum likelihood reconstruction of the training targets is (MML Eq. 9.67):
This is exactly the orthogonal projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\). The projection matrix is \(\mathbf{P} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) (MML Section 3.8).
In the general case with feature matrix \(\boldsymbol{\Phi}\) (MML Eq. 9.70):
and \(\boldsymbol{\Phi}\boldsymbol{\theta}_{\text{ML}}\) is the projection of \(\mathbf{y}\) onto the \(K\)-dimensional subspace of \(\mathbb{R}^N\) spanned by the columns of \(\boldsymbol{\Phi}\).
# --- Geometric interpretation: MLE as orthogonal projection (MML 9.4, Fig. 9.12) ---
np.random.seed(42)
N_proj = 30
x_proj = np.random.uniform(-4, 4, N_proj)
y_proj = 0.5 * x_proj + np.random.normal(0, 0.5, N_proj)
# Design matrix (no intercept, to match MML 9.65 for cleaner projection visualization)
X_proj = x_proj.reshape(-1, 1) # N x 1
# MLE
theta_ml_proj = np.linalg.solve(X_proj.T @ X_proj, X_proj.T @ y_proj)
# Projections: y_hat = X * theta_ML
y_hat_proj = X_proj @ theta_ml_proj
# Projection matrix P = X(X^T X)^{-1} X^T
P = X_proj @ np.linalg.inv(X_proj.T @ X_proj) @ X_proj.T
y_hat_via_P = P @ y_proj
# Verify P is idempotent and symmetric
print(f'theta_ML = {theta_ml_proj[0]:.4f}')
print(f'Projection matrix P shape: {P.shape}')
print(f'P is idempotent (P^2 = P): {np.allclose(P @ P, P)}')
print(f'P is symmetric (P = P^T): {np.allclose(P, P.T)}')
print(f'Projection via theta_ML equals P @ y: {np.allclose(y_hat_proj, y_hat_via_P)}')
# Residual is orthogonal to column space of X
residual = y_proj - y_hat_proj
print(f'Residual orthogonal to X: X^T (y - X theta) = {X_proj.T @ residual}')
# Visualize (cf. MML Fig. 9.12)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Left: dataset
axes[0].scatter(x_proj, y_proj, c='blue', alpha=0.6, label='Observations $y_n$', zorder=3)
axes[0].set_xlabel('$x$'); axes[0].set_ylabel('$y$')
axes[0].set_title('(a) Regression Dataset')
axes[0].grid(True, alpha=0.3); axes[0].legend()
# Right: show projection
x_line = np.linspace(-4.5, 4.5, 200)
axes[1].scatter(x_proj, y_proj, c='blue', alpha=0.6, s=40, label='Observations', zorder=3)
axes[1].scatter(x_proj, y_hat_proj, c='orange', alpha=0.8, s=40,
label='Projections', zorder=4)
axes[1].plot(x_line, theta_ml_proj[0] * x_line, 'k-', lw=2, label='MLE line')
# Draw projection lines
for i in range(N_proj):
axes[1].plot([x_proj[i], x_proj[i]], [y_proj[i], y_hat_proj[i]],
'orange', alpha=0.4, lw=1)
axes[1].set_xlabel('$x$'); axes[1].set_ylabel('$y$')
axes[1].set_title('(b) MLE as Orthogonal Projection (MML Fig. 9.12)')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# --- Projection in higher dimensions (3D visualization) ---
# With 3 data points and 2 features, we can show the geometry in 3D
np.random.seed(42)
# 3 data points, 2D feature space (intercept + slope)
x_3d = np.array([-1.0, 0.5, 2.0])
y_3d = np.array([0.5, 1.5, 3.5])
Phi_3d = np.column_stack([np.ones(3), x_3d]) # 3 x 2
# MLE
theta_3d = np.linalg.solve(Phi_3d.T @ Phi_3d, Phi_3d.T @ y_3d)
y_hat_3d = Phi_3d @ theta_3d
resid_3d = y_3d - y_hat_3d
# Column vectors of Phi
col1 = Phi_3d[:, 0] # [1, 1, 1] -- intercept basis
col2 = Phi_3d[:, 1] # [-1, 0.5, 2] -- slope basis
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Plot the column space (a plane through origin spanned by col1, col2)
s_range = np.linspace(-2, 4, 20)
t_range = np.linspace(-1, 2, 20)
SS, TT = np.meshgrid(s_range, t_range)
plane_x = SS * col1[0] + TT * col2[0]
plane_y = SS * col1[1] + TT * col2[1]
plane_z = SS * col1[2] + TT * col2[2]
ax.plot_surface(plane_x, plane_y, plane_z, alpha=0.15, color='blue')
# Plot y, y_hat, and the residual vector
ax.scatter(*y_3d, c='blue', s=100, label='$\mathbf{y}$', zorder=5)
ax.scatter(*y_hat_3d, c='orange', s=100, label=r'$\hat{\mathbf{y}} = \Phi\theta_{ML}$', zorder=5)
# Residual (y - y_hat) arrow
ax.plot([y_3d[0], y_hat_3d[0]], [y_3d[1], y_hat_3d[1]], [y_3d[2], y_hat_3d[2]],
'r-', lw=2, label='Residual $\mathbf{y} - \hat{\mathbf{y}}$')
# Column vectors from origin
origin = np.zeros(3)
ax.quiver(*origin, *col1, color='green', arrow_length_ratio=0.1, lw=2,
label='Column 1 ($\mathbf{1}$)')
ax.quiver(*origin, *col2, color='purple', arrow_length_ratio=0.1, lw=2,
label='Column 2 ($\mathbf{x}$)')
ax.set_xlabel('Dim 1'); ax.set_ylabel('Dim 2'); ax.set_zlabel('Dim 3')
ax.set_title('MLE as Orthogonal Projection onto Column Space of $\Phi$')
ax.legend(fontsize=9, loc='upper left')
ax.view_init(elev=20, azim=135)
plt.tight_layout()
plt.show()
print(f'y = {y_3d}')
print(f'y_hat = {y_hat_3d}')
print(f'residual = {resid_3d}')
print(f'Residual dot col1: {resid_3d @ col1:.6f} (should be ~0)')
print(f'Residual dot col2: {resid_3d @ col2:.6f} (should be ~0)')
7. Model Selection (MML 8.6)ΒΆ
We need to choose the right model complexity (e.g., polynomial degree). Three approaches from MML:
7.1 Cross-Validation (MML 8.2.4)ΒΆ
\(K\)-fold cross-validation partitions the data into \(K\) chunks. For each fold \(k\), we train on \(K-1\) chunks and evaluate on the held-out chunk. The estimated generalization error is (MML Eq. 8.13):
7.2 Information CriteriaΒΆ
AIC (Akaike Information Criterion, MML Eq. 8.48): \(\text{AIC} = \log p(\mathbf{x} \mid \boldsymbol{\theta}) - M\)
We write this equivalently as \(\text{AIC} = 2M - 2\log \hat{L}\), where \(\hat{L}\) is the maximized likelihood and \(M\) is the number of parameters. Lower is better.
BIC (Bayesian Information Criterion, MML Eq. 8.49): \(\text{BIC} \approx \log p(\mathbf{x} \mid \boldsymbol{\theta}) - \frac{1}{2}M \log N\)
Equivalently, \(\text{BIC} = M \ln N - 2\ln \hat{L}\). BIC penalizes complexity more heavily than AIC.
# --- K-fold cross-validation for polynomial model selection ---
def k_fold_cv(x, y, degree, K=5):
"""K-fold cross-validation for polynomial regression.
Returns mean and std of RMSE across folds."""
N_data = len(x)
indices = np.arange(N_data)
np.random.shuffle(indices)
fold_size = N_data // K
rmse_scores = []
for k in range(K):
val_idx = indices[k * fold_size:(k + 1) * fold_size]
train_idx = np.concatenate([indices[:k * fold_size], indices[(k + 1) * fold_size:]])
x_train_k, y_train_k = x[train_idx], y[train_idx]
x_val_k, y_val_k = x[val_idx], y[val_idx]
theta_k = fit_polynomial(x_train_k, y_train_k, degree)
Phi_val_k = polynomial_features(x_val_k, degree)
y_pred_k = Phi_val_k @ theta_k
rmse_scores.append(rmse(y_val_k, y_pred_k))
return np.mean(rmse_scores), np.std(rmse_scores)
# Use a larger dataset for meaningful cross-validation
np.random.seed(42)
N_cv = 50
x_cv = np.random.uniform(-5, 5, N_cv)
y_cv = -np.sin(x_cv / 5) + np.cos(x_cv) + np.random.normal(0, 0.2, N_cv)
max_degree_cv = 10
degrees_cv = np.arange(0, max_degree_cv + 1)
cv_means = []
cv_stds = []
for M in degrees_cv:
mean_rmse, std_rmse = k_fold_cv(x_cv, y_cv, M, K=5)
cv_means.append(mean_rmse)
cv_stds.append(std_rmse)
cv_means = np.array(cv_means)
cv_stds = np.array(cv_stds)
fig, ax = plt.subplots(figsize=(10, 6))
ax.errorbar(degrees_cv, cv_means, yerr=cv_stds, fmt='o-', capsize=5, lw=2,
label='5-fold CV RMSE', color='blue')
ax.set_xlabel('Degree of polynomial $M$')
ax.set_ylabel('RMSE')
ax.set_title('5-Fold Cross-Validation for Model Selection (MML 8.2.4)')
ax.set_xticks(degrees_cv)
best_cv = degrees_cv[np.argmin(cv_means)]
ax.axvline(best_cv, color='green', linestyle='--', alpha=0.7,
label=f'Best CV: M={best_cv}')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
ax.set_ylim(0, min(5, cv_means.max() * 1.5))
plt.tight_layout()
plt.show()
print(f'Best polynomial degree by 5-fold CV: M = {best_cv}')
print(f'CV RMSE at best degree: {cv_means[best_cv]:.4f} +/- {cv_stds[best_cv]:.4f}')
# --- AIC and BIC for model comparison (MML Eq. 8.48, 8.49) ---
def compute_aic_bic(x, y, degree):
"""Compute AIC and BIC for a polynomial model of given degree.
AIC = 2M - 2 ln(L_hat) [lower is better]
BIC = M ln(N) - 2 ln(L_hat) [lower is better]
where ln(L_hat) is the maximized log-likelihood.
"""
N_data = len(y)
M_params = degree + 1 # number of parameters
theta = fit_polynomial(x, y, degree)
Phi = polynomial_features(x, degree)
residuals = y - Phi @ theta
# MLE noise variance
sigma2_hat = np.mean(residuals ** 2)
if sigma2_hat < 1e-15:
sigma2_hat = 1e-15 # prevent log(0)
# Log-likelihood: sum of log N(y_n | phi^T theta, sigma^2)
log_likelihood = -0.5 * N_data * np.log(2 * np.pi * sigma2_hat) - 0.5 * N_data
aic = 2 * M_params - 2 * log_likelihood
bic = M_params * np.log(N_data) - 2 * log_likelihood
return aic, bic, log_likelihood
aic_scores = []
bic_scores = []
log_likelihoods = []
for M in degrees_cv:
aic, bic, ll = compute_aic_bic(x_cv, y_cv, M)
aic_scores.append(aic)
bic_scores.append(bic)
log_likelihoods.append(ll)
aic_scores = np.array(aic_scores)
bic_scores = np.array(bic_scores)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: AIC and BIC
axes[0].plot(degrees_cv, aic_scores, 'b-o', lw=2, label='AIC')
axes[0].plot(degrees_cv, bic_scores, 'r-s', lw=2, label='BIC')
axes[0].axvline(degrees_cv[np.argmin(aic_scores)], color='blue', linestyle='--', alpha=0.5)
axes[0].axvline(degrees_cv[np.argmin(bic_scores)], color='red', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Degree of polynomial $M$')
axes[0].set_ylabel('Information Criterion (lower is better)')
axes[0].set_title('AIC and BIC for Model Selection (MML 8.6)')
axes[0].set_xticks(degrees_cv)
axes[0].legend(fontsize=12)
axes[0].grid(True, alpha=0.3)
# Right: compare all three methods
# Normalize each to [0, 1] for comparison
def normalize(arr):
return (arr - arr.min()) / (arr.max() - arr.min() + 1e-10)
axes[1].plot(degrees_cv, normalize(cv_means), 'g-o', lw=2, label=f'CV (best: M={best_cv})')
axes[1].plot(degrees_cv, normalize(aic_scores), 'b-s', lw=2,
label=f'AIC (best: M={degrees_cv[np.argmin(aic_scores)]})')
axes[1].plot(degrees_cv, normalize(bic_scores), 'r-^', lw=2,
label=f'BIC (best: M={degrees_cv[np.argmin(bic_scores)]})')
axes[1].set_xlabel('Degree of polynomial $M$')
axes[1].set_ylabel('Normalized score (lower is better)')
axes[1].set_title('Comparison of Model Selection Criteria')
axes[1].set_xticks(degrees_cv)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f'Best degree by AIC: M = {degrees_cv[np.argmin(aic_scores)]}')
print(f'Best degree by BIC: M = {degrees_cv[np.argmin(bic_scores)]}')
print(f'Best degree by CV: M = {best_cv}')
print(f'\nNote: BIC penalizes complexity more heavily than AIC (MML Eq. 8.49).')
# --- Final comparison: fits at selected degrees ---
best_degrees = {
'CV': best_cv,
'AIC': degrees_cv[np.argmin(aic_scores)],
'BIC': degrees_cv[np.argmin(bic_scores)],
}
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
x_dense_cv = np.linspace(-5, 5, 300)
y_true_cv = -np.sin(x_dense_cv / 5) + np.cos(x_dense_cv)
for ax, (method, M) in zip(axes, best_degrees.items()):
theta_best = fit_polynomial(x_cv, y_cv, M)
Phi_dense_cv = polynomial_features(x_dense_cv, M)
y_fit_cv = Phi_dense_cv @ theta_best
ax.scatter(x_cv, y_cv, alpha=0.4, s=20, zorder=3, label='Training data')
ax.plot(x_dense_cv, y_true_cv, 'k--', lw=1.5, label='True function')
ax.plot(x_dense_cv, y_fit_cv, 'r-', lw=2, label=f'MLE fit (M={M})')
ax.set_title(f'{method}: Best M = {M}')
ax.set_xlabel('$x$'); ax.set_ylabel('$y$')
ax.set_ylim(-3, 3)
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
plt.suptitle('Polynomial Fits Selected by Different Criteria', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
SummaryΒΆ
In this lab we implemented the core ideas of linear regression from MML Chapters 8β9:
Topic |
MML Reference |
Key Result |
|---|---|---|
Data, Models, Learning |
Ch 8.1 |
Data as vectors, linear predictors \(f(\mathbf{x}) = \boldsymbol{\theta}^\top\mathbf{x}\), squared loss |
Empirical Risk Minimization |
Ch 8.2 |
Minimize average loss \(\frac{1}{N}|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}|^2\) |
MLE via Normal Equations |
Ch 9.2 |
\(\boldsymbol{\theta}_{\text{ML}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) |
Polynomial Features |
Ch 9.2 |
Feature map \(\boldsymbol{\phi}(x) = [1, x, \ldots, x^M]^\top\) |
Overfitting |
Ch 9.2.2 |
Training error decreases with \(M\); test error has U-shape |
Bias-Variance Tradeoff |
Ch 8.2.3 |
Simple models: high bias; complex models: high variance |
Bayesian Linear Regression |
Ch 9.3 |
Posterior: \(\mathbf{S}_N = (\mathbf{S}_0^{-1} + \sigma^{-2}\boldsymbol{\Phi}^\top\boldsymbol{\Phi})^{-1}\), \(\mathbf{m}_N = \mathbf{S}_N(\mathbf{S}_0^{-1}\mathbf{m}_0 + \sigma^{-2}\boldsymbol{\Phi}^\top\mathbf{y})\) |
MLE as Projection |
Ch 9.4 |
\(\boldsymbol{\Phi}\boldsymbol{\theta}_{\text{ML}}\) is orthogonal projection of \(\mathbf{y}\) onto column space of \(\boldsymbol{\Phi}\) |
Model Selection |
Ch 8.6 |
Cross-validation, AIC, BIC |
Key takeaways:
MLE for linear regression has a closed-form solution via the normal equations.
The MLE geometrically computes the orthogonal projection of \(\mathbf{y}\) onto the column space of the feature matrix.
Overfitting is a fundamental challenge: more complex models fit training data better but generalize worse.
Bayesian linear regression addresses overfitting by maintaining a full posterior distribution over parameters, providing both predictions and calibrated uncertainty estimates.
Model selection via cross-validation, AIC, or BIC helps choose the right model complexity.