Lab 10: Classification with Support Vector MachinesΒΆ
Based on Chapter 12 of βMathematics for Machine Learningβ by Deisenroth, Faisal, and Ong
In this lab we implement Support Vector Machines (SVMs) from scratch using only NumPy and Matplotlib. We follow the MML bookβs treatment closely:
12.1 Separating Hyperplanes β The function \(f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle + b\) defines a hyperplane \(\{\mathbf{x} : f(\mathbf{x})=0\}\) that partitions \(\mathbb{R}^D\) into positive and negative half-spaces.
12.2 Primal SVM β Hard margin: minimize \(\frac{1}{2}\|\mathbf{w}\|^2\) subject to \(y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \geq 1\). Soft margin with hinge loss: \(\min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{n=1}^N \max\{0, 1 - y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b)\}\) (Eq. 12.31).
12.3 Dual SVM β Via Lagrange multipliers we obtain the dual: \(\min_\alpha \frac{1}{2}\sum_{i,j} y_i y_j \alpha_i \alpha_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle - \sum_i \alpha_i\), where \(\mathbf{w} = \sum_n \alpha_n y_n \mathbf{x}_n\) (representer theorem, Eq. 12.38). Points with \(\alpha_n > 0\) are support vectors.
12.4 Kernels β Replace inner products \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\) with \(k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_\mathcal{H}\) to get non-linear decision boundaries without explicitly computing \(\phi\).
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
np.random.seed(42)
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 12
print("Libraries loaded. NumPy version:", np.__version__)
1. Separating Hyperplanes (MML 12.1)ΒΆ
A hyperplane in \(\mathbb{R}^D\) is defined by \(f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle + b = 0\) (Eq. 12.3), where \(\mathbf{w}\) is the normal vector to the hyperplane and \(b\) is the bias/intercept.
For binary classification with labels \(y_n \in \{+1, -1\}\), the classification rule is:
Predict \(+1\) if \(f(\mathbf{x}_\text{test}) \geq 0\)
Predict \(-1\) otherwise
The constraint that all examples are correctly classified is captured by \(y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \geq 0\) (Eq. 12.7).
Key insight: There are infinitely many separating hyperplanes for linearly separable data (Figure 12.3 in MML). The SVM chooses the one with the maximum margin.
# Generate linearly separable 2D data
np.random.seed(42)
n_per_class = 30
# Positive class (y = +1): centred around (2, 2)
X_pos = np.random.randn(n_per_class, 2) * 0.8 + np.array([2, 2])
# Negative class (y = -1): centred around (-1, -1)
X_neg = np.random.randn(n_per_class, 2) * 0.8 + np.array([-1, -1])
X = np.vstack([X_pos, X_neg])
y = np.hstack([np.ones(n_per_class), -np.ones(n_per_class)])
# Plot data with multiple candidate separating lines (as in Figure 12.3)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: just the data
ax = axes[0]
ax.scatter(X_pos[:, 0], X_pos[:, 1], c='steelblue', marker='o', s=50, label='$y=+1$', edgecolors='k', linewidth=0.5)
ax.scatter(X_neg[:, 0], X_neg[:, 1], c='darkorange', marker='x', s=50, label='$y=-1$', linewidths=1.5)
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
ax.set_title('Linearly Separable Data')
ax.legend()
# Right: multiple separating hyperplanes
ax = axes[1]
ax.scatter(X_pos[:, 0], X_pos[:, 1], c='steelblue', marker='o', s=50, edgecolors='k', linewidth=0.5)
ax.scatter(X_neg[:, 0], X_neg[:, 1], c='darkorange', marker='x', s=50, linewidths=1.5)
# Draw several valid separating lines
x_line = np.linspace(-3, 4, 100)
separators = [
(1.0, 1.0, 0.0), # w1=1, w2=1, b=0
(0.7, 1.3, -0.2), # tilted
(1.5, 0.5, 0.5), # another tilt
(1.0, 0.8, -0.3), # yet another
]
colors_lines = ['green', 'red', 'purple', 'brown']
for (w1, w2, b), c in zip(separators, colors_lines):
# w1*x1 + w2*x2 + b = 0 => x2 = -(w1*x1 + b)/w2
y_line = -(w1 * x_line + b) / w2
mask = (y_line > -3.5) & (y_line < 5)
ax.plot(x_line[mask], y_line[mask], c=c, alpha=0.7, linewidth=1.5)
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
ax.set_title('Many Valid Separating Hyperplanes')
ax.set_xlim(-3.5, 4.5)
ax.set_ylim(-3.5, 5)
plt.tight_layout()
plt.show()
print("All four lines correctly separate the two classes, but which one is best?")
2. Primal SVM: Hard Margin (MML 12.2)ΒΆ
2.1 Concept of the MarginΒΆ
The margin is the distance from the separating hyperplane to the closest example. From MML Eq. 12.14, the distance from a point to the hyperplane \(\langle \mathbf{w}, \mathbf{x} \rangle + b = 0\) is:
(under the convention that the closest point satisfies \(\langle \mathbf{w}, \mathbf{x}_a \rangle + b = 1\)).
2.2 Hard Margin SVM (Eq. 12.18-12.19)ΒΆ
The hard margin SVM is the constrained optimization problem:
We implement this using sub-gradient descent on the equivalent hinge loss formulation with a very large \(C\) (approaching the hard-margin limit).
def svm_hinge_loss(w, b, X, y, C):
"""
Compute the SVM objective from MML Eq. 12.31:
L = (1/2)||w||^2 + C * sum_n max(0, 1 - y_n * (w . x_n + b))
Returns: (loss, grad_w, grad_b)
"""
N = X.shape[0]
margins = y * (X @ w + b) # y_n * f(x_n) for each n
hinge = np.maximum(0, 1 - margins) # max(0, 1 - t) where t = y*f(x)
# Objective
loss = 0.5 * np.dot(w, w) + C * np.sum(hinge)
# Sub-gradients
# For the regularizer: grad_w += w
# For hinge loss at point n: if margin < 1, grad_w += -C * y_n * x_n, grad_b += -C * y_n
violated = (margins < 1).astype(float) # indicator: 1 if in margin or misclassified
grad_w = w - C * X.T @ (violated * y)
grad_b = -C * np.sum(violated * y)
return loss, grad_w, grad_b
def train_svm_sgd(X, y, C=1.0, lr=0.001, n_epochs=1000, verbose=True):
"""Train SVM using sub-gradient descent on the primal objective (Eq. 12.31)."""
N, D = X.shape
w = np.zeros(D)
b = 0.0
losses = []
for epoch in range(n_epochs):
loss, grad_w, grad_b = svm_hinge_loss(w, b, X, y, C)
w -= lr * grad_w
b -= lr * grad_b
losses.append(loss)
if verbose:
print(f"Final loss: {losses[-1]:.4f}")
print(f"w = [{w[0]:.4f}, {w[1]:.4f}], b = {b:.4f}")
print(f"Margin = 2/||w|| = {2.0/np.linalg.norm(w):.4f}")
return w, b, losses
# Train hard-margin SVM (C very large => no slack allowed)
w_hard, b_hard, losses_hard = train_svm_sgd(X, y, C=1000.0, lr=0.0001, n_epochs=3000)
# Plot the loss convergence
plt.figure(figsize=(8, 4))
plt.plot(losses_hard)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Hard Margin SVM: Training Loss Convergence')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.show()
def plot_svm_boundary(w, b, X, y, title="SVM Decision Boundary", ax=None, show_margin=True):
"""Visualize the decision boundary, margin bands, and support vectors."""
if ax is None:
fig, ax = plt.subplots(figsize=(8, 6))
# Create mesh for decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
Z = xx.ravel() * w[0] + yy.ravel() * w[1] + b
Z = Z.reshape(xx.shape)
# Decision boundary and margin bands
ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['orange', 'black', 'steelblue'],
linestyles=['--', '-', '--'], linewidths=[1.5, 2, 1.5])
if show_margin:
ax.contourf(xx, yy, Z, levels=[-1, 1], colors=['lightyellow'], alpha=0.3)
# Data points
pos = y == 1
neg = y == -1
ax.scatter(X[pos, 0], X[pos, 1], c='steelblue', marker='o', s=50,
edgecolors='k', linewidth=0.5, label='$y=+1$')
ax.scatter(X[neg, 0], X[neg, 1], c='darkorange', marker='x', s=50,
linewidths=1.5, label='$y=-1$')
# Highlight support vectors (points on or within margin)
margins = y * (X @ w + b)
sv_mask = margins <= 1.01 # small tolerance
ax.scatter(X[sv_mask, 0], X[sv_mask, 1], facecolors='none', edgecolors='red',
s=150, linewidths=2, label='Support vectors')
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
ax.set_title(title)
ax.legend(loc='best')
return ax
plot_svm_boundary(w_hard, b_hard, X, y, title="Hard Margin SVM (Large C)")
margin_width = 2.0 / np.linalg.norm(w_hard)
plt.text(0.02, 0.02, f'Margin width = {margin_width:.3f}', transform=plt.gca().transAxes,
fontsize=11, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.tight_layout()
plt.show()
# Verify all points are correctly classified
predictions = np.sign(X @ w_hard + b_hard)
accuracy = np.mean(predictions == y)
print(f"Training accuracy: {accuracy*100:.1f}%")
print(f"Number of support vectors: {np.sum(y * (X @ w_hard + b_hard) <= 1.01)}")
3. Soft Margin SVM (MML 12.2.4-12.2.5)ΒΆ
When data is not linearly separable (Figure 12.6b in MML), we introduce slack variables \(\xi_n \geq 0\) to allow some points to violate the margin. The soft margin SVM (Eq. 12.26a) is:
Equivalently, using the hinge loss \(\ell(t) = \max\{0, 1-t\}\) (Eq. 12.28), this becomes the unconstrained problem (Eq. 12.31):
The parameter \(C > 0\) trades off margin size vs. classification errors. Large \(C\) means less tolerance for errors (approaches hard margin).
# First, let's visualize the hinge loss vs zero-one loss (Figure 12.8 in MML)
t = np.linspace(-3, 3, 300)
hinge = np.maximum(0, 1 - t)
zero_one = (t < 0).astype(float)
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(t, zero_one, 'steelblue', linewidth=2, label='Zero-one loss')
ax.plot(t, hinge, 'darkorange', linewidth=2, label='Hinge loss')
ax.set_xlabel('$t = y \\cdot f(\\mathbf{x})$')
ax.set_ylabel('Loss $\\ell(t)$')
ax.set_title('Hinge Loss is a Convex Upper Bound on Zero-One Loss (MML Fig. 12.8)')
ax.legend()
ax.set_ylim(-0.1, 4)
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)
ax.axvline(1, color='gray', linewidth=0.5, linestyle=':')
ax.text(1.05, 3.5, '$t=1$ (margin)', fontsize=10, color='gray')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("The hinge loss penalizes points inside the margin (0 < t < 1) and misclassified points (t < 0).")
# Create non-linearly-separable data by adding noise/overlap
np.random.seed(123)
n_per_class = 40
X_pos_noisy = np.random.randn(n_per_class, 2) * 1.2 + np.array([1.5, 1.5])
X_neg_noisy = np.random.randn(n_per_class, 2) * 1.2 + np.array([-1.0, -1.0])
X_noisy = np.vstack([X_pos_noisy, X_neg_noisy])
y_noisy = np.hstack([np.ones(n_per_class), -np.ones(n_per_class)])
# Train soft margin SVM with moderate C
w_soft, b_soft, losses_soft = train_svm_sgd(X_noisy, y_noisy, C=1.0, lr=0.001, n_epochs=2000)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: training curve
axes[0].plot(losses_soft)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Soft Margin SVM Training Loss')
axes[0].grid(True, alpha=0.3)
# Right: decision boundary
plot_svm_boundary(w_soft, b_soft, X_noisy, y_noisy,
title="Soft Margin SVM (C=1.0, overlapping data)", ax=axes[1])
plt.tight_layout()
plt.show()
# Count misclassifications
preds_soft = np.sign(X_noisy @ w_soft + b_soft)
acc_soft = np.mean(preds_soft == y_noisy)
n_margin_violations = np.sum(y_noisy * (X_noisy @ w_soft + b_soft) < 1)
print(f"Training accuracy: {acc_soft*100:.1f}%")
print(f"Points violating margin (slack > 0): {n_margin_violations}")
print(f"Margin width: {2.0/np.linalg.norm(w_soft):.3f}")
4. Dual SVM (MML 12.3)ΒΆ
4.1 Convex Duality via Lagrange MultipliersΒΆ
The Lagrangian of the soft margin SVM (Eq. 12.34) is:
Setting partial derivatives to zero yields the representer theorem (Eq. 12.38): $\(\mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n\)$
The dual SVM (Eq. 12.41) is: $\(\min_{\boldsymbol{\alpha}} \quad \frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N y_i y_j \alpha_i \alpha_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle - \sum_{i=1}^N \alpha_i\)\( \)\(\text{subject to} \quad \sum_{i=1}^N y_i \alpha_i = 0, \quad 0 \leq \alpha_i \leq C\)$
Key observations:
The dual depends on the data only through inner products \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\) β this enables the kernel trick.
Points with \(\alpha_n > 0\) are support vectors; points with \(\alpha_n = 0\) do not influence the solution.
The bias is recovered via \(b^* = y_n - \langle \mathbf{w}^*, \mathbf{x}_n \rangle\) for any support vector \(\mathbf{x}_n\) (Eq. 12.42).
def solve_dual_svm(X, y, C=1.0, lr=0.01, n_epochs=3000, kernel_fn=None):
"""
Solve the dual SVM (Eq. 12.41) using projected gradient descent.
Dual objective: min_alpha (1/2) alpha^T Q alpha - 1^T alpha
where Q_ij = y_i y_j K(x_i, x_j)
Constraints: 0 <= alpha_i <= C, sum(y_i * alpha_i) = 0
"""
N = X.shape[0]
# Compute kernel/Gram matrix
if kernel_fn is None:
K = X @ X.T # linear kernel: <x_i, x_j>
else:
K = kernel_fn(X, X)
# Q_ij = y_i * y_j * K_ij
Q = np.outer(y, y) * K
# Initialize alphas
alpha = np.zeros(N)
losses = []
for epoch in range(n_epochs):
# Gradient: Q @ alpha - 1
grad = Q @ alpha - np.ones(N)
# Gradient step
alpha -= lr * grad
# Project onto constraints: 0 <= alpha_i <= C
alpha = np.clip(alpha, 0, C)
# Project onto equality constraint: sum(y_i * alpha_i) = 0
# Remove the component along y: alpha -= (y^T alpha / y^T y) * y
alpha -= (y @ alpha) / (y @ y) * y
alpha = np.clip(alpha, 0, C) # re-clip after projection
# Compute dual objective
dual_obj = 0.5 * alpha @ Q @ alpha - np.sum(alpha)
losses.append(dual_obj)
# Recover primal parameters using representer theorem (Eq. 12.38)
# w = sum_n alpha_n * y_n * x_n
w = X.T @ (alpha * y)
# Recover bias b from support vectors (Eq. 12.42)
# Use support vectors strictly inside box (0 < alpha < C)
sv_mask = (alpha > 1e-5)
interior_sv = (alpha > 1e-5) & (alpha < C - 1e-5)
if np.any(interior_sv):
# b = y_n - w^T x_n for interior support vectors
b_values = y[interior_sv] - X[interior_sv] @ w
b = np.median(b_values)
elif np.any(sv_mask):
b_values = y[sv_mask] - X[sv_mask] @ w
b = np.median(b_values)
else:
b = 0.0
return alpha, w, b, losses
# Solve dual SVM on the clean linearly separable data
alpha_dual, w_dual, b_dual, losses_dual = solve_dual_svm(X, y, C=100.0, lr=0.001, n_epochs=5000)
print(f"Dual SVM solution:")
print(f" w = [{w_dual[0]:.4f}, {w_dual[1]:.4f}], b = {b_dual:.4f}")
print(f" Margin width = {2.0/np.linalg.norm(w_dual):.4f}")
print(f" Number of support vectors (alpha > 0): {np.sum(alpha_dual > 1e-5)}")
print(f" Constraint sum(y*alpha) = {y @ alpha_dual:.6f} (should be ~0)")
# Visualize the dual solution: highlight support vectors with alpha values
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: dual objective convergence
axes[0].plot(losses_dual)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Dual Objective')
axes[0].set_title('Dual SVM Convergence')
axes[0].grid(True, alpha=0.3)
# Right: decision boundary with alpha values
ax = axes[1]
plot_svm_boundary(w_dual, b_dual, X, y, title="Dual SVM Solution", ax=ax)
# Annotate support vectors with their alpha values
sv_mask = alpha_dual > 1e-5
sv_indices = np.where(sv_mask)[0]
for idx in sv_indices:
ax.annotate(f'$\\alpha$={alpha_dual[idx]:.2f}',
xy=(X[idx, 0], X[idx, 1]),
xytext=(10, 10), textcoords='offset points',
fontsize=8, color='red',
arrowprops=dict(arrowstyle='->', color='red', lw=0.5))
plt.tight_layout()
plt.show()
# Show the representer theorem in action
print("Representer theorem (Eq. 12.38): w = sum_n alpha_n * y_n * x_n")
print(f" w from dual alphas: [{w_dual[0]:.4f}, {w_dual[1]:.4f}]")
w_check = np.zeros(2)
for i in sv_indices:
w_check += alpha_dual[i] * y[i] * X[i]
print(f" SV #{i}: alpha={alpha_dual[i]:.4f}, y={y[i]:+.0f}, x=[{X[i,0]:.2f}, {X[i,1]:.2f}]")
print(f" Sum check: [{w_check[0]:.4f}, {w_check[1]:.4f}]")
5. Kernels (MML 12.4)ΒΆ
The dual SVM objective depends on the data only through inner products \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\). If we apply a feature map \(\phi(\mathbf{x})\) to represent data in a (possibly infinite-dimensional) feature space, we replace \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\) with \(\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}\).
A kernel \(k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}\) (Eq. 12.52) lets us compute these inner products without ever computing \(\phi\) explicitly. This is the kernel trick.
Common kernels:
Linear: \(k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^\top \mathbf{x}_j\)
Polynomial: \(k(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^\top \mathbf{x}_j + c)^d\)
RBF (Gaussian): \(k(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{2\sigma^2}\right)\)
The RBF kernel corresponds to an infinite-dimensional feature space, yet we can compute the kernel value in \(O(D)\) time.
# Define kernel functions
def linear_kernel(X1, X2):
"""Linear kernel: k(x, x') = x^T x'"""
return X1 @ X2.T
def polynomial_kernel(X1, X2, degree=3, c=1.0):
"""Polynomial kernel: k(x, x') = (x^T x' + c)^d"""
return (X1 @ X2.T + c) ** degree
def rbf_kernel(X1, X2, gamma=1.0):
"""RBF/Gaussian kernel: k(x, x') = exp(-gamma * ||x - x'||^2)
Note: gamma = 1/(2*sigma^2)
"""
sq_dists = np.sum(X1**2, axis=1, keepdims=True) + np.sum(X2**2, axis=1) - 2 * X1 @ X2.T
return np.exp(-gamma * sq_dists)
# Demonstrate: for polynomial kernel of degree 2 in 2D,
# the implicit feature map is phi(x) = [1, sqrt(2)*x1, sqrt(2)*x2, x1^2, sqrt(2)*x1*x2, x2^2]
x_demo = np.array([1.0, 2.0])
x_demo2 = np.array([3.0, 4.0])
# Explicit feature map for degree-2 polynomial kernel with c=1
def phi_poly2(x):
x1, x2 = x
return np.array([1, np.sqrt(2)*x1, np.sqrt(2)*x2, x1**2, np.sqrt(2)*x1*x2, x2**2])
# Compare kernel value vs inner product in feature space
k_val = polynomial_kernel(x_demo.reshape(1, -1), x_demo2.reshape(1, -1), degree=2, c=1.0)[0, 0]
phi_inner = phi_poly2(x_demo) @ phi_poly2(x_demo2)
print("Polynomial kernel (degree=2, c=1) verification:")
print(f" k(x, x') = (x^T x' + 1)^2 = {k_val:.4f}")
print(f" <phi(x), phi(x')> = {phi_inner:.4f}")
print(f" Match: {np.isclose(k_val, phi_inner)}")
# RBF kernel values at different distances
distances = np.array([0.0, 0.5, 1.0, 2.0, 5.0])
for d in distances:
k_val = np.exp(-1.0 * d**2)
print(f" RBF(gamma=1): distance={d:.1f}, k={k_val:.4f}")
def solve_kernel_svm(X, y, kernel_fn, C=1.0, lr=0.01, n_epochs=3000):
"""
Solve the kernel SVM using the dual formulation.
Predictions use: f(x) = sum_n alpha_n y_n k(x_n, x) + b
"""
N = X.shape[0]
K = kernel_fn(X, X)
Q = np.outer(y, y) * K
alpha = np.zeros(N)
for epoch in range(n_epochs):
grad = Q @ alpha - np.ones(N)
alpha -= lr * grad
alpha = np.clip(alpha, 0, C)
alpha -= (y @ alpha) / (y @ y) * y
alpha = np.clip(alpha, 0, C)
# Recover bias
sv_mask = alpha > 1e-5
interior_sv = (alpha > 1e-5) & (alpha < C - 1e-5)
if np.any(interior_sv):
b_values = y[interior_sv] - K[interior_sv] @ (alpha * y)
b = np.median(b_values)
elif np.any(sv_mask):
b_values = y[sv_mask] - K[sv_mask] @ (alpha * y)
b = np.median(b_values)
else:
b = 0.0
return alpha, b
def predict_kernel_svm(X_train, y_train, alpha, b, X_test, kernel_fn):
"""Predict using kernel SVM: f(x) = sum_n alpha_n y_n k(x_n, x) + b"""
K = kernel_fn(X_train, X_test)
return (alpha * y_train) @ K + b
# Generate concentric circles data (not linearly separable)
np.random.seed(42)
n_samples = 100
theta = np.random.uniform(0, 2 * np.pi, n_samples)
r_inner = np.random.normal(1.0, 0.2, n_samples // 2)
r_outer = np.random.normal(3.0, 0.4, n_samples // 2)
X_inner = np.column_stack([r_inner * np.cos(theta[:n_samples//2]),
r_inner * np.sin(theta[:n_samples//2])])
X_outer = np.column_stack([r_outer * np.cos(theta[n_samples//2:]),
r_outer * np.sin(theta[n_samples//2:])])
X_circles = np.vstack([X_inner, X_outer])
y_circles = np.hstack([np.ones(n_samples // 2), -np.ones(n_samples // 2)])
# Compare linear vs polynomial vs RBF kernels
kernels = {
'Linear': lambda X1, X2: linear_kernel(X1, X2),
'Polynomial (d=3)': lambda X1, X2: polynomial_kernel(X1, X2, degree=3, c=1.0),
'RBF (gamma=1)': lambda X1, X2: rbf_kernel(X1, X2, gamma=1.0),
}
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for ax, (name, kfn) in zip(axes, kernels.items()):
alpha_k, b_k = solve_kernel_svm(X_circles, y_circles, kfn, C=10.0, lr=0.001, n_epochs=5000)
# Create decision boundary mesh
x_min, x_max = X_circles[:, 0].min() - 1, X_circles[:, 0].max() + 1
y_min, y_max = X_circles[:, 1].min() - 1, X_circles[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 150), np.linspace(y_min, y_max, 150))
grid = np.column_stack([xx.ravel(), yy.ravel()])
Z = predict_kernel_svm(X_circles, y_circles, alpha_k, b_k, grid, kfn)
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=[-100, 0, 100], colors=['#FFE0B2', '#BBDEFB'], alpha=0.5)
ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
pos = y_circles == 1
neg = y_circles == -1
ax.scatter(X_circles[pos, 0], X_circles[pos, 1], c='steelblue', marker='o', s=40,
edgecolors='k', linewidth=0.5)
ax.scatter(X_circles[neg, 0], X_circles[neg, 1], c='darkorange', marker='x', s=40, linewidths=1.5)
# Highlight support vectors
sv = alpha_k > 1e-5
ax.scatter(X_circles[sv, 0], X_circles[sv, 1], facecolors='none', edgecolors='red',
s=120, linewidths=1.5)
pred = np.sign(predict_kernel_svm(X_circles, y_circles, alpha_k, b_k, X_circles, kfn))
acc = np.mean(pred == y_circles)
ax.set_title(f'{name}\nAcc={acc*100:.0f}%, SVs={np.sum(sv)}')
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
plt.suptitle('Kernel SVM on Concentric Circles: Linear vs Polynomial vs RBF', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print("The RBF kernel captures the circular boundary; the linear kernel cannot.")
6. Kernel Trick Visualization: XOR ProblemΒΆ
The classic XOR problem demonstrates why kernels are essential. Data arranged in an XOR pattern cannot be separated by any linear classifier. However, a non-linear feature map \(\phi\) (implicitly defined by a kernel) can project the data into a higher-dimensional space where it becomes linearly separable.
For the RBF kernel, each data point is mapped to an infinite-dimensional feature space where the kernel computes similarity based on Euclidean distance: nearby points (regardless of direction) get high similarity.
# Generate XOR-like data
np.random.seed(42)
n_xor = 50
# Cluster 1: top-left (+1)
c1 = np.random.randn(n_xor, 2) * 0.5 + np.array([-1.5, 1.5])
# Cluster 2: bottom-right (+1)
c2 = np.random.randn(n_xor, 2) * 0.5 + np.array([1.5, -1.5])
# Cluster 3: top-right (-1)
c3 = np.random.randn(n_xor, 2) * 0.5 + np.array([1.5, 1.5])
# Cluster 4: bottom-left (-1)
c4 = np.random.randn(n_xor, 2) * 0.5 + np.array([-1.5, -1.5])
X_xor = np.vstack([c1, c2, c3, c4])
y_xor = np.hstack([np.ones(n_xor), np.ones(n_xor), -np.ones(n_xor), -np.ones(n_xor)])
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Panel 1: Raw XOR data -- not linearly separable
ax = axes[0]
pos = y_xor == 1
neg = y_xor == -1
ax.scatter(X_xor[pos, 0], X_xor[pos, 1], c='steelblue', marker='o', s=40,
edgecolors='k', linewidth=0.5, label='$y=+1$')
ax.scatter(X_xor[neg, 0], X_xor[neg, 1], c='darkorange', marker='x', s=40,
linewidths=1.5, label='$y=-1$')
ax.set_title('XOR Data (Not Linearly Separable)')
ax.legend()
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
# Panel 2: Explicit polynomial feature map -- show x1*x2 feature makes it separable
# phi(x) includes x1*x2 which separates XOR
x1x2 = X_xor[:, 0] * X_xor[:, 1]
ax = axes[1]
ax.scatter(np.linalg.norm(X_xor, axis=1)[pos], x1x2[pos], c='steelblue', marker='o', s=40,
edgecolors='k', linewidth=0.5, label='$y=+1$')
ax.scatter(np.linalg.norm(X_xor, axis=1)[neg], x1x2[neg], c='darkorange', marker='x', s=40,
linewidths=1.5, label='$y=-1$')
ax.axhline(0, color='green', linewidth=2, linestyle='--', label='Separating line')
ax.set_title('Feature Space: $x_1 \\cdot x_2$ vs $\\|\\mathbf{x}\\|$')
ax.set_xlabel('$\\|\\mathbf{x}\\|$')
ax.set_ylabel('$x_1 \\cdot x_2$')
ax.legend()
# Panel 3: RBF kernel SVM decision boundary
ax = axes[2]
rbf_fn = lambda X1, X2: rbf_kernel(X1, X2, gamma=1.0)
alpha_xor, b_xor = solve_kernel_svm(X_xor, y_xor, rbf_fn, C=10.0, lr=0.0005, n_epochs=5000)
x_min, x_max = X_xor[:, 0].min() - 1, X_xor[:, 0].max() + 1
y_min, y_max = X_xor[:, 1].min() - 1, X_xor[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 150), np.linspace(y_min, y_max, 150))
grid = np.column_stack([xx.ravel(), yy.ravel()])
Z_xor = predict_kernel_svm(X_xor, y_xor, alpha_xor, b_xor, grid, rbf_fn)
Z_xor = Z_xor.reshape(xx.shape)
ax.contourf(xx, yy, Z_xor, levels=[-100, 0, 100], colors=['#FFE0B2', '#BBDEFB'], alpha=0.5)
ax.contour(xx, yy, Z_xor, levels=[0], colors='black', linewidths=2)
ax.scatter(X_xor[pos, 0], X_xor[pos, 1], c='steelblue', marker='o', s=40,
edgecolors='k', linewidth=0.5)
ax.scatter(X_xor[neg, 0], X_xor[neg, 1], c='darkorange', marker='x', s=40, linewidths=1.5)
sv_xor = alpha_xor > 1e-5
ax.scatter(X_xor[sv_xor, 0], X_xor[sv_xor, 1], facecolors='none', edgecolors='red',
s=120, linewidths=1.5)
pred_xor = np.sign(predict_kernel_svm(X_xor, y_xor, alpha_xor, b_xor, X_xor, rbf_fn))
acc_xor = np.mean(pred_xor == y_xor)
ax.set_title(f'RBF Kernel SVM (Acc={acc_xor*100:.0f}%)')
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
plt.suptitle('Kernel Trick: Making XOR Separable', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print(f"The RBF kernel SVM learns a non-linear boundary that correctly classifies XOR data.")
print(f"Support vectors: {np.sum(sv_xor)}/{len(y_xor)}")
7. Effect of the Regularization Parameter \(C\)ΒΆ
Recall from the soft margin SVM (Eq. 12.26a) that \(C\) controls the trade-off:
Large \(C\): Heavily penalizes misclassifications. Narrow margin, fewer errors, risk of overfitting.
Small \(C\): Tolerates more misclassifications. Wider margin, more robust, risk of underfitting.
Note: In the MML formulation (Eq. 12.26a), \(C\) multiplies the error term, not the regularizer. So large \(C\) = less regularization (harder margin). This is opposite to some formulations where \(\lambda\) multiplies the regularizer.
# Demonstrate C effect on the overlapping (noisy) dataset using primal SVM
C_values = [0.01, 0.1, 1.0, 100.0]
fig, axes = plt.subplots(1, 4, figsize=(20, 4.5))
for ax, C_val in zip(axes, C_values):
w_c, b_c, _ = train_svm_sgd(X_noisy, y_noisy, C=C_val, lr=0.001, n_epochs=2000, verbose=False)
# Decision boundary
x_min, x_max = X_noisy[:, 0].min() - 1, X_noisy[:, 0].max() + 1
y_min, y_max = X_noisy[:, 1].min() - 1, X_noisy[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))
Z = xx.ravel() * w_c[0] + yy.ravel() * w_c[1] + b_c
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=[-100, -1, 1, 100], colors=['#FFE0B2', '#FFFDE7', '#BBDEFB'], alpha=0.4)
ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['orange', 'black', 'steelblue'],
linestyles=['--', '-', '--'], linewidths=[1, 2, 1])
pos = y_noisy == 1
neg = y_noisy == -1
ax.scatter(X_noisy[pos, 0], X_noisy[pos, 1], c='steelblue', marker='o', s=30,
edgecolors='k', linewidth=0.5)
ax.scatter(X_noisy[neg, 0], X_noisy[neg, 1], c='darkorange', marker='x', s=30, linewidths=1.5)
# Highlight support vectors
margins = y_noisy * (X_noisy @ w_c + b_c)
sv = margins <= 1.01
ax.scatter(X_noisy[sv, 0], X_noisy[sv, 1], facecolors='none', edgecolors='red',
s=100, linewidths=1.5)
preds = np.sign(X_noisy @ w_c + b_c)
acc = np.mean(preds == y_noisy)
margin_w = 2.0 / np.linalg.norm(w_c)
ax.set_title(f'C={C_val}\nMargin={margin_w:.2f}, Acc={acc*100:.0f}%\nSVs={np.sum(sv)}')
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
plt.suptitle('Effect of C on Soft Margin SVM (Overlapping Data)', fontsize=14, y=1.05)
plt.tight_layout()
plt.show()
print("Small C => wide margin, more misclassifications (underfitting)")
print("Large C => narrow margin, fewer misclassifications (overfitting risk)")
# Also show C effect with RBF kernel on the circles dataset
C_values_rbf = [0.1, 1.0, 10.0, 1000.0]
fig, axes = plt.subplots(1, 4, figsize=(20, 4.5))
rbf_fn = lambda X1, X2: rbf_kernel(X1, X2, gamma=1.0)
for ax, C_val in zip(axes, C_values_rbf):
alpha_c, b_c = solve_kernel_svm(X_circles, y_circles, rbf_fn, C=C_val, lr=0.001, n_epochs=5000)
x_min, x_max = X_circles[:, 0].min() - 1, X_circles[:, 0].max() + 1
y_min, y_max = X_circles[:, 1].min() - 1, X_circles[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
grid = np.column_stack([xx.ravel(), yy.ravel()])
Z = predict_kernel_svm(X_circles, y_circles, alpha_c, b_c, grid, rbf_fn)
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=[-100, 0, 100], colors=['#FFE0B2', '#BBDEFB'], alpha=0.5)
ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
pos = y_circles == 1
neg = y_circles == -1
ax.scatter(X_circles[pos, 0], X_circles[pos, 1], c='steelblue', marker='o', s=30,
edgecolors='k', linewidth=0.5)
ax.scatter(X_circles[neg, 0], X_circles[neg, 1], c='darkorange', marker='x', s=30, linewidths=1.5)
pred = np.sign(predict_kernel_svm(X_circles, y_circles, alpha_c, b_c, X_circles, rbf_fn))
acc = np.mean(pred == y_circles)
n_sv = np.sum(alpha_c > 1e-5)
ax.set_title(f'C={C_val}\nAcc={acc*100:.0f}%, SVs={n_sv}')
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
plt.suptitle('Effect of C on RBF Kernel SVM (Circles Data)', fontsize=14, y=1.05)
plt.tight_layout()
plt.show()
print("With RBF kernel: small C => smoother boundary, large C => tighter fit to training data.")
8. Multi-class SVM: One-vs-Rest StrategyΒΆ
The SVM as formulated in MML Chapter 12 is inherently a binary classifier (\(y \in \{+1, -1\}\)). For multi-class problems with \(K\) classes, the most common approach is one-vs-rest (OVR):
Train \(K\) separate binary SVMs, one for each class \(k\)
For class \(k\): relabel the data as \(y_n = +1\) if original label is \(k\), and \(y_n = -1\) otherwise
At prediction time: compute \(f_k(\mathbf{x}) = \langle \mathbf{w}_k, \mathbf{x} \rangle + b_k\) for each classifier, predict the class with the highest score
# Generate 3-class data
np.random.seed(42)
n_per = 40
X_c0 = np.random.randn(n_per, 2) * 0.7 + np.array([0, 3])
X_c1 = np.random.randn(n_per, 2) * 0.7 + np.array([-2, -1])
X_c2 = np.random.randn(n_per, 2) * 0.7 + np.array([2, -1])
X_multi = np.vstack([X_c0, X_c1, X_c2])
y_multi = np.hstack([np.zeros(n_per), np.ones(n_per), 2 * np.ones(n_per)]) # labels: 0, 1, 2
# One-vs-Rest: train 3 binary SVMs
classes = [0, 1, 2]
class_names = ['Class 0', 'Class 1', 'Class 2']
class_colors = ['steelblue', 'darkorange', 'green']
classifiers = []
print("Training One-vs-Rest SVMs:")
for k in classes:
# Relabel: class k -> +1, rest -> -1
y_binary = np.where(y_multi == k, 1.0, -1.0)
w_k, b_k, _ = train_svm_sgd(X_multi, y_binary, C=10.0, lr=0.001, n_epochs=2000, verbose=False)
classifiers.append((w_k, b_k))
acc_k = np.mean(np.sign(X_multi @ w_k + b_k) == y_binary)
print(f" {class_names[k]} vs Rest: w=[{w_k[0]:.3f}, {w_k[1]:.3f}], b={b_k:.3f}, acc={acc_k*100:.1f}%")
# Predict: argmax_k f_k(x)
def predict_ovr(X, classifiers):
scores = np.zeros((X.shape[0], len(classifiers)))
for k, (w_k, b_k) in enumerate(classifiers):
scores[:, k] = X @ w_k + b_k
return np.argmax(scores, axis=1)
y_pred_multi = predict_ovr(X_multi, classifiers)
accuracy_multi = np.mean(y_pred_multi == y_multi)
print(f"\nOverall multi-class accuracy: {accuracy_multi*100:.1f}%")
# Visualize multi-class decision regions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: individual binary classifiers
ax = axes[0]
markers = ['o', 's', '^']
for k in classes:
mask = y_multi == k
ax.scatter(X_multi[mask, 0], X_multi[mask, 1], c=class_colors[k], marker=markers[k],
s=50, edgecolors='k', linewidth=0.5, label=class_names[k])
x_line = np.linspace(-4, 4, 100)
line_labels = ['Class 0 vs Rest', 'Class 1 vs Rest', 'Class 2 vs Rest']
for k, (w_k, b_k) in enumerate(classifiers):
if abs(w_k[1]) > 1e-8:
y_line = -(w_k[0] * x_line + b_k) / w_k[1]
valid = (y_line > -4) & (y_line < 6)
ax.plot(x_line[valid], y_line[valid], c=class_colors[k], linewidth=2,
linestyle='--', label=line_labels[k])
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
ax.set_title('One-vs-Rest: Individual Classifiers')
ax.legend(fontsize=8, loc='best')
ax.set_xlim(-4.5, 4.5)
ax.set_ylim(-4, 6)
# Right: combined decision regions
ax = axes[1]
x_min, x_max = X_multi[:, 0].min() - 1, X_multi[:, 0].max() + 1
y_min, y_max = X_multi[:, 1].min() - 1, X_multi[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))
grid = np.column_stack([xx.ravel(), yy.ravel()])
Z_multi = predict_ovr(grid, classifiers).reshape(xx.shape)
cmap = ListedColormap(['#BBDEFB', '#FFE0B2', '#C8E6C9'])
ax.contourf(xx, yy, Z_multi, cmap=cmap, alpha=0.5)
for k in classes:
mask = y_multi == k
ax.scatter(X_multi[mask, 0], X_multi[mask, 1], c=class_colors[k], marker=markers[k],
s=50, edgecolors='k', linewidth=0.5, label=class_names[k])
# Mark misclassifications
misclassified = y_pred_multi != y_multi
if np.any(misclassified):
ax.scatter(X_multi[misclassified, 0], X_multi[misclassified, 1],
facecolors='none', edgecolors='red', s=150, linewidths=2, label='Misclassified')
ax.set_xlabel('$x^{(1)}$')
ax.set_ylabel('$x^{(2)}$')
ax.set_title(f'Multi-class Decision Regions (Acc={accuracy_multi*100:.0f}%)')
ax.legend(fontsize=8, loc='best')
plt.tight_layout()
plt.show()
# Confusion matrix
print("Confusion matrix:")
K = len(classes)
conf_mat = np.zeros((K, K), dtype=int)
for true_k in classes:
for pred_k in classes:
conf_mat[int(true_k), int(pred_k)] = np.sum((y_multi == true_k) & (y_pred_multi == pred_k))
print(" Predicted")
print(" " + " ".join([f" {k}" for k in classes]))
for k in classes:
row = " ".join([f"{conf_mat[int(k), int(j)]:3d}" for j in classes])
print(f" True {int(k)}: {row}")
SummaryΒΆ
This lab implemented all the core concepts from MML Chapter 12: Classification with Support Vector Machines.
Key Concepts CoveredΒΆ
Section |
MML Reference |
Key Equations |
|---|---|---|
Separating Hyperplanes |
12.1 |
\(f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle + b\) (Eq. 12.2b), \(y_n f(\mathbf{x}_n) \geq 0\) (Eq. 12.7) |
Hard Margin SVM |
12.2.1-12.2.2 |
\(\min \frac{1}{2}|\mathbf{w}|^2\) s.t. \(y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \geq 1\) (Eq. 12.18-12.19) |
Margin Derivation |
12.2.2 |
Margin \(= \frac{1}{|\mathbf{w}|}\) (Eq. 12.14), Theorem 12.1 equivalence |
Soft Margin SVM |
12.2.4-12.2.5 |
Slack variables \(\xi_n\), hinge loss \(\ell(t)=\max(0,1-t)\) (Eq. 12.28-12.31) |
Dual SVM |
12.3.1 |
\(\min_\alpha \frac{1}{2}\alpha^\top Q \alpha - \mathbf{1}^\top\alpha\) (Eq. 12.41), representer theorem \(\mathbf{w}=\sum \alpha_n y_n \mathbf{x}_n\) (Eq. 12.38) |
Kernels |
12.4 |
\(k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_\mathcal{H}\) (Eq. 12.52) |
Multi-class |
Extension |
One-vs-Rest with \(K\) binary SVMs |
Key TakeawaysΒΆ
Maximum margin principle: Among all separating hyperplanes, the SVM chooses the one with the largest margin (\(2/\|\mathbf{w}\|\)), which tends to generalize better.
Primal vs Dual: The primal has \(D\) parameters (dimension of features); the dual has \(N\) parameters (number of examples). The dual is preferred when \(D \gg N\) or when using kernels.
Support vectors: Only the examples with \(\alpha_n > 0\) influence the solution. The SVM is sparse: the decision boundary is determined by a small subset of the training data.
Kernel trick: By replacing inner products with kernel evaluations, we can learn non-linear decision boundaries while solving the same convex optimization problem.
C parameter: Controls the bias-variance trade-off. Small \(C\) favours a wide margin (high bias, low variance); large \(C\) favours correct classification of training points (low bias, high variance).
Further ReadingΒΆ
MML Section 12.5: Numerical Solution (quadratic programming solvers)
MML Section 12.6: Further Reading (connections to other approaches)
Scholkopf and Smola (2002): Learning with Kernels
Steinwart and Christmann (2008): Support Vector Machines