Lab 06: CNN Feature Extraction & Deep Learning FundamentalsΒΆ
Source: Deep Learning Interviews by Shlomo Kashani (PDF: resources/2201.00650v2.pdf)
Chapters: CNN Feature Extraction, Deep Learning
This lab covers the foundational building blocks of Convolutional Neural Networks and deep learning:
Convolution & Correlation β the core operations behind CNNs
Activation Functions β nonlinearities that give networks their power
Perceptron β the simplest neural network and its limitations
Performance Metrics β ROC, AUC, top-k accuracy
Pooling Operations β spatial down-sampling in CNNs
Loss Functions β objectives that drive learning
All implementations use only NumPy and Matplotlib β no deep learning frameworks required.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["font.size"] = 12
np.random.seed(42)
print("Imports ready.")
PART 1: Convolution & CorrelationΒΆ
What deep learning calls βconvolutionβ is actually cross-correlationΒΆ
In signal processing, true convolution flips the kernel before sliding it:
Cross-correlation does not flip the kernel:
Deep learning frameworks (PyTorch, TensorFlow) implement cross-correlation but call it βconvolution.β Since the kernel weights are learned, flipping is irrelevant β the network learns the appropriate (possibly flipped) weights.
Output size (valid mode, no padding)ΒΆ
For a 1D signal of length \(N\) and kernel of length \(K\):
For a 2D image of shape \((H, W)\) and kernel of shape \((k_H, k_W)\):
With stride \(s\) and padding \(p\):
def cross_correlation_1d(signal, kernel):
"""
Compute 1D cross-correlation in valid mode (no padding).
Args:
signal: 1D numpy array of length N
kernel: 1D numpy array of length K
Returns:
1D numpy array of length N - K + 1
"""
n = len(signal)
k = len(kernel)
out_len = n - k + 1
output = np.zeros(out_len)
for i in range(out_len):
output[i] = np.sum(signal[i:i + k] * kernel)
return output
# --- Demo ---
signal = np.array([1, 2, 3, 4, 5, 6, 7], dtype=float)
kernel = np.array([1, 0, -1], dtype=float)
result = cross_correlation_1d(signal, kernel)
print(f"Signal: {signal}")
print(f"Kernel: {kernel}")
print(f"Output: {result}")
print(f"Expected length: {len(signal)} - {len(kernel)} + 1 = {len(signal) - len(kernel) + 1}")
print(f"Actual length: {len(result)}")
def cross_correlation_2d(image, kernel):
"""
Compute 2D cross-correlation in valid mode using nested loops.
Args:
image: 2D numpy array of shape (H, W)
kernel: 2D numpy array of shape (kH, kW)
Returns:
2D numpy array of shape (H - kH + 1, W - kW + 1)
"""
H, W = image.shape
kH, kW = kernel.shape
out_H = H - kH + 1
out_W = W - kW + 1
output = np.zeros((out_H, out_W))
for i in range(out_H):
for j in range(out_W):
region = image[i:i + kH, j:j + kW]
output[i, j] = np.sum(region * kernel)
return output
# --- Demo ---
image = np.array([
[1, 2, 3, 0],
[0, 1, 2, 3],
[3, 0, 1, 2],
[2, 3, 0, 1]
], dtype=float)
kernel = np.array([
[1, 0],
[0, 1]
], dtype=float)
result = cross_correlation_2d(image, kernel)
print("Image (4x4):")
print(image)
print(f"\nKernel (2x2):\n{kernel}")
print(f"\nOutput shape: {result.shape}")
print(f"Output:\n{result}")
# Apply 4 classic filters to an 8x8 checkerboard pattern
# Create checkerboard
checkerboard = np.zeros((8, 8))
checkerboard[::2, ::2] = 1
checkerboard[1::2, 1::2] = 1
# Define filters
kernels = {
"Horizontal Edge": np.array([[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1]], dtype=float),
"Vertical Edge": np.array([[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]], dtype=float),
"Sharpen": np.array([[ 0, -1, 0],
[-1, 5, -1],
[ 0, -1, 0]], dtype=float),
"Blur (Box)": np.ones((3, 3), dtype=float) / 9.0,
}
# Apply and plot
fig, axes = plt.subplots(1, 5, figsize=(16, 3))
axes[0].imshow(checkerboard, cmap="gray", vmin=0, vmax=1)
axes[0].set_title("Original\nCheckerboard")
axes[0].axis("off")
for idx, (name, kernel) in enumerate(kernels.items(), start=1):
filtered = cross_correlation_2d(checkerboard, kernel)
axes[idx].imshow(filtered, cmap="gray")
axes[idx].set_title(name)
axes[idx].axis("off")
fig.suptitle("2D Cross-Correlation with Classic Filters", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
def conv2d_with_stride_padding(image, kernel, stride=1, padding=0):
"""
2D cross-correlation with stride and zero-padding.
Args:
image: 2D array of shape (H, W)
kernel: 2D array of shape (kH, kW)
stride: step size for sliding the kernel
padding: number of zero-padding rows/cols on each side
Returns:
2D array of shape ((H + 2*pad - kH) // stride + 1,
(W + 2*pad - kW) // stride + 1)
"""
# Step 1: zero-pad the image
if padding > 0:
padded = np.zeros((image.shape[0] + 2 * padding,
image.shape[1] + 2 * padding))
padded[padding:-padding, padding:-padding] = image
else:
padded = image.copy()
H, W = padded.shape
kH, kW = kernel.shape
out_H = (H - kH) // stride + 1
out_W = (W - kW) // stride + 1
output = np.zeros((out_H, out_W))
# Step 2: strided cross-correlation
for i in range(out_H):
for j in range(out_W):
row_start = i * stride
col_start = j * stride
region = padded[row_start:row_start + kH, col_start:col_start + kW]
output[i, j] = np.sum(region * kernel)
return output
# --- Demo ---
image_5x5 = np.arange(1, 26, dtype=float).reshape(5, 5)
kernel_3x3 = np.array([[1, 0, -1], [1, 0, -1], [1, 0, -1]], dtype=float)
print("Image (5x5):")
print(image_5x5)
print(f"\nKernel (3x3):\n{kernel_3x3}")
# No padding, stride=1 (standard valid)
out1 = conv2d_with_stride_padding(image_5x5, kernel_3x3, stride=1, padding=0)
print(f"\nStride=1, Padding=0 -> shape {out1.shape}:")
print(out1)
# Padding=1, stride=1 ("same" convolution for 3x3 kernel)
out2 = conv2d_with_stride_padding(image_5x5, kernel_3x3, stride=1, padding=1)
print(f"\nStride=1, Padding=1 -> shape {out2.shape}:")
print(out2)
# Stride=2, padding=1
out3 = conv2d_with_stride_padding(image_5x5, kernel_3x3, stride=2, padding=1)
print(f"\nStride=2, Padding=1 -> shape {out3.shape}:")
print(out3)
PART 2: Activation FunctionsΒΆ
Activation functions introduce nonlinearity into neural networks. Without them, any stack of linear layers collapses to a single linear transformation.
Function |
Formula |
Key Properties |
|---|---|---|
ReLU |
\(\max(0, z)\) |
Simple, sparse, but βdying neuronβ risk |
Leaky ReLU |
\(z\) if \(z > 0\), else \(\alpha z\) |
Fixes dying ReLU with small negative slope |
ELU |
\(z\) if \(z > 0\), else \(\alpha(e^z - 1)\) |
Smooth for \(z < 0\), pushes mean toward zero |
Swish |
\(z \cdot \sigma(\beta z)\) |
Smooth, non-monotonic, self-gated |
GELU |
\(z \cdot \Phi(z)\) |
Used in BERT/GPT; smooth approximation of ReLU |
Tanh |
\(\frac{e^z - e^{-z}}{e^z + e^{-z}}\) |
Output in \((-1, 1)\), zero-centered |
where \(\sigma\) is the sigmoid function and \(\Phi\) is the standard normal CDF.
def relu(z):
"""ReLU: max(0, z)"""
return np.maximum(0, z)
def leaky_relu(z, alpha=0.01):
"""Leaky ReLU: z if z > 0, else alpha * z"""
return np.where(z > 0, z, alpha * z)
def elu(z, alpha=1.0):
"""ELU: z if z > 0, else alpha * (exp(z) - 1)"""
return np.where(z > 0, z, alpha * (np.exp(z) - 1))
def swish(z, beta=1.0):
"""Swish: z * sigmoid(beta * z)"""
sigmoid = 1.0 / (1.0 + np.exp(-beta * z))
return z * sigmoid
def gelu(z):
"""GELU approximation: 0.5 * z * (1 + tanh(sqrt(2/pi) * (z + 0.044715 * z^3)))"""
return 0.5 * z * (1.0 + np.tanh(np.sqrt(2.0 / np.pi) * (z + 0.044715 * z**3)))
def tanh(z):
"""Tanh activation: (e^z - e^-z) / (e^z + e^-z)"""
return np.tanh(z)
# --- Quick check ---
test_z = np.array([-2, -1, 0, 1, 2], dtype=float)
print(f"z: {test_z}")
print(f"ReLU: {relu(test_z)}")
print(f"Leaky ReLU: {leaky_relu(test_z)}")
print(f"ELU: {elu(test_z).round(4)}")
print(f"Swish: {swish(test_z).round(4)}")
print(f"GELU: {gelu(test_z).round(4)}")
print(f"Tanh: {tanh(test_z).round(4)}")
def relu_derivative(z):
"""Derivative of ReLU: 1 if z > 0, else 0"""
return np.where(z > 0, 1.0, 0.0)
def tanh_derivative(z):
"""Derivative of tanh: 1 - tanh^2(z)"""
t = np.tanh(z)
return 1.0 - t ** 2
# --- Quick check ---
print(f"z: {test_z}")
print(f"ReLU': {relu_derivative(test_z)}")
print(f"Tanh': {tanh_derivative(test_z).round(4)}")
# Plot all activation functions on [-3, 3]
z = np.linspace(-3, 3, 300)
activations = {
"ReLU": relu(z),
"Leaky ReLU": leaky_relu(z, alpha=0.1), # exaggerated alpha for visibility
"ELU": elu(z),
"Swish": swish(z),
"GELU": gelu(z),
"Tanh": tanh(z),
}
fig, ax = plt.subplots(figsize=(10, 6))
colors = ["#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00", "#a65628"]
for (name, values), color in zip(activations.items(), colors):
ax.plot(z, values, label=name, linewidth=2, color=color)
ax.axhline(y=0, color="gray", linewidth=0.5, linestyle="--")
ax.axvline(x=0, color="gray", linewidth=0.5, linestyle="--")
ax.set_xlabel("z")
ax.set_ylabel("activation(z)")
ax.set_title("Activation Functions Comparison")
ax.legend(loc="upper left")
ax.set_xlim(-3, 3)
ax.set_ylim(-2, 3)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Dying ReLU demo
# When a neuron's pre-activation is always negative, ReLU outputs 0 forever.
# The gradient is 0, so the weights never update -- the neuron is "dead."
np.random.seed(42)
# Simulate a single neuron: z = X @ w + b
# Use weights that push most inputs into negative territory
w = np.random.randn(10) - 1.5 # biased negative
b = -2.0 # large negative bias
X = np.random.randn(200, 10)
z = X @ w + b # pre-activations
relu_out = relu(z)
leaky_out = leaky_relu(z, alpha=0.01)
dead_relu = np.sum(relu_out == 0)
dead_leaky = np.sum(leaky_out == 0)
print(f"Total samples: {len(z)}")
print(f"Pre-activations < 0: {np.sum(z < 0)} ({100 * np.sum(z < 0) / len(z):.1f}%)")
print(f"ReLU dead outputs (exactly 0): {dead_relu} ({100 * dead_relu / len(z):.1f}%)")
print(f"Leaky ReLU dead outputs (exactly 0): {dead_leaky} ({100 * dead_leaky / len(z):.1f}%)")
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(z, bins=30, color="steelblue", edgecolor="black", alpha=0.7)
axes[0].axvline(0, color="red", linestyle="--", label="z=0")
axes[0].set_title("Pre-activation distribution")
axes[0].set_xlabel("z = Xw + b")
axes[0].legend()
axes[1].hist(relu_out, bins=30, color="coral", edgecolor="black", alpha=0.7)
axes[1].set_title(f"After ReLU ({dead_relu}/{len(z)} dead)")
axes[1].set_xlabel("ReLU(z)")
axes[2].hist(leaky_out, bins=30, color="mediumseagreen", edgecolor="black", alpha=0.7)
axes[2].set_title(f"After Leaky ReLU ({dead_leaky}/{len(z)} dead)")
axes[2].set_xlabel("LeakyReLU(z)")
for ax in axes:
ax.set_ylabel("Count")
ax.grid(True, alpha=0.3)
plt.suptitle("Dying ReLU Problem: Leaky ReLU Preserves Gradient Flow", fontsize=14)
plt.tight_layout()
plt.show()
PART 3: PerceptronΒΆ
The perceptron is the simplest neural network β a single linear classifier:
Perceptron Learning RuleΒΆ
For each training sample \((\mathbf{x}_i, y_i)\) where \(y_i \in \{-1, +1\}\):
Compute prediction: \(\hat{y}_i = \text{sign}(\mathbf{w} \cdot \mathbf{x}_i + b)\)
If misclassified (\(\hat{y}_i \neq y_i\)):
\(\mathbf{w} \leftarrow \mathbf{w} + \eta \cdot y_i \cdot \mathbf{x}_i\)
\(b \leftarrow b + \eta \cdot y_i\)
Convergence theorem: If the data is linearly separable, the perceptron algorithm converges in a finite number of steps.
Limitation: A single perceptron cannot solve problems that are not linearly separable (e.g., XOR).
class Perceptron:
"""Binary perceptron classifier with labels in {-1, +1}."""
def __init__(self, n_features, lr=1.0):
self.weights = np.zeros(n_features)
self.bias = 0.0
self.lr = lr
def predict(self, X):
"""
Predict class labels for samples in X.
Args:
X: array of shape (n_samples, n_features) or (n_features,)
Returns:
array of +1 or -1
"""
linear = X @ self.weights + self.bias
return np.where(linear >= 0, 1, -1)
def fit(self, X, y, n_epochs=100):
"""
Train the perceptron using the perceptron learning rule.
Args:
X: array of shape (n_samples, n_features)
y: array of shape (n_samples,) with values in {-1, +1}
n_epochs: maximum number of passes over the data
Returns:
list of error counts per epoch
"""
errors_per_epoch = []
for epoch in range(n_epochs):
errors = 0
for i in range(len(X)):
prediction = self.predict(X[i:i+1])[0]
if prediction != y[i]:
# Misclassified: update weights
self.weights += self.lr * y[i] * X[i]
self.bias += self.lr * y[i]
errors += 1
errors_per_epoch.append(errors)
if errors == 0:
print(f"Converged at epoch {epoch + 1}")
break
else:
print(f"Did not converge after {n_epochs} epochs (last errors: {errors})")
return errors_per_epoch
# Quick test
p = Perceptron(2)
X_test = np.array([[1, 1], [-1, -1]])
y_test = np.array([1, -1])
p.fit(X_test, y_test, n_epochs=10)
print(f"Weights: {p.weights}, Bias: {p.bias}")
print(f"Predictions: {p.predict(X_test)}")
# Train perceptron on linearly separable 2D data and plot decision boundary
np.random.seed(42)
# Generate two linearly separable clusters
n_per_class = 50
class_1 = np.random.randn(n_per_class, 2) + np.array([2, 2])
class_2 = np.random.randn(n_per_class, 2) + np.array([-2, -2])
X_train = np.vstack([class_1, class_2])
y_train = np.array([1] * n_per_class + [-1] * n_per_class)
# Train
perc = Perceptron(n_features=2, lr=0.1)
errors = perc.fit(X_train, y_train, n_epochs=100)
# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Decision boundary
ax = axes[0]
ax.scatter(class_1[:, 0], class_1[:, 1], c="steelblue", label="Class +1", edgecolors="k", s=40)
ax.scatter(class_2[:, 0], class_2[:, 1], c="coral", label="Class -1", edgecolors="k", s=40)
# Decision boundary: w0*x0 + w1*x1 + b = 0 => x1 = -(w0*x0 + b) / w1
x_range = np.linspace(-5, 5, 200)
if perc.weights[1] != 0:
boundary = -(perc.weights[0] * x_range + perc.bias) / perc.weights[1]
ax.plot(x_range, boundary, "k-", linewidth=2, label="Decision boundary")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_title("Perceptron Decision Boundary")
ax.legend()
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.grid(True, alpha=0.3)
# Errors per epoch
ax = axes[1]
ax.plot(range(1, len(errors) + 1), errors, "o-", color="steelblue", markersize=4)
ax.set_xlabel("Epoch")
ax.set_ylabel("Misclassifications")
ax.set_title("Perceptron Training Convergence")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Report accuracy
preds = perc.predict(X_train)
accuracy = np.mean(preds == y_train) * 100
print(f"Training accuracy: {accuracy:.1f}%")
print(f"Learned weights: {perc.weights.round(4)}, bias: {perc.bias:.4f}")
# XOR Problem: Perceptron fails, but a 2-layer network succeeds
# XOR truth table
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([-1, 1, 1, -1]) # XOR
# --- Attempt 1: Single perceptron (will fail) ---
xor_perc = Perceptron(n_features=2, lr=0.1)
xor_errors = xor_perc.fit(X_xor, y_xor, n_epochs=200)
xor_preds = xor_perc.predict(X_xor)
print(f"\nSingle Perceptron on XOR:")
print(f" Predictions: {xor_preds}")
print(f" True labels: {y_xor}")
print(f" Accuracy: {np.mean(xor_preds == y_xor) * 100:.1f}%")
# --- Attempt 2: 2-layer network with manual weights ---
# XOR = AND(OR(x1, x2), NAND(x1, x2))
#
# Layer 1: two neurons
# Neuron A (OR-like): sign(x1 + x2 - 0.5) -> +1 if x1 OR x2
# Neuron B (NAND-like): sign(-x1 - x2 + 1.5) -> +1 if NOT (x1 AND x2)
#
# Layer 2: one neuron
# Neuron C (AND): sign(h1 + h2 - 1.5) -> +1 if h1 AND h2
def two_layer_xor(X):
"""Solve XOR with a manually-constructed 2-layer network."""
results = []
for x in X:
x1, x2 = x
# Layer 1
h1 = 1 if (x1 + x2 - 0.5) >= 0 else -1 # OR gate
h2 = 1 if (-x1 - x2 + 1.5) >= 0 else -1 # NAND gate
# Layer 2
out = 1 if (h1 + h2 - 1.5) >= 0 else -1 # AND gate
results.append(out)
return np.array(results)
two_layer_preds = two_layer_xor(X_xor)
print(f"\n2-Layer Network (manual weights) on XOR:")
print(f" Predictions: {two_layer_preds}")
print(f" True labels: {y_xor}")
print(f" Accuracy: {np.mean(two_layer_preds == y_xor) * 100:.1f}%")
# --- Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Single perceptron attempt
ax = axes[0]
colors = ["coral" if y == -1 else "steelblue" for y in y_xor]
ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200, edgecolors="k", zorder=5)
for i, (x, y) in enumerate(zip(X_xor, y_xor)):
ax.annotate(f"({x[0]},{x[1]})={y:+d}", (x[0], x[1]),
textcoords="offset points", xytext=(10, 10), fontsize=10)
ax.set_title("Single Perceptron: FAILS on XOR\n(No single line separates the classes)")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.grid(True, alpha=0.3)
# 2-layer network
ax = axes[1]
# Show the two decision boundaries from layer 1
xx = np.linspace(-0.5, 1.5, 200)
# OR boundary: x1 + x2 = 0.5
ax.plot(xx, 0.5 - xx, "b--", linewidth=2, label="OR boundary")
# NAND boundary: x1 + x2 = 1.5
ax.plot(xx, 1.5 - xx, "r--", linewidth=2, label="NAND boundary")
# Shade the XOR=+1 region (between the two lines)
ax.fill_between(xx, 0.5 - xx, 1.5 - xx, alpha=0.15, color="green", label="XOR = +1 region")
ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200, edgecolors="k", zorder=5)
for i, (x, y) in enumerate(zip(X_xor, y_xor)):
ax.annotate(f"({x[0]},{x[1]})={y:+d}", (x[0], x[1]),
textcoords="offset points", xytext=(10, 10), fontsize=10)
ax.set_title("2-Layer Network: SOLVES XOR\n(Two lines create the correct regions)")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.legend(loc="lower right", fontsize=9)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
PART 4: Performance MetricsΒΆ
ROC Curve and AUCΒΆ
The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds:
The Area Under the ROC Curve (AUC) summarizes classifier performance:
AUC = 1.0: perfect classifier
AUC = 0.5: random guessing (diagonal line)
AUC < 0.5: worse than random
AUC can be computed using the trapezoidal rule:
Top-k AccuracyΒΆ
For multi-class problems, top-k accuracy counts a prediction as correct if the true label appears among the \(k\) highest-scoring classes:
def compute_roc(y_true, y_scores):
"""
Compute ROC curve: TPR vs FPR at every unique threshold.
Args:
y_true: binary labels (0 or 1), shape (N,)
y_scores: predicted scores/probabilities, shape (N,)
Returns:
fprs: list of false positive rates
tprs: list of true positive rates
"""
# Sort thresholds from high to low so we go from (0,0) to (1,1)
thresholds = np.sort(np.unique(y_scores))[::-1]
fprs = [0.0] # start at origin
tprs = [0.0]
total_positives = np.sum(y_true == 1)
total_negatives = np.sum(y_true == 0)
for thresh in thresholds:
y_pred = (y_scores >= thresh).astype(int)
tp = np.sum((y_pred == 1) & (y_true == 1))
fp = np.sum((y_pred == 1) & (y_true == 0))
tpr = tp / total_positives if total_positives > 0 else 0.0
fpr = fp / total_negatives if total_negatives > 0 else 0.0
fprs.append(fpr)
tprs.append(tpr)
# Ensure we end at (1, 1)
if fprs[-1] != 1.0 or tprs[-1] != 1.0:
fprs.append(1.0)
tprs.append(1.0)
return np.array(fprs), np.array(tprs)
# Quick test
y_true_test = np.array([1, 1, 0, 1, 0, 0, 1, 0])
y_scores_test = np.array([0.9, 0.8, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1])
fprs, tprs = compute_roc(y_true_test, y_scores_test)
print("FPRs:", fprs.round(3))
print("TPRs:", tprs.round(3))
def compute_auc(fprs, tprs):
"""
Compute Area Under the ROC Curve using the trapezoidal rule.
AUC = sum over consecutive points of (x_{i+1} - x_i) * (y_i + y_{i+1}) / 2
"""
auc = 0.0
for i in range(len(fprs) - 1):
dx = fprs[i + 1] - fprs[i]
avg_y = (tprs[i] + tprs[i + 1]) / 2.0
auc += dx * avg_y
return auc
# Test AUC
auc_val = compute_auc(fprs, tprs)
print(f"AUC: {auc_val:.4f}")
# Plot ROC curve for sample predictions
np.random.seed(42)
# Generate realistic predictions: a decent classifier, a weak classifier, and random
n_samples = 200
y_true_roc = np.random.randint(0, 2, size=n_samples)
# Good classifier: true positives get higher scores
scores_good = y_true_roc * 0.6 + np.random.rand(n_samples) * 0.4
# Weak classifier: noisy
scores_weak = y_true_roc * 0.3 + np.random.rand(n_samples) * 0.7
# Random classifier
scores_random = np.random.rand(n_samples)
fig, ax = plt.subplots(figsize=(8, 7))
for name, scores, color, ls in [
("Good Classifier", scores_good, "steelblue", "-"),
("Weak Classifier", scores_weak, "coral", "-"),
("Random", scores_random, "gray", "--"),
]:
fpr, tpr = compute_roc(y_true_roc, scores)
auc = compute_auc(fpr, tpr)
ax.plot(fpr, tpr, color=color, linewidth=2, linestyle=ls,
label=f"{name} (AUC = {auc:.3f})")
# Diagonal reference line
ax.plot([0, 1], [0, 1], "k:", linewidth=1, label="Chance (AUC = 0.5)")
ax.set_xlabel("False Positive Rate (FPR)")
ax.set_ylabel("True Positive Rate (TPR)")
ax.set_title("ROC Curves with AUC")
ax.legend(loc="lower right")
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_aspect("equal")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def top_k_accuracy(y_true, y_probs, k=5):
"""
Compute top-k accuracy for multi-class classification.
Args:
y_true: integer labels, shape (N,)
y_probs: predicted probabilities, shape (N, C) for C classes
k: number of top predictions to consider
Returns:
float: fraction of samples where true label is in top-k predictions
"""
n_samples = len(y_true)
correct = 0
for i in range(n_samples):
# Get indices of k highest probabilities
top_k_classes = np.argsort(y_probs[i])[::-1][:k]
if y_true[i] in top_k_classes:
correct += 1
return correct / n_samples
# --- Demo with 10 classes ---
np.random.seed(42)
n_samples = 500
n_classes = 10
y_true_multi = np.random.randint(0, n_classes, size=n_samples)
# Create somewhat realistic predictions: true class gets a boost
y_probs_multi = np.random.rand(n_samples, n_classes)
for i in range(n_samples):
y_probs_multi[i, y_true_multi[i]] += 1.5 # boost true class
# Normalize to probabilities
y_probs_multi = y_probs_multi / y_probs_multi.sum(axis=1, keepdims=True)
for k in [1, 3, 5]:
acc = top_k_accuracy(y_true_multi, y_probs_multi, k=k)
print(f"Top-{k} Accuracy: {acc:.4f} ({acc * 100:.1f}%)")
PART 5: Pooling OperationsΒΆ
Pooling layers reduce the spatial dimensions of feature maps, providing:
Translation invariance β small shifts in the input do not change the output
Reduced computation β fewer parameters in subsequent layers
Regularization β reduces overfitting by discarding fine-grained spatial detail
Types of PoolingΒΆ
Type |
Operation |
Use Case |
|---|---|---|
Max Pooling |
Take the maximum value in each window |
Most common; preserves strongest activations |
Average Pooling |
Take the mean of values in each window |
Smoother; preserves overall magnitude |
Global Average Pooling |
Average over the entire spatial dimension |
Replaces fully connected layers at the end of a CNN (e.g., ResNet) |
Output size for max/average pooling with pool size \(p\) and stride \(s\):
def max_pool_2d(feature_map, pool_size=2, stride=2):
"""
Apply 2D max pooling.
Args:
feature_map: 2D array of shape (H, W)
pool_size: size of the pooling window (square)
stride: step size between pooling windows
Returns:
2D array of shape ((H - pool_size) // stride + 1,
(W - pool_size) // stride + 1)
"""
H, W = feature_map.shape
out_H = (H - pool_size) // stride + 1
out_W = (W - pool_size) // stride + 1
output = np.zeros((out_H, out_W))
for i in range(out_H):
for j in range(out_W):
row_start = i * stride
col_start = j * stride
window = feature_map[row_start:row_start + pool_size,
col_start:col_start + pool_size]
output[i, j] = np.max(window)
return output
def avg_pool_2d(feature_map, pool_size=2, stride=2):
"""
Apply 2D average pooling.
Args:
feature_map: 2D array of shape (H, W)
pool_size: size of the pooling window (square)
stride: step size between pooling windows
Returns:
2D array of shape ((H - pool_size) // stride + 1,
(W - pool_size) // stride + 1)
"""
H, W = feature_map.shape
out_H = (H - pool_size) // stride + 1
out_W = (W - pool_size) // stride + 1
output = np.zeros((out_H, out_W))
for i in range(out_H):
for j in range(out_W):
row_start = i * stride
col_start = j * stride
window = feature_map[row_start:row_start + pool_size,
col_start:col_start + pool_size]
output[i, j] = np.mean(window)
return output
def global_avg_pool(feature_map):
"""
Global average pooling: average all spatial dimensions into a single value.
Args:
feature_map: 2D array of shape (H, W)
Returns:
float: the mean of all values
"""
return np.mean(feature_map)
# Quick check
test_fm = np.arange(1, 17, dtype=float).reshape(4, 4)
print("Feature map (4x4):")
print(test_fm)
print(f"\nMax pool (2x2, stride 2):\n{max_pool_2d(test_fm, 2, 2)}")
print(f"\nAvg pool (2x2, stride 2):\n{avg_pool_2d(test_fm, 2, 2)}")
print(f"\nGlobal avg pool: {global_avg_pool(test_fm)}")
# Demo pooling on a 6x6 feature map
np.random.seed(42)
feature_map_6x6 = np.random.randint(0, 10, size=(6, 6)).astype(float)
max_pooled = max_pool_2d(feature_map_6x6, pool_size=2, stride=2)
avg_pooled = avg_pool_2d(feature_map_6x6, pool_size=2, stride=2)
gap_value = global_avg_pool(feature_map_6x6)
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
# Original
im0 = axes[0].imshow(feature_map_6x6, cmap="YlOrRd", vmin=0, vmax=9)
axes[0].set_title(f"Original (6x6)")
for i in range(6):
for j in range(6):
axes[0].text(j, i, f"{feature_map_6x6[i, j]:.0f}",
ha="center", va="center", fontsize=11, fontweight="bold")
axes[0].set_xticks(range(6))
axes[0].set_yticks(range(6))
# Max pooled
im1 = axes[1].imshow(max_pooled, cmap="YlOrRd", vmin=0, vmax=9)
axes[1].set_title(f"Max Pool 2x2 (3x3)")
for i in range(3):
for j in range(3):
axes[1].text(j, i, f"{max_pooled[i, j]:.0f}",
ha="center", va="center", fontsize=14, fontweight="bold")
axes[1].set_xticks(range(3))
axes[1].set_yticks(range(3))
# Avg pooled
im2 = axes[2].imshow(avg_pooled, cmap="YlOrRd", vmin=0, vmax=9)
axes[2].set_title(f"Avg Pool 2x2 (3x3)")
for i in range(3):
for j in range(3):
axes[2].text(j, i, f"{avg_pooled[i, j]:.1f}",
ha="center", va="center", fontsize=14, fontweight="bold")
axes[2].set_xticks(range(3))
axes[2].set_yticks(range(3))
fig.suptitle(f"Pooling Operations (Global Avg Pool = {gap_value:.2f})", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print(f"Global Average Pooling of 6x6 map: {gap_value:.4f}")
PART 6: Loss FunctionsΒΆ
Loss functions quantify how wrong a modelβs predictions are. The choice of loss function shapes the learning dynamics.
Loss |
Formula |
Use Case |
|---|---|---|
MSE |
\(\frac{1}{N}\sum(y - \hat{y})^2\) |
Regression; heavily penalizes large errors |
MAE |
\(\frac{1}{N}\sum|y - \hat{y}|\) |
Regression; robust to outliers |
Huber |
MSE for small errors, MAE for large |
Best of both worlds; controlled by \(\delta\) |
Hinge |
\(\max(0, 1 - y \cdot \hat{y})\) |
SVM classification; \(y \in \{-1, +1\}\) |
Huber Loss (element-wise)ΒΆ
The Huber loss transitions smoothly from quadratic (MSE-like) for small errors to linear (MAE-like) for large errors, making it robust to outliers while still being differentiable everywhere.
def mse_loss(y_true, y_pred):
"""Mean Squared Error: (1/N) * sum((y_true - y_pred)^2)"""
return np.mean((y_true - y_pred) ** 2)
def mae_loss(y_true, y_pred):
"""Mean Absolute Error: (1/N) * sum(|y_true - y_pred|)"""
return np.mean(np.abs(y_true - y_pred))
def huber_loss(y_true, y_pred, delta=1.0):
"""
Huber loss: quadratic for small errors, linear for large errors.
L = 0.5 * e^2 if |e| <= delta
L = delta * (|e| - 0.5 * delta) otherwise
Returns the mean loss over all samples.
"""
error = y_true - y_pred
abs_error = np.abs(error)
quadratic = 0.5 * error ** 2
linear = delta * (abs_error - 0.5 * delta)
elementwise = np.where(abs_error <= delta, quadratic, linear)
return np.mean(elementwise)
def hinge_loss(y_true, y_pred):
"""
Hinge loss for binary classification: max(0, 1 - y_true * y_pred)
Labels y_true must be in {-1, +1}.
Returns the mean loss over all samples.
"""
return np.mean(np.maximum(0, 1.0 - y_true * y_pred))
# --- Quick check ---
y_t = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_p = np.array([1.1, 2.5, 2.8, 4.2, 4.5])
print(f"MSE: {mse_loss(y_t, y_p):.4f}")
print(f"MAE: {mae_loss(y_t, y_p):.4f}")
print(f"Huber: {huber_loss(y_t, y_p, delta=1.0):.4f}")
# Hinge loss test (classification)
y_t_cls = np.array([1, -1, 1, -1, 1])
y_p_cls = np.array([0.8, -1.2, 0.3, 0.5, 1.5])
print(f"Hinge: {hinge_loss(y_t_cls, y_p_cls):.4f}")
# Plot all 4 loss functions vs error on the same axes
errors = np.linspace(-3, 3, 300)
# For MSE, MAE, Huber: loss as a function of error (y_true - y_pred)
mse_vals = errors ** 2 # element-wise MSE (not averaged)
mae_vals = np.abs(errors) # element-wise MAE
huber_vals = np.where(
np.abs(errors) <= 1.0,
0.5 * errors ** 2,
1.0 * (np.abs(errors) - 0.5)
)
# For hinge loss: loss as function of y*f(x) margin
# We plot on the same error axis for comparison, treating error as margin
hinge_vals = np.maximum(0, 1.0 - errors) # max(0, 1 - margin)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: regression losses
ax = axes[0]
ax.plot(errors, mse_vals, label="MSE: $e^2$", linewidth=2, color="#e41a1c")
ax.plot(errors, mae_vals, label="MAE: $|e|$", linewidth=2, color="#377eb8")
ax.plot(errors, huber_vals, label=r"Huber ($\delta$=1): smooth", linewidth=2,
color="#4daf4a", linestyle="--")
ax.axvline(0, color="gray", linewidth=0.5, linestyle="--")
ax.set_xlabel("Error ($y - \hat{y}$)")
ax.set_ylabel("Loss")
ax.set_title("Regression Losses vs Error")
ax.legend()
ax.set_xlim(-3, 3)
ax.set_ylim(0, 5)
ax.grid(True, alpha=0.3)
# Right plot: hinge loss vs margin
ax = axes[1]
margins = np.linspace(-2, 3, 300)
hinge_margin = np.maximum(0, 1.0 - margins)
ax.plot(margins, hinge_margin, label="Hinge: max(0, 1-margin)", linewidth=2,
color="#984ea3")
ax.axvline(1, color="gray", linewidth=0.5, linestyle=":", label="Margin = 1")
ax.fill_between(margins, 0, hinge_margin, alpha=0.1, color="#984ea3")
ax.set_xlabel("Margin ($y \\cdot \\hat{y}$)")
ax.set_ylabel("Loss")
ax.set_title("Hinge Loss vs Margin")
ax.legend()
ax.set_xlim(-2, 3)
ax.set_ylim(0, 3.5)
ax.grid(True, alpha=0.3)
plt.suptitle("Loss Functions Comparison", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
SummaryΒΆ
In this lab we built the core components of CNNs and deep learning from scratch using only NumPy:
Part 1: Convolution & CorrelationΒΆ
Implemented 1D and 2D cross-correlation (what DL calls βconvolutionβ)
Applied classic filters (edge detection, sharpen, blur) to a checkerboard image
Extended to strided convolution with zero-padding, matching the output-size formula
Part 2: Activation FunctionsΒΆ
Implemented 6 activations: ReLU, Leaky ReLU, ELU, Swish, GELU, Tanh
Computed derivatives for ReLU and Tanh (essential for backpropagation)
Demonstrated the dying ReLU problem and how Leaky ReLU mitigates it
Part 3: PerceptronΒΆ
Built a Perceptron class with the classic learning rule
Trained it on linearly separable data and plotted the decision boundary
Proved the perceptron fails on XOR and showed a 2-layer network solves it (OR + NAND + AND decomposition)
Part 4: Performance MetricsΒΆ
Implemented ROC curve computation from raw scores and labels
Computed AUC using the trapezoidal rule
Plotted ROC curves comparing good, weak, and random classifiers
Implemented top-k accuracy for multi-class evaluation
Part 5: Pooling OperationsΒΆ
Implemented max pooling, average pooling, and global average pooling
Visualized their effects on a 6x6 feature map
Part 6: Loss FunctionsΒΆ
Implemented MSE, MAE, Huber, and Hinge loss functions
Plotted regression losses vs error and hinge loss vs margin
Key takeaway: Every βmagicalβ operation in deep learning frameworks is just a combination of simple mathematical operations. Understanding these building blocks demystifies CNN architectures and makes debugging, tuning, and innovating much easier.