Run this notebook: Open in Colab Open in Kaggle

Lab 06: CNN Feature Extraction & Deep Learning Fundamentals¶

Source: Deep Learning Interviews by Shlomo Kashani (PDF: resources/2201.00650v2.pdf)
Chapters: CNN Feature Extraction, Deep Learning

This lab covers the foundational building blocks of Convolutional Neural Networks and deep learning:

Convolution & Correlation – the core operations behind CNNs
Activation Functions – nonlinearities that give networks their power
Perceptron – the simplest neural network and its limitations
Performance Metrics – ROC, AUC, top-k accuracy
Pooling Operations – spatial down-sampling in CNNs
Loss Functions – objectives that drive learning

All implementations use only NumPy and Matplotlib – no deep learning frameworks required.

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["font.size"] = 12
np.random.seed(42)

print("Imports ready.")

PART 1: Convolution & Correlation¶

What deep learning calls “convolution” is actually cross-correlation¶

In signal processing, true convolution flips the kernel before sliding it:

\[(\mathbf{f} * \mathbf{g})[n] = \sum_{k} f[k] \cdot g[n - k]\]

Cross-correlation does not flip the kernel:

\[(\mathbf{f} \star \mathbf{g})[n] = \sum_{k} f[k] \cdot g[n + k]\]

Deep learning frameworks (PyTorch, TensorFlow) implement cross-correlation but call it “convolution.” Since the kernel weights are learned, flipping is irrelevant – the network learns the appropriate (possibly flipped) weights.

Output size (valid mode, no padding)¶

For a 1D signal of length \(N\) and kernel of length \(K\):

\[\text{output length} = N - K + 1\]

For a 2D image of shape \((H, W)\) and kernel of shape \((k_H, k_W)\):

\[\text{output shape} = (H - k_H + 1, \; W - k_W + 1)\]

With stride \(s\) and padding \(p\):

\[\text{output size} = \left\lfloor \frac{H + 2p - k_H}{s} \right\rfloor + 1\]

def cross_correlation_1d(signal, kernel):
    """
    Compute 1D cross-correlation in valid mode (no padding).
    
    Args:
        signal: 1D numpy array of length N
        kernel: 1D numpy array of length K
    Returns:
        1D numpy array of length N - K + 1
    """
    n = len(signal)
    k = len(kernel)
    out_len = n - k + 1
    output = np.zeros(out_len)
    for i in range(out_len):
        output[i] = np.sum(signal[i:i + k] * kernel)
    return output


# --- Demo ---
signal = np.array([1, 2, 3, 4, 5, 6, 7], dtype=float)
kernel = np.array([1, 0, -1], dtype=float)

result = cross_correlation_1d(signal, kernel)
print(f"Signal:  {signal}")
print(f"Kernel:  {kernel}")
print(f"Output:  {result}")
print(f"Expected length: {len(signal)} - {len(kernel)} + 1 = {len(signal) - len(kernel) + 1}")
print(f"Actual length:   {len(result)}")

def cross_correlation_2d(image, kernel):
    """
    Compute 2D cross-correlation in valid mode using nested loops.
    
    Args:
        image:  2D numpy array of shape (H, W)
        kernel: 2D numpy array of shape (kH, kW)
    Returns:
        2D numpy array of shape (H - kH + 1, W - kW + 1)
    """
    H, W = image.shape
    kH, kW = kernel.shape
    out_H = H - kH + 1
    out_W = W - kW + 1
    output = np.zeros((out_H, out_W))
    for i in range(out_H):
        for j in range(out_W):
            region = image[i:i + kH, j:j + kW]
            output[i, j] = np.sum(region * kernel)
    return output


# --- Demo ---
image = np.array([
    [1, 2, 3, 0],
    [0, 1, 2, 3],
    [3, 0, 1, 2],
    [2, 3, 0, 1]
], dtype=float)

kernel = np.array([
    [1, 0],
    [0, 1]
], dtype=float)

result = cross_correlation_2d(image, kernel)
print("Image (4x4):")
print(image)
print(f"\nKernel (2x2):\n{kernel}")
print(f"\nOutput shape: {result.shape}")
print(f"Output:\n{result}")

# Apply 4 classic filters to an 8x8 checkerboard pattern

# Create checkerboard
checkerboard = np.zeros((8, 8))
checkerboard[::2, ::2] = 1
checkerboard[1::2, 1::2] = 1

# Define filters
kernels = {
    "Horizontal Edge": np.array([[-1, -1, -1],
                                  [ 0,  0,  0],
                                  [ 1,  1,  1]], dtype=float),
    "Vertical Edge":   np.array([[-1, 0, 1],
                                  [-1, 0, 1],
                                  [-1, 0, 1]], dtype=float),
    "Sharpen":         np.array([[ 0, -1,  0],
                                  [-1,  5, -1],
                                  [ 0, -1,  0]], dtype=float),
    "Blur (Box)":      np.ones((3, 3), dtype=float) / 9.0,
}

# Apply and plot
fig, axes = plt.subplots(1, 5, figsize=(16, 3))

axes[0].imshow(checkerboard, cmap="gray", vmin=0, vmax=1)
axes[0].set_title("Original\nCheckerboard")
axes[0].axis("off")

for idx, (name, kernel) in enumerate(kernels.items(), start=1):
    filtered = cross_correlation_2d(checkerboard, kernel)
    axes[idx].imshow(filtered, cmap="gray")
    axes[idx].set_title(name)
    axes[idx].axis("off")

fig.suptitle("2D Cross-Correlation with Classic Filters", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

def conv2d_with_stride_padding(image, kernel, stride=1, padding=0):
    """
    2D cross-correlation with stride and zero-padding.
    
    Args:
        image:   2D array of shape (H, W)
        kernel:  2D array of shape (kH, kW)
        stride:  step size for sliding the kernel
        padding: number of zero-padding rows/cols on each side
    Returns:
        2D array of shape ((H + 2*pad - kH) // stride + 1,
                           (W + 2*pad - kW) // stride + 1)
    """
    # Step 1: zero-pad the image
    if padding > 0:
        padded = np.zeros((image.shape[0] + 2 * padding,
                           image.shape[1] + 2 * padding))
        padded[padding:-padding, padding:-padding] = image
    else:
        padded = image.copy()
    
    H, W = padded.shape
    kH, kW = kernel.shape
    
    out_H = (H - kH) // stride + 1
    out_W = (W - kW) // stride + 1
    output = np.zeros((out_H, out_W))
    
    # Step 2: strided cross-correlation
    for i in range(out_H):
        for j in range(out_W):
            row_start = i * stride
            col_start = j * stride
            region = padded[row_start:row_start + kH, col_start:col_start + kW]
            output[i, j] = np.sum(region * kernel)
    
    return output


# --- Demo ---
image_5x5 = np.arange(1, 26, dtype=float).reshape(5, 5)
kernel_3x3 = np.array([[1, 0, -1], [1, 0, -1], [1, 0, -1]], dtype=float)

print("Image (5x5):")
print(image_5x5)
print(f"\nKernel (3x3):\n{kernel_3x3}")

# No padding, stride=1 (standard valid)
out1 = conv2d_with_stride_padding(image_5x5, kernel_3x3, stride=1, padding=0)
print(f"\nStride=1, Padding=0 -> shape {out1.shape}:")
print(out1)

# Padding=1, stride=1 ("same" convolution for 3x3 kernel)
out2 = conv2d_with_stride_padding(image_5x5, kernel_3x3, stride=1, padding=1)
print(f"\nStride=1, Padding=1 -> shape {out2.shape}:")
print(out2)

# Stride=2, padding=1
out3 = conv2d_with_stride_padding(image_5x5, kernel_3x3, stride=2, padding=1)
print(f"\nStride=2, Padding=1 -> shape {out3.shape}:")
print(out3)

PART 2: Activation Functions¶

Activation functions introduce nonlinearity into neural networks. Without them, any stack of linear layers collapses to a single linear transformation.

Function	Formula	Key Properties
ReLU	\(\max(0, z)\)	Simple, sparse, but “dying neuron” risk
Leaky ReLU	\(z\) if \(z > 0\), else \(\alpha z\)	Fixes dying ReLU with small negative slope
ELU	\(z\) if \(z > 0\), else \(\alpha(e^z - 1)\)	Smooth for \(z < 0\), pushes mean toward zero
Swish	\(z \cdot \sigma(\beta z)\)	Smooth, non-monotonic, self-gated
GELU	\(z \cdot \Phi(z)\)	Used in BERT/GPT; smooth approximation of ReLU
Tanh	\(\frac{e^z - e^{-z}}{e^z + e^{-z}}\)	Output in \((-1, 1)\), zero-centered

where \(\sigma\) is the sigmoid function and \(\Phi\) is the standard normal CDF.

def relu(z):
    """ReLU: max(0, z)"""
    return np.maximum(0, z)


def leaky_relu(z, alpha=0.01):
    """Leaky ReLU: z if z > 0, else alpha * z"""
    return np.where(z > 0, z, alpha * z)


def elu(z, alpha=1.0):
    """ELU: z if z > 0, else alpha * (exp(z) - 1)"""
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))


def swish(z, beta=1.0):
    """Swish: z * sigmoid(beta * z)"""
    sigmoid = 1.0 / (1.0 + np.exp(-beta * z))
    return z * sigmoid


def gelu(z):
    """GELU approximation: 0.5 * z * (1 + tanh(sqrt(2/pi) * (z + 0.044715 * z^3)))"""
    return 0.5 * z * (1.0 + np.tanh(np.sqrt(2.0 / np.pi) * (z + 0.044715 * z**3)))


def tanh(z):
    """Tanh activation: (e^z - e^-z) / (e^z + e^-z)"""
    return np.tanh(z)


# --- Quick check ---
test_z = np.array([-2, -1, 0, 1, 2], dtype=float)
print(f"z:          {test_z}")
print(f"ReLU:       {relu(test_z)}")
print(f"Leaky ReLU: {leaky_relu(test_z)}")
print(f"ELU:        {elu(test_z).round(4)}")
print(f"Swish:      {swish(test_z).round(4)}")
print(f"GELU:       {gelu(test_z).round(4)}")
print(f"Tanh:       {tanh(test_z).round(4)}")

def relu_derivative(z):
    """Derivative of ReLU: 1 if z > 0, else 0"""
    return np.where(z > 0, 1.0, 0.0)


def tanh_derivative(z):
    """Derivative of tanh: 1 - tanh^2(z)"""
    t = np.tanh(z)
    return 1.0 - t ** 2


# --- Quick check ---
print(f"z:             {test_z}")
print(f"ReLU':         {relu_derivative(test_z)}")
print(f"Tanh':         {tanh_derivative(test_z).round(4)}")

# Plot all activation functions on [-3, 3]

z = np.linspace(-3, 3, 300)

activations = {
    "ReLU":       relu(z),
    "Leaky ReLU": leaky_relu(z, alpha=0.1),  # exaggerated alpha for visibility
    "ELU":        elu(z),
    "Swish":      swish(z),
    "GELU":       gelu(z),
    "Tanh":       tanh(z),
}

fig, ax = plt.subplots(figsize=(10, 6))

colors = ["#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00", "#a65628"]
for (name, values), color in zip(activations.items(), colors):
    ax.plot(z, values, label=name, linewidth=2, color=color)

ax.axhline(y=0, color="gray", linewidth=0.5, linestyle="--")
ax.axvline(x=0, color="gray", linewidth=0.5, linestyle="--")
ax.set_xlabel("z")
ax.set_ylabel("activation(z)")
ax.set_title("Activation Functions Comparison")
ax.legend(loc="upper left")
ax.set_xlim(-3, 3)
ax.set_ylim(-2, 3)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Dying ReLU demo
# When a neuron's pre-activation is always negative, ReLU outputs 0 forever.
# The gradient is 0, so the weights never update -- the neuron is "dead."

np.random.seed(42)

# Simulate a single neuron: z = X @ w + b
# Use weights that push most inputs into negative territory
w = np.random.randn(10) - 1.5  # biased negative
b = -2.0                        # large negative bias
X = np.random.randn(200, 10)

z = X @ w + b  # pre-activations

relu_out = relu(z)
leaky_out = leaky_relu(z, alpha=0.01)

dead_relu = np.sum(relu_out == 0)
dead_leaky = np.sum(leaky_out == 0)

print(f"Total samples: {len(z)}")
print(f"Pre-activations < 0: {np.sum(z < 0)} ({100 * np.sum(z < 0) / len(z):.1f}%)")
print(f"ReLU dead outputs (exactly 0):       {dead_relu} ({100 * dead_relu / len(z):.1f}%)")
print(f"Leaky ReLU dead outputs (exactly 0): {dead_leaky} ({100 * dead_leaky / len(z):.1f}%)")

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].hist(z, bins=30, color="steelblue", edgecolor="black", alpha=0.7)
axes[0].axvline(0, color="red", linestyle="--", label="z=0")
axes[0].set_title("Pre-activation distribution")
axes[0].set_xlabel("z = Xw + b")
axes[0].legend()

axes[1].hist(relu_out, bins=30, color="coral", edgecolor="black", alpha=0.7)
axes[1].set_title(f"After ReLU ({dead_relu}/{len(z)} dead)")
axes[1].set_xlabel("ReLU(z)")

axes[2].hist(leaky_out, bins=30, color="mediumseagreen", edgecolor="black", alpha=0.7)
axes[2].set_title(f"After Leaky ReLU ({dead_leaky}/{len(z)} dead)")
axes[2].set_xlabel("LeakyReLU(z)")

for ax in axes:
    ax.set_ylabel("Count")
    ax.grid(True, alpha=0.3)

plt.suptitle("Dying ReLU Problem: Leaky ReLU Preserves Gradient Flow", fontsize=14)
plt.tight_layout()
plt.show()

PART 3: Perceptron¶

The perceptron is the simplest neural network – a single linear classifier:

\[y = \text{sign}(\mathbf{w} \cdot \mathbf{x} + b)\]

Perceptron Learning Rule¶

For each training sample \((\mathbf{x}_i, y_i)\) where \(y_i \in \{-1, +1\}\):

Compute prediction: \(\hat{y}_i = \text{sign}(\mathbf{w} \cdot \mathbf{x}_i + b)\)
If misclassified (\(\hat{y}_i \neq y_i\)):
- \(\mathbf{w} \leftarrow \mathbf{w} + \eta \cdot y_i \cdot \mathbf{x}_i\)
- \(b \leftarrow b + \eta \cdot y_i\)

Convergence theorem: If the data is linearly separable, the perceptron algorithm converges in a finite number of steps.

Limitation: A single perceptron cannot solve problems that are not linearly separable (e.g., XOR).

class Perceptron:
    """Binary perceptron classifier with labels in {-1, +1}."""
    
    def __init__(self, n_features, lr=1.0):
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        self.lr = lr
    
    def predict(self, X):
        """
        Predict class labels for samples in X.
        
        Args:
            X: array of shape (n_samples, n_features) or (n_features,)
        Returns:
            array of +1 or -1
        """
        linear = X @ self.weights + self.bias
        return np.where(linear >= 0, 1, -1)
    
    def fit(self, X, y, n_epochs=100):
        """
        Train the perceptron using the perceptron learning rule.
        
        Args:
            X: array of shape (n_samples, n_features)
            y: array of shape (n_samples,) with values in {-1, +1}
            n_epochs: maximum number of passes over the data
        Returns:
            list of error counts per epoch
        """
        errors_per_epoch = []
        for epoch in range(n_epochs):
            errors = 0
            for i in range(len(X)):
                prediction = self.predict(X[i:i+1])[0]
                if prediction != y[i]:
                    # Misclassified: update weights
                    self.weights += self.lr * y[i] * X[i]
                    self.bias += self.lr * y[i]
                    errors += 1
            errors_per_epoch.append(errors)
            if errors == 0:
                print(f"Converged at epoch {epoch + 1}")
                break
        else:
            print(f"Did not converge after {n_epochs} epochs (last errors: {errors})")
        return errors_per_epoch


# Quick test
p = Perceptron(2)
X_test = np.array([[1, 1], [-1, -1]])
y_test = np.array([1, -1])
p.fit(X_test, y_test, n_epochs=10)
print(f"Weights: {p.weights}, Bias: {p.bias}")
print(f"Predictions: {p.predict(X_test)}")

# Train perceptron on linearly separable 2D data and plot decision boundary

np.random.seed(42)

# Generate two linearly separable clusters
n_per_class = 50
class_1 = np.random.randn(n_per_class, 2) + np.array([2, 2])
class_2 = np.random.randn(n_per_class, 2) + np.array([-2, -2])
X_train = np.vstack([class_1, class_2])
y_train = np.array([1] * n_per_class + [-1] * n_per_class)

# Train
perc = Perceptron(n_features=2, lr=0.1)
errors = perc.fit(X_train, y_train, n_epochs=100)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Decision boundary
ax = axes[0]
ax.scatter(class_1[:, 0], class_1[:, 1], c="steelblue", label="Class +1", edgecolors="k", s=40)
ax.scatter(class_2[:, 0], class_2[:, 1], c="coral", label="Class -1", edgecolors="k", s=40)

# Decision boundary: w0*x0 + w1*x1 + b = 0  =>  x1 = -(w0*x0 + b) / w1
x_range = np.linspace(-5, 5, 200)
if perc.weights[1] != 0:
    boundary = -(perc.weights[0] * x_range + perc.bias) / perc.weights[1]
    ax.plot(x_range, boundary, "k-", linewidth=2, label="Decision boundary")

ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_title("Perceptron Decision Boundary")
ax.legend()
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.grid(True, alpha=0.3)

# Errors per epoch
ax = axes[1]
ax.plot(range(1, len(errors) + 1), errors, "o-", color="steelblue", markersize=4)
ax.set_xlabel("Epoch")
ax.set_ylabel("Misclassifications")
ax.set_title("Perceptron Training Convergence")
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Report accuracy
preds = perc.predict(X_train)
accuracy = np.mean(preds == y_train) * 100
print(f"Training accuracy: {accuracy:.1f}%")
print(f"Learned weights: {perc.weights.round(4)}, bias: {perc.bias:.4f}")

# XOR Problem: Perceptron fails, but a 2-layer network succeeds

# XOR truth table
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([-1, 1, 1, -1])  # XOR

# --- Attempt 1: Single perceptron (will fail) ---
xor_perc = Perceptron(n_features=2, lr=0.1)
xor_errors = xor_perc.fit(X_xor, y_xor, n_epochs=200)
xor_preds = xor_perc.predict(X_xor)
print(f"\nSingle Perceptron on XOR:")
print(f"  Predictions: {xor_preds}")
print(f"  True labels:  {y_xor}")
print(f"  Accuracy: {np.mean(xor_preds == y_xor) * 100:.1f}%")

# --- Attempt 2: 2-layer network with manual weights ---
# XOR = AND(OR(x1, x2), NAND(x1, x2))
#
# Layer 1: two neurons
#   Neuron A (OR-like):   sign(x1 + x2 - 0.5)   -> +1 if x1 OR x2
#   Neuron B (NAND-like): sign(-x1 - x2 + 1.5)   -> +1 if NOT (x1 AND x2)
#
# Layer 2: one neuron
#   Neuron C (AND):       sign(h1 + h2 - 1.5)     -> +1 if h1 AND h2

def two_layer_xor(X):
    """Solve XOR with a manually-constructed 2-layer network."""
    results = []
    for x in X:
        x1, x2 = x
        # Layer 1
        h1 = 1 if (x1 + x2 - 0.5) >= 0 else -1    # OR gate
        h2 = 1 if (-x1 - x2 + 1.5) >= 0 else -1    # NAND gate
        # Layer 2
        out = 1 if (h1 + h2 - 1.5) >= 0 else -1     # AND gate
        results.append(out)
    return np.array(results)

two_layer_preds = two_layer_xor(X_xor)
print(f"\n2-Layer Network (manual weights) on XOR:")
print(f"  Predictions: {two_layer_preds}")
print(f"  True labels:  {y_xor}")
print(f"  Accuracy: {np.mean(two_layer_preds == y_xor) * 100:.1f}%")

# --- Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Single perceptron attempt
ax = axes[0]
colors = ["coral" if y == -1 else "steelblue" for y in y_xor]
ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200, edgecolors="k", zorder=5)
for i, (x, y) in enumerate(zip(X_xor, y_xor)):
    ax.annotate(f"({x[0]},{x[1]})={y:+d}", (x[0], x[1]),
                textcoords="offset points", xytext=(10, 10), fontsize=10)
ax.set_title("Single Perceptron: FAILS on XOR\n(No single line separates the classes)")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.grid(True, alpha=0.3)

# 2-layer network
ax = axes[1]
# Show the two decision boundaries from layer 1
xx = np.linspace(-0.5, 1.5, 200)
# OR boundary: x1 + x2 = 0.5
ax.plot(xx, 0.5 - xx, "b--", linewidth=2, label="OR boundary")
# NAND boundary: x1 + x2 = 1.5
ax.plot(xx, 1.5 - xx, "r--", linewidth=2, label="NAND boundary")
# Shade the XOR=+1 region (between the two lines)
ax.fill_between(xx, 0.5 - xx, 1.5 - xx, alpha=0.15, color="green", label="XOR = +1 region")
ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200, edgecolors="k", zorder=5)
for i, (x, y) in enumerate(zip(X_xor, y_xor)):
    ax.annotate(f"({x[0]},{x[1]})={y:+d}", (x[0], x[1]),
                textcoords="offset points", xytext=(10, 10), fontsize=10)
ax.set_title("2-Layer Network: SOLVES XOR\n(Two lines create the correct regions)")
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.legend(loc="lower right", fontsize=9)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

PART 4: Performance Metrics¶

ROC Curve and AUC¶

The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds:

\[\text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN}\]

The Area Under the ROC Curve (AUC) summarizes classifier performance:

AUC = 1.0: perfect classifier
AUC = 0.5: random guessing (diagonal line)
AUC < 0.5: worse than random

AUC can be computed using the trapezoidal rule:

\[\text{AUC} = \sum_{i=1}^{n-1} \frac{(x_{i+1} - x_i)(y_i + y_{i+1})}{2}\]

Top-k Accuracy¶

For multi-class problems, top-k accuracy counts a prediction as correct if the true label appears among the \(k\) highest-scoring classes:

\[\text{Top-}k\text{ Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[y_i \in \text{top-}k(\hat{y}_i)]\]

def compute_roc(y_true, y_scores):
    """
    Compute ROC curve: TPR vs FPR at every unique threshold.
    
    Args:
        y_true:   binary labels (0 or 1), shape (N,)
        y_scores: predicted scores/probabilities, shape (N,)
    Returns:
        fprs: list of false positive rates
        tprs: list of true positive rates
    """
    # Sort thresholds from high to low so we go from (0,0) to (1,1)
    thresholds = np.sort(np.unique(y_scores))[::-1]
    
    fprs = [0.0]  # start at origin
    tprs = [0.0]
    
    total_positives = np.sum(y_true == 1)
    total_negatives = np.sum(y_true == 0)
    
    for thresh in thresholds:
        y_pred = (y_scores >= thresh).astype(int)
        tp = np.sum((y_pred == 1) & (y_true == 1))
        fp = np.sum((y_pred == 1) & (y_true == 0))
        
        tpr = tp / total_positives if total_positives > 0 else 0.0
        fpr = fp / total_negatives if total_negatives > 0 else 0.0
        
        fprs.append(fpr)
        tprs.append(tpr)
    
    # Ensure we end at (1, 1)
    if fprs[-1] != 1.0 or tprs[-1] != 1.0:
        fprs.append(1.0)
        tprs.append(1.0)
    
    return np.array(fprs), np.array(tprs)


# Quick test
y_true_test = np.array([1, 1, 0, 1, 0, 0, 1, 0])
y_scores_test = np.array([0.9, 0.8, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1])
fprs, tprs = compute_roc(y_true_test, y_scores_test)
print("FPRs:", fprs.round(3))
print("TPRs:", tprs.round(3))

def compute_auc(fprs, tprs):
    """
    Compute Area Under the ROC Curve using the trapezoidal rule.
    
    AUC = sum over consecutive points of (x_{i+1} - x_i) * (y_i + y_{i+1}) / 2
    """
    auc = 0.0
    for i in range(len(fprs) - 1):
        dx = fprs[i + 1] - fprs[i]
        avg_y = (tprs[i] + tprs[i + 1]) / 2.0
        auc += dx * avg_y
    return auc


# Test AUC
auc_val = compute_auc(fprs, tprs)
print(f"AUC: {auc_val:.4f}")

# Plot ROC curve for sample predictions

np.random.seed(42)

# Generate realistic predictions: a decent classifier, a weak classifier, and random
n_samples = 200
y_true_roc = np.random.randint(0, 2, size=n_samples)

# Good classifier: true positives get higher scores
scores_good = y_true_roc * 0.6 + np.random.rand(n_samples) * 0.4
# Weak classifier: noisy
scores_weak = y_true_roc * 0.3 + np.random.rand(n_samples) * 0.7
# Random classifier
scores_random = np.random.rand(n_samples)

fig, ax = plt.subplots(figsize=(8, 7))

for name, scores, color, ls in [
    ("Good Classifier", scores_good, "steelblue", "-"),
    ("Weak Classifier", scores_weak, "coral", "-"),
    ("Random",          scores_random, "gray", "--"),
]:
    fpr, tpr = compute_roc(y_true_roc, scores)
    auc = compute_auc(fpr, tpr)
    ax.plot(fpr, tpr, color=color, linewidth=2, linestyle=ls,
            label=f"{name} (AUC = {auc:.3f})")

# Diagonal reference line
ax.plot([0, 1], [0, 1], "k:", linewidth=1, label="Chance (AUC = 0.5)")

ax.set_xlabel("False Positive Rate (FPR)")
ax.set_ylabel("True Positive Rate (TPR)")
ax.set_title("ROC Curves with AUC")
ax.legend(loc="lower right")
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_aspect("equal")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

def top_k_accuracy(y_true, y_probs, k=5):
    """
    Compute top-k accuracy for multi-class classification.
    
    Args:
        y_true:  integer labels, shape (N,)
        y_probs: predicted probabilities, shape (N, C) for C classes
        k:       number of top predictions to consider
    Returns:
        float: fraction of samples where true label is in top-k predictions
    """
    n_samples = len(y_true)
    correct = 0
    for i in range(n_samples):
        # Get indices of k highest probabilities
        top_k_classes = np.argsort(y_probs[i])[::-1][:k]
        if y_true[i] in top_k_classes:
            correct += 1
    return correct / n_samples


# --- Demo with 10 classes ---
np.random.seed(42)
n_samples = 500
n_classes = 10

y_true_multi = np.random.randint(0, n_classes, size=n_samples)

# Create somewhat realistic predictions: true class gets a boost
y_probs_multi = np.random.rand(n_samples, n_classes)
for i in range(n_samples):
    y_probs_multi[i, y_true_multi[i]] += 1.5  # boost true class
# Normalize to probabilities
y_probs_multi = y_probs_multi / y_probs_multi.sum(axis=1, keepdims=True)

for k in [1, 3, 5]:
    acc = top_k_accuracy(y_true_multi, y_probs_multi, k=k)
    print(f"Top-{k} Accuracy: {acc:.4f} ({acc * 100:.1f}%)")

PART 5: Pooling Operations¶

Pooling layers reduce the spatial dimensions of feature maps, providing:

Translation invariance – small shifts in the input do not change the output
Reduced computation – fewer parameters in subsequent layers
Regularization – reduces overfitting by discarding fine-grained spatial detail

Types of Pooling¶

Type	Operation	Use Case
Max Pooling	Take the maximum value in each window	Most common; preserves strongest activations
Average Pooling	Take the mean of values in each window	Smoother; preserves overall magnitude
Global Average Pooling	Average over the entire spatial dimension	Replaces fully connected layers at the end of a CNN (e.g., ResNet)

Output size for max/average pooling with pool size \(p\) and stride \(s\):

\[\text{output size} = \left\lfloor \frac{H - p}{s} \right\rfloor + 1\]

def max_pool_2d(feature_map, pool_size=2, stride=2):
    """
    Apply 2D max pooling.
    
    Args:
        feature_map: 2D array of shape (H, W)
        pool_size:   size of the pooling window (square)
        stride:      step size between pooling windows
    Returns:
        2D array of shape ((H - pool_size) // stride + 1,
                           (W - pool_size) // stride + 1)
    """
    H, W = feature_map.shape
    out_H = (H - pool_size) // stride + 1
    out_W = (W - pool_size) // stride + 1
    output = np.zeros((out_H, out_W))
    for i in range(out_H):
        for j in range(out_W):
            row_start = i * stride
            col_start = j * stride
            window = feature_map[row_start:row_start + pool_size,
                                 col_start:col_start + pool_size]
            output[i, j] = np.max(window)
    return output


def avg_pool_2d(feature_map, pool_size=2, stride=2):
    """
    Apply 2D average pooling.
    
    Args:
        feature_map: 2D array of shape (H, W)
        pool_size:   size of the pooling window (square)
        stride:      step size between pooling windows
    Returns:
        2D array of shape ((H - pool_size) // stride + 1,
                           (W - pool_size) // stride + 1)
    """
    H, W = feature_map.shape
    out_H = (H - pool_size) // stride + 1
    out_W = (W - pool_size) // stride + 1
    output = np.zeros((out_H, out_W))
    for i in range(out_H):
        for j in range(out_W):
            row_start = i * stride
            col_start = j * stride
            window = feature_map[row_start:row_start + pool_size,
                                 col_start:col_start + pool_size]
            output[i, j] = np.mean(window)
    return output


def global_avg_pool(feature_map):
    """
    Global average pooling: average all spatial dimensions into a single value.
    
    Args:
        feature_map: 2D array of shape (H, W)
    Returns:
        float: the mean of all values
    """
    return np.mean(feature_map)


# Quick check
test_fm = np.arange(1, 17, dtype=float).reshape(4, 4)
print("Feature map (4x4):")
print(test_fm)
print(f"\nMax pool (2x2, stride 2):\n{max_pool_2d(test_fm, 2, 2)}")
print(f"\nAvg pool (2x2, stride 2):\n{avg_pool_2d(test_fm, 2, 2)}")
print(f"\nGlobal avg pool: {global_avg_pool(test_fm)}")

# Demo pooling on a 6x6 feature map

np.random.seed(42)
feature_map_6x6 = np.random.randint(0, 10, size=(6, 6)).astype(float)

max_pooled = max_pool_2d(feature_map_6x6, pool_size=2, stride=2)
avg_pooled = avg_pool_2d(feature_map_6x6, pool_size=2, stride=2)
gap_value = global_avg_pool(feature_map_6x6)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Original
im0 = axes[0].imshow(feature_map_6x6, cmap="YlOrRd", vmin=0, vmax=9)
axes[0].set_title(f"Original (6x6)")
for i in range(6):
    for j in range(6):
        axes[0].text(j, i, f"{feature_map_6x6[i, j]:.0f}",
                     ha="center", va="center", fontsize=11, fontweight="bold")
axes[0].set_xticks(range(6))
axes[0].set_yticks(range(6))

# Max pooled
im1 = axes[1].imshow(max_pooled, cmap="YlOrRd", vmin=0, vmax=9)
axes[1].set_title(f"Max Pool 2x2 (3x3)")
for i in range(3):
    for j in range(3):
        axes[1].text(j, i, f"{max_pooled[i, j]:.0f}",
                     ha="center", va="center", fontsize=14, fontweight="bold")
axes[1].set_xticks(range(3))
axes[1].set_yticks(range(3))

# Avg pooled
im2 = axes[2].imshow(avg_pooled, cmap="YlOrRd", vmin=0, vmax=9)
axes[2].set_title(f"Avg Pool 2x2 (3x3)")
for i in range(3):
    for j in range(3):
        axes[2].text(j, i, f"{avg_pooled[i, j]:.1f}",
                     ha="center", va="center", fontsize=14, fontweight="bold")
axes[2].set_xticks(range(3))
axes[2].set_yticks(range(3))

fig.suptitle(f"Pooling Operations (Global Avg Pool = {gap_value:.2f})", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print(f"Global Average Pooling of 6x6 map: {gap_value:.4f}")

PART 6: Loss Functions¶

Loss functions quantify how wrong a model’s predictions are. The choice of loss function shapes the learning dynamics.

Loss	Formula	Use Case
MSE	\(\frac{1}{N}\sum(y - \hat{y})^2\)	Regression; heavily penalizes large errors
MAE	\(\frac{1}{N}\sum\|y - \hat{y}\|\)	Regression; robust to outliers
Huber	MSE for small errors, MAE for large	Best of both worlds; controlled by \(\delta\)
Hinge	\(\max(0, 1 - y \cdot \hat{y})\)	SVM classification; \(y \in \{-1, +1\}\)

Huber Loss (element-wise)¶

\[\begin{split}L_\delta(e) = \begin{cases} \frac{1}{2}e^2 & \text{if } |e| \leq \delta \\ \delta \cdot (|e| - \frac{1}{2}\delta) & \text{otherwise} \end{cases}\end{split}\]

The Huber loss transitions smoothly from quadratic (MSE-like) for small errors to linear (MAE-like) for large errors, making it robust to outliers while still being differentiable everywhere.

def mse_loss(y_true, y_pred):
    """Mean Squared Error: (1/N) * sum((y_true - y_pred)^2)"""
    return np.mean((y_true - y_pred) ** 2)


def mae_loss(y_true, y_pred):
    """Mean Absolute Error: (1/N) * sum(|y_true - y_pred|)"""
    return np.mean(np.abs(y_true - y_pred))


def huber_loss(y_true, y_pred, delta=1.0):
    """
    Huber loss: quadratic for small errors, linear for large errors.
    
    L = 0.5 * e^2           if |e| <= delta
    L = delta * (|e| - 0.5 * delta)  otherwise
    
    Returns the mean loss over all samples.
    """
    error = y_true - y_pred
    abs_error = np.abs(error)
    quadratic = 0.5 * error ** 2
    linear = delta * (abs_error - 0.5 * delta)
    elementwise = np.where(abs_error <= delta, quadratic, linear)
    return np.mean(elementwise)


def hinge_loss(y_true, y_pred):
    """
    Hinge loss for binary classification: max(0, 1 - y_true * y_pred)
    Labels y_true must be in {-1, +1}.
    Returns the mean loss over all samples.
    """
    return np.mean(np.maximum(0, 1.0 - y_true * y_pred))


# --- Quick check ---
y_t = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_p = np.array([1.1, 2.5, 2.8, 4.2, 4.5])

print(f"MSE:   {mse_loss(y_t, y_p):.4f}")
print(f"MAE:   {mae_loss(y_t, y_p):.4f}")
print(f"Huber: {huber_loss(y_t, y_p, delta=1.0):.4f}")

# Hinge loss test (classification)
y_t_cls = np.array([1, -1, 1, -1, 1])
y_p_cls = np.array([0.8, -1.2, 0.3, 0.5, 1.5])
print(f"Hinge: {hinge_loss(y_t_cls, y_p_cls):.4f}")

# Plot all 4 loss functions vs error on the same axes

errors = np.linspace(-3, 3, 300)

# For MSE, MAE, Huber: loss as a function of error (y_true - y_pred)
mse_vals = errors ** 2                     # element-wise MSE (not averaged)
mae_vals = np.abs(errors)                  # element-wise MAE
huber_vals = np.where(
    np.abs(errors) <= 1.0,
    0.5 * errors ** 2,
    1.0 * (np.abs(errors) - 0.5)
)

# For hinge loss: loss as function of y*f(x) margin
# We plot on the same error axis for comparison, treating error as margin
hinge_vals = np.maximum(0, 1.0 - errors)   # max(0, 1 - margin)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: regression losses
ax = axes[0]
ax.plot(errors, mse_vals, label="MSE: $e^2$", linewidth=2, color="#e41a1c")
ax.plot(errors, mae_vals, label="MAE: $|e|$", linewidth=2, color="#377eb8")
ax.plot(errors, huber_vals, label=r"Huber ($\delta$=1): smooth", linewidth=2,
        color="#4daf4a", linestyle="--")
ax.axvline(0, color="gray", linewidth=0.5, linestyle="--")
ax.set_xlabel("Error ($y - \hat{y}$)")
ax.set_ylabel("Loss")
ax.set_title("Regression Losses vs Error")
ax.legend()
ax.set_xlim(-3, 3)
ax.set_ylim(0, 5)
ax.grid(True, alpha=0.3)

# Right plot: hinge loss vs margin
ax = axes[1]
margins = np.linspace(-2, 3, 300)
hinge_margin = np.maximum(0, 1.0 - margins)
ax.plot(margins, hinge_margin, label="Hinge: max(0, 1-margin)", linewidth=2,
        color="#984ea3")
ax.axvline(1, color="gray", linewidth=0.5, linestyle=":", label="Margin = 1")
ax.fill_between(margins, 0, hinge_margin, alpha=0.1, color="#984ea3")
ax.set_xlabel("Margin ($y \\cdot \\hat{y}$)")
ax.set_ylabel("Loss")
ax.set_title("Hinge Loss vs Margin")
ax.legend()
ax.set_xlim(-2, 3)
ax.set_ylim(0, 3.5)
ax.grid(True, alpha=0.3)

plt.suptitle("Loss Functions Comparison", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

Summary¶

In this lab we built the core components of CNNs and deep learning from scratch using only NumPy:

Part 1: Convolution & Correlation¶

Implemented 1D and 2D cross-correlation (what DL calls “convolution”)
Applied classic filters (edge detection, sharpen, blur) to a checkerboard image
Extended to strided convolution with zero-padding, matching the output-size formula

Part 2: Activation Functions¶

Implemented 6 activations: ReLU, Leaky ReLU, ELU, Swish, GELU, Tanh
Computed derivatives for ReLU and Tanh (essential for backpropagation)
Demonstrated the dying ReLU problem and how Leaky ReLU mitigates it

Part 3: Perceptron¶

Built a Perceptron class with the classic learning rule
Trained it on linearly separable data and plotted the decision boundary
Proved the perceptron fails on XOR and showed a 2-layer network solves it (OR + NAND + AND decomposition)

Part 4: Performance Metrics¶

Implemented ROC curve computation from raw scores and labels
Computed AUC using the trapezoidal rule
Plotted ROC curves comparing good, weak, and random classifiers
Implemented top-k accuracy for multi-class evaluation

Part 5: Pooling Operations¶

Implemented max pooling, average pooling, and global average pooling
Visualized their effects on a 6x6 feature map

Part 6: Loss Functions¶

Implemented MSE, MAE, Huber, and Hinge loss functions
Plotted regression losses vs error and hinge loss vs margin

Key takeaway: Every “magical” operation in deep learning frameworks is just a combination of simple mathematical operations. Understanding these building blocks demystifies CNN architectures and makes debugging, tuning, and innovating much easier.