Run this notebook: Open in Colab Open in Kaggle

Lab 01: Logistic Regression from Scratch¶

Sources:

Deep Learning Interviews (Shlomo Kashani, 2nd Edition) – Chapter: Logistic Regression
Speech and Language Processing (Jurafsky & Martin) – Chapter 4: Logistic Regression

Topics covered:

Sigmoid function and its properties
Odds, log-odds, and the logit function
Binary cross-entropy loss
Gradient descent for logistic regression
Evaluation metrics (precision, recall, F1, confusion matrix)
Multinomial logistic regression (softmax)
Regularization (L2)
End-to-end pipeline with decision boundary visualization

import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple

%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['font.size'] = 12

PART 1: The Sigmoid Function¶

Reference: DLI – “The Sigmoid”, SLP Ch 4.2

The sigmoid (logistic) function maps any real number to the interval \((0, 1)\):

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]

Key properties¶

Property	Formula
Value at zero	\(\sigma(0) = 0.5\)
Symmetry	\(\sigma(-z) = 1 - \sigma(z)\)
Derivative	\(\frac{d}{dz}\sigma(z) = \sigma(z)\,(1 - \sigma(z))\)
Range	\((0, 1)\) – output is always a valid probability
Monotonicity	Strictly increasing

The derivative being expressible purely in terms of \(\sigma(z)\) itself makes backpropagation efficient: once you have the forward-pass value, the gradient is essentially free.

def sigmoid(z: np.ndarray) -> np.ndarray:
    """Compute the sigmoid function element-wise.
    
    Uses np.clip to prevent overflow in np.exp for very large
    positive or negative inputs.
    """
    z = np.clip(z, -500, 500)
    return 1.0 / (1.0 + np.exp(-z))


def sigmoid_derivative(z: np.ndarray) -> np.ndarray:
    """Compute the derivative of the sigmoid: sigma(z) * (1 - sigma(z))."""
    s = sigmoid(z)
    return s * (1.0 - s)


# Quick sanity check
print(f"sigmoid(0)   = {sigmoid(np.array([0.0]))[0]:.4f}  (expect 0.5)")
print(f"sigmoid(100) = {sigmoid(np.array([100.0]))[0]:.6f}  (expect ~1.0)")
print(f"sigmoid(-100)= {sigmoid(np.array([-100.0]))[0]:.6f}  (expect ~0.0)")

# Verify the symmetry property: sigma(-z) = 1 - sigma(z)
z_values = np.array([-5.0, -2.0, -1.0, 0.0, 1.0, 2.0, 5.0])

lhs = sigmoid(-z_values)          # sigma(-z)
rhs = 1.0 - sigmoid(z_values)     # 1 - sigma(z)

print("Verifying sigma(-z) = 1 - sigma(z)")
print(f"{'z':>6s} | {'sigma(-z)':>12s} | {'1-sigma(z)':>12s} | {'match':>6s}")
print("-" * 48)
for z, l, r in zip(z_values, lhs, rhs):
    print(f"{z:6.1f} | {l:12.8f} | {r:12.8f} | {np.isclose(l, r)}")

assert np.allclose(lhs, rhs), "Symmetry property failed!"
print("\nAll checks passed.")

# Plot sigmoid and its derivative on the same axes
z = np.linspace(-10, 10, 300)

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(z, sigmoid(z), label=r'$\sigma(z)$', linewidth=2)
ax.plot(z, sigmoid_derivative(z), label=r"$\sigma'(z) = \sigma(z)(1-\sigma(z))$",
        linewidth=2, linestyle='--')
ax.axhline(y=0.5, color='gray', linestyle=':', alpha=0.5)
ax.axvline(x=0, color='gray', linestyle=':', alpha=0.5)
ax.set_xlabel('z')
ax.set_ylabel('Value')
ax.set_title('Sigmoid Function and Its Derivative')
ax.legend(fontsize=13)
ax.set_ylim(-0.05, 1.05)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

PART 2: Odds, Log-Odds, and the Logit Function¶

Reference: DLI – “Odds, Log-odds”, SLP Ch 4.2

Given a probability \(p \in (0, 1)\):

Concept	Definition	Range
Odds	\(\text{odds}(p) = \dfrac{p}{1-p}\)	\((0, \infty)\)
Log-odds (logit)	\(\text{logit}(p) = \log\!\left(\dfrac{p}{1-p}\right)\)	\((-\infty, \infty)\)

Logit is the inverse of sigmoid¶

\[\text{logit}(\sigma(z)) = z \qquad \text{and} \qquad \sigma(\text{logit}(p)) = p\]

This means the sigmoid maps log-odds to probabilities, and the logit maps probabilities back to log-odds. In logistic regression the model computes \(z = \mathbf{w} \cdot \mathbf{x} + b\) (a log-odds score) and then applies \(\sigma\) to obtain a probability.

def logit(p: np.ndarray) -> np.ndarray:
    """Compute the logit (log-odds): log(p / (1-p)).
    
    Input p must be in (0, 1).
    """
    p = np.asarray(p, dtype=float)
    return np.log(p / (1.0 - p))


# Verify logit is the inverse of sigmoid
z_values = np.array([-3.0, -1.0, 0.0, 1.5, 4.0])
roundtrip = logit(sigmoid(z_values))

print("Verifying logit(sigmoid(z)) = z")
print(f"{'z':>8s} | {'sigmoid(z)':>12s} | {'logit(sigmoid(z))':>18s} | {'match':>6s}")
print("-" * 55)
for z_val, rt in zip(z_values, roundtrip):
    print(f"{z_val:8.2f} | {sigmoid(np.array([z_val]))[0]:12.8f} | {rt:18.8f} | {np.isclose(z_val, rt)}")

assert np.allclose(z_values, roundtrip), "Inverse property failed!"
print("\nAll checks passed.")

# Medical odds problem (DLI-style)
# A doctor estimates the probability of disease at p = 0.8.
# 1) What are the odds?
# 2) What are the log-odds?
# 3) If the log-odds increase by 1.5, what is the new probability?

p = 0.8
odds = p / (1.0 - p)
log_odds = logit(np.array([p]))[0]

new_log_odds = log_odds + 1.5
new_probability = sigmoid(np.array([new_log_odds]))[0]

print("Medical Odds Problem")
print("=" * 40)
print(f"Initial probability     p = {p}")
print(f"Odds                      = {odds:.4f}   (4 to 1)")
print(f"Log-odds (logit)          = {log_odds:.4f}")
print(f"New log-odds (+1.5)       = {new_log_odds:.4f}")
print(f"New probability           = {new_probability:.4f}")
print(f"\nInterpretation: adding 1.5 to the log-odds raises")
print(f"the probability from {p:.2f} to {new_probability:.4f}.")

PART 3: Binary Cross-Entropy Loss¶

Reference: DLI – “The Logit Function and Entropy”, SLP Ch 4.5

For a single training example with true label \(y \in \{0, 1\}\) and predicted probability \(\hat{p}\):

\[L(y,\, \hat{p}) = -\bigl[y \log(\hat{p}) + (1 - y)\log(1 - \hat{p})\bigr]\]

For a dataset of \(N\) examples the cost function is the mean loss:

\[J = -\frac{1}{N}\sum_{i=1}^{N}\bigl[y_i \log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i)\bigr]\]

Intuition: When \(y=1\) the loss is \(-\log(\hat{p})\), which penalises small \(\hat{p}\) heavily. When \(y=0\) the loss is \(-\log(1-\hat{p})\), which penalises large \(\hat{p}\) heavily. The loss is always non-negative and equals zero only when the prediction is perfect.

def binary_cross_entropy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute the mean binary cross-entropy loss over a batch.
    
    Args:
        y_true: array of true labels (0 or 1), shape (N,)
        y_pred: array of predicted probabilities, shape (N,)
    
    Returns:
        Scalar mean loss.
    """
    # Clip predictions to avoid log(0)
    eps = 1e-15
    y_pred = np.clip(y_pred, eps, 1.0 - eps)
    loss = -(y_true * np.log(y_pred) + (1.0 - y_true) * np.log(1.0 - y_pred))
    return np.mean(loss)


# Quick test
print(f"Loss when y=1, p=0.9:  {binary_cross_entropy(np.array([1]), np.array([0.9])):.4f}  (low -- good prediction)")
print(f"Loss when y=1, p=0.1:  {binary_cross_entropy(np.array([1]), np.array([0.1])):.4f}  (high -- bad prediction)")
print(f"Loss when y=0, p=0.1:  {binary_cross_entropy(np.array([0]), np.array([0.1])):.4f}  (low -- good prediction)")
print(f"Loss when y=0, p=0.9:  {binary_cross_entropy(np.array([0]), np.array([0.9])):.4f}  (high -- bad prediction)")

# Plot loss curves for y=0 and y=1
p = np.linspace(0.01, 0.99, 200)

loss_y1 = -np.log(p)        # Loss when y = 1
loss_y0 = -np.log(1.0 - p)  # Loss when y = 0

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(p, loss_y1, label='y = 1:  $-\\log(\\hat{p})$', linewidth=2)
ax.plot(p, loss_y0, label='y = 0:  $-\\log(1-\\hat{p})$', linewidth=2, linestyle='--')
ax.set_xlabel('Predicted probability $\\hat{p}$')
ax.set_ylabel('Loss')
ax.set_title('Binary Cross-Entropy Loss')
ax.legend(fontsize=13)
ax.set_ylim(0, 5)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

PART 4: Logistic Regression – Full Implementation¶

Reference: DLI – “Truly Understanding Logistic Regression”, SLP Ch 4.4, 4.6

Model¶

\[P(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w} \cdot \mathbf{x} + b)\]

Training via gradient descent¶

We minimize the binary cross-entropy cost \(J\). The gradients are:

\[\frac{\partial J}{\partial \mathbf{w}} = \frac{1}{N} \mathbf{X}^\top (\hat{\mathbf{p}} - \mathbf{y})\]

\[\frac{\partial J}{\partial b} = \frac{1}{N} \sum_{i=1}^{N} (\hat{p}_i - y_i)\]

Update rule (gradient descent):

\[\mathbf{w} \leftarrow \mathbf{w} - \alpha \, \frac{\partial J}{\partial \mathbf{w}}, \qquad b \leftarrow b - \alpha \, \frac{\partial J}{\partial b}\]

class LogisticRegression:
    """Binary logistic regression trained with gradient descent."""

    def __init__(self, learning_rate: float = 0.01, n_iterations: int = 1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.losses = []

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Compute P(y=1|X) = sigmoid(X @ w + b)."""
        linear = X @ self.weights + self.bias
        return sigmoid(linear)

    def compute_gradients(
        self, X: np.ndarray, y: np.ndarray, predictions: np.ndarray
    ) -> Tuple[np.ndarray, float]:
        """Compute gradients of cross-entropy w.r.t. weights and bias.
        
        Returns:
            (dw, db) -- gradient arrays.
        """
        N = X.shape[0]
        error = predictions - y                     # (N,)
        dw = (1.0 / N) * (X.T @ error)             # (n_features,)
        db = (1.0 / N) * np.sum(error)              # scalar
        return dw, db

    def fit(self, X: np.ndarray, y: np.ndarray):
        """Train logistic regression using gradient descent."""
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        self.losses = []

        for i in range(self.n_iter):
            # Forward pass
            predictions = self.predict_proba(X)

            # Compute loss
            loss = binary_cross_entropy(y, predictions)
            self.losses.append(loss)

            # Compute gradients
            dw, db = self.compute_gradients(X, y, predictions)

            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels (0 or 1) using threshold 0.5."""
        probas = self.predict_proba(X)
        return (probas >= 0.5).astype(int)


print("LogisticRegression class defined.")

# Generate synthetic 2D data and train
def generate_synthetic_data(n_samples=500, n_features=2, seed=42):
    """Generate linearly separable 2D data for testing."""
    np.random.seed(seed)
    X_pos = np.random.randn(n_samples // 2, n_features) + np.array([2, 2])
    X_neg = np.random.randn(n_samples // 2, n_features) + np.array([-2, -2])
    X = np.vstack([X_pos, X_neg])
    y = np.hstack([np.ones(n_samples // 2), np.zeros(n_samples // 2)])
    # Shuffle
    idx = np.random.permutation(n_samples)
    return X[idx], y[idx]


X_data, y_data = generate_synthetic_data()
print(f"Dataset: {X_data.shape[0]} samples, {X_data.shape[1]} features")
print(f"Class distribution: {int(y_data.sum())} positive, {int(len(y_data) - y_data.sum())} negative")

# Train the model
model = LogisticRegression(learning_rate=0.1, n_iterations=300)
model.fit(X_data, y_data)

print(f"\nFinal loss: {model.losses[-1]:.4f}")
print(f"Learned weights: {model.weights}")
print(f"Learned bias:    {model.bias:.4f}")

# Plot the loss curve
fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(model.losses, linewidth=2)
ax.set_xlabel('Iteration')
ax.set_ylabel('Binary Cross-Entropy Loss')
ax.set_title('Training Loss Curve')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

PART 5: Evaluation Metrics¶

Reference: SLP Ch 4.9 – Precision, Recall, F-measure

For binary classification with true labels \(y\) and predictions \(\hat{y}\):

Metric	Formula	Meaning
True Positives (TP)	\(\hat{y}=1\) and \(y=1\)	Correctly predicted positive
False Positives (FP)	\(\hat{y}=1\) and \(y=0\)	Incorrectly predicted positive
True Negatives (TN)	\(\hat{y}=0\) and \(y=0\)	Correctly predicted negative
False Negatives (FN)	\(\hat{y}=0\) and \(y=1\)	Incorrectly predicted negative

\[\text{Precision} = \frac{TP}{TP + FP} \qquad \text{Recall} = \frac{TP}{TP + FN}\]

\[F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

The confusion matrix summarises all four counts:

\[\begin{split}\begin{bmatrix} TN & FP \\ FN & TP \end{bmatrix}\end{split}\]

def precision_recall_f1(y_true: np.ndarray, y_pred: np.ndarray) -> dict:
    """Compute precision, recall, and F1-score for binary classification."""
    y_true = np.asarray(y_true, dtype=int)
    y_pred = np.asarray(y_pred, dtype=int)

    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = (
        2.0 * precision * recall / (precision + recall)
        if (precision + recall) > 0
        else 0.0
    )

    return {"precision": precision, "recall": recall, "f1": f1}


def confusion_matrix(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
    """Compute a 2x2 confusion matrix: [[TN, FP], [FN, TP]]."""
    y_true = np.asarray(y_true, dtype=int)
    y_pred = np.asarray(y_pred, dtype=int)

    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    tn = np.sum((y_pred == 0) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))

    return np.array([[tn, fp], [fn, tp]])


print("Metric functions defined.")

# Evaluate the trained model from Part 4
y_pred = model.predict(X_data)
accuracy = np.mean(y_pred == y_data)

metrics = precision_recall_f1(y_data, y_pred)
cm = confusion_matrix(y_data, y_pred)

print("Evaluation on Training Data")
print("=" * 35)
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall:    {metrics['recall']:.4f}")
print(f"F1-score:  {metrics['f1']:.4f}")
print(f"\nConfusion Matrix (rows=actual, cols=predicted):")
print(f"          Pred 0  Pred 1")
print(f"Actual 0  {cm[0, 0]:>5d}   {cm[0, 1]:>5d}")
print(f"Actual 1  {cm[1, 0]:>5d}   {cm[1, 1]:>5d}")

PART 6: Multinomial Logistic Regression (Softmax)¶

Reference: DLI – “Logistic Regression”, SLP Ch 4.7–4.8

When there are \(K > 2\) classes, we generalise the sigmoid to the softmax function. Given a vector of logits \(\mathbf{z} \in \mathbb{R}^K\):

\[P(y = k \mid \mathbf{x}) = \text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}\]

The outputs form a valid probability distribution: every element is in \((0, 1)\) and they sum to \(1\).

Numerical stability: In practice we compute \(\text{softmax}(\mathbf{z} - \max(\mathbf{z}))\) to prevent overflow. This is mathematically equivalent because the constant cancels in the ratio.

The loss function becomes categorical cross-entropy:

\[L = -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_{ik} \log(\hat{p}_{ik})\]

where \(\mathbf{y}_i\) is one-hot encoded.

def softmax(z: np.ndarray) -> np.ndarray:
    """Compute softmax probabilities (numerically stable).
    
    Args:
        z: array of shape (N, K) where K is the number of classes.
    
    Returns:
        Probabilities of shape (N, K), each row sums to 1.
    """
    z = np.atleast_2d(z)
    z_shifted = z - np.max(z, axis=-1, keepdims=True)  # subtract max for stability
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)


# Sanity checks
z_test = np.array([[1.0, 2.0, 3.0]])
probs = softmax(z_test)
print(f"softmax({z_test[0]}) = {probs[0]}")
print(f"Sum = {probs.sum():.6f}  (should be 1.0)")

# Test with large values (should not overflow)
z_big = np.array([[1000.0, 1001.0, 1002.0]])
probs_big = softmax(z_big)
print(f"\nsoftmax([1000, 1001, 1002]) = {probs_big[0]}")
print(f"Sum = {probs_big.sum():.6f}  (should be 1.0, no overflow)")

def categorical_cross_entropy(y_true_onehot: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute categorical cross-entropy loss.
    
    Args:
        y_true_onehot: one-hot encoded labels, shape (N, K)
        y_pred: predicted probabilities, shape (N, K)
    
    Returns:
        Scalar mean loss.
    """
    eps = 1e-15
    y_pred = np.clip(y_pred, eps, 1.0 - eps)
    loss = -np.sum(y_true_onehot * np.log(y_pred), axis=-1)
    return np.mean(loss)


print("categorical_cross_entropy defined.")

# Demo: multinomial softmax with 3 classes
np.random.seed(0)

# Simulate logits for 5 samples and 3 classes
logits = np.random.randn(5, 3)
probs = softmax(logits)

# One-hot true labels: classes [0, 1, 2, 0, 1]
y_true_labels = np.array([0, 1, 2, 0, 1])
y_onehot = np.zeros((5, 3))
y_onehot[np.arange(5), y_true_labels] = 1.0

loss = categorical_cross_entropy(y_onehot, probs)

print("Softmax Demo (5 samples, 3 classes)")
print("=" * 50)
print(f"Logits:\n{logits}\n")
print(f"Softmax probabilities:\n{probs}\n")
print(f"Row sums: {probs.sum(axis=1)}  (all should be 1.0)\n")
print(f"True labels (one-hot):\n{y_onehot}\n")
print(f"Categorical cross-entropy loss: {loss:.4f}")

PART 7: Regularization¶

Reference: SLP Ch 4.14, DLI – “Logistic Regression”

Regularization adds a penalty to the loss function to discourage large weights, which helps prevent overfitting.

L2 Regularization (Ridge / Weight Decay)¶

\[J_{\text{reg}} = J + \frac{\lambda}{2} \|\mathbf{w}\|^2 = J + \frac{\lambda}{2} \sum_j w_j^2\]

The gradient of the regularization term with respect to weights is simply \(\lambda \mathbf{w}\), so the update becomes:

\[\mathbf{w} \leftarrow \mathbf{w} - \alpha \left(\frac{\partial J}{\partial \mathbf{w}} + \lambda \mathbf{w}\right)\]

Note: the bias \(b\) is typically not regularized.

Effect: L2 regularization shrinks weights toward zero but rarely makes them exactly zero. Larger \(\lambda\) means stronger regularization (simpler model).

class LogisticRegressionL2(LogisticRegression):
    """Logistic regression with L2 regularization."""

    def __init__(self, learning_rate=0.01, n_iterations=1000, reg_lambda=0.1):
        super().__init__(learning_rate, n_iterations)
        self.reg_lambda = reg_lambda

    def fit(self, X: np.ndarray, y: np.ndarray):
        """Train with L2-regularized loss."""
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        self.losses = []

        for i in range(self.n_iter):
            # Forward pass
            predictions = self.predict_proba(X)

            # Compute loss with L2 penalty
            ce_loss = binary_cross_entropy(y, predictions)
            reg_term = (self.reg_lambda / 2.0) * np.sum(self.weights ** 2)
            loss = ce_loss + reg_term
            self.losses.append(loss)

            # Compute gradients (base gradients + regularization gradient)
            dw, db = self.compute_gradients(X, y, predictions)
            dw += self.reg_lambda * self.weights  # L2 gradient on weights
            # bias is NOT regularized

            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

        return self


print("LogisticRegressionL2 class defined.")

# Compare regularized vs unregularized on high-dimensional data
np.random.seed(42)

n_samples = 100
n_features = 50  # many features relative to samples -- overfitting risk

# Only the first 2 features are informative
X_hd = np.random.randn(n_samples, n_features)
y_hd = (X_hd[:, 0] + X_hd[:, 1] > 0).astype(float)

# Train unregularized
model_noreg = LogisticRegression(learning_rate=0.05, n_iterations=500)
model_noreg.fit(X_hd, y_hd)

# Train L2-regularized
model_l2 = LogisticRegressionL2(learning_rate=0.05, n_iterations=500, reg_lambda=0.5)
model_l2.fit(X_hd, y_hd)

print("Weight magnitudes (L2 norm):")
print(f"  Unregularized: ||w|| = {np.linalg.norm(model_noreg.weights):.4f}")
print(f"  L2 regularized: ||w|| = {np.linalg.norm(model_l2.weights):.4f}")

# Plot loss curves side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(model_noreg.losses, linewidth=2, label='No regularization')
axes[0].plot(model_l2.losses, linewidth=2, linestyle='--', label=f'L2 ($\\lambda$={model_l2.reg_lambda})')
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Loss')
axes[0].set_title('Loss Curves')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Bar chart of weight magnitudes
axes[1].bar(np.arange(n_features) - 0.2, np.abs(model_noreg.weights), 0.4,
            label='No regularization', alpha=0.7)
axes[1].bar(np.arange(n_features) + 0.2, np.abs(model_l2.weights), 0.4,
            label=f'L2 ($\\lambda$={model_l2.reg_lambda})', alpha=0.7)
axes[1].set_xlabel('Feature index')
axes[1].set_ylabel('|weight|')
axes[1].set_title('Weight Magnitudes by Feature')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nNote: L2 regularization shrinks all weights toward zero,")
print(f"reducing overfitting on the {n_features - 2} noise features.")

PART 8: End-to-End Pipeline¶

# Full pipeline: generate data, train/test split, train, evaluate, plot decision boundary

np.random.seed(123)

# 1. Generate data
X_all, y_all = generate_synthetic_data(n_samples=600, seed=123)

# 2. Train/test split (80/20)
split_idx = int(0.8 * len(y_all))
indices = np.random.permutation(len(y_all))
train_idx, test_idx = indices[:split_idx], indices[split_idx:]

X_train, y_train = X_all[train_idx], y_all[train_idx]
X_test, y_test = X_all[test_idx], y_all[test_idx]

print(f"Train set: {X_train.shape[0]} samples")
print(f"Test set:  {X_test.shape[0]} samples")

# 3. Train logistic regression
pipeline_model = LogisticRegression(learning_rate=0.1, n_iterations=500)
pipeline_model.fit(X_train, y_train)

# 4. Evaluate
y_pred_train = pipeline_model.predict(X_train)
y_pred_test = pipeline_model.predict(X_test)

train_metrics = precision_recall_f1(y_train, y_pred_train)
test_metrics = precision_recall_f1(y_test, y_pred_test)
test_cm = confusion_matrix(y_test, y_pred_test)

train_acc = np.mean(y_pred_train == y_train)
test_acc = np.mean(y_pred_test == y_test)

print(f"\n{'Metric':<12s} {'Train':>8s} {'Test':>8s}")
print("-" * 30)
print(f"{'Accuracy':<12s} {train_acc:>8.4f} {test_acc:>8.4f}")
print(f"{'Precision':<12s} {train_metrics['precision']:>8.4f} {test_metrics['precision']:>8.4f}")
print(f"{'Recall':<12s} {train_metrics['recall']:>8.4f} {test_metrics['recall']:>8.4f}")
print(f"{'F1':<12s} {train_metrics['f1']:>8.4f} {test_metrics['f1']:>8.4f}")

print(f"\nTest Confusion Matrix:")
print(f"          Pred 0  Pred 1")
print(f"Actual 0  {test_cm[0, 0]:>5d}   {test_cm[0, 1]:>5d}")
print(f"Actual 1  {test_cm[1, 0]:>5d}   {test_cm[1, 1]:>5d}")

# 5. Plot decision boundary
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: loss curve
axes[0].plot(pipeline_model.losses, linewidth=2)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss Curve')
axes[0].grid(True, alpha=0.3)

# Right: decision boundary
ax = axes[1]

# Create mesh grid
x_min, x_max = X_all[:, 0].min() - 1, X_all[:, 0].max() + 1
y_min, y_max = X_all[:, 1].min() - 1, X_all[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                      np.linspace(y_min, y_max, 200))
grid = np.c_[xx.ravel(), yy.ravel()]
probs_grid = pipeline_model.predict_proba(grid).reshape(xx.shape)

# Plot decision regions
ax.contourf(xx, yy, probs_grid, levels=50, cmap='RdBu', alpha=0.6)
ax.contour(xx, yy, probs_grid, levels=[0.5], colors='black', linewidths=2)

# Plot test data
ax.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1],
           c='blue', edgecolors='k', s=40, label='Class 0 (test)', alpha=0.7)
ax.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1],
           c='red', edgecolors='k', s=40, label='Class 1 (test)', alpha=0.7)

ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.set_title(f'Decision Boundary (Test Acc = {test_acc:.2%})')
ax.legend()

plt.tight_layout()
plt.show()

Summary¶

In this lab we built logistic regression entirely from scratch, covering:

Sigmoid function – the core nonlinearity that maps real numbers to probabilities, along with its elegant derivative \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\).
Odds and logit – the logit function as the inverse of the sigmoid, connecting probability space to the unrestricted real line.
Binary cross-entropy – the principled loss function derived from maximum likelihood estimation.
Gradient descent training – computing gradients analytically and iteratively updating weights and bias to minimize the loss.
Evaluation metrics – precision, recall, F1-score, and the confusion matrix for assessing classifier performance.
Softmax / multinomial logistic regression – generalizing to \(K > 2\) classes using the softmax function and categorical cross-entropy.
L2 regularization – adding a weight penalty to reduce overfitting, especially with many features.
End-to-end pipeline – data generation, train/test split, training, evaluation, and decision boundary visualization.

These foundations are critical building blocks for neural networks, where logistic regression is essentially a single-layer network with a sigmoid output.