Run this notebook: Open in Colab Open in Kaggle

Lab 6: Continuous Optimization¶

Based on Chapter 7 of Mathematics for Machine Learning (Deisenroth, Faisal, Ong)¶

This lab covers the core optimization techniques used in machine learning:

Gradient Descent – univariate and multivariate
Learning Rate Effects – convergence vs divergence
Stochastic Gradient Descent – SGD vs full-batch GD
Momentum – accelerating convergence
Constrained Optimization – Lagrange multipliers
KKT Conditions – inequality constraints
Convex Optimization – Hessian checks and global minima
Newton’s Method – second-order optimization

All implementations use only numpy and matplotlib.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
print("Setup complete. NumPy version:", np.__version__)

1. Gradient Descent¶

Gradient descent is a first-order optimization algorithm. To find a local minimum of a differentiable function $f$, we take steps proportional to the negative gradient (MML Eq. 7.6):

\[\boldsymbol{x}_{i+1} = \boldsymbol{x}_i - \gamma_i \left( \nabla f(\boldsymbol{x}_i) \right)^\top\]

where $\gamma_i > 0$ is the step-size (learning rate). The gradient points in the direction of steepest ascent, so the negative gradient points toward steepest descent.

1a. Univariate Gradient Descent¶

We minimize $f(x) = x^4 - 3x^3 + 2$. Its derivative is $f'(x) = 4x^3 - 9x^2$.

Setting $f'(x) = 0$: $x^2(4x - 9) = 0$, giving stationary points at $x = 0$ (saddle) and $x = 9/4 = 2.25$ (minimum).

The update rule becomes: $x_{i+1} = x_i - \gamma \cdot f'(x_i)$

# Univariate Gradient Descent on f(x) = x^4 - 3x^3 + 2
def f_univar(x):
    return x**4 - 3*x**3 + 2

def df_univar(x):
    return 4*x**3 - 9*x**2

# Run gradient descent
x = -1.0          # starting point
lr = 0.01          # learning rate
n_steps = 80
trajectory = [x]

for _ in range(n_steps):
    x = x - lr * df_univar(x)
    trajectory.append(x)

trajectory = np.array(trajectory)
print(f"Starting point: x = -1.0, f(x) = {f_univar(-1.0):.4f}")
print(f"Final point:    x = {trajectory[-1]:.4f}, f(x) = {f_univar(trajectory[-1]):.4f}")
print(f"Analytical minimum at x = 9/4 = 2.25, f(2.25) = {f_univar(2.25):.4f}")

# Plot
x_plot = np.linspace(-1.5, 3.5, 300)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(x_plot, f_univar(x_plot), 'b-', linewidth=2, label='$f(x) = x^4 - 3x^3 + 2$')
ax1.plot(trajectory, f_univar(trajectory), 'ro-', markersize=4, alpha=0.6, label='GD steps')
ax1.plot(trajectory[0], f_univar(trajectory[0]), 'gs', markersize=12, label='Start')
ax1.plot(trajectory[-1], f_univar(trajectory[-1]), 'r*', markersize=15, label='End')
ax1.set_xlabel('x'); ax1.set_ylabel('f(x)')
ax1.set_title('Gradient Descent on $f(x) = x^4 - 3x^3 + 2$')
ax1.legend(); ax1.grid(True, alpha=0.3)

# Loss curve
ax2.plot(f_univar(trajectory), 'b-', linewidth=2)
ax2.set_xlabel('Iteration'); ax2.set_ylabel('f(x)')
ax2.set_title('Loss Curve'); ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

1b. 2D Gradient Descent: Rosenbrock Function¶

The Rosenbrock function is a classic test for optimization algorithms:

\[f(x, y) = (a - x)^2 + b(y - x^2)^2\]

with $a=1, b=100$. The global minimum is at $(1, 1)$. The function has a narrow curved valley that is hard for gradient descent to navigate.

The gradient is: $$\nabla f = \begin{bmatrix} -2(a - x) - 4bx(y - x^2) \\ 2b(y - x^2) \end{bmatrix}$$

# 2D Gradient Descent on the Rosenbrock function
def rosenbrock(xy):
    x, y = xy
    return (1 - x)**2 + 100*(y - x**2)**2

def grad_rosenbrock(xy):
    x, y = xy
    dfdx = -2*(1 - x) - 400*x*(y - x**2)
    dfdy = 200*(y - x**2)
    return np.array([dfdx, dfdy])

# Run GD
xy = np.array([-1.5, 2.0])
lr = 0.001
n_steps = 10000
traj_2d = [xy.copy()]

for _ in range(n_steps):
    xy = xy - lr * grad_rosenbrock(xy)
    traj_2d.append(xy.copy())

traj_2d = np.array(traj_2d)
print(f"Start: ({traj_2d[0,0]:.2f}, {traj_2d[0,1]:.2f}), f = {rosenbrock(traj_2d[0]):.2f}")
print(f"End:   ({traj_2d[-1,0]:.4f}, {traj_2d[-1,1]:.4f}), f = {rosenbrock(traj_2d[-1]):.6f}")
print(f"True minimum at (1, 1), f = 0")

# Contour plot with trajectory
fig, ax = plt.subplots(figsize=(10, 8))
x_grid = np.linspace(-2, 2, 400)
y_grid = np.linspace(-1, 3, 400)
X, Y = np.meshgrid(x_grid, y_grid)
Z = (1 - X)**2 + 100*(Y - X**2)**2

levels = np.logspace(-1, 3.5, 20)
ax.contour(X, Y, Z, levels=levels, cmap='viridis', alpha=0.7)
ax.contourf(X, Y, Z, levels=levels, cmap='viridis', alpha=0.3)

# Plot every 100th point for clarity
step = max(1, len(traj_2d)//100)
ax.plot(traj_2d[::step, 0], traj_2d[::step, 1], 'r.-', markersize=3, linewidth=0.8, alpha=0.8, label='GD trajectory')
ax.plot(traj_2d[0, 0], traj_2d[0, 1], 'gs', markersize=12, label='Start')
ax.plot(1, 1, 'r*', markersize=15, label='Global min (1,1)')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('Gradient Descent on Rosenbrock Function')
ax.legend(); ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

2. Learning Rate Effects¶

The choice of step-size (learning rate) $\gamma$ is critical (MML Section 7.1.1):

Too small: Convergence is very slow.
Too large: The algorithm overshoots, oscillates, or diverges.
Just right: Fast, stable convergence.

Two simple heuristics (Toussaint, 2012):

When the function value increases after a step, the step-size was too large – undo and decrease.
When the function value decreases, the step could have been larger – try increasing.

# Learning rate effects on f(x) = x^4 - 3x^3 + 2
learning_rates = [0.001, 0.01, 0.03, 0.05]
colors = ['blue', 'green', 'orange', 'red']
n_steps = 100

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

x_plot = np.linspace(-1.5, 3.5, 300)
ax1.plot(x_plot, f_univar(x_plot), 'k-', linewidth=2, alpha=0.3, label='f(x)')

for lr, color in zip(learning_rates, colors):
    x = -1.0
    traj = [x]
    diverged = False
    for i in range(n_steps):
        x_new = x - lr * df_univar(x)
        if abs(x_new) > 100:  # divergence check
            diverged = True
            break
        x = x_new
        traj.append(x)
    traj = np.array(traj)
    
    label_str = f'lr={lr}'
    if diverged:
        label_str += ' (DIVERGED)'
    
    # Trajectory on function
    valid = np.abs(traj) < 5
    ax1.plot(traj[valid], f_univar(traj[valid]), 'o-', color=color, markersize=3, alpha=0.7, label=label_str)
    
    # Loss curve
    losses = f_univar(traj)
    ax2.plot(losses, '-', color=color, linewidth=2, label=label_str)

ax1.set_xlabel('x'); ax1.set_ylabel('f(x)')
ax1.set_title('GD Trajectories for Different Learning Rates')
ax1.set_ylim(-15, 30); ax1.legend(); ax1.grid(True, alpha=0.3)

ax2.set_xlabel('Iteration'); ax2.set_ylabel('f(x)')
ax2.set_title('Loss Curves')
ax2.set_ylim(-15, 30); ax2.legend(); ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print final values
for lr in learning_rates:
    x = -1.0
    for step_i in range(n_steps):
        x_new = x - lr * df_univar(x)
        if abs(x_new) > 100:
            print(f"lr={lr:.3f}: DIVERGED at step {step_i}")
            break
        x = x_new
    else:
        print(f"lr={lr:.3f}: x = {x:.4f}, f(x) = {f_univar(x):.4f}")

Learning Rate Effects on 2D: Rosenbrock¶

The learning rate has even more dramatic effects in multiple dimensions. On the Rosenbrock function, a rate that is too large causes wild oscillations or divergence, while a rate that is too small barely moves from the starting point.

# 2D learning rate comparison on Rosenbrock
lr_list_2d = [0.0001, 0.001, 0.003]
colors_2d = ['blue', 'green', 'red']
n_steps_2d = 3000

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Contour background
x_grid = np.linspace(-2, 2, 300)
y_grid = np.linspace(-1, 3, 300)
X, Y = np.meshgrid(x_grid, y_grid)
Z = (1 - X)**2 + 100*(Y - X**2)**2
levels = np.logspace(-1, 3.5, 20)
ax1.contour(X, Y, Z, levels=levels, cmap='gray', alpha=0.4)

for lr_val, color in zip(lr_list_2d, colors_2d):
    xy = np.array([-1.5, 2.0])
    traj = [xy.copy()]
    losses = [rosenbrock(xy)]
    for _ in range(n_steps_2d):
        g = grad_rosenbrock(xy)
        if np.linalg.norm(g) > 1e6:  # divergence guard
            break
        xy = xy - lr_val * g
        traj.append(xy.copy())
        losses.append(rosenbrock(xy))
    traj = np.array(traj)
    
    step_s = max(1, len(traj) // 80)
    ax1.plot(traj[::step_s, 0], traj[::step_s, 1], '.-', color=color, markersize=2,
             linewidth=0.8, label=f'lr={lr_val}')
    ax2.semilogy(losses, '-', color=color, linewidth=1.5, label=f'lr={lr_val}')
    print(f"lr={lr_val}: final f = {losses[-1]:.6f} after {len(losses)-1} steps")

ax1.plot(1, 1, 'k*', markersize=15, label='Minimum')
ax1.set_xlabel('$x_1$'); ax1.set_ylabel('$x_2$')
ax1.set_title('2D Learning Rate Effects (Rosenbrock)')
ax1.legend(); ax1.grid(True, alpha=0.3)

ax2.set_xlabel('Iteration'); ax2.set_ylabel('f(x) [log scale]')
ax2.set_title('Loss Curves (log scale)')
ax2.legend(); ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3. Stochastic Gradient Descent¶

Computing the full gradient can be expensive when the objective is a sum of many terms (MML Section 7.1.3, Eq. 7.13):

\[L(\boldsymbol{\theta}) = \sum_{n=1}^{N} L_n(\boldsymbol{\theta})\]

Stochastic gradient descent (SGD) approximates the gradient by using a random subset (mini-batch) of the data. In the extreme case, we use a single data point.

The key insight: we only need the gradient to be an unbiased estimate of the true gradient for convergence.

Full (batch) GD update (Eq. 7.15): $$\boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_i - \gamma_i \sum_{n=1}^{N} (\nabla L_n(\boldsymbol{\theta}_i))^\top$$

SGD with mini-batch of size $B$: $$\boldsymbol{\theta}_{i+1} = \boldsymbol{\theta}_i - \gamma_i \sum_{n \in \mathcal{B}} (\nabla L_n(\boldsymbol{\theta}_i))^\top$$

# SGD vs Full GD on a simple linear regression problem
# True model: y = 2x + 1 + noise
np.random.seed(42)
N = 100
X_data = np.random.randn(N, 1)
y_data = 2 * X_data + 1 + 0.5 * np.random.randn(N, 1)

# Add bias column: X_aug = [x, 1]
X_aug = np.hstack([X_data, np.ones((N, 1))])

# Loss for linear regression: L(theta) = (1/N) * ||X_aug @ theta - y||^2
# Gradient: (2/N) * X_aug.T @ (X_aug @ theta - y)

def full_loss(theta, X, y):
    residuals = X @ theta - y
    return np.mean(residuals**2)

def full_gradient(theta, X, y):
    residuals = X @ theta - y
    return (2.0 / len(y)) * (X.T @ residuals)

def sgd_gradient(theta, X, y, batch_size):
    idx = np.random.choice(len(y), batch_size, replace=False)
    X_batch, y_batch = X[idx], y[idx]
    residuals = X_batch @ theta - y_batch
    return (2.0 / batch_size) * (X_batch.T @ residuals)

# Run Full GD
theta_gd = np.array([[0.0], [0.0]])
lr_gd = 0.1
losses_gd = []
for _ in range(200):
    losses_gd.append(full_loss(theta_gd, X_aug, y_data))
    grad = full_gradient(theta_gd, X_aug, y_data)
    theta_gd = theta_gd - lr_gd * grad

# Run SGD (batch_size=1)
np.random.seed(0)
theta_sgd = np.array([[0.0], [0.0]])
lr_sgd = 0.05
losses_sgd = []
for _ in range(200):
    losses_sgd.append(full_loss(theta_sgd, X_aug, y_data))
    grad = sgd_gradient(theta_sgd, X_aug, y_data, batch_size=1)
    theta_sgd = theta_sgd - lr_sgd * grad

# Run Mini-batch SGD (batch_size=16)
np.random.seed(0)
theta_mb = np.array([[0.0], [0.0]])
lr_mb = 0.1
losses_mb = []
for _ in range(200):
    losses_mb.append(full_loss(theta_mb, X_aug, y_data))
    grad = sgd_gradient(theta_mb, X_aug, y_data, batch_size=16)
    theta_mb = theta_mb - lr_mb * grad

print(f"Full GD  final theta: [{theta_gd[0,0]:.4f}, {theta_gd[1,0]:.4f}], loss: {losses_gd[-1]:.4f}")
print(f"SGD(B=1) final theta: [{theta_sgd[0,0]:.4f}, {theta_sgd[1,0]:.4f}], loss: {losses_sgd[-1]:.4f}")
print(f"SGD(B=16) final theta: [{theta_mb[0,0]:.4f}, {theta_mb[1,0]:.4f}], loss: {losses_mb[-1]:.4f}")
print(f"True parameters: [2.0, 1.0]")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(losses_gd, 'b-', linewidth=2, label='Full GD')
ax1.plot(losses_sgd, 'r-', alpha=0.7, linewidth=1, label='SGD (B=1)')
ax1.plot(losses_mb, 'g-', alpha=0.7, linewidth=1.5, label='Mini-batch SGD (B=16)')
ax1.set_xlabel('Iteration'); ax1.set_ylabel('Loss (MSE)')
ax1.set_title('Loss Curves: Full GD vs SGD')
ax1.legend(); ax1.grid(True, alpha=0.3)

# Zoom into later iterations
ax2.plot(losses_gd[20:], 'b-', linewidth=2, label='Full GD')
ax2.plot(losses_sgd[20:], 'r-', alpha=0.7, linewidth=1, label='SGD (B=1)')
ax2.plot(losses_mb[20:], 'g-', alpha=0.7, linewidth=1.5, label='Mini-batch SGD (B=16)')
ax2.set_xlabel('Iteration (offset by 20)'); ax2.set_ylabel('Loss (MSE)')
ax2.set_title('Zoomed: SGD noise vs GD smooth convergence')
ax2.legend(); ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

4. Gradient Descent with Momentum¶

When the optimization surface has regions that are poorly scaled (e.g., long narrow valleys), gradient descent zigzags and converges slowly (see Figure 7.3 in MML). Momentum (Rumelhart et al., 1986) adds a memory of past updates to dampen oscillations (MML Eqs. 7.11-7.12):

\[\boldsymbol{x}_{i+1} = \boldsymbol{x}_i - \gamma_i (\nabla f(\boldsymbol{x}_i))^\top + \alpha \Delta \boldsymbol{x}_i\]

\[\Delta \boldsymbol{x}_i = \boldsymbol{x}_i - \boldsymbol{x}_{i-1} = \alpha \Delta \boldsymbol{x}_{i-1} - \gamma_{i-1}(\nabla f(\boldsymbol{x}_{i-1}))^\top\]

where $\alpha \in [0, 1]$ is the momentum coefficient. This is like a heavy ball rolling downhill that resists changing direction.

The speed of gradient descent convergence depends on the condition number $\kappa = \sigma(\boldsymbol{A})_{\max} / \sigma(\boldsymbol{A})_{\min}$ (MML p.230). Poorly conditioned problems have large $\kappa$ and slow convergence.

# Comparison: GD vs GD+Momentum on an elongated quadratic
# f(x) = 0.5 * x^T A x - b^T x, with A having very different eigenvalues
# This creates an elongated valley (high condition number)
A = np.array([[50.0, 0.0],
              [0.0,  1.0]])   # condition number = 50
b = np.array([5.0, 1.0])

# Optimal solution: x* = A^{-1} b
x_star = np.linalg.solve(A, b)
cond_num = np.linalg.cond(A)
print(f"Condition number of A: {cond_num:.1f}")
print(f"Optimal solution: x* = [{x_star[0]:.4f}, {x_star[1]:.4f}]")

def f_quad(x):
    return 0.5 * x @ A @ x - b @ x

def grad_quad(x):
    return A @ x - b

# Plain GD
x_gd = np.array([-4.0, -4.0])
lr = 0.035
traj_gd = [x_gd.copy()]
for _ in range(100):
    x_gd = x_gd - lr * grad_quad(x_gd)
    traj_gd.append(x_gd.copy())
traj_gd = np.array(traj_gd)

# GD with Momentum
x_mom = np.array([-4.0, -4.0])
lr = 0.035
alpha = 0.9  # momentum coefficient
delta_x = np.zeros(2)
traj_mom = [x_mom.copy()]
for _ in range(100):
    grad = grad_quad(x_mom)
    delta_x = alpha * delta_x - lr * grad
    x_mom = x_mom + delta_x
    traj_mom.append(x_mom.copy())
traj_mom = np.array(traj_mom)

print(f"GD  final: [{traj_gd[-1,0]:.4f}, {traj_gd[-1,1]:.4f}], f = {f_quad(traj_gd[-1]):.6f}")
print(f"Mom final: [{traj_mom[-1,0]:.4f}, {traj_mom[-1,1]:.4f}], f = {f_quad(traj_mom[-1]):.6f}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Contour plot
for ax, traj, title in [(axes[0], traj_gd, 'Gradient Descent'), (axes[1], traj_mom, 'GD + Momentum')]:
    x1 = np.linspace(-5, 2, 200)
    x2 = np.linspace(-5, 3, 200)
    X1, X2 = np.meshgrid(x1, x2)
    Z = 0.5*(50*X1**2 + X2**2) - 5*X1 - X2
    ax.contour(X1, X2, Z, levels=30, cmap='viridis', alpha=0.6)
    ax.plot(traj[:, 0], traj[:, 1], 'ro-', markersize=3, linewidth=1, alpha=0.8)
    ax.plot(traj[0, 0], traj[0, 1], 'gs', markersize=10)
    ax.plot(x_star[0], x_star[1], 'r*', markersize=15)
    ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
    ax.set_title(title); ax.grid(True, alpha=0.3)

# Loss comparison
losses_gd_q = [f_quad(traj_gd[i]) for i in range(len(traj_gd))]
losses_mom_q = [f_quad(traj_mom[i]) for i in range(len(traj_mom))]
axes[2].plot(losses_gd_q, 'b-', linewidth=2, label='GD')
axes[2].plot(losses_mom_q, 'r-', linewidth=2, label='GD + Momentum')
axes[2].axhline(y=f_quad(x_star), color='k', linestyle='--', alpha=0.5, label=f'Optimal: {f_quad(x_star):.2f}')
axes[2].set_xlabel('Iteration'); axes[2].set_ylabel('f(x)')
axes[2].set_title('Convergence Comparison'); axes[2].legend(); axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

5. Constrained Optimization and Lagrange Multipliers¶

In constrained optimization (MML Section 7.2), we solve:

\[\min_{\boldsymbol{x}} f(\boldsymbol{x}) \quad \text{subject to} \quad g_i(\boldsymbol{x}) \leqslant 0, \; i=1,\ldots,m\]

The Lagrangian (Eq. 7.20) introduces multipliers $\lambda_i \geqslant 0$:

\[\mathfrak{L}(\boldsymbol{x}, \boldsymbol{\lambda}) = f(\boldsymbol{x}) + \sum_{i=1}^{m} \lambda_i g_i(\boldsymbol{x})\]

For equality constraints $h(\boldsymbol{x}) = 0$, we use: $\mathfrak{L}(\boldsymbol{x}, \lambda) = f(\boldsymbol{x}) + \lambda h(\boldsymbol{x})$

At the optimum, the gradients of $f$ and $h$ are parallel: $\nabla f = -\lambda \nabla h$.

Example: Minimize $f(x,y) = x^2 + y^2$ subject to $g(x,y) = x + y - 1 = 0$¶

Analytical solution:

$\mathfrak{L} = x^2 + y^2 + \lambda(x + y - 1)$
$\partial \mathfrak{L}/\partial x = 2x + \lambda = 0 \Rightarrow x = -\lambda/2$
$\partial \mathfrak{L}/\partial y = 2y + \lambda = 0 \Rightarrow y = -\lambda/2$
$\partial \mathfrak{L}/\partial \lambda = x + y - 1 = 0$
Substituting: $-\lambda/2 - \lambda/2 = 1 \Rightarrow \lambda = -1$, so $x^* = y^* = 1/2$

# Constrained optimization: min x^2 + y^2 s.t. x + y = 1
# Analytical solution
x_opt, y_opt, lam_opt = 0.5, 0.5, -1.0
print(f"Analytical solution: x* = {x_opt}, y* = {y_opt}, lambda* = {lam_opt}")
print(f"f(x*, y*) = {x_opt**2 + y_opt**2}")
print(f"Constraint satisfied: x* + y* - 1 = {x_opt + y_opt - 1}")

# Numerical verification using gradient of the Lagrangian
# Solve the system: [2 0 1; 0 2 1; 1 1 0] [x; y; lam] = [0; 0; 1]
M = np.array([[2, 0, 1],
              [0, 2, 1],
              [1, 1, 0]], dtype=float)
rhs = np.array([0, 0, 1], dtype=float)
sol = np.linalg.solve(M, rhs)
print(f"\nNumerical solution: x = {sol[0]:.4f}, y = {sol[1]:.4f}, lambda = {sol[2]:.4f}")

# Visualization
fig, ax = plt.subplots(figsize=(8, 8))

x_grid = np.linspace(-1, 2, 300)
y_grid = np.linspace(-1, 2, 300)
X, Y = np.meshgrid(x_grid, y_grid)
Z = X**2 + Y**2

# Contour of objective
levels = np.linspace(0, 4, 20)
cs = ax.contour(X, Y, Z, levels=levels, cmap='Blues', alpha=0.7)
ax.clabel(cs, inline=True, fontsize=8)

# Constraint line: x + y = 1
x_line = np.linspace(-0.5, 1.5, 100)
y_line = 1 - x_line
ax.plot(x_line, y_line, 'r-', linewidth=3, label='Constraint: $x + y = 1$')

# Optimal point
ax.plot(x_opt, y_opt, 'r*', markersize=20, label=f'Optimum ({x_opt}, {y_opt})')

# Show gradient vectors at optimum
grad_f = np.array([2*x_opt, 2*y_opt])  # gradient of f
grad_g = np.array([1, 1])               # gradient of g
scale = 0.3
ax.arrow(x_opt, y_opt, scale*grad_f[0], scale*grad_f[1], head_width=0.04, head_length=0.02, fc='blue', ec='blue')
ax.arrow(x_opt, y_opt, scale*grad_g[0], scale*grad_g[1], head_width=0.04, head_length=0.02, fc='green', ec='green')
ax.annotate('$\\nabla f$', xy=(x_opt + scale*grad_f[0], y_opt + scale*grad_f[1] + 0.05), fontsize=14, color='blue')
ax.annotate('$\\nabla g$', xy=(x_opt + scale*grad_g[0] + 0.05, y_opt + scale*grad_g[1]), fontsize=14, color='green')

ax.set_xlabel('x'); ax.set_ylabel('y')
ax.set_title('Constrained Optimization via Lagrange Multipliers\n$\\nabla f$ and $\\nabla g$ are parallel at the optimum')
ax.legend(fontsize=11); ax.grid(True, alpha=0.3)
ax.set_aspect('equal')
plt.tight_layout()
plt.show()

6. KKT Conditions¶

The Karush-Kuhn-Tucker (KKT) conditions generalize Lagrange multipliers to inequality constraints. For the problem:

\[\min_{\boldsymbol{x}} f(\boldsymbol{x}) \quad \text{s.t.} \quad g_i(\boldsymbol{x}) \leqslant 0\]

the KKT necessary conditions at a solution $\boldsymbol{x}^*$ are:

Stationarity: $\nabla f(\boldsymbol{x}^*) + \sum_i \lambda_i \nabla g_i(\boldsymbol{x}^*) = 0$
Primal feasibility: $g_i(\boldsymbol{x}^*) \leqslant 0$
Dual feasibility: $\lambda_i \geqslant 0$
Complementary slackness: $\lambda_i g_i(\boldsymbol{x}^*) = 0$ (either $\lambda_i = 0$ or the constraint is active)

Example: Minimize $f(x,y) = (x-2)^2 + (y-1)^2$¶

Subject to:

$g_1: x + y \leqslant 2$
$g_2: -x \leqslant 0$ (i.e., $x \geqslant 0$)
$g_3: -y \leqslant 0$ (i.e., $y \geqslant 0$)

The unconstrained minimum is at $(2, 1)$, where $g_1: 2+1=3 > 2$ – violated! So $g_1$ will be active at the constrained solution.

# KKT Example: min (x-2)^2 + (y-1)^2 s.t. x+y<=2, x>=0, y>=0
#
# Unconstrained min at (2,1). Check g1: 2+1=3 > 2 --> violated.
# So g1 is active at the solution. Assume g2, g3 inactive (lambda2=lambda3=0).
#
# Lagrangian: L = (x-2)^2 + (y-1)^2 + lambda1*(x+y-2)
# dL/dx = 2(x-2) + lambda1 = 0  -->  x = 2 - lambda1/2
# dL/dy = 2(y-1) + lambda1 = 0  -->  y = 1 - lambda1/2
# g1: x + y = 2  -->  (2-lambda1/2) + (1-lambda1/2) = 2  -->  3-lambda1 = 2  -->  lambda1 = 1
# So: x* = 1.5, y* = 0.5, lambda1* = 1

x_kkt = 1.5
y_kkt = 0.5
lam1 = 1.0

print("=== KKT Solution ===")
print(f"x* = {x_kkt}, y* = {y_kkt}")
print(f"f(x*,y*) = {(x_kkt-2)**2 + (y_kkt-1)**2}")
print(f"\nKKT Condition Checks:")
print(f"  1. Stationarity:")
print(f"     df/dx + lam1*dg1/dx = {2*(x_kkt-2)} + {lam1}*1 = {2*(x_kkt-2) + lam1} (should be 0)")
print(f"     df/dy + lam1*dg1/dy = {2*(y_kkt-1)} + {lam1}*1 = {2*(y_kkt-1) + lam1} (should be 0)")
print(f"  2. Primal feasibility:")
print(f"     g1: x+y-2 = {x_kkt+y_kkt-2} <= 0  (ACTIVE)")
print(f"     g2: -x    = {-x_kkt} <= 0  (inactive)")
print(f"     g3: -y    = {-y_kkt} <= 0  (inactive)")
print(f"  3. Dual feasibility: lambda1 = {lam1} >= 0")
print(f"  4. Complementary slackness:")
print(f"     lambda1 * g1 = {lam1} * {x_kkt+y_kkt-2} = {lam1*(x_kkt+y_kkt-2)}")

# Visualization
fig, ax = plt.subplots(figsize=(8, 8))

x_grid = np.linspace(-0.5, 3.5, 300)
y_grid = np.linspace(-0.5, 3.0, 300)
X, Y = np.meshgrid(x_grid, y_grid)
Z = (X - 2)**2 + (Y - 1)**2

# Contours
cs = ax.contour(X, Y, Z, levels=15, cmap='Blues', alpha=0.7)
ax.clabel(cs, inline=True, fontsize=8)

# Feasible region
# x+y<=2, x>=0, y>=0 --> triangle with vertices (0,0), (2,0), (0,2)
triangle = plt.Polygon([[0, 0], [2, 0], [0, 2]], alpha=0.2, color='green', label='Feasible region')
ax.add_patch(triangle)
ax.plot([0, 2], [2, 0], 'g-', linewidth=2.5, label='$g_1$: $x+y=2$ (ACTIVE)')
ax.plot([0, 0], [0, 2], 'gray', linewidth=1.5, linestyle='--', label='$g_2$: $x=0$ (inactive)')
ax.plot([0, 2], [0, 0], 'gray', linewidth=1.5, linestyle='-.', label='$g_3$: $y=0$ (inactive)')

# Points
ax.plot(2, 1, 'bo', markersize=12, label='Unconstrained min (2,1)')
ax.plot(x_kkt, y_kkt, 'r*', markersize=20, label=f'KKT solution ({x_kkt},{y_kkt})')

ax.set_xlabel('x'); ax.set_ylabel('y')
ax.set_title('KKT Conditions: Inequality Constrained Optimization')
ax.legend(loc='upper right', fontsize=10); ax.grid(True, alpha=0.3)
ax.set_aspect('equal')
ax.set_xlim(-0.5, 3.5); ax.set_ylim(-0.5, 3.0)
plt.tight_layout()
plt.show()

7. Convex Optimization¶

A function $f: \mathbb{R}^D \to \mathbb{R}$ is convex if for all $\boldsymbol{x}, \boldsymbol{y}$ and $0 \leqslant \theta \leqslant 1$ (MML Definition 7.3, Eq. 7.30):

\[f(\theta \boldsymbol{x} + (1-\theta)\boldsymbol{y}) \leqslant \theta f(\boldsymbol{x}) + (1-\theta) f(\boldsymbol{y})\]

Key property: For convex functions, every local minimum is a global minimum (MML p.236). This makes convex optimization problems much easier to solve.

A twice-differentiable function is convex if and only if its Hessian $\nabla^2 f(\boldsymbol{x})$ is positive semi-definite everywhere (MML p.237).

Equivalently, for a differentiable convex function (Eq. 7.31): $$f(\boldsymbol{y}) \geqslant f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^\top (\boldsymbol{y} - \boldsymbol{x})$$

(the function lies above all its tangent hyperplanes).

# Check convexity via Hessian eigenvalues

# Convex: f(x,y) = 3x^2 + 2y^2 + xy + x + 2y
# Hessian: [[6, 1], [1, 4]]
H_convex = np.array([[6, 1], [1, 4]])
eigs_convex = np.linalg.eigvalsh(H_convex)
print("=== Convex Function: f(x,y) = 3x^2 + 2y^2 + xy + x + 2y ===")
print(f"Hessian: {H_convex.tolist()}")
print(f"Eigenvalues: {eigs_convex}")
print(f"All eigenvalues >= 0? {np.all(eigs_convex >= 0)} --> {'CONVEX' if np.all(eigs_convex >= 0) else 'NOT CONVEX'}")

# Non-convex: f(x,y) = x^4 - 2x^2 + y^2 (saddle at origin)
# Hessian: [[12x^2 - 4, 0], [0, 2]] -- at origin: [[-4, 0], [0, 2]]
H_nonconvex_origin = np.array([[-4, 0], [0, 2]])
eigs_nc = np.linalg.eigvalsh(H_nonconvex_origin)
print(f"\n=== Non-convex Function: f(x,y) = x^4 - 2x^2 + y^2 ===")
print(f"Hessian at origin: {H_nonconvex_origin.tolist()}")
print(f"Eigenvalues at (0,0): {eigs_nc}")
print(f"All eigenvalues >= 0? {np.all(eigs_nc >= 0)} --> NOT CONVEX (saddle point at origin)")

# Verify convexity definition directly: f(theta*x + (1-theta)*y) <= theta*f(x) + (1-theta)*f(y)
def f_cvx_check(xy):
    x, y = xy
    return 3*x**2 + 2*y**2 + x*y + x + 2*y

pt_x = np.array([2.0, 1.0])
pt_y = np.array([-1.0, 3.0])
theta_vals = np.linspace(0, 1, 50)
lhs = [f_cvx_check(t*pt_x + (1-t)*pt_y) for t in theta_vals]
rhs = [t*f_cvx_check(pt_x) + (1-t)*f_cvx_check(pt_y) for t in theta_vals]
print(f"\nConvexity check along line segment: f(theta*x+(1-theta)*y) <= theta*f(x)+(1-theta)*f(y)?")
print(f"Max violation: {max(np.array(lhs) - np.array(rhs)):.6f} (negative means always satisfied)")

# Visualize convex vs non-convex and their convergence behavior

# Convex function
def f_convex(xy):
    x, y = xy
    return 3*x**2 + 2*y**2 + x*y + x + 2*y

def grad_convex(xy):
    x, y = xy
    return np.array([6*x + y + 1, 4*y + x + 2])

# Non-convex function
def f_nonconvex(xy):
    x, y = xy
    return x**4 - 2*x**2 + y**2

def grad_nonconvex(xy):
    x, y = xy
    return np.array([4*x**3 - 4*x, 2*y])

# GD on convex (always finds global min)
xy = np.array([3.0, 3.0])
lr = 0.05
traj_cvx = [xy.copy()]
for _ in range(60):
    xy = xy - lr * grad_convex(xy)
    traj_cvx.append(xy.copy())
traj_cvx = np.array(traj_cvx)

# GD on non-convex from two different starting points
starts_nc = [np.array([1.5, 2.0]), np.array([-1.5, 2.0])]
trajs_nc = []
for start in starts_nc:
    xy = start.copy()
    lr = 0.01
    traj = [xy.copy()]
    for _ in range(150):
        xy = xy - lr * grad_nonconvex(xy)
        traj.append(xy.copy())
    trajs_nc.append(np.array(traj))

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Convex contour
x_g = np.linspace(-3, 4, 200)
y_g = np.linspace(-3, 4, 200)
Xg, Yg = np.meshgrid(x_g, y_g)
Zc = 3*Xg**2 + 2*Yg**2 + Xg*Yg + Xg + 2*Yg
axes[0].contour(Xg, Yg, Zc, levels=30, cmap='Blues', alpha=0.6)
axes[0].plot(traj_cvx[:, 0], traj_cvx[:, 1], 'ro-', markersize=3, linewidth=1)
axes[0].plot(traj_cvx[0, 0], traj_cvx[0, 1], 'gs', markersize=10)
axes[0].plot(traj_cvx[-1, 0], traj_cvx[-1, 1], 'r*', markersize=15)
axes[0].set_title('Convex: Single global min\nany start converges to same point')
axes[0].set_xlabel('x'); axes[0].set_ylabel('y'); axes[0].grid(True, alpha=0.3)

# Non-convex contour
x_g2 = np.linspace(-2.5, 2.5, 200)
y_g2 = np.linspace(-2, 3, 200)
Xg2, Yg2 = np.meshgrid(x_g2, y_g2)
Znc = Xg2**4 - 2*Xg2**2 + Yg2**2
axes[1].contour(Xg2, Yg2, Znc, levels=30, cmap='Reds', alpha=0.6)
colors_nc = ['blue', 'green']
for i, (traj, start, color) in enumerate(zip(trajs_nc, starts_nc, colors_nc)):
    axes[1].plot(traj[:, 0], traj[:, 1], f'{color[0]}o-', markersize=3, linewidth=1,
                label=f'Start ({start[0]},{start[1]})')
    axes[1].plot(traj[-1, 0], traj[-1, 1], '*', color=color, markersize=15)
axes[1].set_title('Non-convex: Multiple local minima\ndifferent starts reach different minima')
axes[1].set_xlabel('x'); axes[1].set_ylabel('y')
axes[1].legend(fontsize=9); axes[1].grid(True, alpha=0.3)

# Loss curves
axes[2].plot([f_convex(traj_cvx[i]) for i in range(len(traj_cvx))], 'b-', linewidth=2, label='Convex')
for i, (traj, color) in enumerate(zip(trajs_nc, colors_nc)):
    axes[2].plot([f_nonconvex(traj[j]) for j in range(len(traj))], '-', color=color, linewidth=2,
                label=f'Non-convex (start {i+1})')
axes[2].set_xlabel('Iteration'); axes[2].set_ylabel('f(x,y)')
axes[2].set_title('Convergence Comparison'); axes[2].legend(); axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nConvex: final point = ({traj_cvx[-1,0]:.4f}, {traj_cvx[-1,1]:.4f}), f = {f_convex(traj_cvx[-1]):.6f}")
for i, traj in enumerate(trajs_nc):
    print(f"Non-convex start {i+1}: final = ({traj[-1,0]:.4f}, {traj[-1,1]:.4f}), f = {f_nonconvex(traj[-1]):.6f}")

8. Newton’s Method¶

Newton’s method is a second-order optimization algorithm that uses the Hessian (the matrix of second derivatives) to achieve faster convergence. While gradient descent is a first-order method, Newton’s method incorporates curvature information (MML Section 7.4, Further Reading).

The update rule is: $$\boldsymbol{x}_{i+1} = \boldsymbol{x}_i - [\nabla^2 f(\boldsymbol{x}_i)]^{-1} \nabla f(\boldsymbol{x}_i)$$

where $\nabla^2 f(\boldsymbol{x}_i)$ is the Hessian matrix at $\boldsymbol{x}_i$.

Key properties:

Newton’s method has quadratic convergence rate near the optimum (vs. linear for GD)
It requires computing and inverting the Hessian, which is $O(D^3)$ for $D$ dimensions
For a quadratic function, Newton’s method converges in one step

8a. Univariate Newton’s Method¶

For $f(x) = x^4 - 3x^3 + 2$:

$f'(x) = 4x^3 - 9x^2$
$f''(x) = 12x^2 - 18x$
Update: $x_{i+1} = x_i - f'(x_i) / f''(x_i)$

# Newton's Method vs GD -- univariate
def f_uni(x):
    return x**4 - 3*x**3 + 2

def df_uni(x):
    return 4*x**3 - 9*x**2

def ddf_uni(x):
    return 12*x**2 - 18*x

# Newton's method
x_newton = -1.0
traj_newton = [x_newton]
for _ in range(15):
    hess = ddf_uni(x_newton)
    if abs(hess) < 1e-12:
        break
    x_newton = x_newton - df_uni(x_newton) / hess
    traj_newton.append(x_newton)
traj_newton = np.array(traj_newton)

# GD for comparison
x_gd_c = -1.0
lr = 0.01
traj_gd_c = [x_gd_c]
for _ in range(80):
    x_gd_c = x_gd_c - lr * df_uni(x_gd_c)
    traj_gd_c.append(x_gd_c)
traj_gd_c = np.array(traj_gd_c)

print("Newton's method:")
for i, x in enumerate(traj_newton):
    print(f"  Step {i}: x = {x:.8f}, f(x) = {f_uni(x):.8f}")

print(f"\nGD after 80 steps: x = {traj_gd_c[-1]:.8f}, f(x) = {f_uni(traj_gd_c[-1]):.8f}")

8b. 2D Newton’s Method on a Quadratic¶

For a quadratic $f(\boldsymbol{x}) = \frac{1}{2}\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} - \boldsymbol{b}^\top \boldsymbol{x}$, the Hessian is $\nabla^2 f = \boldsymbol{A}$ (constant), so Newton’s method converges in exactly one step:

\[\boldsymbol{x}_1 = \boldsymbol{x}_0 - \boldsymbol{A}^{-1}(\boldsymbol{A}\boldsymbol{x}_0 - \boldsymbol{b}) = \boldsymbol{A}^{-1}\boldsymbol{b} = \boldsymbol{x}^*\]

# 2D Newton's method vs GD on a quadratic
A_newton = np.array([[10.0, 2.0],
                     [2.0,  5.0]])
b_newton = np.array([4.0, 3.0])
x_star_n = np.linalg.solve(A_newton, b_newton)

def f_quad_n(x):
    return 0.5 * x @ A_newton @ x - b_newton @ x

def grad_quad_n(x):
    return A_newton @ x - b_newton

# Newton's method (Hessian is constant = A for quadratic)
A_inv = np.linalg.inv(A_newton)
x_n = np.array([-3.0, -3.0])
traj_newton_2d = [x_n.copy()]
for _ in range(5):
    x_n = x_n - A_inv @ grad_quad_n(x_n)
    traj_newton_2d.append(x_n.copy())
traj_newton_2d = np.array(traj_newton_2d)

# GD
x_g = np.array([-3.0, -3.0])
lr_g = 0.05
traj_gd_2d = [x_g.copy()]
for _ in range(100):
    x_g = x_g - lr_g * grad_quad_n(x_g)
    traj_gd_2d.append(x_g.copy())
traj_gd_2d = np.array(traj_gd_2d)

print(f"True optimum: ({x_star_n[0]:.4f}, {x_star_n[1]:.4f})")
print(f"Newton after 1 step: ({traj_newton_2d[1,0]:.4f}, {traj_newton_2d[1,1]:.4f}) -- converges in ONE step for quadratics!")
print(f"GD after 100 steps: ({traj_gd_2d[-1,0]:.4f}, {traj_gd_2d[-1,1]:.4f})")

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

x1 = np.linspace(-4, 3, 200)
x2 = np.linspace(-4, 3, 200)
X1, X2 = np.meshgrid(x1, x2)
Z_n = 0.5*(10*X1**2 + 4*X1*X2 + 5*X2**2) - 4*X1 - 3*X2

for ax, traj, title in [(axes[0], traj_gd_2d, 'Gradient Descent (100 steps)'),
                         (axes[1], traj_newton_2d, "Newton's Method (1 step!)")]:
    ax.contour(X1, X2, Z_n, levels=30, cmap='viridis', alpha=0.6)
    ax.plot(traj[:, 0], traj[:, 1], 'ro-', markersize=5, linewidth=1.5)
    ax.plot(traj[0, 0], traj[0, 1], 'gs', markersize=12, label='Start')
    ax.plot(x_star_n[0], x_star_n[1], 'r*', markersize=15, label='Optimum')
    ax.set_title(title); ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
    ax.legend(); ax.grid(True, alpha=0.3)

# Convergence comparison (distance to optimum)
dist_gd = [np.linalg.norm(traj_gd_2d[i] - x_star_n) for i in range(len(traj_gd_2d))]
dist_newton = [np.linalg.norm(traj_newton_2d[i] - x_star_n) for i in range(len(traj_newton_2d))]
axes[2].semilogy(dist_gd, 'b-', linewidth=2, label='GD')
axes[2].semilogy(dist_newton, 'r-o', linewidth=2, markersize=8, label="Newton's method")
axes[2].set_xlabel('Iteration'); axes[2].set_ylabel('$||x_i - x^*||$ (log scale)')
axes[2].set_title('Convergence Rate Comparison')
axes[2].legend(); axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

8c. Newton’s Method on the Rosenbrock Function¶

On a non-quadratic function, Newton’s method does not converge in one step, but still exhibits dramatically faster convergence than gradient descent thanks to its use of curvature information.

For the Rosenbrock function, the Hessian is: $$\nabla^2 f = \begin{bmatrix} 2 - 400y + 1200x^2 & -400x \\ -400x & 200 \end{bmatrix}$$

# Newton's method on Rosenbrock (non-quadratic)
def hessian_rosenbrock(xy):
    x, y = xy
    dfdxx = 2 - 400*y + 1200*x**2
    dfdxy = -400*x
    dfdyy = 200
    return np.array([[dfdxx, dfdxy],
                     [dfdxy, dfdyy]])

# Newton on Rosenbrock
xy_nw = np.array([-1.0, 1.0])
traj_nw_rb = [xy_nw.copy()]
for _ in range(50):
    g = grad_rosenbrock(xy_nw)
    H = hessian_rosenbrock(xy_nw)
    try:
        step = np.linalg.solve(H, g)
        xy_nw = xy_nw - step
    except np.linalg.LinAlgError:
        break
    traj_nw_rb.append(xy_nw.copy())
traj_nw_rb = np.array(traj_nw_rb)

# GD on Rosenbrock for comparison
xy_gd_rb = np.array([-1.0, 1.0])
lr = 0.001
traj_gd_rb = [xy_gd_rb.copy()]
for _ in range(5000):
    xy_gd_rb = xy_gd_rb - lr * grad_rosenbrock(xy_gd_rb)
    traj_gd_rb.append(xy_gd_rb.copy())
traj_gd_rb = np.array(traj_gd_rb)

print(f"Newton on Rosenbrock: converged to ({traj_nw_rb[-1,0]:.6f}, {traj_nw_rb[-1,1]:.6f}) in {len(traj_nw_rb)-1} steps")
print(f"GD on Rosenbrock:     at ({traj_gd_rb[-1,0]:.6f}, {traj_gd_rb[-1,1]:.6f}) after {len(traj_gd_rb)-1} steps")
print(f"f_Newton = {rosenbrock(traj_nw_rb[-1]):.10f}")
print(f"f_GD     = {rosenbrock(traj_gd_rb[-1]):.10f}")

# Convergence comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Contour with both trajectories
x_grid = np.linspace(-1.5, 1.5, 300)
y_grid = np.linspace(-0.5, 2, 300)
X, Y = np.meshgrid(x_grid, y_grid)
Z = (1 - X)**2 + 100*(Y - X**2)**2
levels = np.logspace(-1, 3, 20)
ax1.contour(X, Y, Z, levels=levels, cmap='viridis', alpha=0.5)

step_gd = max(1, len(traj_gd_rb)//50)
ax1.plot(traj_gd_rb[::step_gd, 0], traj_gd_rb[::step_gd, 1], 'b.-', markersize=2, linewidth=0.5, alpha=0.6, label=f'GD ({len(traj_gd_rb)-1} steps)')
ax1.plot(traj_nw_rb[:, 0], traj_nw_rb[:, 1], 'r.-', markersize=6, linewidth=2, label=f"Newton ({len(traj_nw_rb)-1} steps)")
ax1.plot(1, 1, 'k*', markersize=15)
ax1.set_title('Rosenbrock: GD vs Newton trajectories')
ax1.set_xlabel('$x_1$'); ax1.set_ylabel('$x_2$')
ax1.legend(); ax1.grid(True, alpha=0.3)

# Distance to optimum (log scale)
dist_gd_rb = [np.linalg.norm(traj_gd_rb[i] - np.array([1, 1])) for i in range(min(200, len(traj_gd_rb)))]
dist_nw_rb = [np.linalg.norm(traj_nw_rb[i] - np.array([1, 1])) for i in range(len(traj_nw_rb))]

ax2.semilogy(dist_gd_rb, 'b-', linewidth=2, label=f'GD (first 200 of {len(traj_gd_rb)-1} steps)')
ax2.semilogy(dist_nw_rb, 'r-o', linewidth=2, markersize=6, label=f"Newton ({len(traj_nw_rb)-1} steps)")
ax2.set_xlabel('Iteration'); ax2.set_ylabel('$||x_i - x^*||$ (log scale)')
ax2.set_title('Convergence Rate: Newton\'s quadratic rate dominates')
ax2.legend(); ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Summary¶

This lab covered the core optimization concepts from MML Chapter 7:

Method	Order	Convergence	Cost per Step	Key Property
Gradient Descent	1st	Linear	$O(D)$	Simple, requires tuning $\gamma$
GD + Momentum	1st	Accelerated	$O(D)$	Dampens oscillations in narrow valleys
SGD	1st	Sublinear	$O(B \cdot D)$	Noisy but scalable to large datasets
Newton’s Method	2nd	Quadratic	$O(D^3)$	Fast near optimum, expensive per step

Key takeaways:

Gradient descent follows the negative gradient to find local minima. The learning rate is critical – too small is slow, too large diverges.
SGD trades accuracy for speed by using mini-batches. The noise can actually help escape local minima.
Momentum gives GD memory, smoothing updates and accelerating convergence in ill-conditioned problems.
Lagrange multipliers convert constrained problems to unconstrained ones. At the optimum, $\nabla f$ and $\nabla g$ are parallel.
KKT conditions extend Lagrange multipliers to inequality constraints via complementary slackness.
Convex functions have the powerful property that all local minima are global. Convexity can be verified by checking that the Hessian is positive semi-definite.
Newton’s method uses second-order (Hessian) information for quadratic convergence, reaching the optimum in far fewer iterations than GD – but at higher per-step cost.

References¶

Deisenroth, M.P., Faisal, A.A., Ong, C.S. (2020). Mathematics for Machine Learning. Chapter 7.
Boyd, S., Vandenberghe, L. (2004). Convex Optimization.
Bottou, L. et al. (2018). Optimization Methods for Large-Scale Machine Learning.