Lab 04: Vector CalculusΒΆ

Source: Mathematics for Machine Learning (Deisenroth, Faisal, Ong, 2020) – Chapter 5

Topics covered:

  1. Differentiation of univariate functions (numerical vs. analytic derivatives)

  2. Partial differentiation and gradients (gradient fields, contour plots)

  3. Gradients of vector-valued functions (Jacobian matrix)

  4. Gradients of matrices (trace, quadratic form)

  5. Useful identities for computing gradients

  6. Backpropagation and automatic differentiation (computation graphs)

  7. Higher-order derivatives (Hessian matrix, convexity)

  8. Linearization and multivariate Taylor series

Vector calculus is the mathematical backbone of optimization in machine learning. Gradients tell us the direction of steepest ascent, the Jacobian generalizes derivatives to vector-valued functions, and the Hessian captures curvature information. Backpropagation – the workhorse of deep learning – is simply the chain rule applied systematically through a computation graph.

import numpy as np
import matplotlib.pyplot as plt
from math import factorial

np.random.seed(42)
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['font.size'] = 11
print("Imports loaded.")
print(f"NumPy version: {np.__version__}")

1. Differentiation of Univariate Functions (MML 5.1)ΒΆ

The derivative of \(f\) at \(x\) is defined as the limit of the difference quotient (Def. 5.2):

\[\frac{\mathrm{d}f}{\mathrm{d}x} := \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}\]

Numerically, we approximate this with the central difference formula, which has \(O(h^2)\) error (vs. \(O(h)\) for the forward difference):

\[f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}\]

We will compare numerical derivatives against known analytic derivatives for \(\sin(x)\), \(\exp(x)\), and a polynomial \(f(x) = x^4 - 3x^2 + 2\).

def central_difference(f, x, h=1e-5):
    """Compute f'(x) using central difference approximation."""
    return (f(x + h) - f(x - h)) / (2 * h)


# Define test functions and their analytic derivatives
functions = {
    'sin(x)':   (np.sin,  np.cos),
    'exp(x)':   (np.exp,  np.exp),
    'x^4 - 3x^2 + 2': (
        lambda x: x**4 - 3*x**2 + 2,
        lambda x: 4*x**3 - 6*x
    ),
}

test_points = np.array([-2.0, -1.0, 0.0, 0.5, 1.0, 2.5])

for name, (f, df_analytic) in functions.items():
    print(f"\n--- f(x) = {name} ---")
    print(f"{'x':>6s}  {'Numerical':>14s}  {'Analytic':>14s}  {'|Error|':>12s}")
    for x0 in test_points:
        num = central_difference(f, x0)
        ana = df_analytic(x0)
        print(f"{x0:6.2f}  {num:14.10f}  {ana:14.10f}  {abs(num - ana):12.2e}")
# Visualize: function vs. derivative for sin(x)
x = np.linspace(-2 * np.pi, 2 * np.pi, 300)
numerical_deriv = np.array([central_difference(np.sin, xi) for xi in x])

fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))

# Left: function and derivatives
axes[0].plot(x, np.sin(x), 'b-', lw=2, label='sin(x)')
axes[0].plot(x, np.cos(x), 'r--', lw=2, label="cos(x) [analytic f']")
axes[0].plot(x[::10], numerical_deriv[::10], 'go', ms=4, label="numerical f'")
axes[0].set_xlabel('x'); axes[0].set_ylabel('y')
axes[0].set_title('sin(x): Analytic vs. Numerical Derivative')
axes[0].legend(); axes[0].grid(True, alpha=0.3)

# Right: error vs. step size h
h_values = np.logspace(-15, 0, 50)
x_test = 1.0
forward_errors = [abs((np.sin(x_test + h) - np.sin(x_test)) / h - np.cos(x_test)) for h in h_values]
central_errors = [abs((np.sin(x_test + h) - np.sin(x_test - h)) / (2*h) - np.cos(x_test)) for h in h_values]

axes[1].loglog(h_values, forward_errors, 'b.-', label='Forward diff O(h)', alpha=0.7)
axes[1].loglog(h_values, central_errors, 'r.-', label='Central diff O(h^2)', alpha=0.7)
axes[1].set_xlabel('Step size h'); axes[1].set_ylabel('Absolute error')
axes[1].set_title('Approximation Error vs. Step Size')
axes[1].legend(); axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print("Note: Central difference achieves ~1e-10 accuracy at h=1e-5,")
print("while forward difference only reaches ~1e-5. Both degrade for very small h due to floating-point cancellation.")

2. Partial Differentiation and Gradients (MML 5.2)ΒΆ

For a function \(f : \mathbb{R}^n \to \mathbb{R}\) of \(n\) variables, the partial derivatives are obtained by differentiating with respect to one variable while keeping the others fixed (Def. 5.5):

\[\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(\mathbf{x})}{h}\]

The gradient collects all partial derivatives into a row vector (Eq. 5.40):

\[\nabla_{\mathbf{x}} f = \frac{\mathrm{d}f}{\mathrm{d}\mathbf{x}} = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \cdots & \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{1 \times n}\]

The gradient points in the direction of steepest ascent of \(f\).

We will work with \(f(x, y) = x^2 y + x y^3\) from Example 5.7 (modified) and visualize the gradient field on a contour plot.

def numerical_gradient(f, x, h=1e-5):
    """Compute the gradient of f at point x using central differences.
    
    Args:
        f: function R^n -> R
        x: numpy array of shape (n,)
        h: step size
    Returns:
        gradient: numpy array of shape (n,)
    """
    n = len(x)
    grad = np.zeros(n)
    for i in range(n):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += h
        x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad


# f(x,y) = x^2 * y + x * y^3  (inspired by MML Example 5.7)
# Analytic: df/dx = 2xy + y^3,  df/dy = x^2 + 3xy^2
def f_2d(xy):
    x, y = xy
    return x**2 * y + x * y**3

def grad_f_2d_analytic(xy):
    x, y = xy
    return np.array([2*x*y + y**3, x**2 + 3*x*y**2])

# Test at several points
test_pts = [np.array([1.0, 2.0]), np.array([-1.0, 3.0]), np.array([2.0, -1.0])]
print("Gradient of f(x,y) = x^2*y + x*y^3")
print(f"{'Point':>14s}  {'Numerical grad':>28s}  {'Analytic grad':>28s}  {'Error norm':>12s}")
for pt in test_pts:
    num_g = numerical_gradient(f_2d, pt)
    ana_g = grad_f_2d_analytic(pt)
    err = np.linalg.norm(num_g - ana_g)
    print(f"  ({pt[0]:4.1f}, {pt[1]:4.1f})  [{num_g[0]:12.6f}, {num_g[1]:12.6f}]  [{ana_g[0]:12.6f}, {ana_g[1]:12.6f}]  {err:12.2e}")
# Visualize gradient field on contour plot
xx = np.linspace(-2, 2, 200)
yy = np.linspace(-2, 2, 200)
X, Y = np.meshgrid(xx, yy)
Z = X**2 * Y + X * Y**3

# Gradient on a coarser grid for arrows
xq = np.linspace(-1.8, 1.8, 12)
yq = np.linspace(-1.8, 1.8, 12)
XQ, YQ = np.meshgrid(xq, yq)
U = 2*XQ*YQ + YQ**3         # df/dx
V = XQ**2 + 3*XQ*YQ**2      # df/dy

fig, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(X, Y, Z, levels=30, cmap='RdBu_r', alpha=0.8)
ax.contour(X, Y, Z, levels=15, colors='k', linewidths=0.5, alpha=0.4)
plt.colorbar(contour, ax=ax, label='f(x, y)')

# Normalize arrows for visibility
magnitude = np.sqrt(U**2 + V**2)
magnitude = np.where(magnitude == 0, 1, magnitude)
ax.quiver(XQ, YQ, U/magnitude, V/magnitude, color='black', alpha=0.7, scale=25)

ax.set_xlabel('x'); ax.set_ylabel('y')
ax.set_title(r'Gradient field of $f(x,y) = x^2 y + x y^3$')
plt.tight_layout()
plt.show()
print("Arrows show the gradient direction (steepest ascent) at each grid point.")
print("The gradient is perpendicular to the contour lines.")

3. Gradients of Vector-Valued Functions: The Jacobian (MML 5.3)ΒΆ

For a vector-valued function \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\), the gradient generalizes to the Jacobian matrix \(\mathbf{J} \in \mathbb{R}^{m \times n}\) (Def. 5.6, Eq. 5.58):

\[\begin{split}\mathbf{J} = \nabla_{\mathbf{x}} \mathbf{f} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}\end{split}\]

The Jacobian determinant \(|\det(\mathbf{J})|\) measures how much areas/volumes are scaled by the transformation (Eq. 5.66).

We implement this for the mapping from MML Example 5.9/5.10: \(\mathbf{f}(x_1, x_2) = \begin{bmatrix} x_1^2 + x_2 \\ x_1 \cdot x_2 \end{bmatrix}\)

def numerical_jacobian(f, x, h=1e-5):
    """Compute the Jacobian matrix of f: R^n -> R^m at point x.
    
    Uses central differences. Returns J of shape (m, n).
    """
    f0 = f(x)
    n = len(x)
    m = len(f0)
    J = np.zeros((m, n))
    for j in range(n):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[j] += h
        x_minus[j] -= h
        J[:, j] = (f(x_plus) - f(x_minus)) / (2 * h)
    return J


# f: R^2 -> R^2: f(x1, x2) = [x1^2 + x2, x1*x2]
def f_vec(x):
    return np.array([x[0]**2 + x[1], x[0] * x[1]])

def jacobian_analytic(x):
    """Analytic Jacobian: [[2*x1, 1], [x2, x1]]"""
    return np.array([[2*x[0], 1.0],
                     [x[1],   x[0]]])

# Test
x_test = np.array([3.0, -1.0])
J_num = numerical_jacobian(f_vec, x_test)
J_ana = jacobian_analytic(x_test)

print(f"f(x) = [x1^2 + x2, x1*x2] at x = {x_test}")
print(f"\nNumerical Jacobian:\n{J_num}")
print(f"\nAnalytic Jacobian:\n{J_ana}")
print(f"\nMax absolute error: {np.max(np.abs(J_num - J_ana)):.2e}")
print(f"\nJacobian determinant: {np.linalg.det(J_ana):.4f}")
print(f"This means the mapping locally scales areas by |det(J)| = {abs(np.linalg.det(J_ana)):.4f}")
# Additional example: the linear mapping from MML Eq. 5.63-5.66
# f(x1, x2) = [-2*x1 + x2, x1 + x2]  =>  J = [[-2, 1], [1, 1]]
def f_linear(x):
    return np.array([-2*x[0] + x[1], x[0] + x[1]])

x_test2 = np.array([1.0, 1.0])
J_linear = numerical_jacobian(f_linear, x_test2)
print("Linear mapping f(x) = [-2*x1 + x2, x1 + x2]")
print(f"Numerical Jacobian:\n{J_linear}")
print(f"Expected (constant) Jacobian:\n{np.array([[-2, 1], [1, 1]])}")
print(f"|det(J)| = {abs(np.linalg.det(J_linear)):.1f}")
print("=> The unit square is scaled by a factor of 3 (cf. MML Eq. 5.61).")

4. Gradients of Matrices (MML 5.4)ΒΆ

When working with matrix expressions common in ML, we often need gradients of scalar-valued matrix functions. Two important cases:

Gradient of the trace (Eq. 5.100): $\(\frac{\partial}{\partial \mathbf{X}} \mathrm{tr}(\mathbf{f}(\mathbf{X})) = \mathrm{tr}\left(\frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right)\)$

For \(f(\mathbf{X}) = \mathbf{A}\mathbf{X}\), we get \(\frac{\partial}{\partial \mathbf{X}} \mathrm{tr}(\mathbf{A}\mathbf{X}) = \mathbf{A}^\top\).

Gradient of a quadratic form (Eq. 5.107): $\(\frac{\partial}{\partial \mathbf{x}} \mathbf{x}^\top \mathbf{B} \mathbf{x} = \mathbf{x}^\top (\mathbf{B} + \mathbf{B}^\top)\)$

For symmetric \(\mathbf{B}\), this simplifies to \(2\mathbf{x}^\top \mathbf{B}\).

# --- Gradient of tr(AX) ---
# tr(AX) is a scalar function of X. Its gradient w.r.t. X is A^T.
A = np.array([[1.0, 2.0], [3.0, 4.0]])

def trace_AX(X_flat):
    """tr(AX) as a function of flattened X."""
    X = X_flat.reshape(A.shape)
    return np.trace(A @ X)

# Numerical gradient w.r.t. each element of X
X_test = np.array([[1.0, 0.0], [0.0, 1.0]])
grad_num = numerical_gradient(trace_AX, X_test.flatten()).reshape(A.shape)
grad_analytic = A.T

print("=== Gradient of tr(AX) w.r.t. X ===")
print(f"A = \n{A}")
print(f"\nNumerical gradient (w.r.t. X):\n{grad_num}")
print(f"\nAnalytic gradient (A^T):\n{grad_analytic}")
print(f"Match: {np.allclose(grad_num, grad_analytic)}")

print("\n" + "="*50)

# --- Gradient of x^T B x ---
B = np.array([[2.0, 1.0], [3.0, 4.0]])  # non-symmetric

def quadratic_form(x):
    return x @ B @ x

x_test = np.array([1.0, 2.0])
grad_num_q = numerical_gradient(quadratic_form, x_test)
grad_analytic_q = x_test @ (B + B.T)   # x^T(B + B^T) as a row vector => 1D array

print("\n=== Gradient of x^T B x ===")
print(f"B = \n{B}")
print(f"x = {x_test}")
print(f"x^T B x = {quadratic_form(x_test):.4f}")
print(f"\nNumerical gradient: {grad_num_q}")
print(f"Analytic x^T(B+B^T): {grad_analytic_q}")
print(f"Match: {np.allclose(grad_num_q, grad_analytic_q)}")

# Symmetric case: B_sym = B, then gradient = 2*x^T*B
B_sym = np.array([[2.0, 1.0], [1.0, 4.0]])
grad_sym_num = numerical_gradient(lambda x: x @ B_sym @ x, x_test)
grad_sym_ana = 2 * x_test @ B_sym
print(f"\nFor symmetric B_sym, gradient = 2x^T B: {grad_sym_ana}")
print(f"Numerical:                               {grad_sym_num}")
print(f"Match: {np.allclose(grad_sym_num, grad_sym_ana)}")

5. Useful Identities for Computing Gradients (MML 5.5)ΒΆ

MML Section 5.5 lists identities that appear repeatedly in ML derivations. We will verify the following numerically:

Identity

Equation

\(\frac{\partial}{\partial \mathbf{x}} \mathbf{x}^\top \mathbf{a} = \mathbf{a}^\top\)

(5.104)

\(\frac{\partial}{\partial \mathbf{x}} \mathbf{a}^\top \mathbf{x} = \mathbf{a}^\top\)

(5.105)

\(\frac{\partial}{\partial \mathbf{x}} \mathbf{x}^\top \mathbf{A} \mathbf{x} = \mathbf{x}^\top (\mathbf{A} + \mathbf{A}^\top)\)

(5.107)

\(\frac{\partial}{\partial \mathbf{X}} \mathbf{a}^\top \mathbf{X} \mathbf{b} = \mathbf{a}\mathbf{b}^\top\)

(5.106)

\(\frac{\partial}{\partial \mathbf{s}} (\mathbf{x} - \mathbf{A}\mathbf{s})^\top \mathbf{W} (\mathbf{x} - \mathbf{A}\mathbf{s}) = -2(\mathbf{x} - \mathbf{A}\mathbf{s})^\top \mathbf{W} \mathbf{A}\)

(5.108)

# --- Identity 1: d/dx (x^T a) = a^T  (Eq. 5.104) ---
a = np.array([3.0, -1.0, 2.0])
x = np.array([1.0, 2.0, -1.0])

grad_xTa = numerical_gradient(lambda x: x @ a, x)
print("Identity 1:  d/dx (x^T a) = a^T")
print(f"  Numerical: {grad_xTa}")
print(f"  a^T:       {a}")
print(f"  Match: {np.allclose(grad_xTa, a)}\n")

# --- Identity 2: d/dx (a^T x) = a^T  (Eq. 5.105) ---
grad_aTx = numerical_gradient(lambda x: a @ x, x)
print("Identity 2:  d/dx (a^T x) = a^T")
print(f"  Numerical: {grad_aTx}")
print(f"  a^T:       {a}")
print(f"  Match: {np.allclose(grad_aTx, a)}\n")

# --- Identity 3: d/dx (x^T A x) = x^T(A + A^T)  (Eq. 5.107) ---
A3 = np.array([[2.0, 1.0, 0.5],
               [0.0, 3.0, -1.0],
               [1.0, 2.0, 4.0]])

grad_xAx = numerical_gradient(lambda x: x @ A3 @ x, x)
analytic_xAx = x @ (A3 + A3.T)
print("Identity 3:  d/dx (x^T A x) = x^T(A + A^T)")
print(f"  Numerical:    {grad_xAx}")
print(f"  x^T(A + A^T): {analytic_xAx}")
print(f"  Match: {np.allclose(grad_xAx, analytic_xAx)}\n")

# --- Identity 4: d/dX (a^T X b) = a b^T  (Eq. 5.106) ---
a4 = np.array([1.0, 2.0])
b4 = np.array([3.0, -1.0, 2.0])
X4_test = np.random.randn(2, 3)

def aTXb(X_flat):
    X = X_flat.reshape(2, 3)
    return a4 @ X @ b4

grad_aTXb = numerical_gradient(aTXb, X4_test.flatten()).reshape(2, 3)
analytic_aTXb = np.outer(a4, b4)
print("Identity 4:  d/dX (a^T X b) = a b^T")
print(f"  Numerical:\n{grad_aTXb}")
print(f"  a b^T:\n{analytic_aTXb}")
print(f"  Match: {np.allclose(grad_aTXb, analytic_aTXb)}\n")

# --- Identity 5: d/ds (x - As)^T W (x - As) = -2(x - As)^T W A  (Eq. 5.108) ---
A5 = np.random.randn(3, 2)
W5 = np.eye(3)  # symmetric W
x5 = np.array([1.0, 2.0, 3.0])
s5 = np.array([0.5, -1.0])

def loss_5108(s):
    e = x5 - A5 @ s
    return e @ W5 @ e

grad_5108_num = numerical_gradient(loss_5108, s5)
grad_5108_ana = -2 * (x5 - A5 @ s5) @ W5 @ A5
print("Identity 5:  d/ds (x - As)^T W (x - As) = -2(x - As)^T W A")
print(f"  Numerical: {grad_5108_num}")
print(f"  Analytic:  {grad_5108_ana}")
print(f"  Match: {np.allclose(grad_5108_num, grad_5108_ana)}")

6. Backpropagation and Automatic Differentiation (MML 5.6)ΒΆ

In deep learning, we compute gradients via backpropagation, which is reverse-mode automatic differentiation. The key idea (MML Eq. 5.119-5.121):

\[\frac{\mathrm{d}y}{\mathrm{d}x} = \frac{\mathrm{d}y}{\mathrm{d}b} \frac{\mathrm{d}b}{\mathrm{d}a} \frac{\mathrm{d}a}{\mathrm{d}x}\]

We break \(f(x) = (x+1)^2 \cdot \sin(x)\) into elementary operations and build a computation graph. Each node stores its value (forward pass) and its local derivative. The backward pass propagates gradients from output to input.

Forward pass: compute intermediate values. Backward pass: apply chain rule from output to input (Eq. 5.145): $\(\frac{\partial f}{\partial x_i} = \sum_{x_j : x_i \in \text{Pa}(x_j)} \frac{\partial f}{\partial x_j} \frac{\partial g_j}{\partial x_i}\)$

class CompNode:
    """A node in a computation graph supporting forward and backward passes."""
    def __init__(self, value=0.0, children=None, local_grads=None, name=''):
        self.value = value
        self.grad = 0.0  # df/d(this node)
        self.children = children or []     # upstream nodes
        self.local_grads = local_grads or []  # d(this)/d(child_i)
        self.name = name
    
    def backward(self, upstream_grad=1.0):
        """Propagate gradient backward through the graph."""
        self.grad += upstream_grad
        for child, local_grad in zip(self.children, self.local_grads):
            child.backward(upstream_grad * local_grad)


def build_graph(x_val):
    """Build computation graph for f(x) = (x+1)^2 * sin(x).
    
    Decomposition:
      a = x + 1
      b = a^2
      c = sin(x)
      f = b * c
    """
    x = CompNode(value=x_val, name='x')
    
    # a = x + 1
    a = CompNode(value=x.value + 1, children=[x], local_grads=[1.0], name='a = x+1')
    
    # b = a^2
    b = CompNode(value=a.value**2, children=[a], local_grads=[2*a.value], name='b = a^2')
    
    # c = sin(x)
    c = CompNode(value=np.sin(x.value), children=[x], local_grads=[np.cos(x.value)], name='c = sin(x)')
    
    # f = b * c
    f = CompNode(value=b.value * c.value,
                 children=[b, c],
                 local_grads=[c.value, b.value],
                 name='f = b*c')
    
    return x, a, b, c, f


# Analytic derivative of f(x) = (x+1)^2 * sin(x)
# f'(x) = 2(x+1)*sin(x) + (x+1)^2*cos(x)
def f_target(x):
    return (x + 1)**2 * np.sin(x)

def df_analytic(x):
    return 2*(x+1)*np.sin(x) + (x+1)**2 * np.cos(x)


# Test at several points
print("f(x) = (x+1)^2 * sin(x)")
print(f"{'x':>6s}  {'f(x)':>10s}  {'Backprop df/dx':>16s}  {'Analytic df/dx':>16s}  {'Numerical df/dx':>16s}  {'BP Error':>10s}")

for x_val in [0.0, 0.5, 1.0, np.pi/2, 2.0, 3.0]:
    x_node, a_node, b_node, c_node, f_node = build_graph(x_val)
    f_node.backward(1.0)  # df/df = 1
    
    bp_grad = x_node.grad
    ana_grad = df_analytic(x_val)
    num_grad = central_difference(f_target, x_val)
    f_val = f_target(x_val)
    
    print(f"{x_val:6.3f}  {f_val:10.6f}  {bp_grad:16.10f}  {ana_grad:16.10f}  {num_grad:16.10f}  {abs(bp_grad - ana_grad):10.2e}")
# Visualize f(x) and its derivative
x_range = np.linspace(-3, 4, 300)
f_vals = f_target(x_range)
df_vals = df_analytic(x_range)

# Backprop gradient at sampled points
x_sample = np.linspace(-3, 4, 20)
bp_grads = []
for xv in x_sample:
    xn, _, _, _, fn = build_graph(xv)
    fn.backward(1.0)
    bp_grads.append(xn.grad)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x_range, f_vals, 'b-', lw=2, label=r'$f(x) = (x+1)^2 \sin(x)$')
ax.plot(x_range, df_vals, 'r--', lw=2, label=r"$f'(x)$ analytic")
ax.plot(x_sample, bp_grads, 'go', ms=6, label="Backprop gradient")
ax.axhline(0, color='k', lw=0.5)
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.set_title('Backpropagation: Computation Graph Gradients')
ax.legend(); ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Green dots (backprop) lie exactly on the analytic derivative curve.")

7. Higher-Order Derivatives: The Hessian (MML 5.7)ΒΆ

The Hessian matrix \(\mathbf{H}\) collects all second-order partial derivatives (Eq. 5.147):

\[\begin{split}\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix}\end{split}\]

The Hessian is symmetric for twice continuously differentiable functions (Schwarz’s theorem, Eq. 5.146).

Convexity test: A function is (strictly) convex if and only if its Hessian is positive (semi)definite everywhere, i.e., all eigenvalues \(\geq 0\) (\(> 0\)).

We will compute the Hessian numerically and verify convexity for:

  • \(f_1(x, y) = x^2 + 2y^2 + xy\) (convex)

  • \(f_2(x, y) = x^2 - y^2\) (saddle, indefinite)

  • \(f_3(x, y) = x^2 + 2xy + y^3\) (MML Example 5.15-style)

def numerical_hessian(f, x, h=1e-4):
    """Compute the Hessian matrix of f: R^n -> R at point x.
    
    Uses central differences on the gradient (second-order finite differences).
    """
    n = len(x)
    H = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            x_pp = x.copy(); x_pp[i] += h; x_pp[j] += h
            x_pm = x.copy(); x_pm[i] += h; x_pm[j] -= h
            x_mp = x.copy(); x_mp[i] -= h; x_mp[j] += h
            x_mm = x.copy(); x_mm[i] -= h; x_mm[j] -= h
            H[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * h**2)
    return H


# Function 1: f(x,y) = x^2 + 2y^2 + xy  (convex)
def f1(xy): return xy[0]**2 + 2*xy[1]**2 + xy[0]*xy[1]
# Analytic Hessian: [[2, 1], [1, 4]]

# Function 2: f(x,y) = x^2 - y^2  (saddle)
def f2(xy): return xy[0]**2 - xy[1]**2
# Analytic Hessian: [[2, 0], [0, -2]]

# Function 3: f(x,y) = x^2 + 2xy + y^3 (from MML Example 5.15)
def f3(xy): return xy[0]**2 + 2*xy[0]*xy[1] + xy[1]**3
# Analytic Hessian: [[2, 2], [2, 6y]]  at (1,2): [[2, 2], [2, 12]]

test_funcs = [
    ('x^2 + 2y^2 + xy',  f1, np.array([1.0, 1.0]), np.array([[2, 1], [1, 4]])  ),
    ('x^2 - y^2',        f2, np.array([0.0, 0.0]), np.array([[2, 0], [0, -2]]) ),
    ('x^2 + 2xy + y^3',  f3, np.array([1.0, 2.0]), np.array([[2, 2], [2, 12]]) ),
]

for name, func, pt, H_true in test_funcs:
    H_num = numerical_hessian(func, pt)
    eigvals = np.linalg.eigvalsh(H_num)
    
    if np.all(eigvals > 0):
        convexity = "POSITIVE DEFINITE => locally strictly convex"
    elif np.all(eigvals >= 0):
        convexity = "POSITIVE SEMI-DEFINITE => locally convex"
    elif np.all(eigvals < 0):
        convexity = "NEGATIVE DEFINITE => locally strictly concave"
    else:
        convexity = "INDEFINITE => saddle point / neither convex nor concave"
    
    print(f"\n--- f(x,y) = {name} at {pt} ---")
    print(f"Numerical Hessian:\n{np.round(H_num, 4)}")
    print(f"Analytic Hessian:\n{H_true}")
    print(f"Match: {np.allclose(H_num, H_true, atol=1e-4)}")
    print(f"Eigenvalues: {eigvals}")
    print(f"Convexity: {convexity}")
    print(f"Symmetric: {np.allclose(H_num, H_num.T, atol=1e-6)}")
# Visualize the three functions as surface/contour plots
fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

xx = np.linspace(-2, 2, 100)
yy = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(xx, yy)

surfaces = [
    ('$x^2 + 2y^2 + xy$ (convex)',   X**2 + 2*Y**2 + X*Y),
    ('$x^2 - y^2$ (saddle)',          X**2 - Y**2),
    ('$x^2 + 2xy + y^3$ (mixed)',     X**2 + 2*X*Y + Y**3),
]

for ax, (title, Z) in zip(axes, surfaces):
    cs = ax.contourf(X, Y, Z, levels=25, cmap='coolwarm', alpha=0.8)
    ax.contour(X, Y, Z, levels=15, colors='k', linewidths=0.4, alpha=0.5)
    plt.colorbar(cs, ax=ax)
    ax.set_xlabel('x'); ax.set_ylabel('y')
    ax.set_title(title, fontsize=10)
    ax.set_aspect('equal')

plt.suptitle('Hessian Eigenvalues Determine Curvature', fontsize=12, y=1.02)
plt.tight_layout()
plt.show()
print("Left: bowl shape (convex). Center: saddle (indefinite Hessian). Right: mixed curvature.")

8. Linearization and Taylor Series (MML 5.1.1, 5.8)ΒΆ

The Taylor polynomial of degree \(n\) of \(f : \mathbb{R} \to \mathbb{R}\) at \(x_0\) is (Def. 5.3, Eq. 5.7):

\[T_n(x) = \sum_{k=0}^{n} \frac{f^{(k)}(x_0)}{k!}(x - x_0)^k\]

The Taylor expansion provides a local polynomial approximation. Higher-order terms improve accuracy farther from \(x_0\).

Multivariate Taylor series at \(\mathbf{x}_0\) (Def. 5.7, Eq. 5.151): $\(f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top \boldsymbol{\delta} + \frac{1}{2} \boldsymbol{\delta}^\top \mathbf{H}(\mathbf{x}_0) \boldsymbol{\delta} + \ldots\)$

where \(\boldsymbol{\delta} = \mathbf{x} - \mathbf{x}_0\) and \(\mathbf{H}\) is the Hessian.

We implement univariate Taylor expansion for \(\exp(x)\) and \(\sin(x)\), following MML Examples 5.3 and 5.4.

def taylor_coefficients(f, x0, n, h=1e-4):
    """Compute Taylor coefficients f^(k)(x0)/k! for k = 0, ..., n.
    
    Uses repeated numerical differentiation.
    """
    # Compute derivatives f^(0), f^(1), ..., f^(n) at x0
    # using recursive central differences
    coeffs = np.zeros(n + 1)
    
    # f^(0) = f(x0)
    coeffs[0] = f(x0)
    
    # For higher derivatives, use the fact that
    # f^(k)(x0) ~ [f^(k-1)(x0+h) - f^(k-1)(x0-h)] / (2h)
    # We compute this using a grid of points
    # More robust: use the formula for k-th derivative from function values
    for k in range(1, n + 1):
        # k-th derivative via finite differences with spacing h
        # Using the standard central stencil
        deriv_k = 0.0
        for j in range(k + 1):
            sign = (-1) ** (k - j)
            binom = factorial(k) / (factorial(j) * factorial(k - j))
            deriv_k += sign * binom * f(x0 + (j - k/2) * h)
        deriv_k /= h ** k
        coeffs[k] = deriv_k / factorial(k)
    
    return coeffs


def taylor_polynomial(coeffs, x0, x):
    """Evaluate Taylor polynomial with given coefficients at points x."""
    result = np.zeros_like(x, dtype=float)
    for k, ck in enumerate(coeffs):
        result += ck * (x - x0) ** k
    return result


# --- Taylor expansion of exp(x) around x0=0 ---
x0 = 0.0
orders = [1, 2, 4, 8]
x_plot = np.linspace(-3, 3, 300)

# For exp, the exact Taylor coefficients are 1/k!
# so we can compare numerical coefficients to known values
max_order = max(orders)
coeffs_exp = taylor_coefficients(np.exp, x0, max_order)

print("Taylor coefficients for exp(x) at x0 = 0:")
print(f"{'k':>3s}  {'Numerical':>12s}  {'Exact (1/k!)':>12s}")
for k in range(min(max_order + 1, 9)):
    exact = 1.0 / factorial(k)
    print(f"{k:3d}  {coeffs_exp[k]:12.8f}  {exact:12.8f}")
# Visualize Taylor approximations of exp(x)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# --- exp(x) ---
ax = axes[0]
ax.plot(x_plot, np.exp(x_plot), 'k-', lw=2.5, label='$e^x$')
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
for order, color in zip(orders, colors):
    coeffs = coeffs_exp[:order+1]
    T_vals = taylor_polynomial(coeffs, x0, x_plot)
    ax.plot(x_plot, T_vals, '--', color=color, lw=1.5, label=f'$T_{{{order}}}$')

ax.set_ylim(-2, 15)
ax.axvline(x0, color='gray', ls=':', alpha=0.5)
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.set_title(f'Taylor Approximations of $e^x$ at $x_0 = {x0}$')
ax.legend(loc='upper left'); ax.grid(True, alpha=0.3)

# --- sin(x) around x0=0 (Maclaurin series) ---
ax = axes[1]
x0_sin = 0.0
orders_sin = [1, 3, 5, 9]
coeffs_sin = taylor_coefficients(np.sin, x0_sin, max(orders_sin))

ax.plot(x_plot, np.sin(x_plot), 'k-', lw=2.5, label='$\\sin(x)$')
for order, color in zip(orders_sin, colors):
    T_vals = taylor_polynomial(coeffs_sin[:order+1], x0_sin, x_plot)
    ax.plot(x_plot, T_vals, '--', color=color, lw=1.5, label=f'$T_{{{order}}}$')

ax.set_ylim(-2, 2)
ax.axvline(x0_sin, color='gray', ls=':', alpha=0.5)
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.set_title(f'Taylor Approximations of $\\sin(x)$ at $x_0 = {x0_sin}$')
ax.legend(loc='upper left'); ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print("Higher-order Taylor polynomials approximate the function over a wider range.")
print("The approximation is exact at x0 and degrades as |x - x0| increases.")
# Taylor expansion at a non-zero point: sin(x) + cos(x) at x0=0
# This is MML Example 5.4 (Eq. 5.19-5.25)
def f_sincos(x):
    return np.sin(x) + np.cos(x)

x0_sc = 0.0
orders_sc = [0, 1, 5, 10]
x_plot_sc = np.linspace(-4, 4, 400)
coeffs_sc = taylor_coefficients(f_sincos, x0_sc, max(orders_sc))

# Known coefficients from MML: 1, 1, -1/2!, -1/3!, 1/4!, 1/5!, ...
print("Taylor coefficients for sin(x)+cos(x) at x0=0:")
for k in range(min(11, len(coeffs_sc))):
    print(f"  c_{k} = {coeffs_sc[k]:10.6f}")

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(x_plot_sc, f_sincos(x_plot_sc), 'k-', lw=2.5, label='$f = \\sin(x) + \\cos(x)$')
styles = ['b--', 'r--', 'g--', 'm--']
for order, style in zip(orders_sc, styles):
    T = taylor_polynomial(coeffs_sc[:order+1], x0_sc, x_plot_sc)
    ax.plot(x_plot_sc, T, style, lw=1.5, label=f'$T_{{{order}}}$')

ax.set_ylim(-3, 5)
ax.axvline(x0_sc, color='gray', ls=':', alpha=0.5)
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.set_title('MML Example 5.4: Taylor Polynomials of $\\sin(x) + \\cos(x)$')
ax.legend(); ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("T_10 already closely matches f(x) on [-4, 4], as shown in MML Figure 5.4.")
# Multivariate Taylor expansion: f(x,y) = x^2 + 2xy + y^3 at (1,2)
# This is MML Example 5.15 (Eq. 5.161-5.180)
def f_mv(xy):
    x, y = xy
    return x**2 + 2*x*y + y**3

x0_mv = np.array([1.0, 2.0])
f_x0 = f_mv(x0_mv)  # = 1 + 4 + 8 = 13

# Gradient at (1,2): [2x + 2y, 2x + 3y^2] = [6, 14]
grad_x0 = numerical_gradient(f_mv, x0_mv)

# Hessian at (1,2): [[2, 2], [2, 6y]] = [[2, 2], [2, 12]]
H_x0 = numerical_hessian(f_mv, x0_mv)

print("Multivariate Taylor expansion of f(x,y) = x^2 + 2xy + y^3")
print(f"Expansion point: x0 = {x0_mv}")
print(f"f(x0) = {f_x0:.1f}")
print(f"Gradient at x0: {grad_x0}  (expected: [6, 14])")
print(f"Hessian at x0:\n{np.round(H_x0, 2)}")
print(f"Expected Hessian: [[2, 2], [2, 12]]")

# Test the second-order Taylor approximation at a nearby point
x_test_mv = np.array([1.2, 2.3])
delta = x_test_mv - x0_mv

# T_2(x) = f(x0) + grad^T * delta + 0.5 * delta^T * H * delta
T0 = f_x0
T1 = T0 + grad_x0 @ delta
T2 = T1 + 0.5 * delta @ H_x0 @ delta

f_exact = f_mv(x_test_mv)
print(f"\nTest point: {x_test_mv}")
print(f"  f(x) exact:      {f_exact:.6f}")
print(f"  T_0 (constant):  {T0:.6f}   error = {abs(f_exact - T0):.6f}")
print(f"  T_1 (linear):    {T1:.6f}   error = {abs(f_exact - T1):.6f}")
print(f"  T_2 (quadratic): {T2:.6f}   error = {abs(f_exact - T2):.6f}")
print(f"\nThe remaining error in T_2 comes from the cubic term y^3.")
print(f"For this polynomial, T_3 would be exact (cf. MML Eq. 5.180c).")

SummaryΒΆ

This lab covered the core concepts of vector calculus from MML Chapter 5:

Topic

Key Concept

MML Section

Univariate derivatives

Central difference approximation, \(O(h^2)\) accuracy

5.1

Partial derivatives & gradients

Gradient = row vector of partial derivatives, steepest ascent

5.2

Jacobian

Matrix of partial derivatives for vector-valued functions \(\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m\)

5.3

Matrix gradients

\(\nabla_\mathbf{x}(\mathbf{x}^\top\mathbf{Bx}) = \mathbf{x}^\top(\mathbf{B}+\mathbf{B}^\top)\), trace identities

5.4, 5.5

Backpropagation

Reverse-mode autodiff via computation graphs, chain rule

5.6

Hessian

Second-order derivatives, convexity via eigenvalue sign

5.7

Taylor series

Local polynomial approximation, multivariate generalization

5.1.1, 5.8

Key takeaways for ML:

  • Gradient descent relies on \(\nabla f\) to iteratively minimize loss functions.

  • Backpropagation is just the chain rule applied to a computation graph – each node computes a local derivative, and gradients flow backward.

  • The Hessian tells us about curvature: positive definite = convex (unique minimum), indefinite = saddle point (common in neural networks).

  • Taylor series underlie many approximations in ML: linearization for Gauss-Newton, second-order methods (Newton’s method), and the Laplace approximation.