Run this notebook: Open in Colab Open in Kaggle

Lab 05: Transformers & Attention¶

Source: Speech and Language Processing (3rd ed.) by Dan Jurafsky & James H. Martin – Chapter 8: Transformers

PDF: resources/ed3book_jan26.pdf

Topics covered:

Scaled dot-product attention
Causal (masked) attention
Multi-head attention
Sinusoidal positional encoding
Transformer blocks (pre-norm variant)
Mini transformer language model

import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple

%matplotlib inline
plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["font.size"] = 11

PART 1: Dot-Product Attention¶

Reference: SLP Ch 8.1 (Equations 8.10–8.14)

The core idea of the transformer is attention: computing a contextual representation for each token by selectively attending to and integrating information from other tokens.

Each input embedding plays three roles via learned projections:

Role	Matrix	Purpose
Query \(\mathbf{q}_i = \mathbf{x}_i \mathbf{W}^Q\)	\(\mathbf{W}^Q \in \mathbb{R}^{d \times d_k}\)	The current element being compared
Key \(\mathbf{k}_j = \mathbf{x}_j \mathbf{W}^K\)	\(\mathbf{W}^K \in \mathbb{R}^{d \times d_k}\)	A preceding input used to determine similarity
Value \(\mathbf{v}_j = \mathbf{x}_j \mathbf{W}^V\)	\(\mathbf{W}^V \in \mathbb{R}^{d \times d_v}\)	The content that gets weighted and summed

Scaled dot-product attention (Eq. 8.33):

\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}}\right) \mathbf{V}\]

The scaling by \(\sqrt{d_k}\) prevents the dot products from growing too large (which would push softmax into regions with tiny gradients).

def softmax(z: np.ndarray) -> np.ndarray:
    """
    Numerically stable row-wise softmax.
    
    Subtracting the row-max before exponentiation prevents overflow
    (exp of very large numbers) while producing identical results,
    since softmax is shift-invariant: softmax(z) == softmax(z - c).
    
    Args:
        z: array of shape (..., n) -- logits
    Returns:
        array of same shape with each row summing to 1
    """
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)


# --- Quick sanity check ---
test_z = np.array([[1.0, 2.0, 3.0],
                   [1000.0, 1001.0, 1002.0]])  # large values to test stability
test_probs = softmax(test_z)
print("Softmax sanity check:")
print(f"  Input row 1: {test_z[0]}  ->  {test_probs[0].round(4)}")
print(f"  Input row 2: {test_z[1]}  ->  {test_probs[1].round(4)}  (large values, still stable)")
print(f"  Row sums: {test_probs.sum(axis=1).round(6)}")

def scaled_dot_product_attention(
    Q: np.ndarray, K: np.ndarray, V: np.ndarray
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute scaled dot-product attention (SLP Eq. 8.33).
    
    Args:
        Q: queries,  shape (seq_len, d_k)
        K: keys,     shape (seq_len, d_k)
        V: values,   shape (seq_len, d_v)
    Returns:
        output:            shape (seq_len, d_v)
        attention_weights: shape (seq_len, seq_len)
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute raw scores  Q K^T  -> (seq_len, seq_len)
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Step 2: Softmax to get attention weights (each row sums to 1)
    weights = softmax(scores)
    
    # Step 3: Weighted sum of values
    output = weights @ V
    
    return output, weights

# --- Demo: scaled dot-product attention with random Q, K, V ---
np.random.seed(42)
seq_len, d_k = 4, 8

Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)

print("Scaled Dot-Product Attention Demo")
print("=" * 40)
print(f"  Q shape: {Q.shape}")
print(f"  K shape: {K.shape}")
print(f"  V shape: {V.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Attention weights shape: {weights.shape}")
print(f"\nAttention weights (row sums should be 1.0):")
print(f"  {weights.sum(axis=1).round(6)}")
print(f"\nAttention weight matrix:")
print(np.array2string(weights, precision=4, suppress_small=True))

# --- Visualize attention weights as a heatmap ---
fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(weights, cmap="Blues", vmin=0, vmax=1)
ax.set_xlabel("Key position")
ax.set_ylabel("Query position")
ax.set_title("Scaled Dot-Product Attention Weights")
ax.set_xticks(range(seq_len))
ax.set_yticks(range(seq_len))

# Annotate cells with weight values
for i in range(seq_len):
    for j in range(seq_len):
        ax.text(j, i, f"{weights[i, j]:.2f}",
                ha="center", va="center", fontsize=10,
                color="white" if weights[i, j] > 0.5 else "black")

plt.colorbar(im, ax=ax, label="Attention weight")
plt.tight_layout()
plt.show()

PART 2: Causal (Masked) Attention¶

Reference: SLP Ch 8.3 (Masking out the future, Eq. 8.33–8.34, Fig. 8.10)

For causal (left-to-right) language modeling, each position \(i\) should only attend to positions \(j \le i\) (the current token and past tokens). Attending to future tokens would be cheating – you cannot condition on words you have not yet seen.

We enforce this by adding a mask matrix \(\mathbf{M}\) to the scores before softmax:

\[\begin{split}M_{ij} = \begin{cases} 0 & \text{if } j \le i \\ -\infty & \text{if } j > i \end{cases}\end{split}\]

Since \(\exp(-\infty) = 0\), the softmax assigns zero weight to all future positions. The upper-triangular portion of \(\mathbf{QK}^T\) is effectively zeroed out (Fig. 8.10).

def causal_attention(
    Q: np.ndarray, K: np.ndarray, V: np.ndarray
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Causal (masked) scaled dot-product attention.
    Prevents position i from attending to any position j > i.
    
    Args:
        Q: queries,  shape (seq_len, d_k)
        K: keys,     shape (seq_len, d_k)
        V: values,   shape (seq_len, d_v)
    Returns:
        output:            shape (seq_len, d_v)
        attention_weights: shape (seq_len, seq_len)
    """
    d_k = Q.shape[-1]
    seq_len = Q.shape[0]
    
    # Compute raw scores
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Create causal mask: upper triangle (k=1 means above the diagonal) = -inf
    # np.triu with k=1 gives 1s above the main diagonal, 0s elsewhere
    mask = np.triu(np.ones((seq_len, seq_len)), k=1) * (-1e9)
    scores = scores + mask
    
    # Softmax (future positions get exp(-1e9) ~ 0 weight)
    weights = softmax(scores)
    
    # Weighted sum of values
    output = weights @ V
    
    return output, weights

# --- Verify causal attention properties ---
np.random.seed(42)
seq_len, d = 5, 4
Q = np.random.randn(seq_len, d)
K = np.random.randn(seq_len, d)
V = np.random.randn(seq_len, d)

output_c, weights_c = causal_attention(Q, K, V)

print("Causal Attention Verification")
print("=" * 50)

# Check 1: Position i only attends to positions 0..i
print("\n1. Future weights should be ~0 (upper triangle):")
for i in range(seq_len):
    future_weights = weights_c[i, i+1:]
    past_weights = weights_c[i, :i+1]
    print(f"   Row {i}: attends to 0..{i} with weights {past_weights.round(4)}"
          f"  |  future weights = {future_weights.round(10)}")

# Check 2: All future weights are effectively zero
upper_triangle = np.triu(weights_c, k=1)
print(f"\n2. Max weight assigned to any future position: {upper_triangle.max():.2e}")

# Check 3: Rows still sum to 1
print(f"\n3. Row sums (should all be 1.0): {weights_c.sum(axis=1).round(6)}")

# --- Visualize causal attention weights ---
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Full (non-causal) attention for comparison
_, weights_full = scaled_dot_product_attention(Q, K, V)

for ax, w, title in zip(axes,
                        [weights_full, weights_c],
                        ["Full (Bidirectional) Attention", "Causal (Masked) Attention"]):
    im = ax.imshow(w, cmap="Blues", vmin=0, vmax=1)
    ax.set_xlabel("Key position")
    ax.set_ylabel("Query position")
    ax.set_title(title)
    ax.set_xticks(range(seq_len))
    ax.set_yticks(range(seq_len))
    for i in range(seq_len):
        for j in range(seq_len):
            ax.text(j, i, f"{w[i, j]:.2f}",
                    ha="center", va="center", fontsize=9,
                    color="white" if w[i, j] > 0.5 else "black")
    plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)

plt.suptitle("Causal mask zeros out the upper triangle", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

PART 3: Multi-Head Attention¶

Reference: SLP Ch 8.1 (Equations 8.15–8.20), Ch 8.3 (Equations 8.35–8.37, Fig. 8.6)

A single attention head can only capture one kind of relationship. Transformers use multiple attention heads (\(A\) heads) in parallel, each with its own projection matrices, allowing the model to attend to different aspects simultaneously (e.g., syntactic vs. semantic).

For each head \(c\) (\(1 \le c \le A\)):

\[\mathbf{Q}^c = \mathbf{X} \mathbf{W}^{Qc}, \quad \mathbf{K}^c = \mathbf{X} \mathbf{W}^{Kc}, \quad \mathbf{V}^c = \mathbf{X} \mathbf{W}^{Vc}\]

\[\text{head}_c = \text{softmax}\!\left(\text{mask}\!\left(\frac{\mathbf{Q}^c {\mathbf{K}^c}^T}{\sqrt{d_k}}\right)\right) \mathbf{V}^c\]

The heads are concatenated and projected back to model dimension \(d\):

\[\text{MultiHead}(\mathbf{X}) = (\text{head}_1 \oplus \text{head}_2 \oplus \cdots \oplus \text{head}_A) \, \mathbf{W}^O\]

With \(d_k = d_v = d / A\), the output projection \(\mathbf{W}^O \in \mathbb{R}^{d \times d}\) maps the concatenated heads back to shape \((N, d)\).

class MultiHeadAttention:
    """Multi-head self-attention (SLP Eqs. 8.35--8.37)."""

    def __init__(self, d_model: int, n_heads: int):
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # d_k = d_v = d / A

        # Initialize projection matrices with small random weights
        # W_Q, W_K, W_V: (d_model, d_model)  --  all heads packed together
        # W_O: (d_model, d_model)  --  output projection
        scale = 0.02
        self.W_Q = np.random.randn(d_model, d_model) * scale
        self.W_K = np.random.randn(d_model, d_model) * scale
        self.W_V = np.random.randn(d_model, d_model) * scale
        self.W_O = np.random.randn(d_model, d_model) * scale

    def forward(self, X: np.ndarray, causal: bool = False) -> np.ndarray:
        """
        Multi-head attention forward pass.
        
        Args:
            X: input of shape (seq_len, d_model)
            causal: if True, apply causal mask
        Returns:
            output of shape (seq_len, d_model)
        """
        seq_len = X.shape[0]
        
        # Project to Q, K, V -- shape (seq_len, d_model)
        Q = X @ self.W_Q
        K = X @ self.W_K
        V = X @ self.W_V
        
        # Reshape into separate heads:
        # (seq_len, d_model) -> (seq_len, n_heads, d_k) -> (n_heads, seq_len, d_k)
        Q = Q.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        K = K.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        V = V.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        
        # Compute attention for each head
        # scores: (n_heads, seq_len, seq_len)
        scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(self.d_k)
        
        # Apply causal mask if needed
        if causal:
            mask = np.triu(np.ones((seq_len, seq_len)), k=1) * (-1e9)
            scores = scores + mask  # broadcasts over head dimension
        
        # Softmax over keys (last axis)
        weights = softmax(scores)  # (n_heads, seq_len, seq_len)
        
        # Weighted sum of values
        head_outputs = np.matmul(weights, V)  # (n_heads, seq_len, d_k)
        
        # Concatenate heads:
        # (n_heads, seq_len, d_k) -> (seq_len, n_heads, d_k) -> (seq_len, d_model)
        concat = head_outputs.transpose(1, 0, 2).reshape(seq_len, self.d_model)
        
        # Output projection
        output = concat @ self.W_O
        
        return output

# --- Demo: multi-head attention ---
np.random.seed(42)
seq_len = 6
d_model = 32
n_heads = 4

mha = MultiHeadAttention(d_model, n_heads)
X = np.random.randn(seq_len, d_model)

out_mha = mha.forward(X, causal=False)
out_mha_causal = mha.forward(X, causal=True)

print("Multi-Head Attention Demo")
print("=" * 40)
print(f"  Input shape:  {X.shape}  (seq_len={seq_len}, d_model={d_model})")
print(f"  n_heads: {n_heads}, d_k = d_model/n_heads = {d_model // n_heads}")
print(f"  Output shape (full):   {out_mha.shape}")
print(f"  Output shape (causal): {out_mha_causal.shape}")
print(f"\n  W_Q shape: {mha.W_Q.shape}")
print(f"  W_K shape: {mha.W_K.shape}")
print(f"  W_V shape: {mha.W_V.shape}")
print(f"  W_O shape: {mha.W_O.shape}")

PART 4: Positional Encoding¶

Reference: SLP Ch 8.4 (Embeddings for Token and Position)

Since attention is permutation-equivariant (it has no inherent notion of position), we must inject positional information. The original transformer (Vaswani et al., 2017) uses sinusoidal positional encodings – a deterministic function that maps each position to a unique vector:

\[\text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d}}\right)\]

\[\text{PE}(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d}}\right)\]

Key properties:

Each dimension oscillates at a different frequency (low dims = fast, high dims = slow).
The dot product \(\text{PE}(\text{pos}) \cdot \text{PE}(\text{pos}+k)\) depends primarily on the relative distance \(k\), enabling the model to learn relative position patterns.
Nearby positions have more similar encodings than distant ones.

def positional_encoding(max_len: int, d_model: int) -> np.ndarray:
    """
    Generate sinusoidal positional encodings.
    
    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    
    Args:
        max_len: maximum sequence length
        d_model: model dimensionality
    Returns:
        PE matrix of shape (max_len, d_model)
    """
    PE = np.zeros((max_len, d_model))
    
    # Position indices: column vector (max_len, 1)
    position = np.arange(max_len)[:, np.newaxis]
    
    # Compute the division term: 1 / 10000^(2i/d_model)
    # Using exp-log trick for numerical stability:
    #   10000^(2i/d) = exp(2i * ln(10000) / d)
    #   1/10000^(2i/d) = exp(-2i * ln(10000) / d)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    # Even indices: sin, Odd indices: cos
    PE[:, 0::2] = np.sin(position * div_term)
    PE[:, 1::2] = np.cos(position * div_term)
    
    return PE


# --- Quick check ---
PE_check = positional_encoding(10, 8)
print(f"PE shape: {PE_check.shape}")
print(f"PE[0, :8] = {PE_check[0].round(4)}")
print(f"PE[1, :8] = {PE_check[1].round(4)}")
print(f"PE[0] should start with sin(0)=0, cos(0)=1: [{PE_check[0,0]:.1f}, {PE_check[0,1]:.1f}, ...]")

# --- Plot PE matrix as a heatmap ---
PE = positional_encoding(50, 64)

fig, ax = plt.subplots(figsize=(10, 5))
im = ax.imshow(PE, aspect="auto", cmap="RdBu_r", vmin=-1, vmax=1)
ax.set_xlabel("Encoding dimension $i$")
ax.set_ylabel("Position")
ax.set_title("Sinusoidal Positional Encoding  (max_len=50, d_model=64)")
plt.colorbar(im, ax=ax, label="PE value")
plt.tight_layout()
plt.show()

print("Low dimensions (left) oscillate rapidly; high dimensions (right) oscillate slowly.")
print("Each row (position) gets a unique fingerprint.")

# --- Dot product captures relative position ---
# Cosine similarity between PE[0] and PE[k] for various k.
# Closer positions should be more similar.

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


PE_large = positional_encoding(50, 64)

print("Cosine similarity between PE[0] and PE[k]:")
print("-" * 45)
for k in [1, 5, 10, 25]:
    sim = cosine_similarity(PE_large[0], PE_large[k])
    print(f"  k = {k:2d}  ->  cosine_sim = {sim:.4f}")

print("\nCloser positions have higher similarity, confirming that")
print("the encoding captures relative position information.")

# Plot similarity curve
distances = np.arange(1, 50)
sims = [cosine_similarity(PE_large[0], PE_large[k]) for k in distances]

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(distances, sims, "o-", markersize=3)
ax.set_xlabel("Relative distance $k$")
ax.set_ylabel("Cosine similarity PE[0] vs PE[k]")
ax.set_title("Positional Encoding: Similarity Decreases with Distance")
ax.axhline(y=0, color="gray", linestyle="--", alpha=0.5)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

PART 5: Transformer Block¶

Reference: SLP Ch 8.2 (Transformer Blocks, Eqs. 8.21–8.31, 8.38–8.45, Fig. 8.7)

A transformer block combines self-attention with two other key components via a residual stream architecture (pre-norm variant, Eqs. 8.40–8.45):

\[\mathbf{T}^1 = \text{LayerNorm}(\mathbf{X})\]

\[\mathbf{T}^2 = \text{MultiHeadAttention}(\mathbf{T}^1)\]

\[\mathbf{T}^3 = \mathbf{T}^2 + \mathbf{X} \quad \text{(residual connection)}\]

\[\mathbf{T}^4 = \text{LayerNorm}(\mathbf{T}^3)\]

\[\mathbf{T}^5 = \text{FFN}(\mathbf{T}^4)\]

\[\mathbf{H} = \mathbf{T}^5 + \mathbf{T}^3 \quad \text{(residual connection)}\]

where the feedforward network (Eq. 8.21) is:

\[\text{FFN}(\mathbf{x}) = \text{ReLU}(\mathbf{x} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2\]

and layer normalization (Eq. 8.25) normalizes across the feature dimension:

\[\text{LayerNorm}(\mathbf{x}) = \gamma \frac{\mathbf{x} - \mu}{\sigma} + \beta\]

def layer_norm(
    x: np.ndarray,
    gamma: np.ndarray = None,
    beta: np.ndarray = None,
    eps: float = 1e-5
) -> np.ndarray:
    """
    Layer normalization (SLP Eq. 8.25).
    Normalizes over the last (feature) dimension, then applies
    learnable affine transform gamma * x_norm + beta.
    
    Args:
        x: input of shape (..., d)
        gamma: scale parameter of shape (d,), defaults to ones
        beta: shift parameter of shape (d,), defaults to zeros
        eps: small constant for numerical stability
    Returns:
        normalized output of same shape as x
    """
    mean = np.mean(x, axis=-1, keepdims=True)
    var = np.var(x, axis=-1, keepdims=True)
    x_norm = (x - mean) / np.sqrt(var + eps)
    
    if gamma is not None:
        x_norm = gamma * x_norm + beta
    
    return x_norm


# --- Verify layer norm ---
x_test = np.array([[1.0, 2.0, 3.0, 4.0],
                   [2.0, 4.0, 6.0, 8.0]])
x_normed = layer_norm(x_test)

print("Layer Norm Verification")
print("=" * 40)
print(f"  Input:      {x_test[0]}")
print(f"  Normalized: {x_normed[0].round(4)}")
print(f"  Mean after norm: {x_normed[0].mean():.6f}  (should be ~0)")
print(f"  Std after norm:  {x_normed[0].std():.6f}  (should be ~1)")

def feedforward(
    x: np.ndarray,
    W1: np.ndarray, b1: np.ndarray,
    W2: np.ndarray, b2: np.ndarray
) -> np.ndarray:
    """
    Position-wise feedforward network (SLP Eq. 8.21).
    
    FFN(x) = ReLU(x @ W1 + b1) @ W2 + b2
    
    Typically d_ff = 4 * d_model (the inner dimension is expanded
    then projected back down).
    
    Args:
        x:  shape (seq_len, d_model)
        W1: shape (d_model, d_ff)
        b1: shape (d_ff,)
        W2: shape (d_ff, d_model)
        b2: shape (d_model,)
    Returns:
        output of shape (seq_len, d_model)
    """
    hidden = np.maximum(0, x @ W1 + b1)  # ReLU activation
    return hidden @ W2 + b2


# --- Quick test ---
np.random.seed(0)
d_model_t, d_ff_t = 8, 32
x_ff = np.random.randn(3, d_model_t)
W1_t = np.random.randn(d_model_t, d_ff_t) * 0.1
b1_t = np.zeros(d_ff_t)
W2_t = np.random.randn(d_ff_t, d_model_t) * 0.1
b2_t = np.zeros(d_model_t)

out_ff = feedforward(x_ff, W1_t, b1_t, W2_t, b2_t)
print(f"FFN: input shape {x_ff.shape} -> output shape {out_ff.shape}")

class TransformerBlock:
    """
    A single transformer block -- pre-norm variant (SLP Eqs. 8.40--8.45).
    
    Pre-norm: LayerNorm is applied BEFORE attention and FFN,
    rather than after. This is the architecture used in modern
    transformers (GPT-2 onward) as it trains more stably.
    """

    def __init__(self, d_model: int, n_heads: int, d_ff: int):
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.d_model = d_model
        self.d_ff = d_ff

        # Layer norm parameters (learnable scale and shift)
        self.ln1_gamma = np.ones(d_model)
        self.ln1_beta = np.zeros(d_model)
        self.ln2_gamma = np.ones(d_model)
        self.ln2_beta = np.zeros(d_model)

        # FFN parameters
        self.W1 = np.random.randn(d_model, d_ff) * 0.02
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * 0.02
        self.b2 = np.zeros(d_model)

    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Pre-norm transformer block forward pass.
        
        T1 = LayerNorm(x)
        T2 = MultiHeadAttention(T1)
        T3 = T2 + x               (residual)
        T4 = LayerNorm(T3)
        T5 = FFN(T4)
        H  = T5 + T3              (residual)
        
        Args:
            x: shape (seq_len, d_model)
        Returns:
            output of shape (seq_len, d_model)
        """
        # Sub-layer 1: attention with residual
        t1 = layer_norm(x, self.ln1_gamma, self.ln1_beta)
        t2 = self.attention.forward(t1, causal=True)
        t3 = t2 + x  # residual connection
        
        # Sub-layer 2: feedforward with residual
        t4 = layer_norm(t3, self.ln2_gamma, self.ln2_beta)
        t5 = feedforward(t4, self.W1, self.b1, self.W2, self.b2)
        h = t5 + t3  # residual connection
        
        return h

# --- Demo: pass random input through a transformer block ---
np.random.seed(42)
seq_len = 6
d_model = 32
n_heads = 4
d_ff = d_model * 4  # 128

block = TransformerBlock(d_model, n_heads, d_ff)
X_block = np.random.randn(seq_len, d_model)

H_block = block.forward(X_block)

print("Transformer Block Demo")
print("=" * 40)
print(f"  Input shape:  {X_block.shape}  (seq_len={seq_len}, d_model={d_model})")
print(f"  Output shape: {H_block.shape}")
print(f"  n_heads: {n_heads}, d_k: {d_model // n_heads}, d_ff: {d_ff}")
print(f"\n  Input[0, :6]:  {X_block[0, :6].round(4)}")
print(f"  Output[0, :6]: {H_block[0, :6].round(4)}")
print(f"\n  (Input and output have the same shape -- the residual stream")
print(f"   dimension is preserved throughout the transformer.)")

PART 6: Mini Transformer LM¶

Reference: SLP Ch 8.4–8.5 (Input Embeddings, Language Modeling Head, Eqs. 8.46–8.47, Fig. 8.15–8.16)

A complete decoder-only transformer language model chains all the pieces together:

Token embedding: look up each token ID in embedding matrix \(\mathbf{E} \in \mathbb{R}^{|V| \times d}\)
Positional encoding: add sinusoidal PE to token embeddings
\(N\) transformer blocks: pass through \(N\) stacked blocks (the residual stream)
Final layer norm: one extra layer norm at the top of the stack (pre-norm convention)
Language modeling head (unembedding): project to vocabulary logits

\[\mathbf{u} = \mathbf{h}_N^L \, \mathbf{E}^T \qquad \text{(logits)}\]

\[\mathbf{y} = \text{softmax}(\mathbf{u}) \qquad \text{(probabilities)}\]

Many models use weight tying: the same matrix \(\mathbf{E}\) is used for both input embedding and the output projection (via its transpose \(\mathbf{E}^T\)).

class MiniTransformerLM:
    """
    Minimal transformer language model (decoder-only, SLP Ch 8.4--8.5).
    
    Architecture:
        token_ids -> E[token_ids] + PE[:seq_len]
                  -> N x TransformerBlock
                  -> LayerNorm
                  -> logits = H @ E^T   (weight tying)
    """

    def __init__(self, vocab_size: int, d_model: int = 64,
                 n_heads: int = 4, n_layers: int = 2, max_len: int = 128):
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.n_layers = n_layers

        # Token embeddings: E of shape (vocab_size, d_model)
        self.token_embeddings = np.random.randn(vocab_size, d_model) * 0.02

        # Sinusoidal positional encodings (not learned)
        self.pos_encodings = positional_encoding(max_len, d_model)

        # Stack of transformer blocks
        self.blocks = [
            TransformerBlock(d_model, n_heads, d_model * 4)
            for _ in range(n_layers)
        ]

        # Final layer norm (pre-norm convention adds one at the very end)
        self.final_ln_gamma = np.ones(d_model)
        self.final_ln_beta = np.zeros(d_model)

        # Language model head: uses weight tying (E^T)
        # logits = H @ E^T, shape (seq_len, vocab_size)
        # (We use self.token_embeddings.T in forward)

    def forward(self, token_ids: np.ndarray) -> np.ndarray:
        """
        Forward pass: token_ids -> logits.
        
        Args:
            token_ids: integer array of shape (seq_len,)
        Returns:
            logits: array of shape (seq_len, vocab_size)
        """
        seq_len = len(token_ids)

        # Step 1: Look up token embeddings
        # E[token_ids] selects rows from the embedding matrix
        x = self.token_embeddings[token_ids]  # (seq_len, d_model)

        # Step 2: Add positional encodings
        x = x + self.pos_encodings[:seq_len]  # (seq_len, d_model)

        # Step 3: Pass through N transformer blocks
        for block in self.blocks:
            x = block.forward(x)

        # Step 4: Final layer norm
        x = layer_norm(x, self.final_ln_gamma, self.final_ln_beta)

        # Step 5: Language model head (weight tying: logits = H @ E^T)
        logits = x @ self.token_embeddings.T  # (seq_len, vocab_size)

        return logits

# --- Demo: mini transformer language model ---
np.random.seed(42)

vocab_size = 100
d_model = 32
n_heads = 4
n_layers = 2
max_len = 64
seq_len = 10

lm = MiniTransformerLM(
    vocab_size=vocab_size,
    d_model=d_model,
    n_heads=n_heads,
    n_layers=n_layers,
    max_len=max_len
)

# Random token IDs (simulating an input sequence)
token_ids = np.random.randint(0, vocab_size, size=seq_len)

logits = lm.forward(token_ids)

print("Mini Transformer Language Model Demo")
print("=" * 50)
print(f"  vocab_size: {vocab_size}")
print(f"  d_model:    {d_model}")
print(f"  n_heads:    {n_heads}  (d_k = {d_model // n_heads})")
print(f"  n_layers:   {n_layers}")
print(f"  d_ff:       {d_model * 4}")
print(f"  max_len:    {max_len}")
print(f"\n  Input token IDs: {token_ids}")
print(f"  Input shape:  ({seq_len},)")
print(f"  Logits shape: {logits.shape}  (seq_len, vocab_size)")

# Convert logits to probabilities for the last position
probs_last = softmax(logits[-1:])[0]
top5_idx = np.argsort(probs_last)[-5:][::-1]
print(f"\n  Predicted next-token probabilities (last position):")
print(f"  Top 5 tokens: {list(zip(top5_idx, probs_last[top5_idx].round(4)))}")
print(f"  Prob sum: {probs_last.sum():.6f}  (should be 1.0)")
print(f"\n  Note: With random weights these predictions are meaningless.")
print(f"  Training would adjust all parameters to minimize cross-entropy loss.")

Summary¶

This lab built a complete (decoder-only) transformer language model from scratch using only NumPy, following the architecture described in Jurafsky & Martin Chapter 8:

Component	Key idea	Equation
Scaled dot-product attention	\(\text{softmax}(QK^T / \sqrt{d_k})\, V\)	Eq. 8.33
Causal mask	Set upper triangle to \(-\infty\) before softmax	Fig. 8.10
Multi-head attention	\(A\) parallel heads, concatenated and projected by \(W^O\)	Eqs. 8.35–8.37
Positional encoding	\(\sin\)/\(\cos\) at different frequencies encode position	Ch. 8.4
Transformer block	Pre-norm: LN \(\to\) Attention \(\to\) Residual \(\to\) LN \(\to\) FFN \(\to\) Residual	Eqs. 8.40–8.45
LM head	Logits \(= H \cdot E^T\), then softmax for next-token probabilities	Eqs. 8.46–8.47

The architecture flows bottom-to-top through a residual stream: each component reads from and writes back into this stream, with attention being the only component that mixes information across token positions.

All code is fully implemented with NumPy – no deep learning framework required. In practice, frameworks like PyTorch handle backpropagation, GPU acceleration, and efficient batched operations.