Lab 05: Transformers & AttentionΒΆ
Source: Speech and Language Processing (3rd ed.) by Dan Jurafsky & James H. Martin β Chapter 8: Transformers
PDF: resources/ed3book_jan26.pdf
Topics covered:
Scaled dot-product attention
Causal (masked) attention
Multi-head attention
Sinusoidal positional encoding
Transformer blocks (pre-norm variant)
Mini transformer language model
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple
%matplotlib inline
plt.rcParams["figure.figsize"] = (8, 5)
plt.rcParams["font.size"] = 11
PART 1: Dot-Product AttentionΒΆ
Reference: SLP Ch 8.1 (Equations 8.10β8.14)
The core idea of the transformer is attention: computing a contextual representation for each token by selectively attending to and integrating information from other tokens.
Each input embedding plays three roles via learned projections:
Role |
Matrix |
Purpose |
|---|---|---|
Query \(\mathbf{q}_i = \mathbf{x}_i \mathbf{W}^Q\) |
\(\mathbf{W}^Q \in \mathbb{R}^{d \times d_k}\) |
The current element being compared |
Key \(\mathbf{k}_j = \mathbf{x}_j \mathbf{W}^K\) |
\(\mathbf{W}^K \in \mathbb{R}^{d \times d_k}\) |
A preceding input used to determine similarity |
Value \(\mathbf{v}_j = \mathbf{x}_j \mathbf{W}^V\) |
\(\mathbf{W}^V \in \mathbb{R}^{d \times d_v}\) |
The content that gets weighted and summed |
Scaled dot-product attention (Eq. 8.33):
The scaling by \(\sqrt{d_k}\) prevents the dot products from growing too large (which would push softmax into regions with tiny gradients).
def softmax(z: np.ndarray) -> np.ndarray:
"""
Numerically stable row-wise softmax.
Subtracting the row-max before exponentiation prevents overflow
(exp of very large numbers) while producing identical results,
since softmax is shift-invariant: softmax(z) == softmax(z - c).
Args:
z: array of shape (..., n) -- logits
Returns:
array of same shape with each row summing to 1
"""
z_shifted = z - np.max(z, axis=-1, keepdims=True)
exp_z = np.exp(z_shifted)
return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
# --- Quick sanity check ---
test_z = np.array([[1.0, 2.0, 3.0],
[1000.0, 1001.0, 1002.0]]) # large values to test stability
test_probs = softmax(test_z)
print("Softmax sanity check:")
print(f" Input row 1: {test_z[0]} -> {test_probs[0].round(4)}")
print(f" Input row 2: {test_z[1]} -> {test_probs[1].round(4)} (large values, still stable)")
print(f" Row sums: {test_probs.sum(axis=1).round(6)}")
def scaled_dot_product_attention(
Q: np.ndarray, K: np.ndarray, V: np.ndarray
) -> Tuple[np.ndarray, np.ndarray]:
"""
Compute scaled dot-product attention (SLP Eq. 8.33).
Args:
Q: queries, shape (seq_len, d_k)
K: keys, shape (seq_len, d_k)
V: values, shape (seq_len, d_v)
Returns:
output: shape (seq_len, d_v)
attention_weights: shape (seq_len, seq_len)
"""
d_k = Q.shape[-1]
# Step 1: Compute raw scores Q K^T -> (seq_len, seq_len)
scores = Q @ K.T / np.sqrt(d_k)
# Step 2: Softmax to get attention weights (each row sums to 1)
weights = softmax(scores)
# Step 3: Weighted sum of values
output = weights @ V
return output, weights
# --- Demo: scaled dot-product attention with random Q, K, V ---
np.random.seed(42)
seq_len, d_k = 4, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)
output, weights = scaled_dot_product_attention(Q, K, V)
print("Scaled Dot-Product Attention Demo")
print("=" * 40)
print(f" Q shape: {Q.shape}")
print(f" K shape: {K.shape}")
print(f" V shape: {V.shape}")
print(f" Output shape: {output.shape}")
print(f" Attention weights shape: {weights.shape}")
print(f"\nAttention weights (row sums should be 1.0):")
print(f" {weights.sum(axis=1).round(6)}")
print(f"\nAttention weight matrix:")
print(np.array2string(weights, precision=4, suppress_small=True))
# --- Visualize attention weights as a heatmap ---
fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(weights, cmap="Blues", vmin=0, vmax=1)
ax.set_xlabel("Key position")
ax.set_ylabel("Query position")
ax.set_title("Scaled Dot-Product Attention Weights")
ax.set_xticks(range(seq_len))
ax.set_yticks(range(seq_len))
# Annotate cells with weight values
for i in range(seq_len):
for j in range(seq_len):
ax.text(j, i, f"{weights[i, j]:.2f}",
ha="center", va="center", fontsize=10,
color="white" if weights[i, j] > 0.5 else "black")
plt.colorbar(im, ax=ax, label="Attention weight")
plt.tight_layout()
plt.show()
PART 2: Causal (Masked) AttentionΒΆ
Reference: SLP Ch 8.3 (Masking out the future, Eq. 8.33β8.34, Fig. 8.10)
For causal (left-to-right) language modeling, each position \(i\) should only attend to positions \(j \le i\) (the current token and past tokens). Attending to future tokens would be cheating β you cannot condition on words you have not yet seen.
We enforce this by adding a mask matrix \(\mathbf{M}\) to the scores before softmax:
Since \(\exp(-\infty) = 0\), the softmax assigns zero weight to all future positions. The upper-triangular portion of \(\mathbf{QK}^T\) is effectively zeroed out (Fig. 8.10).
def causal_attention(
Q: np.ndarray, K: np.ndarray, V: np.ndarray
) -> Tuple[np.ndarray, np.ndarray]:
"""
Causal (masked) scaled dot-product attention.
Prevents position i from attending to any position j > i.
Args:
Q: queries, shape (seq_len, d_k)
K: keys, shape (seq_len, d_k)
V: values, shape (seq_len, d_v)
Returns:
output: shape (seq_len, d_v)
attention_weights: shape (seq_len, seq_len)
"""
d_k = Q.shape[-1]
seq_len = Q.shape[0]
# Compute raw scores
scores = Q @ K.T / np.sqrt(d_k)
# Create causal mask: upper triangle (k=1 means above the diagonal) = -inf
# np.triu with k=1 gives 1s above the main diagonal, 0s elsewhere
mask = np.triu(np.ones((seq_len, seq_len)), k=1) * (-1e9)
scores = scores + mask
# Softmax (future positions get exp(-1e9) ~ 0 weight)
weights = softmax(scores)
# Weighted sum of values
output = weights @ V
return output, weights
# --- Verify causal attention properties ---
np.random.seed(42)
seq_len, d = 5, 4
Q = np.random.randn(seq_len, d)
K = np.random.randn(seq_len, d)
V = np.random.randn(seq_len, d)
output_c, weights_c = causal_attention(Q, K, V)
print("Causal Attention Verification")
print("=" * 50)
# Check 1: Position i only attends to positions 0..i
print("\n1. Future weights should be ~0 (upper triangle):")
for i in range(seq_len):
future_weights = weights_c[i, i+1:]
past_weights = weights_c[i, :i+1]
print(f" Row {i}: attends to 0..{i} with weights {past_weights.round(4)}"
f" | future weights = {future_weights.round(10)}")
# Check 2: All future weights are effectively zero
upper_triangle = np.triu(weights_c, k=1)
print(f"\n2. Max weight assigned to any future position: {upper_triangle.max():.2e}")
# Check 3: Rows still sum to 1
print(f"\n3. Row sums (should all be 1.0): {weights_c.sum(axis=1).round(6)}")
# --- Visualize causal attention weights ---
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
# Full (non-causal) attention for comparison
_, weights_full = scaled_dot_product_attention(Q, K, V)
for ax, w, title in zip(axes,
[weights_full, weights_c],
["Full (Bidirectional) Attention", "Causal (Masked) Attention"]):
im = ax.imshow(w, cmap="Blues", vmin=0, vmax=1)
ax.set_xlabel("Key position")
ax.set_ylabel("Query position")
ax.set_title(title)
ax.set_xticks(range(seq_len))
ax.set_yticks(range(seq_len))
for i in range(seq_len):
for j in range(seq_len):
ax.text(j, i, f"{w[i, j]:.2f}",
ha="center", va="center", fontsize=9,
color="white" if w[i, j] > 0.5 else "black")
plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
plt.suptitle("Causal mask zeros out the upper triangle", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()
PART 3: Multi-Head AttentionΒΆ
Reference: SLP Ch 8.1 (Equations 8.15β8.20), Ch 8.3 (Equations 8.35β8.37, Fig. 8.6)
A single attention head can only capture one kind of relationship. Transformers use multiple attention heads (\(A\) heads) in parallel, each with its own projection matrices, allowing the model to attend to different aspects simultaneously (e.g., syntactic vs. semantic).
For each head \(c\) (\(1 \le c \le A\)):
The heads are concatenated and projected back to model dimension \(d\):
With \(d_k = d_v = d / A\), the output projection \(\mathbf{W}^O \in \mathbb{R}^{d \times d}\) maps the concatenated heads back to shape \((N, d)\).
class MultiHeadAttention:
"""Multi-head self-attention (SLP Eqs. 8.35--8.37)."""
def __init__(self, d_model: int, n_heads: int):
assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads # d_k = d_v = d / A
# Initialize projection matrices with small random weights
# W_Q, W_K, W_V: (d_model, d_model) -- all heads packed together
# W_O: (d_model, d_model) -- output projection
scale = 0.02
self.W_Q = np.random.randn(d_model, d_model) * scale
self.W_K = np.random.randn(d_model, d_model) * scale
self.W_V = np.random.randn(d_model, d_model) * scale
self.W_O = np.random.randn(d_model, d_model) * scale
def forward(self, X: np.ndarray, causal: bool = False) -> np.ndarray:
"""
Multi-head attention forward pass.
Args:
X: input of shape (seq_len, d_model)
causal: if True, apply causal mask
Returns:
output of shape (seq_len, d_model)
"""
seq_len = X.shape[0]
# Project to Q, K, V -- shape (seq_len, d_model)
Q = X @ self.W_Q
K = X @ self.W_K
V = X @ self.W_V
# Reshape into separate heads:
# (seq_len, d_model) -> (seq_len, n_heads, d_k) -> (n_heads, seq_len, d_k)
Q = Q.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
K = K.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
V = V.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
# Compute attention for each head
# scores: (n_heads, seq_len, seq_len)
scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(self.d_k)
# Apply causal mask if needed
if causal:
mask = np.triu(np.ones((seq_len, seq_len)), k=1) * (-1e9)
scores = scores + mask # broadcasts over head dimension
# Softmax over keys (last axis)
weights = softmax(scores) # (n_heads, seq_len, seq_len)
# Weighted sum of values
head_outputs = np.matmul(weights, V) # (n_heads, seq_len, d_k)
# Concatenate heads:
# (n_heads, seq_len, d_k) -> (seq_len, n_heads, d_k) -> (seq_len, d_model)
concat = head_outputs.transpose(1, 0, 2).reshape(seq_len, self.d_model)
# Output projection
output = concat @ self.W_O
return output
# --- Demo: multi-head attention ---
np.random.seed(42)
seq_len = 6
d_model = 32
n_heads = 4
mha = MultiHeadAttention(d_model, n_heads)
X = np.random.randn(seq_len, d_model)
out_mha = mha.forward(X, causal=False)
out_mha_causal = mha.forward(X, causal=True)
print("Multi-Head Attention Demo")
print("=" * 40)
print(f" Input shape: {X.shape} (seq_len={seq_len}, d_model={d_model})")
print(f" n_heads: {n_heads}, d_k = d_model/n_heads = {d_model // n_heads}")
print(f" Output shape (full): {out_mha.shape}")
print(f" Output shape (causal): {out_mha_causal.shape}")
print(f"\n W_Q shape: {mha.W_Q.shape}")
print(f" W_K shape: {mha.W_K.shape}")
print(f" W_V shape: {mha.W_V.shape}")
print(f" W_O shape: {mha.W_O.shape}")
PART 4: Positional EncodingΒΆ
Reference: SLP Ch 8.4 (Embeddings for Token and Position)
Since attention is permutation-equivariant (it has no inherent notion of position), we must inject positional information. The original transformer (Vaswani et al., 2017) uses sinusoidal positional encodings β a deterministic function that maps each position to a unique vector:
Key properties:
Each dimension oscillates at a different frequency (low dims = fast, high dims = slow).
The dot product \(\text{PE}(\text{pos}) \cdot \text{PE}(\text{pos}+k)\) depends primarily on the relative distance \(k\), enabling the model to learn relative position patterns.
Nearby positions have more similar encodings than distant ones.
def positional_encoding(max_len: int, d_model: int) -> np.ndarray:
"""
Generate sinusoidal positional encodings.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Args:
max_len: maximum sequence length
d_model: model dimensionality
Returns:
PE matrix of shape (max_len, d_model)
"""
PE = np.zeros((max_len, d_model))
# Position indices: column vector (max_len, 1)
position = np.arange(max_len)[:, np.newaxis]
# Compute the division term: 1 / 10000^(2i/d_model)
# Using exp-log trick for numerical stability:
# 10000^(2i/d) = exp(2i * ln(10000) / d)
# 1/10000^(2i/d) = exp(-2i * ln(10000) / d)
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
# Even indices: sin, Odd indices: cos
PE[:, 0::2] = np.sin(position * div_term)
PE[:, 1::2] = np.cos(position * div_term)
return PE
# --- Quick check ---
PE_check = positional_encoding(10, 8)
print(f"PE shape: {PE_check.shape}")
print(f"PE[0, :8] = {PE_check[0].round(4)}")
print(f"PE[1, :8] = {PE_check[1].round(4)}")
print(f"PE[0] should start with sin(0)=0, cos(0)=1: [{PE_check[0,0]:.1f}, {PE_check[0,1]:.1f}, ...]")
# --- Plot PE matrix as a heatmap ---
PE = positional_encoding(50, 64)
fig, ax = plt.subplots(figsize=(10, 5))
im = ax.imshow(PE, aspect="auto", cmap="RdBu_r", vmin=-1, vmax=1)
ax.set_xlabel("Encoding dimension $i$")
ax.set_ylabel("Position")
ax.set_title("Sinusoidal Positional Encoding (max_len=50, d_model=64)")
plt.colorbar(im, ax=ax, label="PE value")
plt.tight_layout()
plt.show()
print("Low dimensions (left) oscillate rapidly; high dimensions (right) oscillate slowly.")
print("Each row (position) gets a unique fingerprint.")
# --- Dot product captures relative position ---
# Cosine similarity between PE[0] and PE[k] for various k.
# Closer positions should be more similar.
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
PE_large = positional_encoding(50, 64)
print("Cosine similarity between PE[0] and PE[k]:")
print("-" * 45)
for k in [1, 5, 10, 25]:
sim = cosine_similarity(PE_large[0], PE_large[k])
print(f" k = {k:2d} -> cosine_sim = {sim:.4f}")
print("\nCloser positions have higher similarity, confirming that")
print("the encoding captures relative position information.")
# Plot similarity curve
distances = np.arange(1, 50)
sims = [cosine_similarity(PE_large[0], PE_large[k]) for k in distances]
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(distances, sims, "o-", markersize=3)
ax.set_xlabel("Relative distance $k$")
ax.set_ylabel("Cosine similarity PE[0] vs PE[k]")
ax.set_title("Positional Encoding: Similarity Decreases with Distance")
ax.axhline(y=0, color="gray", linestyle="--", alpha=0.5)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
PART 5: Transformer BlockΒΆ
Reference: SLP Ch 8.2 (Transformer Blocks, Eqs. 8.21β8.31, 8.38β8.45, Fig. 8.7)
A transformer block combines self-attention with two other key components via a residual stream architecture (pre-norm variant, Eqs. 8.40β8.45):
where the feedforward network (Eq. 8.21) is:
and layer normalization (Eq. 8.25) normalizes across the feature dimension:
def layer_norm(
x: np.ndarray,
gamma: np.ndarray = None,
beta: np.ndarray = None,
eps: float = 1e-5
) -> np.ndarray:
"""
Layer normalization (SLP Eq. 8.25).
Normalizes over the last (feature) dimension, then applies
learnable affine transform gamma * x_norm + beta.
Args:
x: input of shape (..., d)
gamma: scale parameter of shape (d,), defaults to ones
beta: shift parameter of shape (d,), defaults to zeros
eps: small constant for numerical stability
Returns:
normalized output of same shape as x
"""
mean = np.mean(x, axis=-1, keepdims=True)
var = np.var(x, axis=-1, keepdims=True)
x_norm = (x - mean) / np.sqrt(var + eps)
if gamma is not None:
x_norm = gamma * x_norm + beta
return x_norm
# --- Verify layer norm ---
x_test = np.array([[1.0, 2.0, 3.0, 4.0],
[2.0, 4.0, 6.0, 8.0]])
x_normed = layer_norm(x_test)
print("Layer Norm Verification")
print("=" * 40)
print(f" Input: {x_test[0]}")
print(f" Normalized: {x_normed[0].round(4)}")
print(f" Mean after norm: {x_normed[0].mean():.6f} (should be ~0)")
print(f" Std after norm: {x_normed[0].std():.6f} (should be ~1)")
def feedforward(
x: np.ndarray,
W1: np.ndarray, b1: np.ndarray,
W2: np.ndarray, b2: np.ndarray
) -> np.ndarray:
"""
Position-wise feedforward network (SLP Eq. 8.21).
FFN(x) = ReLU(x @ W1 + b1) @ W2 + b2
Typically d_ff = 4 * d_model (the inner dimension is expanded
then projected back down).
Args:
x: shape (seq_len, d_model)
W1: shape (d_model, d_ff)
b1: shape (d_ff,)
W2: shape (d_ff, d_model)
b2: shape (d_model,)
Returns:
output of shape (seq_len, d_model)
"""
hidden = np.maximum(0, x @ W1 + b1) # ReLU activation
return hidden @ W2 + b2
# --- Quick test ---
np.random.seed(0)
d_model_t, d_ff_t = 8, 32
x_ff = np.random.randn(3, d_model_t)
W1_t = np.random.randn(d_model_t, d_ff_t) * 0.1
b1_t = np.zeros(d_ff_t)
W2_t = np.random.randn(d_ff_t, d_model_t) * 0.1
b2_t = np.zeros(d_model_t)
out_ff = feedforward(x_ff, W1_t, b1_t, W2_t, b2_t)
print(f"FFN: input shape {x_ff.shape} -> output shape {out_ff.shape}")
class TransformerBlock:
"""
A single transformer block -- pre-norm variant (SLP Eqs. 8.40--8.45).
Pre-norm: LayerNorm is applied BEFORE attention and FFN,
rather than after. This is the architecture used in modern
transformers (GPT-2 onward) as it trains more stably.
"""
def __init__(self, d_model: int, n_heads: int, d_ff: int):
self.attention = MultiHeadAttention(d_model, n_heads)
self.d_model = d_model
self.d_ff = d_ff
# Layer norm parameters (learnable scale and shift)
self.ln1_gamma = np.ones(d_model)
self.ln1_beta = np.zeros(d_model)
self.ln2_gamma = np.ones(d_model)
self.ln2_beta = np.zeros(d_model)
# FFN parameters
self.W1 = np.random.randn(d_model, d_ff) * 0.02
self.b1 = np.zeros(d_ff)
self.W2 = np.random.randn(d_ff, d_model) * 0.02
self.b2 = np.zeros(d_model)
def forward(self, x: np.ndarray) -> np.ndarray:
"""
Pre-norm transformer block forward pass.
T1 = LayerNorm(x)
T2 = MultiHeadAttention(T1)
T3 = T2 + x (residual)
T4 = LayerNorm(T3)
T5 = FFN(T4)
H = T5 + T3 (residual)
Args:
x: shape (seq_len, d_model)
Returns:
output of shape (seq_len, d_model)
"""
# Sub-layer 1: attention with residual
t1 = layer_norm(x, self.ln1_gamma, self.ln1_beta)
t2 = self.attention.forward(t1, causal=True)
t3 = t2 + x # residual connection
# Sub-layer 2: feedforward with residual
t4 = layer_norm(t3, self.ln2_gamma, self.ln2_beta)
t5 = feedforward(t4, self.W1, self.b1, self.W2, self.b2)
h = t5 + t3 # residual connection
return h
# --- Demo: pass random input through a transformer block ---
np.random.seed(42)
seq_len = 6
d_model = 32
n_heads = 4
d_ff = d_model * 4 # 128
block = TransformerBlock(d_model, n_heads, d_ff)
X_block = np.random.randn(seq_len, d_model)
H_block = block.forward(X_block)
print("Transformer Block Demo")
print("=" * 40)
print(f" Input shape: {X_block.shape} (seq_len={seq_len}, d_model={d_model})")
print(f" Output shape: {H_block.shape}")
print(f" n_heads: {n_heads}, d_k: {d_model // n_heads}, d_ff: {d_ff}")
print(f"\n Input[0, :6]: {X_block[0, :6].round(4)}")
print(f" Output[0, :6]: {H_block[0, :6].round(4)}")
print(f"\n (Input and output have the same shape -- the residual stream")
print(f" dimension is preserved throughout the transformer.)")
PART 6: Mini Transformer LMΒΆ
Reference: SLP Ch 8.4β8.5 (Input Embeddings, Language Modeling Head, Eqs. 8.46β8.47, Fig. 8.15β8.16)
A complete decoder-only transformer language model chains all the pieces together:
Token embedding: look up each token ID in embedding matrix \(\mathbf{E} \in \mathbb{R}^{|V| \times d}\)
Positional encoding: add sinusoidal PE to token embeddings
\(N\) transformer blocks: pass through \(N\) stacked blocks (the residual stream)
Final layer norm: one extra layer norm at the top of the stack (pre-norm convention)
Language modeling head (unembedding): project to vocabulary logits
Many models use weight tying: the same matrix \(\mathbf{E}\) is used for both input embedding and the output projection (via its transpose \(\mathbf{E}^T\)).
class MiniTransformerLM:
"""
Minimal transformer language model (decoder-only, SLP Ch 8.4--8.5).
Architecture:
token_ids -> E[token_ids] + PE[:seq_len]
-> N x TransformerBlock
-> LayerNorm
-> logits = H @ E^T (weight tying)
"""
def __init__(self, vocab_size: int, d_model: int = 64,
n_heads: int = 4, n_layers: int = 2, max_len: int = 128):
self.vocab_size = vocab_size
self.d_model = d_model
self.n_layers = n_layers
# Token embeddings: E of shape (vocab_size, d_model)
self.token_embeddings = np.random.randn(vocab_size, d_model) * 0.02
# Sinusoidal positional encodings (not learned)
self.pos_encodings = positional_encoding(max_len, d_model)
# Stack of transformer blocks
self.blocks = [
TransformerBlock(d_model, n_heads, d_model * 4)
for _ in range(n_layers)
]
# Final layer norm (pre-norm convention adds one at the very end)
self.final_ln_gamma = np.ones(d_model)
self.final_ln_beta = np.zeros(d_model)
# Language model head: uses weight tying (E^T)
# logits = H @ E^T, shape (seq_len, vocab_size)
# (We use self.token_embeddings.T in forward)
def forward(self, token_ids: np.ndarray) -> np.ndarray:
"""
Forward pass: token_ids -> logits.
Args:
token_ids: integer array of shape (seq_len,)
Returns:
logits: array of shape (seq_len, vocab_size)
"""
seq_len = len(token_ids)
# Step 1: Look up token embeddings
# E[token_ids] selects rows from the embedding matrix
x = self.token_embeddings[token_ids] # (seq_len, d_model)
# Step 2: Add positional encodings
x = x + self.pos_encodings[:seq_len] # (seq_len, d_model)
# Step 3: Pass through N transformer blocks
for block in self.blocks:
x = block.forward(x)
# Step 4: Final layer norm
x = layer_norm(x, self.final_ln_gamma, self.final_ln_beta)
# Step 5: Language model head (weight tying: logits = H @ E^T)
logits = x @ self.token_embeddings.T # (seq_len, vocab_size)
return logits
# --- Demo: mini transformer language model ---
np.random.seed(42)
vocab_size = 100
d_model = 32
n_heads = 4
n_layers = 2
max_len = 64
seq_len = 10
lm = MiniTransformerLM(
vocab_size=vocab_size,
d_model=d_model,
n_heads=n_heads,
n_layers=n_layers,
max_len=max_len
)
# Random token IDs (simulating an input sequence)
token_ids = np.random.randint(0, vocab_size, size=seq_len)
logits = lm.forward(token_ids)
print("Mini Transformer Language Model Demo")
print("=" * 50)
print(f" vocab_size: {vocab_size}")
print(f" d_model: {d_model}")
print(f" n_heads: {n_heads} (d_k = {d_model // n_heads})")
print(f" n_layers: {n_layers}")
print(f" d_ff: {d_model * 4}")
print(f" max_len: {max_len}")
print(f"\n Input token IDs: {token_ids}")
print(f" Input shape: ({seq_len},)")
print(f" Logits shape: {logits.shape} (seq_len, vocab_size)")
# Convert logits to probabilities for the last position
probs_last = softmax(logits[-1:])[0]
top5_idx = np.argsort(probs_last)[-5:][::-1]
print(f"\n Predicted next-token probabilities (last position):")
print(f" Top 5 tokens: {list(zip(top5_idx, probs_last[top5_idx].round(4)))}")
print(f" Prob sum: {probs_last.sum():.6f} (should be 1.0)")
print(f"\n Note: With random weights these predictions are meaningless.")
print(f" Training would adjust all parameters to minimize cross-entropy loss.")
SummaryΒΆ
This lab built a complete (decoder-only) transformer language model from scratch using only NumPy, following the architecture described in Jurafsky & Martin Chapter 8:
Component |
Key idea |
Equation |
|---|---|---|
Scaled dot-product attention |
\(\text{softmax}(QK^T / \sqrt{d_k})\, V\) |
Eq. 8.33 |
Causal mask |
Set upper triangle to \(-\infty\) before softmax |
Fig. 8.10 |
Multi-head attention |
\(A\) parallel heads, concatenated and projected by \(W^O\) |
Eqs. 8.35β8.37 |
Positional encoding |
\(\sin\)/\(\cos\) at different frequencies encode position |
Ch. 8.4 |
Transformer block |
Pre-norm: LN \(\to\) Attention \(\to\) Residual \(\to\) LN \(\to\) FFN \(\to\) Residual |
Eqs. 8.40β8.45 |
LM head |
Logits \(= H \cdot E^T\), then softmax for next-token probabilities |
Eqs. 8.46β8.47 |
The architecture flows bottom-to-top through a residual stream: each component reads from and writes back into this stream, with attention being the only component that mixes information across token positions.
All code is fully implemented with NumPy β no deep learning framework required. In practice, frameworks like PyTorch handle backpropagation, GPU acceleration, and efficient batched operations.