Run this notebook: Open in Colab Open in Kaggle

Lab 02: N-gram Language Models¶

Source: Speech and Language Processing (3rd ed.) by Dan Jurafsky & James H. Martin – Chapter 3: N-gram Language Models

Topics covered:

N-gram extraction and probability estimation (MLE)
Smoothing techniques (Laplace, Add-k, Interpolation)
Text generation via sampling with temperature control
Perplexity as an evaluation metric
Overfitting vs. generalization across n-gram orders

import numpy as np
from collections import Counter, defaultdict
import random
from typing import List, Dict, Tuple, Optional

import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["font.size"] = 11

# Fix random seed for reproducibility
random.seed(42)
np.random.seed(42)

PART 1: N-gram Basics¶

Reference: SLP Ch 3.1

A language model (LM) assigns a probability to each possible next word, or equivalently gives a probability distribution over possible next words. The simplest kind of LM is the n-gram model.

The Markov Assumption¶

Instead of computing the exact probability of a word given its entire history, we approximate using only the last \(N-1\) words:

\[P(w_n \mid w_1 \ldots w_{n-1}) \approx P(w_n \mid w_{n-N+1} \ldots w_{n-1})\]

Model	Name	Condition on
\(N=1\)	Unigram	\(P(w_n)\) – no context
\(N=2\)	Bigram	\(P(w_n \mid w_{n-1})\) – previous word
\(N=3\)	Trigram	\(P(w_n \mid w_{n-2}, w_{n-1})\) – two previous words

Maximum Likelihood Estimation (MLE)¶

For bigrams, the MLE probability is computed from counts:

\[P(w_n \mid w_{n-1}) = \frac{C(w_{n-1}, w_n)}{C(w_{n-1})}\]

We augment each sentence with <s> start and </s> end symbols to model sentence boundaries.

def extract_ngrams(tokens: List[str], n: int) -> List[Tuple[str, ...]]:
    """
    Extract all n-grams from a token list.
    Pads with (n-1) <s> start symbols and 1 </s> end symbol.
    
    Args:
        tokens: list of word tokens
        n: the n-gram order (1=unigram, 2=bigram, etc.)
    Returns:
        list of n-gram tuples
    """
    padded = ["<s>"] * (n - 1) + tokens + ["</s>"]
    return [tuple(padded[i:i + n]) for i in range(len(padded) - n + 1)]


# Demonstrate n-gram extraction
tokens = "the cat sat on the mat".split()
print("Tokens:", tokens)
print()
for n in [1, 2, 3]:
    ngrams = extract_ngrams(tokens, n)
    label = {1: "Unigrams", 2: "Bigrams", 3: "Trigrams"}[n]
    print(f"{label} (n={n}):")
    for ng in ngrams:
        print(f"  {ng}")
    print()

def count_ngrams(corpus: List[List[str]], n: int) -> Counter:
    """
    Count n-grams across all sentences in a corpus.
    
    Args:
        corpus: list of tokenized sentences (each sentence is a list of strings)
        n: the n-gram order
    Returns:
        Counter mapping n-gram tuples to their counts
    """
    counts = Counter()
    for sentence in corpus:
        ngrams = extract_ngrams(sentence, n)
        counts.update(ngrams)
    return counts


# Sample corpus -- we will reuse this throughout the lab
sample_corpus = [
    "i am sam".split(),
    "sam i am".split(),
    "i do not like green eggs and ham".split(),
    "i am happy today".split(),
    "sam is not happy".split(),
    "i like sam and green eggs".split(),
    "do not judge me today".split(),
    "today i am not happy".split(),
    "i do like eggs and ham today".split(),
    "sam do not like me".split(),
]

print("Corpus has", len(sample_corpus), "sentences")
print()

for n in [1, 2, 3]:
    counts = count_ngrams(sample_corpus, n)
    label = {1: "Unigram", 2: "Bigram", 3: "Trigram"}[n]
    print(f"Top 10 {label} counts:")
    for ng, c in counts.most_common(10):
        print(f"  {ng}: {c}")
    print()

class BigramModel:
    """Simple bigram language model with MLE probability estimation."""

    def __init__(self):
        self.bigram_counts = Counter()
        self.unigram_counts = Counter()
        self.vocab = set()

    def train(self, corpus: List[List[str]]):
        """
        Train on a corpus (list of tokenized sentences).
        Counts unigrams and bigrams, including <s> and </s> symbols.
        """
        for sentence in corpus:
            tokens = ["<s>"] + sentence + ["</s>"]
            for i in range(len(tokens)):
                self.unigram_counts[tokens[i]] += 1
                self.vocab.add(tokens[i])
                if i > 0:
                    self.bigram_counts[(tokens[i - 1], tokens[i])] += 1

    def probability(self, word: str, context: str) -> float:
        """
        P(word | context) = C(context, word) / C(context)
        Returns 0.0 if the context was never seen.
        """
        context_count = self.unigram_counts[context]
        if context_count == 0:
            return 0.0
        bigram_count = self.bigram_counts[(context, word)]
        return bigram_count / context_count

    def sentence_probability(self, sentence: List[str]) -> float:
        """
        Compute P(sentence) as the product of bigram probabilities.
        Uses log space internally to avoid underflow.
        """
        tokens = ["<s>"] + sentence + ["</s>"]
        log_prob = 0.0
        for i in range(1, len(tokens)):
            p = self.probability(tokens[i], tokens[i - 1])
            if p == 0:
                return 0.0
            log_prob += np.log(p)
        return np.exp(log_prob)

    def perplexity(self, test_corpus: List[List[str]]) -> float:
        """
        Compute perplexity on a test corpus.
        PP = exp( -(1/N) * sum( log P(w_i | w_{i-1}) ) )
        
        N counts all tokens including </s> but not <s>,
        following the convention in SLP Ch 3.3.
        """
        total_log_prob = 0.0
        N = 0
        for sentence in test_corpus:
            tokens = ["<s>"] + sentence + ["</s>"]
            for i in range(1, len(tokens)):
                p = self.probability(tokens[i], tokens[i - 1])
                if p > 0:
                    total_log_prob += np.log(p)
                else:
                    # Assign a very small probability to avoid -inf
                    total_log_prob += np.log(1e-10)
                N += 1
        return np.exp(-total_log_prob / N)

# Train the bigram model on our sample corpus
bigram_model = BigramModel()
bigram_model.train(sample_corpus)

print("Vocabulary size |V| =", len(bigram_model.vocab))
print("Total unigram tokens =", sum(bigram_model.unigram_counts.values()))
print("Unique bigrams seen  =", len(bigram_model.bigram_counts))
print()

# Print selected bigram probabilities
# (cf. SLP Fig 3.2 for the Berkeley Restaurant Project)
test_pairs = [
    ("i", "<s>"),
    ("am", "i"),
    ("sam", "am"),
    ("</s>", "sam"),
    ("not", "do"),
    ("like", "not"),
    ("green", "like"),
    ("eggs", "green"),
]

print("Selected bigram probabilities P(word | context):")
print("-" * 45)
for word, context in test_pairs:
    p = bigram_model.probability(word, context)
    print(f"  P({word:>6s} | {context:<6s}) = {p:.4f}")

print()

# Compute sentence probability and perplexity on test data
test_corpus = [
    "i am sam".split(),
    "sam i am".split(),
    "i like green eggs and ham".split(),
]

print("Sentence probabilities (test):")
for sent in test_corpus:
    p = bigram_model.sentence_probability(sent)
    print(f"  P('{' '.join(sent)}') = {p:.6f}")

pp = bigram_model.perplexity(test_corpus)
print(f"\nTest perplexity: {pp:.2f}")

PART 2: Smoothing¶

Reference: SLP Ch 3.6

The Zero Probability Problem¶

With MLE, any n-gram not seen in training gets probability zero. This is catastrophic:

A single unseen bigram makes \(P(\text{entire sentence}) = 0\)
Perplexity becomes infinite (undefined)

Smoothing (also called discounting) shaves off probability mass from seen events and redistributes it to unseen events.

Laplace (Add-1) Smoothing¶

Add 1 to every bigram count before normalizing (SLP Eq. 3.26):

\[P_{\text{Laplace}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}, w_n) + 1}{C(w_{n-1}) + |V|}\]

Add-k Smoothing¶

A refinement: add a fractional count \(k\) instead of 1 (SLP Eq. 3.28):

\[P_{\text{Add-k}}(w_n \mid w_{n-1}) = \frac{C(w_{n-1}, w_n) + k}{C(w_{n-1}) + k \cdot |V|}\]

Interpolation¶

Combine unigram, bigram, and trigram estimates with learned weights (SLP Eq. 3.29):

\[\hat{P}(w_n \mid w_{n-2}, w_{n-1}) = \lambda_1 P_{\text{uni}}(w_n) + \lambda_2 P_{\text{bi}}(w_n \mid w_{n-1}) + \lambda_3 P_{\text{tri}}(w_n \mid w_{n-2}, w_{n-1})\]

where \(\lambda_1 + \lambda_2 + \lambda_3 = 1\). The \(\lambda\) values are typically learned on a held-out corpus.

class LaplaceSmoothedBigram(BigramModel):
    """Bigram model with Laplace (add-1) smoothing."""

    def probability(self, word: str, context: str) -> float:
        """
        P_Laplace(w | c) = (C(c, w) + 1) / (C(c) + |V|)
        """
        bigram_count = self.bigram_counts[(context, word)]
        context_count = self.unigram_counts[context]
        V = len(self.vocab)
        return (bigram_count + 1) / (context_count + V)


# Train and demonstrate
laplace_model = LaplaceSmoothedBigram()
laplace_model.train(sample_corpus)

print("Laplace-smoothed bigram probabilities:")
print("-" * 50)
for word, context in test_pairs:
    p_mle = bigram_model.probability(word, context)
    p_lap = laplace_model.probability(word, context)
    print(f"  P({word:>6s} | {context:<6s})  MLE={p_mle:.4f}  Laplace={p_lap:.4f}")

# Show that previously-zero bigrams now have nonzero probability
print("\nPreviously unseen bigrams now get probability:")
unseen_pairs = [("ham", "<s>"), ("today", "eggs"), ("judge", "sam")]
for word, context in unseen_pairs:
    p_mle = bigram_model.probability(word, context)
    p_lap = laplace_model.probability(word, context)
    print(f"  P({word:>6s} | {context:<6s})  MLE={p_mle:.4f}  Laplace={p_lap:.4f}")

class AddKSmoothedBigram(BigramModel):
    """Bigram model with add-k smoothing."""

    def __init__(self, k: float = 0.5):
        super().__init__()
        self.k = k

    def probability(self, word: str, context: str) -> float:
        """
        P_addk(w | c) = (C(c, w) + k) / (C(c) + k * |V|)
        """
        bigram_count = self.bigram_counts[(context, word)]
        context_count = self.unigram_counts[context]
        V = len(self.vocab)
        return (bigram_count + self.k) / (context_count + self.k * V)


# Demonstrate different k values
print("Add-k smoothing with various k values:")
print("-" * 65)
header = f"{'Bigram':<22s} {'MLE':>7s} {'k=0.01':>7s} {'k=0.1':>7s} {'k=0.5':>7s} {'k=1.0':>7s}"
print(header)
print("-" * 65)

k_values = [0.01, 0.1, 0.5, 1.0]
addk_models = {}
for k in k_values:
    m = AddKSmoothedBigram(k=k)
    m.train(sample_corpus)
    addk_models[k] = m

all_pairs = test_pairs + unseen_pairs
for word, context in all_pairs:
    row = f"  P({word:>6s}|{context:<6s})"
    row += f"  {bigram_model.probability(word, context):7.4f}"
    for k in k_values:
        row += f"  {addk_models[k].probability(word, context):7.4f}"
    print(row)

class InterpolatedModel:
    """
    Interpolated trigram/bigram/unigram model.
    P_hat(w | c2, c1) = l1*P_uni(w) + l2*P_bi(w|c1) + l3*P_tri(w|c2,c1)
    """

    def __init__(self, lambda1: float = 0.1, lambda2: float = 0.3, lambda3: float = 0.6):
        """lambdas must sum to 1: unigram, bigram, trigram weights."""
        assert abs(lambda1 + lambda2 + lambda3 - 1.0) < 1e-9, "Lambdas must sum to 1"
        self.l1 = lambda1  # unigram weight
        self.l2 = lambda2  # bigram weight
        self.l3 = lambda3  # trigram weight
        self.unigram_counts = Counter()
        self.bigram_counts = Counter()
        self.trigram_counts = Counter()
        self.total_tokens = 0
        self.vocab = set()

    def train(self, corpus: List[List[str]]):
        """Train unigram, bigram, and trigram counts."""
        for sentence in corpus:
            # Two <s> tokens for trigram context at sentence start
            tokens = ["<s>", "<s>"] + sentence + ["</s>"]
            for i in range(len(tokens)):
                self.unigram_counts[tokens[i]] += 1
                self.vocab.add(tokens[i])
                self.total_tokens += 1
                if i >= 1:
                    self.bigram_counts[(tokens[i - 1], tokens[i])] += 1
                if i >= 2:
                    self.trigram_counts[(tokens[i - 2], tokens[i - 1], tokens[i])] += 1

    def _p_unigram(self, word: str) -> float:
        """P(w) = C(w) / N"""
        if self.total_tokens == 0:
            return 0.0
        return self.unigram_counts[word] / self.total_tokens

    def _p_bigram(self, word: str, context1: str) -> float:
        """P(w | c1) = C(c1, w) / C(c1)"""
        denom = self.unigram_counts[context1]
        if denom == 0:
            return 0.0
        return self.bigram_counts[(context1, word)] / denom

    def _p_trigram(self, word: str, context2: str, context1: str) -> float:
        """P(w | c2, c1) = C(c2, c1, w) / C(c2, c1)"""
        denom = self.bigram_counts[(context2, context1)]
        if denom == 0:
            return 0.0
        return self.trigram_counts[(context2, context1, word)] / denom

    def probability(self, word: str, context1: str, context2: str) -> float:
        """
        P_interp(w | c2, c1) = l1*P_uni(w) + l2*P_bi(w|c1) + l3*P_tri(w|c2,c1)
        """
        p_uni = self._p_unigram(word)
        p_bi = self._p_bigram(word, context1)
        p_tri = self._p_trigram(word, context2, context1)
        return self.l1 * p_uni + self.l2 * p_bi + self.l3 * p_tri

    def perplexity(self, test_corpus: List[List[str]]) -> float:
        """
        Compute perplexity on a test corpus using interpolated probabilities.
        """
        total_log_prob = 0.0
        N = 0
        for sentence in test_corpus:
            tokens = ["<s>", "<s>"] + sentence + ["</s>"]
            for i in range(2, len(tokens)):
                p = self.probability(tokens[i], tokens[i - 1], tokens[i - 2])
                if p > 0:
                    total_log_prob += np.log(p)
                else:
                    total_log_prob += np.log(1e-10)
                N += 1
        return np.exp(-total_log_prob / N)


# Train the interpolated model
interp_model = InterpolatedModel(lambda1=0.1, lambda2=0.3, lambda3=0.6)
interp_model.train(sample_corpus)

print("Interpolated model probabilities (l1=0.1, l2=0.3, l3=0.6):")
print("-" * 55)
demo_triples = [
    ("am", "i", "<s>"),
    ("sam", "am", "i"),
    ("</s>", "sam", "am"),
    ("not", "do", "i"),
    ("like", "not", "do"),
    ("green", "like", "not"),
]
for word, c1, c2 in demo_triples:
    p = interp_model.probability(word, c1, c2)
    print(f"  P({word:>6s} | {c2:<4s} {c1:<5s}) = {p:.4f}")

# Compare perplexity of all models on test data
test_corpus_full = [
    "i am happy".split(),
    "sam do not like ham".split(),
    "i like green eggs today".split(),
    "sam am i not happy today".split(),
]

# Unsmoothed bigram
pp_unsmoothed = bigram_model.perplexity(test_corpus_full)

# Laplace (k=1)
pp_laplace = laplace_model.perplexity(test_corpus_full)

# Add-k (k=0.1)
addk01_model = AddKSmoothedBigram(k=0.1)
addk01_model.train(sample_corpus)
pp_addk = addk01_model.perplexity(test_corpus_full)

# Interpolated
pp_interp = interp_model.perplexity(test_corpus_full)

print("Perplexity comparison on test set:")
print("=" * 40)
results = [
    ("Unsmoothed Bigram", pp_unsmoothed),
    ("Laplace (k=1)", pp_laplace),
    ("Add-k (k=0.1)", pp_addk),
    ("Interpolated", pp_interp),
]
for name, pp in results:
    print(f"  {name:<22s}: {pp:>10.2f}")

print("\nLower perplexity = better model.")
print("Note: Unsmoothed may show very high perplexity due to unseen bigrams.")

# Bar chart
names = [r[0] for r in results]
pps = [r[1] for r in results]

fig, ax = plt.subplots(figsize=(8, 4))
# Cap display at 500 for readability if unsmoothed is huge
display_pps = [min(p, 500) for p in pps]
bars = ax.bar(names, display_pps, color=["#d9534f", "#5bc0de", "#5cb85c", "#f0ad4e"])
ax.set_ylabel("Perplexity (lower is better)")
ax.set_title("Test Perplexity: Unsmoothed vs. Smoothed Models")
for bar, pp_val in zip(bars, pps):
    label = f"{pp_val:.1f}" if pp_val < 500 else f"{pp_val:.0f}"
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 5,
            label, ha="center", va="bottom", fontsize=10)
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

PART 3: Text Generation¶

Reference: SLP Ch 3.4

Sampling from a Language Model¶

To visualize what a language model has learned, we can sample sentences from it. We start with <s>, then at each step sample the next word proportional to \(P(w \mid \text{context})\), and stop when we generate </s> or hit a maximum length.

Temperature Control¶

Temperature scaling adjusts the “sharpness” of the probability distribution before sampling:

\[P_T(w_i) = \frac{\exp(\log P(w_i) / T)}{\sum_j \exp(\log P(w_j) / T)}\]

\(T < 1\): Sharper distribution – more deterministic, favors high-probability words
\(T = 1\): Standard sampling from the original distribution
\(T > 1\): Flatter distribution – more random, more diverse output

def generate_text(model: BigramModel, max_length: int = 20, seed: int = None) -> List[str]:
    """
    Generate text by sampling from a bigram model.
    Start with <s>, sample next word proportional to P(w | context).
    Stop at </s> or max_length.
    
    Uses a smoothed model so that we never get stuck with zero probabilities.
    """
    if seed is not None:
        np.random.seed(seed)

    tokens = ["<s>"]
    for _ in range(max_length):
        context = tokens[-1]
        # Gather all possible next words and their probabilities
        candidates = []
        probs = []
        for (c, w), count in model.bigram_counts.items():
            if c == context:
                candidates.append(w)
                probs.append(model.probability(w, c))

        if not candidates:
            break

        # Normalize probabilities to ensure they sum to 1
        probs = np.array(probs)
        probs = probs / probs.sum()

        # Sample from the distribution
        next_word = np.random.choice(candidates, p=probs)
        if next_word == "</s>":
            break
        tokens.append(next_word)

    return tokens[1:]  # remove leading <s>


# Use the Laplace-smoothed model for generation (avoids dead ends)
print("Generated sentences (Laplace-smoothed bigram model):")
print("=" * 55)
for i in range(8):
    sentence = generate_text(laplace_model, max_length=15, seed=42 + i)
    print(f"  [{i+1}] {' '.join(sentence)}")

def generate_with_temperature(model: BigramModel, temperature: float = 1.0,
                               max_length: int = 20, seed: int = None) -> List[str]:
    """
    Temperature-controlled text generation.
    
    Args:
        model: a trained bigram model (should be smoothed)
        temperature: T<1 = sharper (more greedy), T=1 = standard, T>1 = flatter (more random)
        max_length: maximum number of tokens to generate
        seed: random seed for reproducibility
    Returns:
        list of generated tokens
    """
    if seed is not None:
        np.random.seed(seed)

    tokens = ["<s>"]
    for _ in range(max_length):
        context = tokens[-1]
        candidates = []
        log_probs = []
        for (c, w), count in model.bigram_counts.items():
            if c == context:
                p = model.probability(w, c)
                if p > 0:
                    candidates.append(w)
                    log_probs.append(np.log(p))

        if not candidates:
            break

        # Apply temperature scaling
        log_probs = np.array(log_probs)
        adjusted_logits = log_probs / temperature
        # Softmax to get probabilities
        adjusted_logits -= adjusted_logits.max()  # numerical stability
        probs = np.exp(adjusted_logits)
        probs = probs / probs.sum()

        next_word = np.random.choice(candidates, p=probs)
        if next_word == "</s>":
            break
        tokens.append(next_word)

    return tokens[1:]


# Generate sentences at different temperatures
temperatures = [0.3, 0.7, 1.0, 1.5, 2.0]

print("Text generation with temperature control:")
print("=" * 65)
for temp in temperatures:
    print(f"\n  Temperature = {temp}:")
    for trial in range(5):
        sentence = generate_with_temperature(
            laplace_model, temperature=temp, max_length=15, seed=100 + trial
        )
        print(f"    [{trial+1}] {' '.join(sentence)}")

# Visualize the effect of temperature on the probability distribution
# for a specific context word
context_word = "i"
candidates = []
base_probs = []
for (c, w), count in laplace_model.bigram_counts.items():
    if c == context_word:
        p = laplace_model.probability(w, c)
        if p > 0:
            candidates.append(w)
            base_probs.append(p)

# Sort by base probability
sorted_indices = np.argsort(base_probs)[::-1]
candidates = [candidates[i] for i in sorted_indices]
base_probs = np.array([base_probs[i] for i in sorted_indices])

fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharey=False)
for ax, temp in zip(axes, [0.3, 1.0, 2.0]):
    log_p = np.log(base_probs)
    adjusted = log_p / temp
    adjusted -= adjusted.max()
    probs = np.exp(adjusted)
    probs = probs / probs.sum()

    ax.bar(range(len(candidates)), probs, color="steelblue")
    ax.set_xticks(range(len(candidates)))
    ax.set_xticklabels(candidates, rotation=45, ha="right", fontsize=8)
    ax.set_title(f"T = {temp}")
    ax.set_ylabel("P(w | 'i')")
    ax.set_xlabel("Next word")

fig.suptitle(f"Effect of Temperature on P(w | '{context_word}')", fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

PART 4: Perplexity & Evaluation¶

Reference: SLP Ch 3.3

Perplexity¶

Perplexity is the standard intrinsic evaluation metric for language models. For a test set \(W = w_1 w_2 \ldots w_N\):

\[\text{PP}(W) = P(w_1 w_2 \ldots w_N)^{-1/N} = \exp\!\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i \mid \text{context})\right)\]

Key insight: Lower perplexity = better model. A perplexity of \(k\) means the model is, on average, as uncertain as if it had to choose uniformly among \(k\) words at each step.

From the WSJ experiments in the textbook:

Model	Perplexity
Unigram	962
Bigram	170
Trigram	109

# Compare perplexity with different training data sizes
# We simulate this by training on increasing portions of the corpus

# Build a larger corpus by repeating and shuffling
large_corpus = [
    "the cat sat on the mat".split(),
    "the dog sat on the log".split(),
    "the cat chased the dog around the yard".split(),
    "the dog chased the cat around the house".split(),
    "the mat is on the floor near the door".split(),
    "the cat and the dog sat on the mat together".split(),
    "the log is near the yard and the house".split(),
    "the door is near the floor of the house".split(),
    "the cat is near the door of the house".split(),
    "the dog is on the floor near the mat".split(),
    "on the mat sat the cat and the dog".split(),
    "around the house the dog chased the cat".split(),
    "the yard is near the house and the log".split(),
    "the cat sat on the log near the yard".split(),
    "the floor is near the door of the house".split(),
    "the dog and the cat chased around the yard".split(),
    "the mat is on the floor of the house".split(),
    "the cat sat near the log and the mat".split(),
    "the dog sat on the mat near the door".split(),
    "the cat is on the mat near the log".split(),
]

test_set = [
    "the cat sat on the log".split(),
    "the dog is on the mat".split(),
    "the cat chased the dog around the house".split(),
    "the mat is near the door".split(),
]

# Train on increasing fractions of the training data
fractions = [0.1, 0.2, 0.3, 0.5, 0.7, 1.0]
pp_by_size = []

for frac in fractions:
    n_sentences = max(1, int(len(large_corpus) * frac))
    subset = large_corpus[:n_sentences]
    
    model = LaplaceSmoothedBigram()
    model.train(subset)
    pp = model.perplexity(test_set)
    pp_by_size.append(pp)
    print(f"  Training sentences: {n_sentences:>3d} ({frac*100:.0f}%)  ->  Test PP = {pp:.2f}")

# Plot
fig, ax = plt.subplots(figsize=(8, 4))
n_sents = [max(1, int(len(large_corpus) * f)) for f in fractions]
ax.plot(n_sents, pp_by_size, "o-", color="steelblue", linewidth=2, markersize=8)
ax.set_xlabel("Number of Training Sentences")
ax.set_ylabel("Test Perplexity")
ax.set_title("Effect of Training Data Size on Perplexity")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nMore training data -> lower perplexity (better generalization).")

# Show effect of smoothing on test perplexity more systematically

# Train all models on the full large_corpus
model_unsmoothed = BigramModel()
model_unsmoothed.train(large_corpus)

model_laplace = LaplaceSmoothedBigram()
model_laplace.train(large_corpus)

k_sweep = [0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0]
pp_k_sweep = []
for k in k_sweep:
    m = AddKSmoothedBigram(k=k)
    m.train(large_corpus)
    pp_k_sweep.append(m.perplexity(test_set))

pp_unsm = model_unsmoothed.perplexity(test_set)
pp_lap = model_laplace.perplexity(test_set)

print("Smoothing effect on test perplexity:")
print("=" * 45)
print(f"  Unsmoothed Bigram : {pp_unsm:.2f}")
print(f"  Laplace (k=1.0)   : {pp_lap:.2f}")
print()
print("  Add-k sweep:")
for k, pp in zip(k_sweep, pp_k_sweep):
    marker = " <-- best" if pp == min(pp_k_sweep) else ""
    print(f"    k={k:<6.3f} : PP = {pp:.2f}{marker}")

# Plot
fig, ax = plt.subplots(figsize=(8, 4))
ax.semilogx(k_sweep, pp_k_sweep, "o-", color="steelblue", linewidth=2, markersize=8, label="Add-k")
ax.axhline(y=pp_unsm, color="red", linestyle="--", linewidth=1.5,
           label=f"Unsmoothed (PP={pp_unsm:.1f})")
ax.set_xlabel("k (smoothing parameter, log scale)")
ax.set_ylabel("Test Perplexity")
ax.set_title("Effect of Smoothing Parameter k on Test Perplexity")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nBest k = {k_sweep[np.argmin(pp_k_sweep)]:.3f} with PP = {min(pp_k_sweep):.2f}")
print("Too little smoothing -> can't handle unseen n-grams.")
print("Too much smoothing -> uniform distribution, high perplexity.")

PART 5: Overfitting vs. Generalization¶

Reference: SLP Ch 3.5

Higher-order n-grams capture more context and fit training data better, but they also memorize the training data. With a small corpus:

Training perplexity decreases monotonically as \(n\) increases (the model perfectly memorizes longer sequences)
Test perplexity initially decreases (better context helps) but then increases as the model overfits to training-specific patterns

This is the classic bias-variance tradeoff: unigrams are high-bias/low-variance, while high-order n-grams are low-bias/high-variance.

As the textbook notes: “the 4-gram sentences look a little too much like Shakespeare. The words ‘It cannot be but so’ are directly from King John.” (SLP Ch 3.5)

class NgramModel:
    """
    General n-gram language model with add-k smoothing.
    Supports any order n >= 1.
    """

    def __init__(self, n: int, k: float = 0.1):
        self.n = n
        self.k = k
        self.ngram_counts = Counter()    # counts of n-grams
        self.context_counts = Counter()  # counts of (n-1)-gram contexts
        self.vocab = set()

    def train(self, corpus: List[List[str]]):
        """Train the model by counting n-grams."""
        for sentence in corpus:
            # Pad with (n-1) start symbols and 1 end symbol
            tokens = ["<s>"] * (self.n - 1) + sentence + ["</s>"]
            for token in tokens:
                self.vocab.add(token)
            for i in range(self.n - 1, len(tokens)):
                ngram = tuple(tokens[i - self.n + 1: i + 1])
                context = tuple(tokens[i - self.n + 1: i])
                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1

    def probability(self, word: str, context: Tuple[str, ...]) -> float:
        """
        P(word | context) with add-k smoothing.
        For unigrams (n=1), context is an empty tuple.
        """
        ngram = context + (word,)
        ngram_count = self.ngram_counts[ngram]
        context_count = self.context_counts[context]
        V = len(self.vocab)
        return (ngram_count + self.k) / (context_count + self.k * V)

    def perplexity(self, test_corpus: List[List[str]]) -> float:
        """Compute perplexity on a test corpus."""
        total_log_prob = 0.0
        N = 0
        for sentence in test_corpus:
            tokens = ["<s>"] * (self.n - 1) + sentence + ["</s>"]
            for i in range(self.n - 1, len(tokens)):
                context = tuple(tokens[i - self.n + 1: i])
                word = tokens[i]
                p = self.probability(word, context)
                if p > 0:
                    total_log_prob += np.log(p)
                else:
                    total_log_prob += np.log(1e-10)
                N += 1
        return np.exp(-total_log_prob / N)


# Train n-gram models of order 1 through 4
# Use the large_corpus for training and test_set for testing
orders = [1, 2, 3, 4]
train_perplexities = []
test_perplexities = []

print("N-gram order comparison (add-k smoothing, k=0.1):")
print("=" * 55)
print(f"  {'Order':<8s} {'Train PP':>12s} {'Test PP':>12s} {'Overfit?':>10s}")
print("-" * 55)

for n in orders:
    model = NgramModel(n=n, k=0.1)
    model.train(large_corpus)

    train_pp = model.perplexity(large_corpus)
    test_pp = model.perplexity(test_set)

    train_perplexities.append(train_pp)
    test_perplexities.append(test_pp)

    gap = test_pp - train_pp
    overfit = "Yes" if gap > train_pp * 0.5 else "No"
    print(f"  n={n:<5d} {train_pp:>12.2f} {test_pp:>12.2f} {overfit:>10s}")

# Plot train vs test perplexity
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(orders, train_perplexities, "o-", color="steelblue", linewidth=2,
        markersize=10, label="Train Perplexity")
ax.plot(orders, test_perplexities, "s--", color="#d9534f", linewidth=2,
        markersize=10, label="Test Perplexity")
ax.set_xlabel("N-gram Order")
ax.set_ylabel("Perplexity")
ax.set_title("Overfitting: Train vs. Test Perplexity by N-gram Order")
ax.set_xticks(orders)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Annotate the gap
for i, n in enumerate(orders):
    ax.annotate("", xy=(n, test_perplexities[i]), xytext=(n, train_perplexities[i]),
                arrowprops=dict(arrowstyle="<->", color="gray", lw=1.5))
    mid = (train_perplexities[i] + test_perplexities[i]) / 2
    gap = test_perplexities[i] - train_perplexities[i]
    ax.text(n + 0.1, mid, f"gap={gap:.1f}", fontsize=9, color="gray")

plt.tight_layout()
plt.show()

print("\nKey observations:")
print("- Training perplexity decreases as n increases (model memorizes training data).")
print("- Test perplexity may increase for higher n (overfitting to training patterns).")
print("- The gap between train and test PP grows with n -- the hallmark of overfitting.")
print("- With a small corpus, high-order n-grams have very sparse count tables.")

Summary¶

This lab implemented the core concepts from Chapter 3 of Speech and Language Processing:

Concept	What We Built	Key Equation
N-gram extraction	`extract_ngrams()`, `count_ngrams()`	Padding with `<s>` and `</s>`
MLE estimation	`BigramModel`	\(P(w \mid c) = C(c,w) / C(c)\)
Laplace smoothing	`LaplaceSmoothedBigram`	\(P = (C + 1) / (C_{ctx} + V)\)
Add-k smoothing	`AddKSmoothedBigram`	\(P = (C + k) / (C_{ctx} + kV)\)
Interpolation	`InterpolatedModel`	\(\hat{P} = \lambda_1 P_{uni} + \lambda_2 P_{bi} + \lambda_3 P_{tri}\)
Text generation	`generate_text()`, `generate_with_temperature()`	Sampling from \(P(w \mid \text{context})\)
Perplexity	`perplexity()` methods	\(PP = \exp(-(1/N)\sum \log P)\)
Overfitting	`NgramModel` (orders 1–4)	Train PP vs. Test PP gap

Key takeaways:

N-gram models use the Markov assumption to make language modeling tractable
MLE gives zero probability to unseen n-grams – smoothing is essential
Interpolation often outperforms simple add-k smoothing
Perplexity is the standard evaluation metric (lower = better)
Higher-order n-grams overfit on small corpora – the bias-variance tradeoff