Run this notebook: Open in Colab Open in Kaggle

Lab 03: Word Embeddings¶

Source: Speech and Language Processing by Daniel Jurafsky & James H. Martin – Chapter 5: Embeddings

This lab implements the core concepts of word embeddings from scratch:

Count-Based Embeddings – term-document matrices, word-word co-occurrence, TF-IDF
Cosine Similarity – measuring word similarity, finding nearest neighbors, word analogies
Word2Vec (Skip-gram) – learning dense embeddings with negative sampling
Dimensionality Reduction – PCA for visualization
Evaluating Embeddings – intrinsic evaluation with similarity and analogy tasks

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
from typing import List, Dict, Tuple, Optional

np.random.seed(42)
%matplotlib inline

PART 1: Count-Based Embeddings¶

Reference: SLP Ch 5.2–5.3

The distributional hypothesis (Firth, 1957) states:

“You shall know a word by the company it keeps.”

Words that occur in similar contexts tend to have similar meanings. We can operationalize this by building vectors from co-occurrence counts.

Term-Document Matrix¶

A term-document matrix has shape \(|V| \times |D|\) where each cell records how often word \(w\) appears in document \(d\). Each column is a document vector; each row is a word vector whose dimensions correspond to documents.

Word-Word Co-occurrence Matrix¶

A word-context matrix (or word-word co-occurrence matrix) has shape \(|V| \times |V|\). Each cell records the number of times the row word (target) and the column word (context) co-occur within a context window of \(\pm k\) words in a training corpus. Two words with similar co-occurrence patterns (similar row vectors) are likely to be semantically related.

TF-IDF Weighting¶

Raw counts can be misleading – frequent words like “the” dominate. TF-IDF down-weights common terms:

\[\text{TF}(t, d) = \frac{\text{count}(t, d)}{\sum_{t'} \text{count}(t', d)}\]

\[\text{IDF}(t) = \log\left(\frac{N}{\text{df}(t)}\right)\]

\[\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)\]

where \(N\) is the total number of documents and \(\text{df}(t)\) is the number of documents containing term \(t\).

def build_term_document_matrix(
    documents: List[List[str]],
) -> Tuple[np.ndarray, List[str], List[int]]:
    """
    Build a term-document matrix.

    Args:
        documents: list of documents, each document is a list of words.

    Returns:
        matrix: shape (|V|, |D|) -- rows are words, columns are documents.
        vocab: sorted list of unique words.
        doc_indices: list of document indices [0, 1, ..., |D|-1].
    """
    vocab = sorted(set(word for doc in documents for word in doc))
    word_to_idx = {w: i for i, w in enumerate(vocab)}

    matrix = np.zeros((len(vocab), len(documents)))
    for j, doc in enumerate(documents):
        for word in doc:
            matrix[word_to_idx[word], j] += 1

    return matrix, vocab, list(range(len(documents)))

def build_cooccurrence_matrix(
    corpus: List[List[str]], window_size: int = 2
) -> Tuple[np.ndarray, List[str]]:
    """
    Build a symmetric word-word co-occurrence matrix.

    For each word, count how many times every other word appears within
    a +/- window_size context window across all sentences.

    Args:
        corpus: list of sentences, each sentence is a list of words.
        window_size: number of words to the left and right to consider.

    Returns:
        matrix: shape (|V|, |V|) symmetric co-occurrence counts.
        vocab: sorted list of unique words.
    """
    vocab = sorted(set(word for sent in corpus for word in sent))
    word_to_idx = {w: i for i, w in enumerate(vocab)}
    V = len(vocab)
    matrix = np.zeros((V, V))

    for sentence in corpus:
        for i, word in enumerate(sentence):
            start = max(0, i - window_size)
            end = min(len(sentence), i + window_size + 1)
            for j in range(start, end):
                if i != j:
                    matrix[word_to_idx[word], word_to_idx[sentence[j]]] += 1

    return matrix, vocab

def compute_tfidf(term_doc_matrix: np.ndarray) -> np.ndarray:
    """
    Compute TF-IDF from a term-document matrix.

    TF(t,d) = count(t,d) / total_words_in_d
    IDF(t)  = log(N / df(t))
    TF-IDF  = TF * IDF

    Args:
        term_doc_matrix: shape (|V|, |D|), raw counts.

    Returns:
        tfidf_matrix: shape (|V|, |D|), TF-IDF weighted values.
    """
    N = term_doc_matrix.shape[1]  # number of documents

    # TF: normalize each column (document) by its total word count
    col_sums = term_doc_matrix.sum(axis=0, keepdims=True)  # (1, |D|)
    tf = term_doc_matrix / (col_sums + 1e-10)

    # DF: number of documents each term appears in
    df = np.sum(term_doc_matrix > 0, axis=1, keepdims=True)  # (|V|, 1)

    # IDF: log(N / df), with smoothing to avoid division by zero
    idf = np.log(N / (df + 1e-10))

    return tf * idf

# --- Demo: Count-Based Embeddings ---

documents = [
    "the cat sat on the mat".split(),
    "the dog sat on the log".split(),
    "the cat chased the dog".split(),
    "the bird flew over the mat".split(),
]

# Term-document matrix
tdm, vocab_td, doc_ids = build_term_document_matrix(documents)
print("=== Term-Document Matrix ===")
print(f"Vocabulary ({len(vocab_td)} words): {vocab_td}")
print(f"Shape: {tdm.shape}  (words x documents)\n")
# Print as a table
header = "Word".ljust(12) + "".join(f"Doc {d}".rjust(6) for d in doc_ids)
print(header)
print("-" * len(header))
for i, w in enumerate(vocab_td):
    row = w.ljust(12) + "".join(f"{int(tdm[i, j])}".rjust(6) for j in doc_ids)
    print(row)

# TF-IDF
tfidf = compute_tfidf(tdm)
print("\n=== TF-IDF Matrix ===")
header = "Word".ljust(12) + "".join(f"Doc {d}".rjust(8) for d in doc_ids)
print(header)
print("-" * len(header))
for i, w in enumerate(vocab_td):
    row = w.ljust(12) + "".join(f"{tfidf[i, j]:.3f}".rjust(8) for j in doc_ids)
    print(row)
print("\nNote: 'the' has TF-IDF = 0.000 everywhere (appears in all docs, IDF -> 0).")

# Co-occurrence matrix
corpus = [
    "I like deep learning".split(),
    "I like NLP".split(),
    "I enjoy flying".split(),
]
cooc, vocab_co = build_cooccurrence_matrix(corpus, window_size=2)
print("\n=== Word-Word Co-occurrence Matrix (window=2) ===")
print(f"Vocabulary: {vocab_co}")
header = "".ljust(10) + "".join(w.rjust(10) for w in vocab_co)
print(header)
for i, w in enumerate(vocab_co):
    row = w.ljust(10) + "".join(f"{int(cooc[i, j])}".rjust(10) for j in range(len(vocab_co)))
    print(row)

PART 2: Cosine Similarity¶

Reference: SLP Ch 5.4

The raw dot product favors frequent (long) vectors. We normalize by dividing by the vector lengths, giving us the cosine of the angle between two vectors:

\[\text{cosine}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \, \|\mathbf{v}\|} = \frac{\sum_i u_i v_i}{\sqrt{\sum_i u_i^2} \sqrt{\sum_i v_i^2}}\]

Range:

\(1\) for vectors pointing in the same direction (identical distribution)
\(0\) for orthogonal vectors (no shared context)
\(-1\) for vectors pointing in opposite directions

For non-negative count vectors the range is \([0, 1]\).

Word Analogy¶

A famous property of word embeddings: vector arithmetic captures semantic relationships. To solve a : b :: c : ?, compute \(\mathbf{b} - \mathbf{a} + \mathbf{c}\) and find the word whose embedding is closest to the result (excluding a, b, c). The classic example:

\[\text{king} - \text{man} + \text{woman} \approx \text{queen}\]

def cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors.

    cos(u, v) = (u . v) / (||u|| * ||v||)

    Returns 0.0 if either vector has zero norm.
    """
    dot = np.dot(u, v)
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    if norm_u == 0.0 or norm_v == 0.0:
        return 0.0
    return float(dot / (norm_u * norm_v))

def most_similar(
    word: str,
    matrix: np.ndarray,
    vocab: List[str],
    top_k: int = 5,
) -> List[Tuple[str, float]]:
    """
    Find the top-k most similar words to `word` using cosine similarity.

    Args:
        word: the query word.
        matrix: shape (|V|, d) -- each row is a word embedding.
        vocab: list of words corresponding to rows of `matrix`.
        top_k: how many neighbors to return.

    Returns:
        List of (word, similarity) tuples, sorted descending by similarity.
    """
    if word not in vocab:
        return []
    word_idx = vocab.index(word)
    word_vec = matrix[word_idx]

    similarities = []
    for i, w in enumerate(vocab):
        if i == word_idx:
            continue
        sim = cosine_similarity(word_vec, matrix[i])
        similarities.append((w, sim))

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_k]

def word_analogy(
    a: str,
    b: str,
    c: str,
    embeddings: Dict[str, np.ndarray],
    vocab: List[str],
) -> str:
    """
    Solve the analogy: a is to b as c is to ?

    Method: find the word whose embedding is closest to (b - a + c),
    excluding a, b, and c from the candidates.

    Args:
        a, b, c: words in the analogy.
        embeddings: dict mapping word -> np.ndarray vector.
        vocab: list of candidate words.

    Returns:
        The word d that best completes a:b :: c:d.
    """
    target = embeddings[b] - embeddings[a] + embeddings[c]
    exclude = {a, b, c}

    best_word = None
    best_sim = -2.0
    for w in vocab:
        if w in exclude or w not in embeddings:
            continue
        sim = cosine_similarity(target, embeddings[w])
        if sim > best_sim:
            best_sim = sim
            best_word = w

    return best_word

# --- Demo: Cosine Similarity ---

print("=== Cosine Similarity ===")
u = np.array([1, 2, 3], dtype=float)
v = np.array([2, 4, 6], dtype=float)  # parallel to u
w = np.array([3, 2, 1], dtype=float)  # different direction
z = np.array([0, 0, 0], dtype=float)  # zero vector

print(f"cos(u, v) = {cosine_similarity(u, v):.4f}  (parallel vectors, expect 1.0)")
print(f"cos(u, w) = {cosine_similarity(u, w):.4f}  (different direction)")
print(f"cos(u, z) = {cosine_similarity(u, z):.4f}  (zero vector, expect 0.0)")

# Recreating the textbook example (SLP Fig 5.3 / 5.5)
print("\n--- Textbook Example: cherry vs. digital vs. information ---")
# Dimensions: [pie, data, computer]
cherry      = np.array([442, 8, 2], dtype=float)
digital     = np.array([5, 1683, 1670], dtype=float)
information = np.array([5, 3982, 3325], dtype=float)

print(f"cos(cherry, information)  = {cosine_similarity(cherry, information):.3f}")
print(f"cos(digital, information) = {cosine_similarity(digital, information):.3f}")
print("As the textbook notes, 'information' is much closer to 'digital' than to 'cherry'.")

# Most similar words from co-occurrence matrix
print("\n--- Most Similar Words (co-occurrence matrix from Part 1) ---")
for query in ["I", "like"]:
    neighbors = most_similar(query, cooc, vocab_co, top_k=3)
    print(f"  '{query}' -> {[(w, round(s, 3)) for w, s in neighbors]}")

# Word analogy demo with hand-crafted embeddings
print("\n--- Word Analogy (hand-crafted embeddings) ---")
# Create toy embeddings that encode gender and royalty dimensions
toy_embeddings = {
    "king":    np.array([0.9, 0.1, 0.9]),   # [royalty, female, authority]
    "queen":   np.array([0.9, 0.9, 0.9]),
    "man":     np.array([0.1, 0.1, 0.5]),
    "woman":   np.array([0.1, 0.9, 0.5]),
    "prince":  np.array([0.7, 0.1, 0.4]),
    "princess":np.array([0.7, 0.9, 0.4]),
}
toy_vocab = list(toy_embeddings.keys())

result = word_analogy("man", "king", "woman", toy_embeddings, toy_vocab)
print(f"  man : king :: woman : {result}")

result = word_analogy("king", "prince", "queen", toy_embeddings, toy_vocab)
print(f"  king : prince :: queen : {result}")

PART 3: Word2Vec (Skip-gram)¶

Reference: SLP Ch 5.5

Count-based methods produce sparse vectors of dimension \(|V|\) (often 10,000–50,000). Word2Vec (Mikolov et al., 2013) learns short, dense vectors (50–300 dimensions) via a prediction task.

Skip-gram Intuition¶

The skip-gram model predicts context words from a center (target) word:

Treat the target word and a neighboring context word as positive examples.
Randomly sample other words from the lexicon as negative examples.
Train a logistic regression classifier to distinguish the two cases.
Use the learned weights as the word embeddings.

Two Embedding Matrices¶

The model maintains two matrices:

W (target/center embeddings): shape \((|V|, d)\)
C (context embeddings): shape \((|V|, d)\)

Negative Sampling Loss¶

For a positive pair \((w, c_{\text{pos}})\) with \(k\) noise words \(c_{\text{neg}_1} \ldots c_{\text{neg}_k}\):

\[L = -\left[\log \sigma(\mathbf{c}_{\text{pos}} \cdot \mathbf{w}) + \sum_{i=1}^{k} \log \sigma(-\mathbf{c}_{\text{neg}_i} \cdot \mathbf{w})\right]\]

Gradients (SLP Eq. 5.22–5.27)¶

\[\frac{\partial L}{\partial \mathbf{c}_{\text{pos}}} = [\sigma(\mathbf{c}_{\text{pos}} \cdot \mathbf{w}) - 1]\,\mathbf{w}\]

\[\frac{\partial L}{\partial \mathbf{c}_{\text{neg}_i}} = [\sigma(\mathbf{c}_{\text{neg}_i} \cdot \mathbf{w})]\,\mathbf{w}\]

\[\frac{\partial L}{\partial \mathbf{w}} = [\sigma(\mathbf{c}_{\text{pos}} \cdot \mathbf{w}) - 1]\,\mathbf{c}_{\text{pos}} + \sum_{i=1}^{k} [\sigma(\mathbf{c}_{\text{neg}_i} \cdot \mathbf{w})]\,\mathbf{c}_{\text{neg}_i}\]

The SGD update rules subtract \(\eta \times \text{gradient}\) from the current parameters.

def generate_skipgram_pairs(
    corpus: List[List[str]], window_size: int = 2
) -> List[Tuple[str, str]]:
    """
    Generate (center, context) pairs for skip-gram training.

    For each word in each sentence, pair it with every word within
    +/- window_size positions.

    Args:
        corpus: list of sentences (each a list of tokens).
        window_size: how many words left/right to include.

    Returns:
        List of (center_word, context_word) string pairs.
    """
    pairs = []
    for sentence in corpus:
        for i, center in enumerate(sentence):
            start = max(0, i - window_size)
            end = min(len(sentence), i + window_size + 1)
            for j in range(start, end):
                if i != j:
                    pairs.append((center, sentence[j]))
    return pairs

class SimpleWord2Vec:
    """
    Simplified Word2Vec skip-gram with negative sampling.

    Implements the SGNS algorithm from SLP Ch 5.5 with:
    - Two embedding matrices W (target) and C (context)
    - Negative sampling from a unigram distribution raised to the 0.75 power
    - SGD updates following Eq. 5.25--5.27
    """

    def __init__(
        self,
        vocab_size: int,
        embedding_dim: int = 50,
        lr: float = 0.025,
    ):
        self.embedding_dim = embedding_dim
        self.lr = lr
        self.vocab_size = vocab_size
        # Target (center) embeddings W and context embeddings C
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01
        self.C = np.random.randn(vocab_size, embedding_dim) * 0.01

    @staticmethod
    def _sigmoid(x: np.ndarray) -> np.ndarray:
        """Numerically stable sigmoid."""
        return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

    def train_pair(
        self,
        center_idx: int,
        context_idx: int,
        negative_indices: List[int],
    ) -> float:
        """
        One SGD step for a (center, context) pair with negative samples.

        Implements the gradient updates from SLP Eq. 5.25--5.27:
          c_pos <- c_pos - lr * [sigma(c_pos . w) - 1] * w
          c_neg <- c_neg - lr * [sigma(c_neg . w)] * w
          w     <- w     - lr * {[sigma(c_pos . w) - 1] * c_pos
                                  + sum [sigma(c_neg . w)] * c_neg}

        Returns:
            The loss for this training instance.
        """
        w = self.W[center_idx].copy()
        c_pos = self.C[context_idx].copy()

        # --- Positive pair ---
        score_pos = self._sigmoid(np.dot(c_pos, w))
        # Loss contribution: -log(sigma(c_pos . w))
        loss = -np.log(score_pos + 1e-10)

        # Gradients for w accumulator
        grad_w = (score_pos - 1.0) * c_pos

        # Update context embedding for positive sample (Eq. 5.25)
        self.C[context_idx] -= self.lr * (score_pos - 1.0) * w

        # --- Negative pairs ---
        for neg_idx in negative_indices:
            c_neg = self.C[neg_idx].copy()
            score_neg = self._sigmoid(np.dot(c_neg, w))
            # Loss contribution: -log(sigma(-c_neg . w)) = -log(1 - sigma(c_neg . w))
            loss -= np.log(1.0 - score_neg + 1e-10)

            # Accumulate gradient for w (Eq. 5.24 second term)
            grad_w += score_neg * c_neg

            # Update context embedding for negative sample (Eq. 5.26)
            self.C[neg_idx] -= self.lr * score_neg * w

        # Update target embedding (Eq. 5.27)
        self.W[center_idx] -= self.lr * grad_w

        return float(loss)

    def get_embedding(self, word_idx: int) -> np.ndarray:
        """Return the target embedding for a word."""
        return self.W[word_idx]

# --- Train Word2Vec on a small corpus ---

train_corpus = [
    "the king rules the kingdom".split(),
    "the queen rules the kingdom".split(),
    "the prince is the son of the king".split(),
    "the princess is the daughter of the queen".split(),
    "the man works in the field".split(),
    "the woman works in the house".split(),
] * 20  # Repeat to get enough training signal

# Build vocabulary and word-index mapping
w2v_vocab = sorted(set(word for sent in train_corpus for word in sent))
w2v_word_to_idx = {w: i for i, w in enumerate(w2v_vocab)}
w2v_idx_to_word = {i: w for w, i in w2v_word_to_idx.items()}
V = len(w2v_vocab)
print(f"Vocabulary ({V} words): {w2v_vocab}")

# Build unigram distribution for negative sampling (raised to 0.75 power)
word_counts = Counter(word for sent in train_corpus for word in sent)
freqs = np.array([word_counts[w2v_idx_to_word[i]] for i in range(V)], dtype=float)
freqs_alpha = freqs ** 0.75
neg_sample_probs = freqs_alpha / freqs_alpha.sum()

# Generate skip-gram pairs
pairs = generate_skipgram_pairs(train_corpus, window_size=2)
print(f"Generated {len(pairs)} skip-gram training pairs.")
print(f"First 10 pairs: {pairs[:10]}")

# Train
EMBEDDING_DIM = 30
NUM_NEGATIVES = 5
NUM_EPOCHS = 5

model = SimpleWord2Vec(vocab_size=V, embedding_dim=EMBEDDING_DIM, lr=0.025)

for epoch in range(NUM_EPOCHS):
    # Shuffle pairs each epoch
    indices = np.random.permutation(len(pairs))
    total_loss = 0.0
    for idx in indices:
        center_word, context_word = pairs[idx]
        center_idx = w2v_word_to_idx[center_word]
        context_idx = w2v_word_to_idx[context_word]

        # Sample negatives (exclude the center word)
        neg_indices = []
        while len(neg_indices) < NUM_NEGATIVES:
            neg = np.random.choice(V, p=neg_sample_probs)
            if neg != center_idx:
                neg_indices.append(int(neg))

        loss = model.train_pair(center_idx, context_idx, neg_indices)
        total_loss += loss

    avg_loss = total_loss / len(pairs)
    print(f"Epoch {epoch + 1}/{NUM_EPOCHS}, avg loss: {avg_loss:.4f}")

# Show similar words using the learned embeddings
print("\n=== Most Similar Words (Word2Vec) ===")
w2v_matrix = np.array([model.get_embedding(i) for i in range(V)])
for query in ["king", "queen", "man", "woman"]:
    neighbors = most_similar(query, w2v_matrix, w2v_vocab, top_k=5)
    print(f"  '{query}' -> {[(w, round(s, 3)) for w, s in neighbors]}")

PART 4: Dimensionality Reduction¶

Reference: SLP Ch 5.6

We cannot directly visualize 30- or 300-dimensional embeddings. PCA (Principal Component Analysis) projects high-dimensional data onto the directions of maximum variance.

PCA Algorithm¶

Center the data: subtract the mean vector \(\bar{\mathbf{x}}\) from every data point.
Compute the covariance matrix \(\mathbf{\Sigma} = \frac{1}{n}\mathbf{X}^T \mathbf{X}\) (where \(\mathbf{X}\) is the centered data).
Eigendecomposition: find eigenvalues \(\lambda_1 \geq \lambda_2 \geq \ldots\) and corresponding eigenvectors.
Project onto the top-\(k\) eigenvectors to get a \(k\)-dimensional representation.

For visualization we project to 2D, using the two eigenvectors with the largest eigenvalues.

def pca_2d(embeddings: np.ndarray) -> np.ndarray:
    """
    Reduce embeddings to 2D using PCA.

    Steps:
        1. Center the data (subtract mean).
        2. Compute covariance matrix.
        3. Eigendecomposition via np.linalg.eigh (symmetric matrix).
        4. Project onto top 2 eigenvectors.

    Args:
        embeddings: shape (n_words, d) -- one row per word.

    Returns:
        projected: shape (n_words, 2).
    """
    # 1. Center
    mean = embeddings.mean(axis=0)
    centered = embeddings - mean

    # 2. Covariance matrix (d x d)
    cov = np.dot(centered.T, centered) / centered.shape[0]

    # 3. Eigendecomposition (eigh returns sorted ascending)
    eigenvalues, eigenvectors = np.linalg.eigh(cov)

    # 4. Take top 2 eigenvectors (last two columns, reversed for descending)
    top2 = eigenvectors[:, -2:][:, ::-1]  # shape (d, 2)

    # Project
    projected = np.dot(centered, top2)
    return projected

def visualize_embeddings(
    embeddings: np.ndarray,
    words: List[str],
    title: str = "Word Embeddings (PCA 2D)",
) -> None:
    """
    Reduce embeddings to 2D via PCA and display a scatter plot with word labels.

    Args:
        embeddings: shape (n_words, d).
        words: list of word strings matching the rows of embeddings.
        title: plot title.
    """
    coords = pca_2d(embeddings)

    fig, ax = plt.subplots(figsize=(10, 7))
    ax.scatter(coords[:, 0], coords[:, 1], c="steelblue", s=80, edgecolors="k", linewidths=0.5)

    for i, word in enumerate(words):
        ax.annotate(
            word,
            (coords[i, 0], coords[i, 1]),
            textcoords="offset points",
            xytext=(5, 5),
            fontsize=11,
            fontweight="bold",
        )

    ax.set_title(title, fontsize=14)
    ax.set_xlabel("PC 1", fontsize=12)
    ax.set_ylabel("PC 2", fontsize=12)
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# --- Demo: Visualize the Word2Vec embeddings ---

# Use the embeddings trained in Part 3 (excluding very frequent words like 'the', 'of', 'in', 'is')
stopwords = {"the", "of", "in", "is"}
content_words = [w for w in w2v_vocab if w not in stopwords]
content_indices = [w2v_word_to_idx[w] for w in content_words]
content_embeddings = w2v_matrix[content_indices]

print(f"Visualizing {len(content_words)} content words: {content_words}")
visualize_embeddings(content_embeddings, content_words, title="Word2Vec Embeddings (PCA 2D)")

PART 5: Evaluating Embeddings¶

Reference: SLP Ch 5.6–5.7

How do we know whether our embeddings are good? Two categories of evaluation:

Intrinsic Evaluation¶

Tests the quality of the embeddings directly:

Word similarity: Compute cosine similarity between word pairs and compare with human-assigned similarity scores. Measure correlation (Spearman’s \(\rho\)). Datasets: SimLex-999, WordSim-353.
Word analogy: Solve a : b :: c : ? and measure accuracy. Dataset: Google analogy test set.

Extrinsic Evaluation¶

Use the embeddings as features in a downstream NLP task (e.g., sentiment analysis, named entity recognition) and measure task performance. This is the ultimate test but is expensive and confounded by other system components.

Spearman Rank Correlation¶

Spearman’s \(\rho\) measures the monotonic association between two ranked lists. Given ranks \(r_i\) and \(s_i\):

\[\rho = 1 - \frac{6 \sum_i (r_i - s_i)^2}{n(n^2 - 1)}\]

def _rank(values: List[float]) -> List[float]:
    """
    Compute ranks for a list of values (1-based, average ties).
    """
    n = len(values)
    indexed = sorted(range(n), key=lambda i: values[i])
    ranks = [0.0] * n

    i = 0
    while i < n:
        # Find the end of the group of tied values
        j = i + 1
        while j < n and values[indexed[j]] == values[indexed[i]]:
            j += 1
        # Average rank for tied values
        avg_rank = (i + 1 + j) / 2.0
        for k in range(i, j):
            ranks[indexed[k]] = avg_rank
        i = j

    return ranks


def spearman_correlation(x: List[float], y: List[float]) -> float:
    """
    Compute Spearman rank correlation between two lists.

    rho = 1 - 6 * sum(d_i^2) / (n * (n^2 - 1))
    where d_i = rank(x_i) - rank(y_i).
    """
    n = len(x)
    if n < 2:
        return 0.0
    ranks_x = _rank(x)
    ranks_y = _rank(y)
    d_sq = sum((rx - ry) ** 2 for rx, ry in zip(ranks_x, ranks_y))
    return 1.0 - (6.0 * d_sq) / (n * (n ** 2 - 1))

def evaluate_similarity(
    embeddings: Dict[str, np.ndarray],
    word_pairs: List[Tuple[str, str, float]],
) -> float:
    """
    Evaluate embeddings on a word similarity task.

    For each (word1, word2, human_score) triple, compute cosine similarity
    between the embeddings and compare with human ratings via Spearman's rho.

    Pairs where either word is missing from embeddings are skipped.

    Args:
        embeddings: dict mapping word -> vector.
        word_pairs: list of (word1, word2, human_score) tuples.

    Returns:
        Spearman correlation between model similarities and human scores.
    """
    model_sims = []
    human_sims = []

    for w1, w2, human_score in word_pairs:
        if w1 not in embeddings or w2 not in embeddings:
            continue
        sim = cosine_similarity(embeddings[w1], embeddings[w2])
        model_sims.append(sim)
        human_sims.append(human_score)

    if len(model_sims) < 2:
        return 0.0

    return spearman_correlation(model_sims, human_sims)

def evaluate_analogies(
    embeddings: Dict[str, np.ndarray],
    analogies: List[Tuple[str, str, str, str]],
) -> float:
    """
    Evaluate embeddings on a word analogy task: a:b :: c:?

    For each (a, b, c, expected_d) tuple, use word_analogy() to predict d
    and check if it matches expected_d.

    Analogies where any word is missing from embeddings are skipped.

    Args:
        embeddings: dict mapping word -> vector.
        analogies: list of (a, b, c, d) tuples.

    Returns:
        Accuracy (fraction correct out of answerable analogies).
    """
    vocab = list(embeddings.keys())
    correct = 0
    total = 0

    for a, b, c, expected_d in analogies:
        if any(w not in embeddings for w in [a, b, c, expected_d]):
            continue
        predicted = word_analogy(a, b, c, embeddings, vocab)
        if predicted == expected_d:
            correct += 1
        total += 1

    if total == 0:
        return 0.0
    return correct / total

# --- Demo: Evaluating Embeddings ---

# Use the hand-crafted toy embeddings from Part 2 for clean demonstrations

# 1. Word Similarity Evaluation
print("=== Word Similarity Evaluation ===")
# (word1, word2, human_similarity_score) -- scores from 0 to 10
similarity_test_data = [
    ("king", "queen", 8.5),
    ("king", "man", 5.0),
    ("king", "woman", 4.0),
    ("queen", "woman", 5.5),
    ("man", "woman", 7.0),
    ("prince", "princess", 8.0),
    ("king", "prince", 7.0),
    ("queen", "princess", 7.0),
    ("man", "prince", 4.5),
    ("woman", "princess", 5.0),
]

print("\nPair-by-pair comparison:")
print(f"{'Word 1':<12} {'Word 2':<12} {'Human':>7} {'Model':>7}")
print("-" * 42)
for w1, w2, human in similarity_test_data:
    model_sim = cosine_similarity(toy_embeddings[w1], toy_embeddings[w2])
    print(f"{w1:<12} {w2:<12} {human:>7.1f} {model_sim:>7.3f}")

rho = evaluate_similarity(toy_embeddings, similarity_test_data)
print(f"\nSpearman correlation (rho): {rho:.4f}")
print("(1.0 = perfect agreement with human judgments)")

# 2. Word Analogy Evaluation
print("\n=== Word Analogy Evaluation ===")
analogy_test_data = [
    ("man", "king", "woman", "queen"),
    ("man", "prince", "woman", "princess"),
    ("king", "queen", "prince", "princess"),
]

for a, b, c, expected in analogy_test_data:
    predicted = word_analogy(a, b, c, toy_embeddings, toy_vocab)
    status = "CORRECT" if predicted == expected else f"WRONG (got {predicted})"
    print(f"  {a}:{b} :: {c}:? -> predicted='{predicted}' [{status}]")

accuracy = evaluate_analogies(toy_embeddings, analogy_test_data)
print(f"\nAnalogy accuracy: {accuracy:.1%} ({int(accuracy * len(analogy_test_data))}/{len(analogy_test_data)})")

Summary¶

This lab implemented the core ideas from SLP Chapter 5 (Embeddings):

Concept	What We Built	Key Idea
Count-based embeddings	`build_term_document_matrix`, `build_cooccurrence_matrix`, `compute_tfidf`	Words that co-occur in similar contexts have similar meanings (distributional hypothesis)
Cosine similarity	`cosine_similarity`, `most_similar`, `word_analogy`	Normalize dot product by vector lengths to measure similarity regardless of frequency
Word2Vec (SGNS)	`SimpleWord2Vec` class with gradient updates	Learn short dense vectors by predicting context words; negative sampling makes training efficient
PCA visualization	`pca_2d`, `visualize_embeddings`	Project to 2D via eigendecomposition of the covariance matrix
Evaluation	`evaluate_similarity` (Spearman), `evaluate_analogies` (accuracy)	Intrinsic evaluation compares model outputs to human judgments

Key takeaways:

Sparse count-based vectors (dimension \(|V|\)) capture distributional information but are high-dimensional.
Dense embeddings (50–300 dimensions) from Word2Vec capture richer semantic properties including analogical structure.
Cosine similarity is the standard metric for comparing word vectors because it is invariant to vector magnitude.
The skip-gram negative sampling loss and its gradients (SLP Eq. 5.21–5.27) provide a clean, efficient training procedure.
Intrinsic evaluation (similarity correlation, analogy accuracy) gives fast feedback; extrinsic evaluation on downstream tasks is the ultimate measure.