Lab 03: Word EmbeddingsΒΆ

Source: Speech and Language Processing by Daniel Jurafsky & James H. Martin – Chapter 5: Embeddings

This lab implements the core concepts of word embeddings from scratch:

  1. Count-Based Embeddings – term-document matrices, word-word co-occurrence, TF-IDF

  2. Cosine Similarity – measuring word similarity, finding nearest neighbors, word analogies

  3. Word2Vec (Skip-gram) – learning dense embeddings with negative sampling

  4. Dimensionality Reduction – PCA for visualization

  5. Evaluating Embeddings – intrinsic evaluation with similarity and analogy tasks

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
from typing import List, Dict, Tuple, Optional

np.random.seed(42)
%matplotlib inline

PART 1: Count-Based EmbeddingsΒΆ

Reference: SLP Ch 5.2–5.3

The distributional hypothesis (Firth, 1957) states:

β€œYou shall know a word by the company it keeps.”

Words that occur in similar contexts tend to have similar meanings. We can operationalize this by building vectors from co-occurrence counts.

Term-Document MatrixΒΆ

A term-document matrix has shape \(|V| \times |D|\) where each cell records how often word \(w\) appears in document \(d\). Each column is a document vector; each row is a word vector whose dimensions correspond to documents.

Word-Word Co-occurrence MatrixΒΆ

A word-context matrix (or word-word co-occurrence matrix) has shape \(|V| \times |V|\). Each cell records the number of times the row word (target) and the column word (context) co-occur within a context window of \(\pm k\) words in a training corpus. Two words with similar co-occurrence patterns (similar row vectors) are likely to be semantically related.

TF-IDF WeightingΒΆ

Raw counts can be misleading – frequent words like β€œthe” dominate. TF-IDF down-weights common terms:

\[\text{TF}(t, d) = \frac{\text{count}(t, d)}{\sum_{t'} \text{count}(t', d)}\]
\[\text{IDF}(t) = \log\left(\frac{N}{\text{df}(t)}\right)\]
\[\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)\]

where \(N\) is the total number of documents and \(\text{df}(t)\) is the number of documents containing term \(t\).

def build_term_document_matrix(
    documents: List[List[str]],
) -> Tuple[np.ndarray, List[str], List[int]]:
    """
    Build a term-document matrix.

    Args:
        documents: list of documents, each document is a list of words.

    Returns:
        matrix: shape (|V|, |D|) -- rows are words, columns are documents.
        vocab: sorted list of unique words.
        doc_indices: list of document indices [0, 1, ..., |D|-1].
    """
    vocab = sorted(set(word for doc in documents for word in doc))
    word_to_idx = {w: i for i, w in enumerate(vocab)}

    matrix = np.zeros((len(vocab), len(documents)))
    for j, doc in enumerate(documents):
        for word in doc:
            matrix[word_to_idx[word], j] += 1

    return matrix, vocab, list(range(len(documents)))
def build_cooccurrence_matrix(
    corpus: List[List[str]], window_size: int = 2
) -> Tuple[np.ndarray, List[str]]:
    """
    Build a symmetric word-word co-occurrence matrix.

    For each word, count how many times every other word appears within
    a +/- window_size context window across all sentences.

    Args:
        corpus: list of sentences, each sentence is a list of words.
        window_size: number of words to the left and right to consider.

    Returns:
        matrix: shape (|V|, |V|) symmetric co-occurrence counts.
        vocab: sorted list of unique words.
    """
    vocab = sorted(set(word for sent in corpus for word in sent))
    word_to_idx = {w: i for i, w in enumerate(vocab)}
    V = len(vocab)
    matrix = np.zeros((V, V))

    for sentence in corpus:
        for i, word in enumerate(sentence):
            start = max(0, i - window_size)
            end = min(len(sentence), i + window_size + 1)
            for j in range(start, end):
                if i != j:
                    matrix[word_to_idx[word], word_to_idx[sentence[j]]] += 1

    return matrix, vocab
def compute_tfidf(term_doc_matrix: np.ndarray) -> np.ndarray:
    """
    Compute TF-IDF from a term-document matrix.

    TF(t,d) = count(t,d) / total_words_in_d
    IDF(t)  = log(N / df(t))
    TF-IDF  = TF * IDF

    Args:
        term_doc_matrix: shape (|V|, |D|), raw counts.

    Returns:
        tfidf_matrix: shape (|V|, |D|), TF-IDF weighted values.
    """
    N = term_doc_matrix.shape[1]  # number of documents

    # TF: normalize each column (document) by its total word count
    col_sums = term_doc_matrix.sum(axis=0, keepdims=True)  # (1, |D|)
    tf = term_doc_matrix / (col_sums + 1e-10)

    # DF: number of documents each term appears in
    df = np.sum(term_doc_matrix > 0, axis=1, keepdims=True)  # (|V|, 1)

    # IDF: log(N / df), with smoothing to avoid division by zero
    idf = np.log(N / (df + 1e-10))

    return tf * idf
# --- Demo: Count-Based Embeddings ---

documents = [
    "the cat sat on the mat".split(),
    "the dog sat on the log".split(),
    "the cat chased the dog".split(),
    "the bird flew over the mat".split(),
]

# Term-document matrix
tdm, vocab_td, doc_ids = build_term_document_matrix(documents)
print("=== Term-Document Matrix ===")
print(f"Vocabulary ({len(vocab_td)} words): {vocab_td}")
print(f"Shape: {tdm.shape}  (words x documents)\n")
# Print as a table
header = "Word".ljust(12) + "".join(f"Doc {d}".rjust(6) for d in doc_ids)
print(header)
print("-" * len(header))
for i, w in enumerate(vocab_td):
    row = w.ljust(12) + "".join(f"{int(tdm[i, j])}".rjust(6) for j in doc_ids)
    print(row)

# TF-IDF
tfidf = compute_tfidf(tdm)
print("\n=== TF-IDF Matrix ===")
header = "Word".ljust(12) + "".join(f"Doc {d}".rjust(8) for d in doc_ids)
print(header)
print("-" * len(header))
for i, w in enumerate(vocab_td):
    row = w.ljust(12) + "".join(f"{tfidf[i, j]:.3f}".rjust(8) for j in doc_ids)
    print(row)
print("\nNote: 'the' has TF-IDF = 0.000 everywhere (appears in all docs, IDF -> 0).")

# Co-occurrence matrix
corpus = [
    "I like deep learning".split(),
    "I like NLP".split(),
    "I enjoy flying".split(),
]
cooc, vocab_co = build_cooccurrence_matrix(corpus, window_size=2)
print("\n=== Word-Word Co-occurrence Matrix (window=2) ===")
print(f"Vocabulary: {vocab_co}")
header = "".ljust(10) + "".join(w.rjust(10) for w in vocab_co)
print(header)
for i, w in enumerate(vocab_co):
    row = w.ljust(10) + "".join(f"{int(cooc[i, j])}".rjust(10) for j in range(len(vocab_co)))
    print(row)

PART 2: Cosine SimilarityΒΆ

Reference: SLP Ch 5.4

The raw dot product favors frequent (long) vectors. We normalize by dividing by the vector lengths, giving us the cosine of the angle between two vectors:

\[\text{cosine}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \, \|\mathbf{v}\|} = \frac{\sum_i u_i v_i}{\sqrt{\sum_i u_i^2} \sqrt{\sum_i v_i^2}}\]

Range:

  • \(1\) for vectors pointing in the same direction (identical distribution)

  • \(0\) for orthogonal vectors (no shared context)

  • \(-1\) for vectors pointing in opposite directions

For non-negative count vectors the range is \([0, 1]\).

Word AnalogyΒΆ

A famous property of word embeddings: vector arithmetic captures semantic relationships. To solve a : b :: c : ?, compute \(\mathbf{b} - \mathbf{a} + \mathbf{c}\) and find the word whose embedding is closest to the result (excluding a, b, c). The classic example:

\[\text{king} - \text{man} + \text{woman} \approx \text{queen}\]
def cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors.

    cos(u, v) = (u . v) / (||u|| * ||v||)

    Returns 0.0 if either vector has zero norm.
    """
    dot = np.dot(u, v)
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    if norm_u == 0.0 or norm_v == 0.0:
        return 0.0
    return float(dot / (norm_u * norm_v))
def most_similar(
    word: str,
    matrix: np.ndarray,
    vocab: List[str],
    top_k: int = 5,
) -> List[Tuple[str, float]]:
    """
    Find the top-k most similar words to `word` using cosine similarity.

    Args:
        word: the query word.
        matrix: shape (|V|, d) -- each row is a word embedding.
        vocab: list of words corresponding to rows of `matrix`.
        top_k: how many neighbors to return.

    Returns:
        List of (word, similarity) tuples, sorted descending by similarity.
    """
    if word not in vocab:
        return []
    word_idx = vocab.index(word)
    word_vec = matrix[word_idx]

    similarities = []
    for i, w in enumerate(vocab):
        if i == word_idx:
            continue
        sim = cosine_similarity(word_vec, matrix[i])
        similarities.append((w, sim))

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_k]
def word_analogy(
    a: str,
    b: str,
    c: str,
    embeddings: Dict[str, np.ndarray],
    vocab: List[str],
) -> str:
    """
    Solve the analogy: a is to b as c is to ?

    Method: find the word whose embedding is closest to (b - a + c),
    excluding a, b, and c from the candidates.

    Args:
        a, b, c: words in the analogy.
        embeddings: dict mapping word -> np.ndarray vector.
        vocab: list of candidate words.

    Returns:
        The word d that best completes a:b :: c:d.
    """
    target = embeddings[b] - embeddings[a] + embeddings[c]
    exclude = {a, b, c}

    best_word = None
    best_sim = -2.0
    for w in vocab:
        if w in exclude or w not in embeddings:
            continue
        sim = cosine_similarity(target, embeddings[w])
        if sim > best_sim:
            best_sim = sim
            best_word = w

    return best_word
# --- Demo: Cosine Similarity ---

print("=== Cosine Similarity ===")
u = np.array([1, 2, 3], dtype=float)
v = np.array([2, 4, 6], dtype=float)  # parallel to u
w = np.array([3, 2, 1], dtype=float)  # different direction
z = np.array([0, 0, 0], dtype=float)  # zero vector

print(f"cos(u, v) = {cosine_similarity(u, v):.4f}  (parallel vectors, expect 1.0)")
print(f"cos(u, w) = {cosine_similarity(u, w):.4f}  (different direction)")
print(f"cos(u, z) = {cosine_similarity(u, z):.4f}  (zero vector, expect 0.0)")

# Recreating the textbook example (SLP Fig 5.3 / 5.5)
print("\n--- Textbook Example: cherry vs. digital vs. information ---")
# Dimensions: [pie, data, computer]
cherry      = np.array([442, 8, 2], dtype=float)
digital     = np.array([5, 1683, 1670], dtype=float)
information = np.array([5, 3982, 3325], dtype=float)

print(f"cos(cherry, information)  = {cosine_similarity(cherry, information):.3f}")
print(f"cos(digital, information) = {cosine_similarity(digital, information):.3f}")
print("As the textbook notes, 'information' is much closer to 'digital' than to 'cherry'.")

# Most similar words from co-occurrence matrix
print("\n--- Most Similar Words (co-occurrence matrix from Part 1) ---")
for query in ["I", "like"]:
    neighbors = most_similar(query, cooc, vocab_co, top_k=3)
    print(f"  '{query}' -> {[(w, round(s, 3)) for w, s in neighbors]}")

# Word analogy demo with hand-crafted embeddings
print("\n--- Word Analogy (hand-crafted embeddings) ---")
# Create toy embeddings that encode gender and royalty dimensions
toy_embeddings = {
    "king":    np.array([0.9, 0.1, 0.9]),   # [royalty, female, authority]
    "queen":   np.array([0.9, 0.9, 0.9]),
    "man":     np.array([0.1, 0.1, 0.5]),
    "woman":   np.array([0.1, 0.9, 0.5]),
    "prince":  np.array([0.7, 0.1, 0.4]),
    "princess":np.array([0.7, 0.9, 0.4]),
}
toy_vocab = list(toy_embeddings.keys())

result = word_analogy("man", "king", "woman", toy_embeddings, toy_vocab)
print(f"  man : king :: woman : {result}")

result = word_analogy("king", "prince", "queen", toy_embeddings, toy_vocab)
print(f"  king : prince :: queen : {result}")

PART 3: Word2Vec (Skip-gram)ΒΆ

Reference: SLP Ch 5.5

Count-based methods produce sparse vectors of dimension \(|V|\) (often 10,000–50,000). Word2Vec (Mikolov et al., 2013) learns short, dense vectors (50–300 dimensions) via a prediction task.

Skip-gram IntuitionΒΆ

The skip-gram model predicts context words from a center (target) word:

  1. Treat the target word and a neighboring context word as positive examples.

  2. Randomly sample other words from the lexicon as negative examples.

  3. Train a logistic regression classifier to distinguish the two cases.

  4. Use the learned weights as the word embeddings.

Two Embedding MatricesΒΆ

The model maintains two matrices:

  • W (target/center embeddings): shape \((|V|, d)\)

  • C (context embeddings): shape \((|V|, d)\)

Negative Sampling LossΒΆ

For a positive pair \((w, c_{\text{pos}})\) with \(k\) noise words \(c_{\text{neg}_1} \ldots c_{\text{neg}_k}\):

\[L = -\left[\log \sigma(\mathbf{c}_{\text{pos}} \cdot \mathbf{w}) + \sum_{i=1}^{k} \log \sigma(-\mathbf{c}_{\text{neg}_i} \cdot \mathbf{w})\right]\]

Gradients (SLP Eq. 5.22–5.27)ΒΆ

\[\frac{\partial L}{\partial \mathbf{c}_{\text{pos}}} = [\sigma(\mathbf{c}_{\text{pos}} \cdot \mathbf{w}) - 1]\,\mathbf{w}\]
\[\frac{\partial L}{\partial \mathbf{c}_{\text{neg}_i}} = [\sigma(\mathbf{c}_{\text{neg}_i} \cdot \mathbf{w})]\,\mathbf{w}\]
\[\frac{\partial L}{\partial \mathbf{w}} = [\sigma(\mathbf{c}_{\text{pos}} \cdot \mathbf{w}) - 1]\,\mathbf{c}_{\text{pos}} + \sum_{i=1}^{k} [\sigma(\mathbf{c}_{\text{neg}_i} \cdot \mathbf{w})]\,\mathbf{c}_{\text{neg}_i}\]

The SGD update rules subtract \(\eta \times \text{gradient}\) from the current parameters.

def generate_skipgram_pairs(
    corpus: List[List[str]], window_size: int = 2
) -> List[Tuple[str, str]]:
    """
    Generate (center, context) pairs for skip-gram training.

    For each word in each sentence, pair it with every word within
    +/- window_size positions.

    Args:
        corpus: list of sentences (each a list of tokens).
        window_size: how many words left/right to include.

    Returns:
        List of (center_word, context_word) string pairs.
    """
    pairs = []
    for sentence in corpus:
        for i, center in enumerate(sentence):
            start = max(0, i - window_size)
            end = min(len(sentence), i + window_size + 1)
            for j in range(start, end):
                if i != j:
                    pairs.append((center, sentence[j]))
    return pairs
class SimpleWord2Vec:
    """
    Simplified Word2Vec skip-gram with negative sampling.

    Implements the SGNS algorithm from SLP Ch 5.5 with:
    - Two embedding matrices W (target) and C (context)
    - Negative sampling from a unigram distribution raised to the 0.75 power
    - SGD updates following Eq. 5.25--5.27
    """

    def __init__(
        self,
        vocab_size: int,
        embedding_dim: int = 50,
        lr: float = 0.025,
    ):
        self.embedding_dim = embedding_dim
        self.lr = lr
        self.vocab_size = vocab_size
        # Target (center) embeddings W and context embeddings C
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01
        self.C = np.random.randn(vocab_size, embedding_dim) * 0.01

    @staticmethod
    def _sigmoid(x: np.ndarray) -> np.ndarray:
        """Numerically stable sigmoid."""
        return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

    def train_pair(
        self,
        center_idx: int,
        context_idx: int,
        negative_indices: List[int],
    ) -> float:
        """
        One SGD step for a (center, context) pair with negative samples.

        Implements the gradient updates from SLP Eq. 5.25--5.27:
          c_pos <- c_pos - lr * [sigma(c_pos . w) - 1] * w
          c_neg <- c_neg - lr * [sigma(c_neg . w)] * w
          w     <- w     - lr * {[sigma(c_pos . w) - 1] * c_pos
                                  + sum [sigma(c_neg . w)] * c_neg}

        Returns:
            The loss for this training instance.
        """
        w = self.W[center_idx].copy()
        c_pos = self.C[context_idx].copy()

        # --- Positive pair ---
        score_pos = self._sigmoid(np.dot(c_pos, w))
        # Loss contribution: -log(sigma(c_pos . w))
        loss = -np.log(score_pos + 1e-10)

        # Gradients for w accumulator
        grad_w = (score_pos - 1.0) * c_pos

        # Update context embedding for positive sample (Eq. 5.25)
        self.C[context_idx] -= self.lr * (score_pos - 1.0) * w

        # --- Negative pairs ---
        for neg_idx in negative_indices:
            c_neg = self.C[neg_idx].copy()
            score_neg = self._sigmoid(np.dot(c_neg, w))
            # Loss contribution: -log(sigma(-c_neg . w)) = -log(1 - sigma(c_neg . w))
            loss -= np.log(1.0 - score_neg + 1e-10)

            # Accumulate gradient for w (Eq. 5.24 second term)
            grad_w += score_neg * c_neg

            # Update context embedding for negative sample (Eq. 5.26)
            self.C[neg_idx] -= self.lr * score_neg * w

        # Update target embedding (Eq. 5.27)
        self.W[center_idx] -= self.lr * grad_w

        return float(loss)

    def get_embedding(self, word_idx: int) -> np.ndarray:
        """Return the target embedding for a word."""
        return self.W[word_idx]
# --- Train Word2Vec on a small corpus ---

train_corpus = [
    "the king rules the kingdom".split(),
    "the queen rules the kingdom".split(),
    "the prince is the son of the king".split(),
    "the princess is the daughter of the queen".split(),
    "the man works in the field".split(),
    "the woman works in the house".split(),
] * 20  # Repeat to get enough training signal

# Build vocabulary and word-index mapping
w2v_vocab = sorted(set(word for sent in train_corpus for word in sent))
w2v_word_to_idx = {w: i for i, w in enumerate(w2v_vocab)}
w2v_idx_to_word = {i: w for w, i in w2v_word_to_idx.items()}
V = len(w2v_vocab)
print(f"Vocabulary ({V} words): {w2v_vocab}")

# Build unigram distribution for negative sampling (raised to 0.75 power)
word_counts = Counter(word for sent in train_corpus for word in sent)
freqs = np.array([word_counts[w2v_idx_to_word[i]] for i in range(V)], dtype=float)
freqs_alpha = freqs ** 0.75
neg_sample_probs = freqs_alpha / freqs_alpha.sum()

# Generate skip-gram pairs
pairs = generate_skipgram_pairs(train_corpus, window_size=2)
print(f"Generated {len(pairs)} skip-gram training pairs.")
print(f"First 10 pairs: {pairs[:10]}")

# Train
EMBEDDING_DIM = 30
NUM_NEGATIVES = 5
NUM_EPOCHS = 5

model = SimpleWord2Vec(vocab_size=V, embedding_dim=EMBEDDING_DIM, lr=0.025)

for epoch in range(NUM_EPOCHS):
    # Shuffle pairs each epoch
    indices = np.random.permutation(len(pairs))
    total_loss = 0.0
    for idx in indices:
        center_word, context_word = pairs[idx]
        center_idx = w2v_word_to_idx[center_word]
        context_idx = w2v_word_to_idx[context_word]

        # Sample negatives (exclude the center word)
        neg_indices = []
        while len(neg_indices) < NUM_NEGATIVES:
            neg = np.random.choice(V, p=neg_sample_probs)
            if neg != center_idx:
                neg_indices.append(int(neg))

        loss = model.train_pair(center_idx, context_idx, neg_indices)
        total_loss += loss

    avg_loss = total_loss / len(pairs)
    print(f"Epoch {epoch + 1}/{NUM_EPOCHS}, avg loss: {avg_loss:.4f}")

# Show similar words using the learned embeddings
print("\n=== Most Similar Words (Word2Vec) ===")
w2v_matrix = np.array([model.get_embedding(i) for i in range(V)])
for query in ["king", "queen", "man", "woman"]:
    neighbors = most_similar(query, w2v_matrix, w2v_vocab, top_k=5)
    print(f"  '{query}' -> {[(w, round(s, 3)) for w, s in neighbors]}")

PART 4: Dimensionality ReductionΒΆ

Reference: SLP Ch 5.6

We cannot directly visualize 30- or 300-dimensional embeddings. PCA (Principal Component Analysis) projects high-dimensional data onto the directions of maximum variance.

PCA AlgorithmΒΆ

  1. Center the data: subtract the mean vector \(\bar{\mathbf{x}}\) from every data point.

  2. Compute the covariance matrix \(\mathbf{\Sigma} = \frac{1}{n}\mathbf{X}^T \mathbf{X}\) (where \(\mathbf{X}\) is the centered data).

  3. Eigendecomposition: find eigenvalues \(\lambda_1 \geq \lambda_2 \geq \ldots\) and corresponding eigenvectors.

  4. Project onto the top-\(k\) eigenvectors to get a \(k\)-dimensional representation.

For visualization we project to 2D, using the two eigenvectors with the largest eigenvalues.

def pca_2d(embeddings: np.ndarray) -> np.ndarray:
    """
    Reduce embeddings to 2D using PCA.

    Steps:
        1. Center the data (subtract mean).
        2. Compute covariance matrix.
        3. Eigendecomposition via np.linalg.eigh (symmetric matrix).
        4. Project onto top 2 eigenvectors.

    Args:
        embeddings: shape (n_words, d) -- one row per word.

    Returns:
        projected: shape (n_words, 2).
    """
    # 1. Center
    mean = embeddings.mean(axis=0)
    centered = embeddings - mean

    # 2. Covariance matrix (d x d)
    cov = np.dot(centered.T, centered) / centered.shape[0]

    # 3. Eigendecomposition (eigh returns sorted ascending)
    eigenvalues, eigenvectors = np.linalg.eigh(cov)

    # 4. Take top 2 eigenvectors (last two columns, reversed for descending)
    top2 = eigenvectors[:, -2:][:, ::-1]  # shape (d, 2)

    # Project
    projected = np.dot(centered, top2)
    return projected
def visualize_embeddings(
    embeddings: np.ndarray,
    words: List[str],
    title: str = "Word Embeddings (PCA 2D)",
) -> None:
    """
    Reduce embeddings to 2D via PCA and display a scatter plot with word labels.

    Args:
        embeddings: shape (n_words, d).
        words: list of word strings matching the rows of embeddings.
        title: plot title.
    """
    coords = pca_2d(embeddings)

    fig, ax = plt.subplots(figsize=(10, 7))
    ax.scatter(coords[:, 0], coords[:, 1], c="steelblue", s=80, edgecolors="k", linewidths=0.5)

    for i, word in enumerate(words):
        ax.annotate(
            word,
            (coords[i, 0], coords[i, 1]),
            textcoords="offset points",
            xytext=(5, 5),
            fontsize=11,
            fontweight="bold",
        )

    ax.set_title(title, fontsize=14)
    ax.set_xlabel("PC 1", fontsize=12)
    ax.set_ylabel("PC 2", fontsize=12)
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
# --- Demo: Visualize the Word2Vec embeddings ---

# Use the embeddings trained in Part 3 (excluding very frequent words like 'the', 'of', 'in', 'is')
stopwords = {"the", "of", "in", "is"}
content_words = [w for w in w2v_vocab if w not in stopwords]
content_indices = [w2v_word_to_idx[w] for w in content_words]
content_embeddings = w2v_matrix[content_indices]

print(f"Visualizing {len(content_words)} content words: {content_words}")
visualize_embeddings(content_embeddings, content_words, title="Word2Vec Embeddings (PCA 2D)")

PART 5: Evaluating EmbeddingsΒΆ

Reference: SLP Ch 5.6–5.7

How do we know whether our embeddings are good? Two categories of evaluation:

Intrinsic EvaluationΒΆ

Tests the quality of the embeddings directly:

  • Word similarity: Compute cosine similarity between word pairs and compare with human-assigned similarity scores. Measure correlation (Spearman’s \(\rho\)). Datasets: SimLex-999, WordSim-353.

  • Word analogy: Solve a : b :: c : ? and measure accuracy. Dataset: Google analogy test set.

Extrinsic EvaluationΒΆ

Use the embeddings as features in a downstream NLP task (e.g., sentiment analysis, named entity recognition) and measure task performance. This is the ultimate test but is expensive and confounded by other system components.

Spearman Rank CorrelationΒΆ

Spearman’s \(\rho\) measures the monotonic association between two ranked lists. Given ranks \(r_i\) and \(s_i\):

\[\rho = 1 - \frac{6 \sum_i (r_i - s_i)^2}{n(n^2 - 1)}\]
def _rank(values: List[float]) -> List[float]:
    """
    Compute ranks for a list of values (1-based, average ties).
    """
    n = len(values)
    indexed = sorted(range(n), key=lambda i: values[i])
    ranks = [0.0] * n

    i = 0
    while i < n:
        # Find the end of the group of tied values
        j = i + 1
        while j < n and values[indexed[j]] == values[indexed[i]]:
            j += 1
        # Average rank for tied values
        avg_rank = (i + 1 + j) / 2.0
        for k in range(i, j):
            ranks[indexed[k]] = avg_rank
        i = j

    return ranks


def spearman_correlation(x: List[float], y: List[float]) -> float:
    """
    Compute Spearman rank correlation between two lists.

    rho = 1 - 6 * sum(d_i^2) / (n * (n^2 - 1))
    where d_i = rank(x_i) - rank(y_i).
    """
    n = len(x)
    if n < 2:
        return 0.0
    ranks_x = _rank(x)
    ranks_y = _rank(y)
    d_sq = sum((rx - ry) ** 2 for rx, ry in zip(ranks_x, ranks_y))
    return 1.0 - (6.0 * d_sq) / (n * (n ** 2 - 1))
def evaluate_similarity(
    embeddings: Dict[str, np.ndarray],
    word_pairs: List[Tuple[str, str, float]],
) -> float:
    """
    Evaluate embeddings on a word similarity task.

    For each (word1, word2, human_score) triple, compute cosine similarity
    between the embeddings and compare with human ratings via Spearman's rho.

    Pairs where either word is missing from embeddings are skipped.

    Args:
        embeddings: dict mapping word -> vector.
        word_pairs: list of (word1, word2, human_score) tuples.

    Returns:
        Spearman correlation between model similarities and human scores.
    """
    model_sims = []
    human_sims = []

    for w1, w2, human_score in word_pairs:
        if w1 not in embeddings or w2 not in embeddings:
            continue
        sim = cosine_similarity(embeddings[w1], embeddings[w2])
        model_sims.append(sim)
        human_sims.append(human_score)

    if len(model_sims) < 2:
        return 0.0

    return spearman_correlation(model_sims, human_sims)
def evaluate_analogies(
    embeddings: Dict[str, np.ndarray],
    analogies: List[Tuple[str, str, str, str]],
) -> float:
    """
    Evaluate embeddings on a word analogy task: a:b :: c:?

    For each (a, b, c, expected_d) tuple, use word_analogy() to predict d
    and check if it matches expected_d.

    Analogies where any word is missing from embeddings are skipped.

    Args:
        embeddings: dict mapping word -> vector.
        analogies: list of (a, b, c, d) tuples.

    Returns:
        Accuracy (fraction correct out of answerable analogies).
    """
    vocab = list(embeddings.keys())
    correct = 0
    total = 0

    for a, b, c, expected_d in analogies:
        if any(w not in embeddings for w in [a, b, c, expected_d]):
            continue
        predicted = word_analogy(a, b, c, embeddings, vocab)
        if predicted == expected_d:
            correct += 1
        total += 1

    if total == 0:
        return 0.0
    return correct / total
# --- Demo: Evaluating Embeddings ---

# Use the hand-crafted toy embeddings from Part 2 for clean demonstrations

# 1. Word Similarity Evaluation
print("=== Word Similarity Evaluation ===")
# (word1, word2, human_similarity_score) -- scores from 0 to 10
similarity_test_data = [
    ("king", "queen", 8.5),
    ("king", "man", 5.0),
    ("king", "woman", 4.0),
    ("queen", "woman", 5.5),
    ("man", "woman", 7.0),
    ("prince", "princess", 8.0),
    ("king", "prince", 7.0),
    ("queen", "princess", 7.0),
    ("man", "prince", 4.5),
    ("woman", "princess", 5.0),
]

print("\nPair-by-pair comparison:")
print(f"{'Word 1':<12} {'Word 2':<12} {'Human':>7} {'Model':>7}")
print("-" * 42)
for w1, w2, human in similarity_test_data:
    model_sim = cosine_similarity(toy_embeddings[w1], toy_embeddings[w2])
    print(f"{w1:<12} {w2:<12} {human:>7.1f} {model_sim:>7.3f}")

rho = evaluate_similarity(toy_embeddings, similarity_test_data)
print(f"\nSpearman correlation (rho): {rho:.4f}")
print("(1.0 = perfect agreement with human judgments)")

# 2. Word Analogy Evaluation
print("\n=== Word Analogy Evaluation ===")
analogy_test_data = [
    ("man", "king", "woman", "queen"),
    ("man", "prince", "woman", "princess"),
    ("king", "queen", "prince", "princess"),
]

for a, b, c, expected in analogy_test_data:
    predicted = word_analogy(a, b, c, toy_embeddings, toy_vocab)
    status = "CORRECT" if predicted == expected else f"WRONG (got {predicted})"
    print(f"  {a}:{b} :: {c}:? -> predicted='{predicted}' [{status}]")

accuracy = evaluate_analogies(toy_embeddings, analogy_test_data)
print(f"\nAnalogy accuracy: {accuracy:.1%} ({int(accuracy * len(analogy_test_data))}/{len(analogy_test_data)})")

SummaryΒΆ

This lab implemented the core ideas from SLP Chapter 5 (Embeddings):

Concept

What We Built

Key Idea

Count-based embeddings

build_term_document_matrix, build_cooccurrence_matrix, compute_tfidf

Words that co-occur in similar contexts have similar meanings (distributional hypothesis)

Cosine similarity

cosine_similarity, most_similar, word_analogy

Normalize dot product by vector lengths to measure similarity regardless of frequency

Word2Vec (SGNS)

SimpleWord2Vec class with gradient updates

Learn short dense vectors by predicting context words; negative sampling makes training efficient

PCA visualization

pca_2d, visualize_embeddings

Project to 2D via eigendecomposition of the covariance matrix

Evaluation

evaluate_similarity (Spearman), evaluate_analogies (accuracy)

Intrinsic evaluation compares model outputs to human judgments

Key takeaways:

  • Sparse count-based vectors (dimension \(|V|\)) capture distributional information but are high-dimensional.

  • Dense embeddings (50–300 dimensions) from Word2Vec capture richer semantic properties including analogical structure.

  • Cosine similarity is the standard metric for comparing word vectors because it is invariant to vector magnitude.

  • The skip-gram negative sampling loss and its gradients (SLP Eq. 5.21–5.27) provide a clean, efficient training procedure.

  • Intrinsic evaluation (similarity correlation, analogy accuracy) gives fast feedback; extrinsic evaluation on downstream tasks is the ultimate measure.