Neural Networks: From Basics to Transformers

Table of Contents

  1. What is a Neural Network?

  2. The Biological Inspiration

  3. Components of Neural Networks

  4. Forward Propagation

  5. Activation Functions

  6. Loss Functions

  7. Backward Propagation

  8. Optimization

  9. Training Process

  10. Common Architectures

  11. From RNNs to Transformers

What is a Neural Network?

A neural network is a computational model inspired by the human brain that learns to perform tasks by considering examples, without being programmed with task-specific rules.

The Core Idea

Instead of writing explicit rules:

# Traditional programming
if word in positive_words:
    sentiment = "positive"
else:
    sentiment = "negative"

Neural networks learn patterns from data:

# Machine learning approach
model = train_on_examples(training_data)
sentiment = model.predict(new_text)

Why Neural Networks?

Problems they solve:

  • Recognize patterns in complex data

  • Handle non-linear relationships

  • Adapt to new patterns automatically

  • Scale to large datasets

  • Transfer knowledge between tasks

Real-world applications:

  • Image recognition (faces, objects, medical scans)

  • Natural language (translation, summarization, chatbots)

  • Speech recognition and synthesis

  • Recommendation systems

  • Game playing (Chess, Go, video games)

  • Drug discovery and protein folding

The Biological Inspiration

Human Neurons

In your brain:

  1. Dendrites receive signals from other neurons

  2. Cell body processes these signals

  3. Axon sends output to other neurons if threshold is reached

  4. Synapses connect neurons with varying strengths

Artificial Neurons (Perceptrons)

A mathematical approximation:

  1. Inputs (x₁, x₂, …, xₙ) come from previous layer

  2. Weights (w₁, w₂, …, wₙ) represent connection strength

  3. Bias (b) represents neuron’s threshold

  4. Activation function determines output

        flowchart LR
    x1["x₁"] -- w₁ --> S["Σ(wᵢxᵢ + b)"]
    x2["x₂"] -- w₂ --> S
    x3["x₃"] -- w₃ --> S
    xn["xₙ"] -- wₙ --> S
    S --> Act[Activation]
    Act --> Out[Output]
    

Mathematical Formula

output = f(Σ(wᵢ × xᵢ) + b)

Where:
- xᵢ = input values
- wᵢ = weights (learnable parameters)
- b = bias (learnable parameter)
- f = activation function
- Σ = sum

Components of Neural Networks

1. Layers

Input Layer:

  • Receives raw data

  • One neuron per feature

  • Example: For 28×28 image = 784 input neurons

Hidden Layers:

  • Process and transform data

  • Multiple layers = “deep” learning

  • Each layer learns increasingly abstract features

Output Layer:

  • Produces final prediction

  • Size depends on task:

    • Binary classification: 1 neuron

    • Multi-class (10 classes): 10 neurons

    • Regression: 1 neuron

Input  [Hidden Layer 1]  [Hidden Layer 2]  Output
 784      [128 neurons]      [64 neurons]      10

2. Weights and Biases

Weights: The “knowledge” of the network

  • Initially random

  • Updated during training

  • Determine strength of connections

Biases: Offset values

  • One per neuron

  • Allow neurons to activate even with zero input

  • Help model fit data better

Total parameters:

Layer 1: (784 × 128) + 128 = 100,480 parameters
Layer 2: (128 × 64) + 64 = 8,256 parameters
Output:  (64 × 10) + 10 = 650 parameters
Total: 109,386 learnable parameters

3. Architecture

Fully Connected (Dense) Layers:

  • Every neuron connects to all neurons in next layer

  • Most common in basic networks

Specialized Layers:

  • Convolutional (Conv): For images, spatial patterns

  • Recurrent (RNN, LSTM, GRU): For sequences, temporal patterns

  • Attention: For focusing on relevant information

  • Dropout: For regularization (randomly disable neurons)

  • Batch Normalization: For training stability

Forward Propagation

Forward propagation is how data flows through the network to produce predictions.

Step-by-Step Process

1. Input Layer → Hidden Layer 1

# For each neuron in hidden layer 1
z1 = W1 @ x + b1  # Linear transformation (matrix multiplication)
a1 = activation(z1)  # Apply activation function

2. Hidden Layer 1 → Hidden Layer 2

z2 = W2 @ a1 + b2
a2 = activation(z2)

3. Hidden Layer 2 → Output

z3 = W3 @ a2 + b3
output = softmax(z3)  # For classification

Example: 2-Layer Network

import numpy as np

# Input: 4 features
x = np.array([1.0, 0.5, 0.2, 0.9])

# Layer 1: 4 → 3 neurons
W1 = np.random.randn(3, 4)  # Shape: (3, 4)
b1 = np.random.randn(3)
z1 = W1 @ x + b1
a1 = np.maximum(0, z1)  # ReLU activation

# Layer 2: 3 → 2 neurons (output)
W2 = np.random.randn(2, 3)
b2 = np.random.randn(2)
z2 = W2 @ a1 + b2
# a2 = softmax(z2) for classification

print(f"Input shape: {x.shape}")
print(f"Hidden activation shape: {a1.shape}")
print(f"Output shape: {z2.shape}")

Activation Functions

Activation functions introduce non-linearity, allowing networks to learn complex patterns.

Why Non-linearity?

Without activation functions, multiple layers collapse into one:

f(W2 @ (W1 @ x + b1) + b2) = f((W2 @ W1) @ x + (W2 @ b1 + b2))
                            = f(W_combined @ x + b_combined)
# This is just a single linear layer!

Common Activation Functions

1. ReLU (Rectified Linear Unit) ⭐

Formula: f(x) = max(0, x)

f(x) = { x   if x > 0
       { 0   if x  0

Graph:
  
      
     
    
──┼─────
  

Pros:

  • ✅ Computationally efficient

  • ✅ Helps with vanishing gradient problem

  • ✅ Sparse activation (many zeros)

Cons:

  • ❌ “Dying ReLU” - neurons can get stuck at 0

Usage: Hidden layers in most modern networks

2. Sigmoid

Formula: f(x) = 1 / (1 + e^(-x))

Output range: (0, 1)

Graph:
  1.0      ╱──
          
  0.5    
        
  0.0 │─╱
      └─────────

Pros:

  • ✅ Output interpretable as probability

  • ✅ Smooth gradient

Cons:

  • ❌ Vanishing gradients for large |x|

  • ❌ Outputs not zero-centered

Usage: Binary classification output, gates in LSTM

3. Tanh (Hyperbolic Tangent)

Formula: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Output range: (-1, 1)

Graph:
  1.0      ╱──
          
  0.0    
        
 -1.0 │─╱
      └─────────

Pros:

  • ✅ Zero-centered (better than sigmoid)

  • ✅ Stronger gradients than sigmoid

Cons:

  • ❌ Still suffers from vanishing gradients

Usage: RNN/LSTM hidden states

4. Softmax

Formula: f(xᵢ) = e^(xᵢ) / Σⱼ e^(xⱼ)

Converts vector to probability distribution:

Input:  [2.0, 1.0, 0.5]
Output: [0.659, 0.242, 0.099]  # Sums to 1.0

Usage: Multi-class classification output layer

5. GELU (Gaussian Error Linear Unit)

Formula: f(x) = x * Φ(x) where Φ is Gaussian CDF

Usage: Modern transformers (GPT, BERT)

Why better than ReLU:

  • Smooth, differentiable everywhere

  • Better gradient flow

  • Used in state-of-the-art models

Loss Functions

Loss functions measure how wrong the model’s predictions are.

Regression Tasks

Mean Squared Error (MSE)

L = (1/n) Σ (ŷᵢ - yᵢ)²

Where:
- ŷᵢ = predicted value
- yᵢ = actual value
- n = number of samples

Usage: Continuous value prediction (house prices, temperatures)

Mean Absolute Error (MAE)

L = (1/n) Σ |ŷᵢ - yᵢ|

Benefit: Less sensitive to outliers than MSE

Classification Tasks

Binary Cross-Entropy

L = -(1/n) Σ [yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]

Usage: Binary classification (spam/not spam)

Categorical Cross-Entropy

L = -(1/n) Σᵢ Σⱼ yᵢⱼ log(ŷᵢⱼ)

Where:
- i = sample index
- j = class index
- y = one-hot encoded true labels

Usage: Multi-class classification (digit recognition, sentiment analysis)

Example:

# True label: class 2 (one-hot: [0, 0, 1, 0, 0])
y_true = [0, 0, 1, 0, 0]

# Predictions (probabilities)
y_pred = [0.1, 0.2, 0.5, 0.1, 0.1]

# Loss focuses on predicted probability for true class
loss = -log(0.5) = 0.693

Backward Propagation

Backpropagation is how the network learns by computing gradients and updating weights.

The Core Idea

Goal: Minimize loss function by adjusting weights

Method: Use calculus chain rule to compute how much each weight contributes to the error

Chain Rule

L/w = L/a × a/z × z/w

Where:
- L = loss
- w = weight
- z = pre-activation (w @ x + b)
- a = post-activation f(z)

Backward Pass

Step 1: Compute output gradient

# For classification with softmax + cross-entropy
d_output = predictions - true_labels

Step 2: Propagate through layer 2

d_W2 = d_output @ a1.T
d_b2 = d_output
d_a1 = W2.T @ d_output

Step 3: Apply activation gradient

# For ReLU: gradient is 1 if input > 0, else 0
d_z1 = d_a1 * (z1 > 0)

Step 4: Propagate through layer 1

d_W1 = d_z1 @ x.T
d_b1 = d_z1

Update Weights

# Simple gradient descent
learning_rate = 0.01
W1 -= learning_rate * d_W1
b1 -= learning_rate * d_b1
W2 -= learning_rate * d_W2
b2 -= learning_rate * d_b2

Optimization

Optimization algorithms update network weights to minimize loss.

Gradient Descent Variants

1. Stochastic Gradient Descent (SGD)

# Update after each sample
for x, y in dataset:
    loss = compute_loss(model(x), y)
    gradients = compute_gradients(loss)
    weights -= learning_rate * gradients

Pros: Fast updates, can escape local minima Cons: Noisy updates, slow convergence

2. Mini-Batch Gradient Descent

# Update after batch of samples
for batch_x, batch_y in dataloader:
    loss = compute_loss(model(batch_x), batch_y)
    gradients = compute_gradients(loss)
    weights -= learning_rate * gradients

Common batch sizes: 32, 64, 128, 256

Pros: Balance between speed and stability

3. SGD with Momentum

velocity = 0
for batch in dataset:
    gradients = compute_gradients(batch)
    velocity = momentum * velocity - learning_rate * gradients
    weights += velocity

Benefit: Accelerates convergence, dampens oscillations

Modern Optimizers

Adam (Adaptive Moment Estimation) ⭐

Most popular optimizer for deep learning:

# Combines momentum and adaptive learning rates
m = 0  # First moment (mean)
v = 0  # Second moment (variance)

for batch in dataset:
    gradients = compute_gradients(batch)
    m = beta1 * m + (1 - beta1) * gradients
    v = beta2 * v + (1 - beta2) * gradients**2
    
    m_hat = m / (1 - beta1**t)  # Bias correction
    v_hat = v / (1 - beta2**t)
    
    weights -= learning_rate * m_hat / (sqrt(v_hat) + epsilon)

Default hyperparameters:

  • learning_rate = 0.001

  • beta1 = 0.9

  • beta2 = 0.999

  • epsilon = 1e-8

Benefits:

  • ✅ Adaptive learning rates per parameter

  • ✅ Works well with minimal tuning

  • ✅ Efficient for large datasets

Others

  • AdamW: Adam with better weight decay

  • RMSprop: Good for RNNs

  • AdaGrad: Adapts learning rate based on parameter frequency

Training Process

Complete Training Loop

import torch
import torch.nn as nn
import torch.optim as optim

# 1. Define model
model = NeuralNetwork()

# 2. Define loss function
criterion = nn.CrossEntropyLoss()

# 3. Define optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 4. Training loop
num_epochs = 10
for epoch in range(num_epochs):
    for batch_x, batch_y in train_loader:
        # Forward pass
        predictions = model(batch_x)
        loss = criterion(predictions, batch_y)
        
        # Backward pass
        optimizer.zero_grad()  # Clear previous gradients
        loss.backward()         # Compute gradients
        optimizer.step()        # Update weights
    
    # Validation
    val_loss = evaluate(model, val_loader)
    print(f"Epoch {epoch}: Train Loss={loss:.4f}, Val Loss={val_loss:.4f}")

Training Best Practices

1. Train/Validation/Test Split

        flowchart LR
    A[Dataset] --> B["70% Training<br/>(optimize weights)"]
    A --> C["15% Validation<br/>(tune hyperparameters)"]
    A --> D["15% Test<br/>(final evaluation)"]
    

2. Normalization

# Normalize inputs to zero mean, unit variance
X = (X - X.mean()) / X.std()

Why: Helps optimization converge faster

3. Weight Initialization

# Xavier/Glorot initialization for tanh
W = np.random.randn(n_in, n_out) * np.sqrt(1 / n_in)

# He initialization for ReLU
W = np.random.randn(n_in, n_out) * np.sqrt(2 / n_in)

4. Learning Rate Scheduling

# Reduce learning rate when validation loss plateaus
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)
scheduler.step(val_loss)

5. Early Stopping

# Stop training if validation loss doesn't improve
best_val_loss = float('inf')
patience_counter = 0

if val_loss < best_val_loss:
    best_val_loss = val_loss
    patience_counter = 0
else:
    patience_counter += 1
    if patience_counter >= patience:
        print("Early stopping!")
        break

6. Regularization

Dropout:

# Randomly disable neurons during training
layer = nn.Linear(128, 64)
dropout = nn.Dropout(p=0.5)  # Disable 50% of neurons

Weight Decay (L2 regularization):

optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)

Common Architectures

1. Feedforward Neural Network (FNN)

        flowchart LR
    A[Input] --> B[Dense] --> C[ReLU] --> D[Dense] --> E[ReLU] --> F[Dense] --> G[Output]
    

Use cases: Tabular data, simple classification

2. Convolutional Neural Network (CNN)

        flowchart LR
    A[Image] --> B[Conv] --> C[ReLU] --> D[Pool] --> E[Conv] --> F[ReLU] --> G[Pool] --> H[Flatten] --> I[Dense] --> J[Output]
    

Use cases: Image classification, object detection, computer vision

Key innovation: Learns spatial hierarchies (edges → shapes → objects)

3. Recurrent Neural Network (RNN)

        flowchart LR
    W1["Word₁"] --> R1[RNN]
    R1 --> H1[Hidden State]
    R1 --> O1["Output₁"]
    H1 --> R2[RNN]
    W2["Word₂"] --> R2
    R2 --> O2["Output₂"]
    R2 --> H2["..."]

**Use cases:** Time series, text, sequential data

**Problem:** Vanishing gradients on long sequences

### 4. LSTM (Long Short-Term Memory)

Improved RNN with gates to control information flow:
- **Forget gate**: What to forget from memory
- **Input gate**: What new information to add
- **Output gate**: What to output

**Use cases:** Machine translation, speech recognition

---

## From RNNs to Transformers

### The Sequential Processing Problem

**RNNs must process one token at a time:**

    

“The cat sat on the mat”

Step 1: Process “The” → hidden_state₁ Step 2: Process “cat” → hidden_state₂ (depends on step 1) Step 3: Process “sat” → hidden_state₃ (depends on step 2) …

❌ Cannot parallelize ❌ Long-range dependencies fade ❌ Slow training


### The Attention Revolution

**Key insight:** What if we could look at ALL words simultaneously?

“The cat sat on the mat”

For predicting next word:

  • Look at all positions at once

  • Compute relevance scores (attention weights)

  • Focus more on important words

  • Process in parallel

✅ Parallelizable ✅ Long-range dependencies preserved ✅ Fast training


### Transformer Benefits

1. **Parallel Processing**: All positions computed simultaneously
2. **Long Context**: No vanishing gradients over distance
3. **Flexibility**: Same architecture for many tasks
4. **Scalability**: Can train on massive datasets
5. **Transfer Learning**: Pre-train once, fine-tune for many tasks

This is why transformers have become the dominant architecture for:
- Natural Language Processing (GPT, BERT, T5)
- Computer Vision (Vision Transformer)
- Multi-modal models (CLIP, GPT-4)
- Audio (Whisper)
- Code generation (Codex, GitHub Copilot)

---

## Next Steps

Now that you understand neural network basics, proceed to:

1. **`attention_explained.md`** - Deep dive into attention mechanism
2. **`transformer_architecture.md`** - Complete transformer breakdown
3. **Run the Python examples** - Hands-on implementation

The journey continues! 🚀