Inference Optimization & Model Serving¶

🎯 Learning Objectives¶

Understand the memory and compute bottlenecks of LLM inference (Memory Wall vs Compute Wall).
Master PagedAttention and KV Cache management.
Apply post-training quantization techniques (AWQ, GPTQ, EXL2).
Deploy models using high-throughput serving engines like vLLM and TensorRT-LLM.
Implement advanced decoding strategies like Speculative Decoding to reduce latency.
Understand prefix caching, chunked prefill, and continuous batching.
Compare local serving engines such as vLLM, TensorRT-LLM, and SGLang.
Measure throughput, TTFT, decode speed, and cost-per-token trade-offs.

01_kv_cache_paged_attention.ipynb - Visualizing and managing the KV cache.
02_quantization_deep_dive.ipynb - Quantizing a Llama-3 model from FP16 to INT4 using AWQ.
03_serving_with_vllm.ipynb - Quickstart notebook for vLLM-based serving and batching.
04_speculative_decoding.ipynb - Speeding up inference using a small draft model.
Add a TensorRT-LLM / SGLang comparison walkthrough.
Add a prefix caching and chunked prefill tuning walkthrough.