Inference Optimization & Model ServingΒΆ
π― Learning ObjectivesΒΆ
Understand the memory and compute bottlenecks of LLM inference (Memory Wall vs Compute Wall).
Master PagedAttention and KV Cache management.
Apply post-training quantization techniques (AWQ, GPTQ, EXL2).
Deploy models using high-throughput serving engines like vLLM and TensorRT-LLM.
Implement advanced decoding strategies like Speculative Decoding to reduce latency.
Understand prefix caching, chunked prefill, and continuous batching.
Compare local serving engines such as vLLM, TensorRT-LLM, and SGLang.
Measure throughput, TTFT, decode speed, and cost-per-token trade-offs.
β±οΈ Time EstimateΒΆ
Expected time: 6-8 hours
π PrerequisitesΒΆ
Completion of 14-local-llms
Completion of 04-token
Basic understanding of PyTorch devices and CUDA memory.
π οΈ Current and Planned MaterialsΒΆ
01_kv_cache_paged_attention.ipynb- Visualizing and managing the KV cache.02_quantization_deep_dive.ipynb- Quantizing a Llama-3 model from FP16 to INT4 using AWQ.03_serving_with_vllm.ipynb- Quickstart notebook for vLLM-based serving and batching.04_speculative_decoding.ipynb- Speeding up inference using a small draft model.Add a TensorRT-LLM / SGLang comparison walkthrough.
Add a prefix caching and chunked prefill tuning walkthrough.
2026 Topics This Phase Should CoverΒΆ
PagedAttention and KV cache layout
Prefix caching and reuse across repeated prompts
Chunked prefill, continuous batching, and scheduler behavior
Quantization stacks: AWQ, GPTQ, EXL2, GGUF, FP8 where available
Serving runtimes: vLLM, TensorRT-LLM, SGLang, TGI
Speculative decoding and draft-model assisted generation
Throughput metrics: TTFT, tokens/sec, concurrency saturation, memory footprint