AI Toolkit for VS Codeยถ
The AI Toolkit extension (formerly Windows AI Studio) is Microsoftโs VS Code extension for browsing, downloading, running, fine-tuning, and evaluating local models โ all from the editor.
1. What AI Toolkit Doesยถ
Capability |
Details |
|---|---|
Model catalog |
Browse models from Hugging Face and Azure AI Foundry directly in the sidebar |
Local playground |
Chat with downloaded models in an interactive panel inside VS Code |
ONNX Runtime |
Run models locally via ONNX Runtime GenAI (CPU, CUDA, DirectML, Apple Silicon) |
Fine-tuning |
QLoRA fine-tuning with a guided UI โ dataset prep, hyperparameters, training |
Evaluation |
Run promptflow-evals evaluators and view results in the extension |
Model conversion |
Convert Hugging Face models to ONNX format for local inference |
Multi-runtime |
Supports ONNX, GGUF (via llama.cpp), and cloud-hosted endpoints |
2. Installationยถ
Open VS Code
Go to Extensions (
Cmd+Shift+X)Search for โAI Toolkitโ (publisher: Microsoft)
Click Install
The extension adds an AI Toolkit icon to the Activity Bar (left sidebar).
3. Browsing and Downloading Modelsยถ
From the Model Catalogยถ
Click the AI Toolkit icon in the sidebar
Browse Popular Models or search by name
Click a model card to see details: size, architecture, quantization options
Click Download to pull the model locally
Where Models Are Storedยถ
Models are downloaded to the AI Toolkit working directory:
Platform |
Path |
|---|---|
macOS/Linux |
|
Windows |
|
Directory structure follows a 4-layer convention:
~/.aitk/models/{publisher}/{model-name}/{runtime}/{display-name}
Example:
~/.aitk/models/microsoft/Phi-4-mini/cpu/phi4-mini-int4
Model Formatsยถ
Format |
Runtime |
When to use |
|---|---|---|
ONNX |
ONNX Runtime GenAI |
Best integration with AI Toolkit, broadest hardware support |
GGUF |
llama.cpp |
Already have GGUF models from Ollama or LM Studio |
Cloud |
API endpoint |
Model too large for local hardware |
4. Using the Local Playgroundยถ
Once a model is downloaded:
Select it in the AI Toolkit sidebar
Click Load in Playground
Chat with the model in the interactive panel
Playground Featuresยถ
System prompt: Set a custom system message
Temperature / Top-p: Adjust generation parameters
Token limit: Control response length
Multi-turn: Maintains conversation context
Playground vs. Ollamaยถ
Feature |
AI Toolkit Playground |
Ollama |
|---|---|---|
UI |
Built into VS Code |
Terminal or separate UI |
Format |
ONNX (primary) |
GGUF |
Runtime |
ONNX Runtime GenAI |
llama.cpp |
API |
Not exposed by default |
OpenAI-compatible API |
Fine-tuning |
Built-in UI |
Requires separate workflow |
Model catalog |
Integrated in sidebar |
|
When to use each: Ollama if you want a local OpenAI-compatible API (see 04_llm_server_and_api.ipynb). AI Toolkit if you want a GUI-first experience with fine-tuning and evaluation built in.
5. Model Conversion to ONNXยถ
To run a Hugging Face model in AI Toolkit, convert it to ONNX format:
Setupยถ
conda create -n model_builder python==3.11 -y
conda activate model_builder
pip install onnx torch onnxruntime_genai transformers
Convertยถ
python -m onnxruntime_genai.models.builder \
-m /path/to/hf/model \
-p int4 \
-e cpu \
-o ~/.aitk/models/publisher/model-name/cpu/display-name \
-c /tmp/conversion-cache \
--extra_options include_prompt_templates=1
Precision ร Runtime Combinationsยถ
Precision |
Runtime |
Use case |
|---|---|---|
INT4 |
CPU |
Laptops, low memory |
FP16 |
CUDA |
NVIDIA GPUs |
FP16 |
DirectML |
AMD/Intel GPUs on Windows |
FP32 |
CPU |
Maximum accuracy, slow |
6. Fine-Tuning with QLoRAยถ
AI Toolkit provides a guided fine-tuning workflow directly in VS Code.
Stepsยถ
Select a model in the sidebar โ Fine-tune
Prepare dataset: Upload JSONL training data
Configure: Set LoRA rank, learning rate, epochs, batch size
Train: Runs locally using your GPU (or CPU with longer times)
Evaluate: Compare base model vs. fine-tuned model side by side
Dataset Formatยถ
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is RAG?"}, {"role": "assistant", "content": "RAG stands for Retrieval-Augmented Generation..."}]}
{"messages": [{"role": "user", "content": "Explain embeddings"}, {"role": "assistant", "content": "Embeddings are dense vector representations..."}]}
When to Fine-Tune Locallyยถ
Scenario |
Fine-tune locally? |
|---|---|
Small model (< 7B params) |
Yes |
Large model (> 13B params) |
Only with a strong GPU (24GB+ VRAM) |
Quick experiment / proof of concept |
Yes |
Production training |
Use cloud (Azure AI Foundry) |
Sensitive data that canโt leave the machine |
Yes โ main advantage |
7. Evaluationยถ
AI Toolkit integrates with promptflow-evals to evaluate model quality.
Built-in Evaluatorsยถ
Evaluator |
Measures |
|---|---|
|
Are responses grounded in provided context? |
|
Are responses relevant to the question? |
|
Is the response logically coherent? |
|
Is the language fluent and natural? |
|
How similar is the response to ground truth? |
|
Token-level F1 against ground truth |
Running an Evaluationยถ
Prepare a JSONL dataset with questions and ground truth
Select a model in AI Toolkit โ Evaluate
Choose evaluators and configure the judge model
View results in the extension panel
Example Evaluation Scriptยถ
from promptflow.evals.evaluators import SimilarityEvaluator, F1ScoreEvaluator
from promptflow.evals import evaluate
results = evaluate(
data="test_data.jsonl",
evaluators={
"similarity": SimilarityEvaluator(model_config=model_config),
"f1": F1ScoreEvaluator(),
},
target=my_model_function,
)
8. AI Toolkit vs. Other Local LLM Toolsยถ
Feature |
AI Toolkit |
Ollama |
LM Studio |
vLLM |
|---|---|---|---|---|
UI |
VS Code sidebar |
CLI |
Desktop app |
CLI/API |
Primary format |
ONNX |
GGUF |
GGUF |
HF/AWQ |
Fine-tuning |
Built-in (QLoRA) |
No |
No |
No |
Evaluation |
Built-in |
No |
No |
No |
API server |
No |
Yes (OpenAI-compatible) |
Yes (OpenAI-compatible) |
Yes (OpenAI-compatible) |
Model catalog |
HF + Azure AI |
Ollama library |
HF |
HF |
Best for |
Experiment + fine-tune |
Quick local API |
GUI exploration |
Production serving |
9. Practical Workflowยถ
Exploring a New Modelยถ
1. Browse catalog in AI Toolkit sidebar
2. Download a quantized ONNX variant
3. Test in playground with sample prompts
4. If quality is good โ serve via Ollama or vLLM for your app
5. If quality is poor โ fine-tune with QLoRA in AI Toolkit
6. Evaluate fine-tuned model against baseline
7. Export and serve the improved model
Connecting AI Toolkit to Copilotยถ
AI Toolkit models donโt directly replace GitHub Copilotโs backend. However, you can:
Use AI Toolkit to explore and evaluate models before deploying them as an API
Serve a local model via Ollama or vLLM, then connect it as a custom endpoint
Use fine-tuned models for domain-specific tasks (code review, doc generation) alongside Copilot for general coding
10. Troubleshootingยถ
Problem |
Fix |
|---|---|
Model doesnโt appear in sidebar |
Check |
Slow inference on macOS |
Ensure the ONNX model targets |
Out of memory during fine-tuning |
Reduce batch size, use INT4 quantization, or switch to a smaller model |
Conversion fails |
Install latest |
Playground shows gibberish |
Model may need |
Next Stepsยถ
Try
01_ollama_quickstart.ipynbfor an API-first approach to local modelsSee
03_local_rag_with_ollama.ipynbfor building a local RAG systemSee
05_speculative_decoding.ipynbfor inference optimizationFor connecting local models to your coding workflow, see 31-ai-powered-dev-tools/02_mcp_deep_dive.md