What is Foundry Local?
Foundry Local is a developer runtime that brings Azure AI Foundry's power to your own device. Models execute locally — on CPU, GPU, or NPU. Your laptop, your workstation, your on-premises server. No internet required after setup.
In one sentence: Foundry Local takes the AI models and infrastructure from Azure AI Foundry and runs them directly on the hardware you already own — making local AI as simple as one command.
Every token processed on your hardware. Nothing sent to external servers. Full compliance for GDPR, HIPAA, and regulated industries.
No network round trip. First token in milliseconds, not hundreds of milliseconds. Real-time AI that actually feels real-time.
Run 1 million tokens or 1 billion — the cost is the same. Hardware is a one-time investment. The meter is permanently off.
Models execute on CPU, GPU, or NPU. Your laptop, your workstation, your on-prem server. No internet required after model download.
Foundry Local Answers the 3 Cloud Problems
Every cloud-only limitation has a direct, architectural solution in Foundry Local.
| Cloud AI Problem | Foundry Local Solution |
|---|---|
| Data leaves your network on every request | All inference on your hardware, always |
| 300–800ms latency before first token | Sub-50ms — no network hop |
| $9K–$45K/month at scale | $0/month after hardware investment |
| Rate limits and API outages | No limits — you own the infrastructure |
| Vendor lock-in and pricing changes | Open models — no single vendor dependency |
How Foundry Local Works
ONNX Runtime + quantization + OpenAI-compatible API — under one command.
# Your app layer
LangChain · Azure SDK · Your code · Open WebUI
↕ OpenAI-compatible REST API · localhost:5272
Foundry Local
Model server · Download manager · Hardware router
↕ ONNX Runtime
Your hardware
NPU · NVIDIA CUDA · AMD ROCm · Apple Metal · CPU
# 1. Install
winget install Microsoft.FoundryLocal
# 2. Run a model (downloads + starts server)
foundry model run phi-4-mini
✓ Downloading phi-4-mini (INT4, 2.5GB)...
✓ Optimizing for your hardware...
✓ Server ready on localhost:5272
# 3. One-line migration from cloud
base_url = "https://api.openai.com/v1" # before
base_url = "http://localhost:5272/v1" # after ✓
Zero code rewrite: The OpenAI-compatible API means every framework, every SDK, every tool that talks to OpenAI also talks to Foundry Local. Change one URL and you're running locally.
The Model Catalog
Curated, quantized, and hardware-optimized models — ready to run locally with a single command.
| Model | Size | Best For |
|---|---|---|
| Phi-4 Mini | 3.8B · 2.5 GB | Laptop / NPU · daily tasks |
| Phi-4 | 14B · 8.5 GB | Workstation GPU · complex reasoning |
| Llama 3.2 3B | 3B · 2 GB | Fast laptop inference |
| Llama 3.1 8B | 8B · 5 GB | Balanced quality / speed |
| Mistral 7B | 7B · 4.5 GB | Code generation, instruction following |
| DeepSeek-R1 | 7B · 5 GB | Reasoning tasks, math |
Hardware Support
Foundry Local auto-detects the best hardware on your machine and routes model execution accordingly.
Priority order: NPU (Copilot+ PC) → NVIDIA GPU (CUDA) → AMD GPU (ROCm) → Apple Silicon (Metal) → CPU. No manual configuration needed.
- Copilot+ PC NPU — 40+ TOPS dedicated AI compute. Best for always-on inference: silent, power-efficient, consistent.
- NVIDIA GPU (CUDA) — Maximum throughput for large models and batch processing.
- AMD GPU (ROCm) — Competitive GPU option, especially on AMD Ryzen AI hardware.
- Apple Silicon (Metal) — macOS with M1/M2/M3 chips delivers excellent performance per watt.
- CPU — Any modern x64 or ARM64 processor. Works everywhere, performance scales with core count.