Introducing Microsoft Foundry Local

What is Foundry Local?

Foundry Local is a developer runtime that brings Azure AI Foundry's power to your own device. Models execute locally — on CPU, GPU, or NPU. Your laptop, your workstation, your on-premises server. No internet required after setup.

💡

In one sentence: Foundry Local takes the AI models and infrastructure from Azure AI Foundry and runs them directly on the hardware you already own — making local AI as simple as one command.

🔒 Data stays with you

Every token processed on your hardware. Nothing sent to external servers. Full compliance for GDPR, HIPAA, and regulated industries.

⚡ Near-zero latency

No network round trip. First token in milliseconds, not hundreds of milliseconds. Real-time AI that actually feels real-time.

💰 Zero token cost

Run 1 million tokens or 1 billion — the cost is the same. Hardware is a one-time investment. The meter is permanently off.

🔄 Runs on your device

Models execute on CPU, GPU, or NPU. Your laptop, your workstation, your on-prem server. No internet required after model download.

Foundry Local Answers the 3 Cloud Problems

Every cloud-only limitation has a direct, architectural solution in Foundry Local.

Cloud AI Problem	Foundry Local Solution
Data leaves your network on every request	All inference on your hardware, always
300–800ms latency before first token	Sub-50ms — no network hop
$9K–$45K/month at scale	$0/month after hardware investment
Rate limits and API outages	No limits — you own the infrastructure
Vendor lock-in and pricing changes	Open models — no single vendor dependency

How Foundry Local Works

ONNX Runtime + quantization + OpenAI-compatible API — under one command.

# Your app layer
LangChain · Azure SDK · Your code · Open WebUI
          ↕ OpenAI-compatible REST API · localhost:5272
        Foundry Local
      Model server · Download manager · Hardware router
          ↕ ONNX Runtime
              Your hardware
        NPU · NVIDIA CUDA · AMD ROCm · Apple Metal · CPU

# 1. Install
winget install Microsoft.FoundryLocal

# 2. Run a model (downloads + starts server)
foundry model run phi-4-mini
✓ Downloading phi-4-mini (INT4, 2.5GB)...
✓ Optimizing for your hardware...
✓ Server ready on localhost:5272

# 3. One-line migration from cloud
base_url = "https://api.openai.com/v1"  # before
base_url = "http://localhost:5272/v1"    # after ✓

✅

Zero code rewrite: The OpenAI-compatible API means every framework, every SDK, every tool that talks to OpenAI also talks to Foundry Local. Change one URL and you're running locally.

The Model Catalog

Curated, quantized, and hardware-optimized models — ready to run locally with a single command.

Model	Size	Best For
Phi-4 Mini	3.8B · 2.5 GB	Laptop / NPU · daily tasks
Phi-4	14B · 8.5 GB	Workstation GPU · complex reasoning
Llama 3.2 3B	3B · 2 GB	Fast laptop inference
Llama 3.1 8B	8B · 5 GB	Balanced quality / speed
Mistral 7B	7B · 4.5 GB	Code generation, instruction following
DeepSeek-R1	7B · 5 GB	Reasoning tasks, math

Hardware Support

Foundry Local auto-detects the best hardware on your machine and routes model execution accordingly.

⚙️

Priority order: NPU (Copilot+ PC) → NVIDIA GPU (CUDA) → AMD GPU (ROCm) → Apple Silicon (Metal) → CPU. No manual configuration needed.

Copilot+ PC NPU — 40+ TOPS dedicated AI compute. Best for always-on inference: silent, power-efficient, consistent.
NVIDIA GPU (CUDA) — Maximum throughput for large models and batch processing.
AMD GPU (ROCm) — Competitive GPU option, especially on AMD Ryzen AI hardware.
Apple Silicon (Metal) — macOS with M1/M2/M3 chips delivers excellent performance per watt.
CPU — Any modern x64 or ARM64 processor. Works everywhere, performance scales with core count.

Meet Microsoft Foundry Local

What is Foundry Local?

Foundry Local Answers the 3 Cloud Problems

How Foundry Local Works

The Model Catalog

Hardware Support