🏗️ Chapter 04 · Microsoft Azure AI

Meet Microsoft Foundry Local

Azure AI Foundry's power — running entirely on your own hardware. No cloud. No bill. No data leaving your building.

What is Foundry Local?

Foundry Local is a developer runtime that brings Azure AI Foundry's power to your own device. Models execute locally — on CPU, GPU, or NPU. Your laptop, your workstation, your on-premises server. No internet required after setup.

💡

In one sentence: Foundry Local takes the AI models and infrastructure from Azure AI Foundry and runs them directly on the hardware you already own — making local AI as simple as one command.

🔒 Data stays with you

Every token processed on your hardware. Nothing sent to external servers. Full compliance for GDPR, HIPAA, and regulated industries.

⚡ Near-zero latency

No network round trip. First token in milliseconds, not hundreds of milliseconds. Real-time AI that actually feels real-time.

💰 Zero token cost

Run 1 million tokens or 1 billion — the cost is the same. Hardware is a one-time investment. The meter is permanently off.

🔄 Runs on your device

Models execute on CPU, GPU, or NPU. Your laptop, your workstation, your on-prem server. No internet required after model download.

Foundry Local Answers the 3 Cloud Problems

Every cloud-only limitation has a direct, architectural solution in Foundry Local.

Cloud AI ProblemFoundry Local Solution
Data leaves your network on every requestAll inference on your hardware, always
300–800ms latency before first tokenSub-50ms — no network hop
$9K–$45K/month at scale$0/month after hardware investment
Rate limits and API outagesNo limits — you own the infrastructure
Vendor lock-in and pricing changesOpen models — no single vendor dependency

How Foundry Local Works

ONNX Runtime + quantization + OpenAI-compatible API — under one command.

# Your app layer
LangChain · Azure SDK · Your code · Open WebUI
          ↕ OpenAI-compatible REST API · localhost:5272
        Foundry Local
      Model server · Download manager · Hardware router
          ↕ ONNX Runtime
              Your hardware
        NPU · NVIDIA CUDA · AMD ROCm · Apple Metal · CPU
# 1. Install
winget install Microsoft.FoundryLocal

# 2. Run a model (downloads + starts server)
foundry model run phi-4-mini
✓ Downloading phi-4-mini (INT4, 2.5GB)...
✓ Optimizing for your hardware...
✓ Server ready on localhost:5272

# 3. One-line migration from cloud
base_url = "https://api.openai.com/v1"  # before
base_url = "http://localhost:5272/v1"    # after ✓

Zero code rewrite: The OpenAI-compatible API means every framework, every SDK, every tool that talks to OpenAI also talks to Foundry Local. Change one URL and you're running locally.

The Model Catalog

Curated, quantized, and hardware-optimized models — ready to run locally with a single command.

ModelSizeBest For
Phi-4 Mini3.8B · 2.5 GBLaptop / NPU · daily tasks
Phi-414B · 8.5 GBWorkstation GPU · complex reasoning
Llama 3.2 3B3B · 2 GBFast laptop inference
Llama 3.1 8B8B · 5 GBBalanced quality / speed
Mistral 7B7B · 4.5 GBCode generation, instruction following
DeepSeek-R17B · 5 GBReasoning tasks, math

Hardware Support

Foundry Local auto-detects the best hardware on your machine and routes model execution accordingly.

⚙️

Priority order: NPU (Copilot+ PC) → NVIDIA GPU (CUDA) → AMD GPU (ROCm) → Apple Silicon (Metal) → CPU. No manual configuration needed.