AI & LLM Engineering for .NET Architects
Lesson 22 of 30 73% of course

Running AI Locally with ONNX and LocalLLM

16 · 8 min · 5/23/2026

Sign in to track progress and bookmarks.

Local AI in .NET

You don't need a REST API to run AI. With ONNX Runtime and Microsoft.Extensions.AI, you can run models directly inside your C# process.

1. What is ONNX?

ONNX (Open Neural Network Exchange) is a universal format for AI models. It allows you to take a model trained in Python/PyTorch and run it in a C# app with high performance. It uses specialized hardware accelerators like **CUDA** (Nvidia) or **DirectML** (Windows) to run incredibly fast.

2. Microsoft.ML.OnnxRuntime.GenAI

This is the new "One-liner" library for running LLMs in C#.

using var model = new Model("phi-3-mini-onnx");
using var tokenizer = new Tokenizer(model);
var generator = new Generator(model, tokenizer);
// Generate text locally!

4. Interview Mastery

Q: "What are the hardware requirements for running a local LLM?"

Architect Answer: "The most important factor is **VRAM (Video RAM)** on the GPU. A 7B parameter model (quantized) needs about 5-6GB of RAM. If the model fits in VRAM, it runs instantly. If it overflows into system RAM, it becomes 10x slower. For a professional AI workstation, we recommend at least 16GB of VRAM (RTX 4080 or better) to run modern SLMs comfortably."

Test your knowledge

Quizzes linked to this course—pass to earn certificates.

Browse all quizzes
AI & LLM Engineering for .NET Architects

On this page

1. What is ONNX? 2. Microsoft.ML.OnnxRuntime.GenAI 4. Interview Mastery
1. AI Foundations & Prompt Engineering
The LLM Landscape: Transformers, Attention, and Tokens Advanced Prompt Engineering: Few-shot, Chain-of-Thought, and ReAct Prompt Versioning & Management in Production LLM Cost Estimation: Token accounting and budget strategies
2. Semantic Kernel & Integration
Introduction to Microsoft Semantic Kernel (SK) Skills & Plugins: Extending the LLM with native C# functions Planner & Orchestration: Automating complex multi-step AI tasks Connectors: Switching between OpenAI, Azure OpenAI, and HuggingFace
3. Vector Databases & RAG
The RAG Pattern: Solving the 'Static Knowledge' problem Embeddings Deep Dive: Converting text to math Vector DBs: Azure AI Search vs Pinecode vs Milvus Hybrid Search: Combining Keyword and Semantic search for accuracy
4. Advanced RAG Techniques
Document Chunking Strategies: Overlap, Slidewindow, and Semantic splitting Recursive Document Processing for massive knowledge bases Context Window Management: Summarization vs Truncation Citations & Grounding: Ensuring the AI doesn't hallucinate
5. AI Safety & Guardrails
Content Moderation: Azure AI Content Safety integration Prompt Injection: Defending against adversarial attacks Punitiveness & Bias: Evaluating and mitigating model behavior Self-Correction Patterns: Letting the AI check its own work
6. Small Language Models (SLMs) & Local AI
The rise of SLMs: Phi-3, Llama-3-8B, and Mistral Running AI Locally with ONNX and LocalLLM Quantization: Running 70B models on 16GB RAM Edge AI: Deploying models to local devices and private clouds
7. Multimodal & Agentic AI
Multimodal AI: Processing Images, PDFs, and Audio in C# Agentic Workflows: Multi-agent collaboration with AutoGen Function Calling: Letting the LLM use your SQL and API tools Memory Management: Ephemeral vs Long-term Semantic memory
8. FAANG AI Engineer Interview
Case Study: Designing a Global Enterprise AI Knowledge Assistant Case Study: Building an Autonomous AI Agent for Software Dev