AI Engineer Roadmap 2026: From LLM APIs to Production (Step-by-Step)

    A free, step-by-step AI engineering roadmap for 2026. Learn to build applications on foundation models: transformers, prompt engineering, evaluation, RAG, agents, finetuning, and production deployment. Grounded in Chip Huyen's AI Engineering, Stanford CS336, and MIT 6.S191, with hands-on RAG, agent, and evaluation projects to build the portfolio that lands an AI engineer job.

    ✓ Expert-Designed Learning Path• Industry-Validated Curriculum• Real-World Application Focus

    This roadmap was created by data engineering professionals with 67 hands-on tasks covering production-ready skills used by companies like Netflix, Airbnb, and Spotify. Master Python, OpenAI API, Anthropic API and 5 more technologies.

    How long does it take? Engineers with Python experience typically complete this roadmap in 5-8 months studying part-time (10-15 hours/week), or about 3-4 months full-time. The 13 sections contain 67 hands-on tasks.

    The 13 steps: (0) Prerequisites · (1) Deep Learning and Transformer Foundations · (2) Understanding Foundation Models · (3) Working with LLM APIs · (4) Prompt Engineering · (5) Evaluation · (6) Retrieval-Augmented Generation (RAG) · (7) AI Agents · (8) Finetuning · (9) Dataset Engineering · (10) Inference Optimization · (11) Production Architecture and Observability · (12) Portfolio and Job Search.

    Intermediate to Advanced
    13 sections • 67 tasks

    Skills You'll Learn

    • Prompt engineering
    • LLM evaluation
    • Retrieval-Augmented Generation (RAG)
    • AI agents and tool use
    • Finetuning (LoRA/QLoRA)
    • Inference optimization
    • Guardrails and AI safety
    • Production AI architecture

    Tools You'll Use

    • Python
    • OpenAI API
    • Anthropic API
    • LangChain
    • Hugging Face
    • Vector databases
    • PyTorch
    • vLLM

    Projects to Build

    • Production RAG System with Retrieval Evaluation

      Build a retrieval-augmented generation system over a real document set: chunking, embeddings, hybrid search with a reranker, grounded answers with citations, and a retrieval + faithfulness evaluation that proves it works.

    • LLM Agent with Tools and Failure-Mode Evaluation

      Build an agent that plans, calls real tools (function calling), manages memory, and recovers from failures, then evaluate it on its trajectory and failure modes, not just happy-path demos.

    • LLM Evaluation Pipeline with Golden Dataset and LLM-as-a-Judge

      Build a reusable evaluation pipeline for LLM applications: a golden dataset, automated scoring with LLM-as-a-judge, and regression testing you can point at any prompt or model change to catch quality drops before users do.

    Step 0: Prerequisites

    -Get fluent in Python beyond the basics: functions, classes, type hints, virtual environments, and async/await for concurrent API calls
    -Build core machine learning literacy: supervised learning, train/validation/test splits, overfitting, and what "a model" is, without needing to train one from scratch
    -Work confidently with REST APIs and JSON, and learn to manage API keys and secrets with environment variables
    -Understand what AI engineering is, how it differs from ML engineering and full-stack engineering, and the three layers of the AI stack (application development, model development, infrastructure)

    Step 1: Deep Learning and Transformer Foundations

    -Understand neural network fundamentals: neurons, layers, activation functions, loss, gradient descent, and backpropagation
    -Learn why sequence modeling is hard: from RNNs and their limitations to the motivation for attention
    -Master the Transformer architecture: self-attention, multi-head attention, positional encodings, and the feed-forward blocks that power every modern LLM
    -Understand tokenization and embeddings: byte-pair encoding, vocabularies, token limits, and how text becomes vectors
    -Learn how LLMs are trained: the next-token prediction objective, pretraining at scale, and scaling laws

    Step 2: Understanding Foundation Models

    -Learn how training data shapes a model: scale, data quality, multilingual and domain-specific models
    -Understand model architecture and size: parameters, dense models versus Mixture-of-Experts, and what model size means for cost and capability
    -Learn post-training: supervised finetuning (SFT) and preference finetuning (RLHF and DPO) that turn a base model into an instruction-following assistant
    -Understand sampling: temperature, top-p and top-k, why model outputs are probabilistic, and test-time compute
    -Learn structured outputs: JSON mode, constrained decoding, and why deterministic structure matters for engineering reliable systems

    Step 3: Working with LLM APIs

    -Call foundation model APIs (OpenAI and Anthropic): messages, system prompts, and the request/response lifecycle
    -Control generation: temperature, max tokens, stop sequences, and streaming responses
    -Get structured outputs in practice: function/tool calling and enforcing a JSON schema on model responses
    -Manage cost and latency: token accounting, choosing model tiers, batching, and basic response caching
    -Build a thin, provider-agnostic LLM client with retries, timeouts, logging, and error handling

    Step 4: Prompt Engineering

    -Learn in-context learning: zero-shot and few-shot prompting, and the difference between system and user prompts
    -Apply prompt engineering best practices: write clear instructions, provide sufficient context, break complex tasks into subtasks, and give the model time to think (chain-of-thought)
    -Organize, version, and test prompts as code so prompt changes are reviewable and measurable
    -Learn defensive prompt engineering: jailbreaking, prompt injection, information extraction, and the defenses against them
    -Practice prompt patterns for extraction, classification, summarization, and structured generation

    Step 5: Evaluation

    -Understand why evaluating foundation models is hard: open-ended outputs, no single ground truth, and the gap between benchmarks and your use case
    -Learn language modeling metrics: entropy, cross-entropy, and perplexity, and what they can and cannot tell you
    -Use exact evaluation: functional correctness, similarity against reference data, and embedding-based similarity
    -Master AI as a judge: how to use an LLM to score outputs, its limitations and biases, and which models make good judges
    -Learn comparative evaluation: pairwise ranking, Elo-style leaderboards, and arena-based comparison
    -Define evaluation criteria and build an evaluation pipeline: an evaluation guideline, a golden dataset, and automated scoring
    -Run model selection the right way: build versus buy, and how to navigate public benchmarks without being misled

    Step 6: Retrieval-Augmented Generation (RAG)

    -Understand why RAG works: grounding answers in your data, reducing hallucination, and serving fresh or private knowledge
    -Learn embeddings and vector search: how documents become vectors and how nearest-neighbor retrieval finds relevant context
    -Build the RAG architecture: chunking, indexing into a vector database, retrieval, and grounded generation
    -Compare retrieval algorithms: keyword search (BM25), vector search, and hybrid search, plus rerankers for precision
    -Optimize retrieval: chunking strategies, metadata filtering, query rewriting, and context construction
    -Evaluate a RAG system: retrieval metrics (recall, precision, MRR) and end-to-end answer faithfulness and relevance

    Step 7: AI Agents

    -Understand what an agent is: the reason-act loop, when an agent helps, and when a simple pipeline is the honest answer
    -Give agents capabilities with tools: function calling, tool schemas, and connecting models to APIs and data sources
    -Learn planning: task decomposition, the ReAct pattern, reflection, and multi-step execution
    -Design multi-agent orchestration: planner, retriever, executor, and reviewer patterns with human-in-the-loop checkpoints
    -Implement memory: short-term context, long-term memory, and state management across turns
    -Handle agent failure modes: guardrails, verification and repair, and evaluating agents on reliability, not just happy-path demos

    Step 8: Finetuning

    -Decide when to finetune and when not to: the trade-offs between prompting, RAG, and finetuning
    -Understand memory bottlenecks: trainable parameters, numerical precision, and the memory math of what fits on a GPU
    -Learn quantization: numerical formats and post-training quantization to shrink models for training and serving
    -Apply parameter-efficient finetuning: LoRA, QLoRA, and adapters that finetune large models on modest hardware
    -Run a small finetune end-to-end and evaluate the finetuned model against the base model on your own task

    Step 9: Dataset Engineering

    -Learn data curation for AI: data quality, coverage, and quantity for both finetuning and evaluation datasets
    -Set up data acquisition and annotation workflows that produce reliable, labeled examples
    -Use data synthesis: AI-powered synthetic data generation and model distillation to build training data at scale
    -Process data properly: inspect, deduplicate, clean, filter, and format datasets for training

    Step 10: Inference Optimization

    -Learn inference performance metrics: latency, throughput, time-to-first-token, and cost per token
    -Understand AI accelerators and bottlenecks: when inference is compute-bound versus memory-bound
    -Apply model-level optimization: quantization, distillation, and speculative decoding
    -Apply service-level optimization: continuous batching, KV caching, and serving frameworks like vLLM

    Step 11: Production Architecture and Observability

    -Design the AI application architecture: enhance context, then layer in the components a production system needs
    -Add guardrails: input and output validation, safety filters, and protection against PII leaks and data exfiltration
    -Add a model router and gateway: route by cost and capability, with fallbacks when a model fails
    -Add caching to cut latency and cost on repeated or similar requests
    -Set up monitoring and observability for LLM apps: tracing, running evals in production, and detecting drift and regressions
    -Build user feedback loops: capture conversational and explicit feedback, design for it, and turn it into systematic improvements

    Step 12: Portfolio and Job Search

    -Build 2-3 end-to-end AI engineering projects on GitHub: a RAG system, an agent with tools, and an evaluation pipeline
    -Write clear README files with architecture diagrams, evaluation results, and an honest discussion of trade-offs
    -Tailor your resume and portfolio to AI engineering: highlight evaluation, guardrails, and production reliability, not just demos
    -Prepare for AI engineering interviews: system design for RAG and agents, prompt and eval design, and deep learning fundamentals
    -Explore the [Interview Prep](/interview-prep) section and the curated [AI engineering Jobs](/jobs) feed for live roles and real interview questions

    Curriculum Reference

    A free preview of the learning material in this roadmap — the full reference for every section is available when you sign in. Click any task to expand it.

    Step 0: Prerequisites

    Get fluent in Python beyond the basics: functions, classes, type hints, virtual environments, and async/await for concurrent API calls

    You do not need to be a Python expert to start, but AI engineering leans on a few patterns more than typical scripting.


    What to be comfortable with

    • Type hints: def embed(text: str) -> list[float]: — they make LLM client code and tool schemas readable and self-documenting
    • Dataclasses / Pydantic: model request and response shapes; Pydantic is the de facto way to validate structured LLM outputs
    • Virtual environments: python -m venv .venv and pip install, or uv for speed — isolate every project
    • async/await: LLM calls are network-bound. Running 50 eval examples sequentially is slow; asyncio.gather runs them concurrently
    • Generators / streaming: token streams from LLM APIs arrive incrementally — for chunk in stream:

    Why it matters

    Most AI engineering code is glue: call a model, validate its output, retry on failure, log the result. Clean Python with types and async turns a fragile demo into something you can ship and test.

    Build core machine learning literacy: supervised learning, train/validation/test splits, overfitting, and what "a model" is, without needing to train one from scratch
    Work confidently with REST APIs and JSON, and learn to manage API keys and secrets with environment variables

    Every AI app talks to model providers over HTTP with an API key. Leaking that key is the most common, most expensive beginner mistake.


    Rules

    • Never hardcode keys in source. Use environment variables: os.environ["OPENAI_API_KEY"]
    • Never commit .env — add it to .gitignore. Use .env.example with blank values for documentation
    • Rotate keys if one is ever exposed in a commit, screenshot, or log
    • Set spend limits in the provider dashboard so a runaway loop cannot drain your budget

    Reading responses

    LLM APIs return JSON. You will parse fields like choices[0].message.content, usage.total_tokens, and stop_reason. Get comfortable inspecting JSON responses before building on top of them.

    Understand what AI engineering is, how it differs from ML engineering and full-stack engineering, and the three layers of the AI stack (application development, model development, infrastructure)

    Chip Huyen's framing, which the rest of this roadmap follows: AI engineering is about building applications on top of foundation models that already exist, not training models from scratch.


    The shift

    Traditional ML Engineering AI Engineering
    Start from data, train a model Start from a pre-trained foundation model
    Feature engineering, tabular data Prompt engineering, context construction
    Model training is the core work Adaptation and evaluation are the core work
    Weeks to a first model Minutes to a first working prototype

    The three layers of the AI stack

    1. Application development — prompts, context, evaluation, the product. Where most AI engineers work.
    2. Model development — training, finetuning, dataset engineering, inference optimization.
    3. Infrastructure — serving, compute, monitoring.

    Why this matters for your career

    Because the model is pre-built, the differentiators are no longer 'can you train a model' but 'can you evaluate, ground, and ship one reliably.' That is why this roadmap spends entire sections on evaluation, RAG, agents, and guardrails.

    Frequently Asked Questions

    What does an AI engineer actually do?

    An AI engineer builds applications on top of foundation models (LLMs and multimodal models) rather than training models from scratch. The day-to-day is prompt engineering, retrieval-augmented generation (RAG), building and evaluating agents, designing evaluation pipelines, adding guardrails, and shipping reliable, observable AI features to production. Chip Huyen frames it as adapting foundation models to real-world problems.

    What is the difference between an AI engineer and a machine learning engineer?

    ML engineering builds applications on traditional models, with more tabular data, feature engineering, and model training. AI engineering builds on top of pre-trained foundation models, with more prompt engineering, context construction, retrieval, and parameter-efficient finetuning. Most AI engineering work starts from a model that already exists and focuses on adapting and evaluating it for a specific use case.

    Do I need a PhD or deep math to become an AI engineer?

    No. You need solid Python, comfort with APIs, and a working understanding of how transformers and foundation models behave. This roadmap teaches the deep learning foundations you need (attention, tokenization, sampling) without requiring you to train a model from scratch. The biggest skills hiring managers look for in 2026 are RAG, agents, and evaluation.

    What should I learn first for AI engineering?

    Start with strong Python and the ability to call LLM APIs (OpenAI, Anthropic), then learn prompt engineering and evaluation. Evaluation is the single most underrated skill: nearly every senior AI engineer job description asks for experience designing eval pipelines, golden datasets, and LLM-as-a-judge workflows before any finetuning.

    Which projects should an AI engineer build for a portfolio?

    Build three end-to-end projects: a production RAG system with retrieval evaluation, an LLM agent with tool use and a failure-mode evaluation, and a reusable LLM evaluation pipeline with a golden dataset and LLM-as-a-judge. These map directly to what AI engineer job postings ask for and show you can ship reliable, evaluated AI systems, not just demos.

    Sign up for free courses and get early access to AI-powered grading, quizzes, and curated learning resources for each roadmap step.