LLM Agent with Tools and Failure-Mode Evaluation
Build an agent that plans, calls real tools (function calling), manages memory, and recovers from failures, then evaluate it on its trajectory and failure modes, not just happy-path demos.
This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Python, Anthropic API, OpenAI API and 2 more technologies through hands-on implementation. Rated advanced level with comprehensive documentation and starter code.
๐ค Project: LLM Agent with Tools and Failure-Mode Evaluation
๐ Project Overview
In this project you'll build an AI agent: an LLM that runs in a reason-act loop, calling tools to accomplish a multi-step task. Then you'll do the part most tutorials skip, the part that gets people hired: evaluate the agent on reliability โ its trajectory, its failure modes, and its recovery, not just whether the demo worked once.
Agentic workflows appear in nearly every senior AI engineer job description (Cohere, Snowflake, ShopFully, Pennylane all list them). What separates a toy from a hireable skill is treating reliability and evaluation as first-class.
๐ฏ Learning Objectives
- Implement the agent reason-act loop with tool/function calling
- Design clear tool schemas and validate tool calls before executing them
- Add planning (ReAct-style reasoning) and a step cap to prevent infinite loops
- Implement short-term memory and state management across steps
- Add guardrails: verification, human-in-the-loop for risky actions, prompt-injection awareness
- Evaluate the agent on trajectory quality, success rate, step count, and cost
๐งฐ Prerequisites
- Python 3.10+ and a virtual environment
- An LLM API key with tool-use/function-calling support (Anthropic or OpenAI)
- Optional: LangGraph for explicit agent state graphs (you can also build the loop by hand, which is more instructive)
โ Estimated Time
Duration: 10โ15 hours
Difficulty: Advanced
๐ฏ Choose a Task with Real Tools
Pick a task that genuinely needs multiple steps and tools. Good options:
- Research assistant: search the web, read pages, synthesize a cited answer
- Data analyst agent: query a SQL database, compute, and report findings
- Coding helper: read files, run a function, fix and re-run on failure
- Ops assistant: check a status API, decide, and take a (reversible, sandboxed) action
Pick a task where you can write 15-20 test scenarios with a known "good trajectory" (which tools should be called, in what rough order). That set is your evaluation backbone.
๐ฆ Suggested Project Structure
llm-agent/
โโโ tools/
โ โโโ registry.py # tool definitions + JSON schemas
โ โโโ implementations.py
โโโ agent/
โ โโโ loop.py # reason-act loop, step cap
โ โโโ planner.py # ReAct / planning prompt
โ โโโ memory.py # short-term + state
โ โโโ guardrails.py # validation, human-in-the-loop, injection checks
โโโ eval/
โ โโโ scenarios.jsonl # tasks + expected trajectories
โ โโโ trajectory_eval.py
โ โโโ report.py
โโโ run.py
โโโ README.md
โโโ .env.example
๐ Step-by-Step Guide
1. ๐งฐ Define tools
- Implement 3-5 real tools as plain functions (search, fetch, query, calculate, etc.).
- Give each a clear description and a strict JSON schema for its arguments. The model picks tools from your descriptions, so write them carefully.
2. ๐ Build the reason-act loop
- Implement the loop: send the task + tool definitions โ model returns either a tool call or a final answer โ execute the tool โ feed the result back โ repeat.
- Add a step cap so the agent can never loop forever.
- Log every step (thought, tool call, arguments, observation) โ this trace is what you'll evaluate and debug.
3. ๐ง Add planning and memory
- Use a ReAct-style prompt so the model reasons explicitly before each action.
- Implement short-term memory (the running trace) and state (what's done, intermediate results). If using LangGraph, model this as a state graph.
4. ๐ก๏ธ Add guardrails
- Validate tool calls against their schema before executing; reject malformed calls.
- Treat all tool/retrieved output as untrusted (prompt-injection defense) โ never let it grant new permissions.
- For any irreversible action, require a human-in-the-loop confirmation step.
- Add verification and repair: when a step fails, retry or escalate instead of crashing.
5. ๐งช Evaluate the agent
- For each scenario, run the agent and score:
- Success rate: did it achieve the goal?
- Trajectory quality: did it pick the right tools in a sensible order, without wasted steps? (LLM-as-judge over the trace works well.)
- Efficiency: number of steps and total token cost
- Failure handling: inject failures (a tool errors, returns junk, or contains an injection attempt) and measure whether the agent recovers safely
- Produce a report table across all scenarios.
6. ๐ง Iterate
- Improve tool descriptions, planning prompt, or guardrails; re-run the eval. Show a before/after on success rate or wasted steps.
โ Deliverables
- A working agent that completes a multi-step task using tools, with a full step trace
- 3-5 tools with strict schemas and validation
- Guardrails: step cap, tool-call validation, human-in-the-loop for risky actions, injection handling
- An evaluation suite of 15-20 scenarios (including failure-injection cases) with:
- Success rate, trajectory quality, step count, and cost
- A before/after table for at least one improvement
- A
README.mdwith an architecture diagram, the eval results, and an honest discussion of failure modes and limits
๐ Optional Extensions
- Split into multi-agent roles (planner, executor, reviewer) and compare reliability vs the single agent
- Add long-term memory (persist facts across runs) and measure its effect
- Add a cost/latency budget the agent must respect, and route easy steps to a cheaper model
- Compare a hardcoded workflow vs the agent on the same tasks, and discuss when the agent is actually worth it