🤖 Project: LLM Agent with Tools and Failure-Mode Evaluation

📌 Project Overview

In this project you'll build an AI agent: an LLM that runs in a reason-act loop, calling tools to accomplish a multi-step task. Then you'll do the part most tutorials skip, the part that gets people hired: evaluate the agent on reliability — its trajectory, its failure modes, and its recovery, not just whether the demo worked once.

Agentic workflows appear in nearly every senior AI engineer job description (Cohere, Snowflake, ShopFully, Pennylane all list them). What separates a toy from a hireable skill is treating reliability and evaluation as first-class.

🎯 Learning Objectives

Implement the agent reason-act loop with tool/function calling
Design clear tool schemas and validate tool calls before executing them
Add planning (ReAct-style reasoning) and a step cap to prevent infinite loops
Implement short-term memory and state management across steps
Add guardrails: verification, human-in-the-loop for risky actions, prompt-injection awareness
Evaluate the agent on trajectory quality, success rate, step count, and cost

🧰 Prerequisites

Python 3.10+ and a virtual environment
An LLM API key with tool-use/function-calling support (Anthropic or OpenAI)
Optional: LangGraph for explicit agent state graphs (you can also build the loop by hand, which is more instructive)

⌛ Estimated Time

Duration: 10–15 hours
Difficulty: Advanced

🎯 Choose a Task with Real Tools

Pick a task that genuinely needs multiple steps and tools. Good options:

Research assistant: search the web, read pages, synthesize a cited answer
Data analyst agent: query a SQL database, compute, and report findings
Coding helper: read files, run a function, fix and re-run on failure
Ops assistant: check a status API, decide, and take a (reversible, sandboxed) action

Pick a task where you can write 15-20 test scenarios with a known "good trajectory" (which tools should be called, in what rough order). That set is your evaluation backbone.

📦 Suggested Project Structure

llm-agent/
├── tools/
│   ├── registry.py      # tool definitions + JSON schemas
│   └── implementations.py
├── agent/
│   ├── loop.py          # reason-act loop, step cap
│   ├── planner.py       # ReAct / planning prompt
│   ├── memory.py        # short-term + state
│   └── guardrails.py    # validation, human-in-the-loop, injection checks
├── eval/
│   ├── scenarios.jsonl  # tasks + expected trajectories
│   ├── trajectory_eval.py
│   └── report.py
├── run.py
├── README.md
└── .env.example

🔄 Step-by-Step Guide

1. 🧰 Define tools

Implement 3-5 real tools as plain functions (search, fetch, query, calculate, etc.).
Give each a clear description and a strict JSON schema for its arguments. The model picks tools from your descriptions, so write them carefully.

2. 🔁 Build the reason-act loop

Implement the loop: send the task + tool definitions → model returns either a tool call or a final answer → execute the tool → feed the result back → repeat.
Add a step cap so the agent can never loop forever.
Log every step (thought, tool call, arguments, observation) — this trace is what you'll evaluate and debug.

3. 🧠 Add planning and memory

Use a ReAct-style prompt so the model reasons explicitly before each action.
Implement short-term memory (the running trace) and state (what's done, intermediate results). If using LangGraph, model this as a state graph.

4. 🛡️ Add guardrails

Validate tool calls against their schema before executing; reject malformed calls.
Treat all tool/retrieved output as untrusted (prompt-injection defense) — never let it grant new permissions.
For any irreversible action, require a human-in-the-loop confirmation step.
Add verification and repair: when a step fails, retry or escalate instead of crashing.

5. 🧪 Evaluate the agent

For each scenario, run the agent and score:
- Success rate: did it achieve the goal?
- Trajectory quality: did it pick the right tools in a sensible order, without wasted steps? (LLM-as-judge over the trace works well.)
- Efficiency: number of steps and total token cost
- Failure handling: inject failures (a tool errors, returns junk, or contains an injection attempt) and measure whether the agent recovers safely
Produce a report table across all scenarios.

6. 🔧 Iterate

Improve tool descriptions, planning prompt, or guardrails; re-run the eval. Show a before/after on success rate or wasted steps.

✅ Deliverables

A working agent that completes a multi-step task using tools, with a full step trace
3-5 tools with strict schemas and validation
Guardrails: step cap, tool-call validation, human-in-the-loop for risky actions, injection handling
An evaluation suite of 15-20 scenarios (including failure-injection cases) with:
- Success rate, trajectory quality, step count, and cost
- A before/after table for at least one improvement
A README.md with an architecture diagram, the eval results, and an honest discussion of failure modes and limits

🚀 Optional Extensions

Split into multi-agent roles (planner, executor, reviewer) and compare reliability vs the single agent
Add long-term memory (persist facts across runs) and measure its effect
Add a cost/latency budget the agent must respect, and route easy steps to a cheaper model
Compare a hardcoded workflow vs the agent on the same tasks, and discuss when the agent is actually worth it

LLM Agent with Tools and Failure-Mode Evaluation