Everyone is racing to deploy AI agents. Far fewer people can answer a deceptively simple question: how do you actually know if an agent is any good? A chatbot is easy to judge — you read its answer. An agent that plans, calls tools, browses, writes to your systems, and works over many steps is a much harder thing to measure. Get the evaluation wrong and you can ship something that demos beautifully and fails quietly in production.
A survey titled Evaluation and Benchmarking of LLM Agents: A Survey, by Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip, maps this fragmented field into a clear framework. It's one of the more useful reference points we've read for anyone trying to evaluate agents seriously — so here's our plain-English summary of their work, and why it matters if you're putting agents to work in a real business.
This is our summary of the authors' research; all credit for the framework and findings belongs to them. We'd encourage reading the full paper.
The core idea: two questions, not one
The survey's central contribution is a two-dimensional taxonomy that organizes the messy landscape of agent evaluation into two questions:
- What to evaluate — the objectives. The authors group these into agent behavior, capabilities, reliability, and safety. In other words: did it do the right things, can it do hard things, does it do them consistently, and does it avoid doing harmful things?
- How to evaluate — the process. This covers interaction modes, the datasets and benchmarks used, how metrics are computed, and the tooling that supports evaluation.
Splitting "what" from "how" sounds obvious, but it's genuinely clarifying. A lot of confusion in agent evaluation comes from mixing the two — arguing about benchmarks (the how) without being explicit about which capability or risk you're actually trying to measure (the what).
Why evaluating agents is harder than evaluating models
A standard language-model benchmark asks a question and scores the answer. Agents break that model in a few ways the survey draws out:
- They act over many steps. Success isn't one output — it's a trajectory of decisions, tool calls, and recoveries from mistakes. A single wrong turn early can doom the whole task.
- They use tools and environments. Evaluation has to account for whether the agent chose the right tool, called it correctly, and interpreted the result — not just the final text.
- Outcome ≠ process. An agent can reach the right answer through luck or a broken path, or fail a task despite mostly sound reasoning. Judging only the end result hides both problems.
- Static benchmarks age fast. Once a benchmark is public, models train on it and scores inflate without real capability gains. Agent evaluation needs more dynamic, harder-to-game approaches.
The part most builders overlook: enterprise reality
The section we found most valuable is where the authors highlight enterprise-specific challenges that most academic evaluation quietly ignores. If you're deploying agents inside a company rather than topping a leaderboard, these are exactly the things that bite:
- Role-based access to data. Real agents operate under permissions. An agent that performs well with unrestricted data access may behave very differently when it can only see what a given user is allowed to see.
- Reliability guarantees. A demo that works 8 times out of 10 is exciting. A production process that works 8 times out of 10 is a liability. Consistency matters more than peak performance.
- Dynamic, long-horizon interactions. Business work unfolds over long, evolving sessions — not tidy one-shot tasks — and evaluation needs to reflect that.
- Compliance. Regulated environments demand auditability and adherence to policy, which most benchmarks don't test at all.
This is the gap between "impressive AI agent" and "AI you can actually trust with real work," and it's why the authors argue current research often overstates readiness for deployment.
Where the field needs to go
The survey points toward evaluation that is holistic (measuring behavior, capability, reliability, and safety together rather than cherry-picking one), more realistic (reflecting the messy, permissioned, long-running conditions of real use), and scalable (so assessment can keep pace with rapidly improving agents without constant manual rebuilding of benchmarks).
Why this matters if you're deploying agents
The practical takeaway for teams: decide what "good" means before you pick a benchmark, and weight reliability and safety at least as heavily as raw capability. An agent that's brilliant but inconsistent, or capable but unaware of data permissions, is not ready for production no matter how strong its benchmark scores look.
This is exactly the lens we apply at Odella. The difference between an interesting AI demo and an AI employee you can actually delegate work to is precisely the stuff this survey foregrounds — reliability across long-running tasks, respect for role-based access, and behavior you can audit. Capability gets you the demo; reliability gets you the hire.
If you're evaluating agents for your own business, this survey is a strong framework to structure your thinking. Start with the objectives — behavior, capabilities, reliability, safety — and only then argue about how to measure them.
Source: Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip. Evaluation and Benchmarking of LLM Agents: A Survey. arXiv:2507.21504 (2025). Read it at arxiv.org/abs/2507.21504.
Want AI that's built for reliability, not just demos? Explore Odella's AI employees or get started free.
