Field notes on testing AI agents
Mid-conversation tangent
A customer is halfway through a return flow with your agent. They've shared the order number, the item and reason for the return. They then pause to ask: "Wait, do you offer…
Voxli
The multi-turn failures that prompt evals can't see
Most agent failures we see in pilots don't show up on prompt evals.
Voxli
The 10-minute test that stops your agent from canceling real orders
Last week a failed tool call caused GPT-5.4-mini to cancel a real order simply because a customer asked a question involving cancellation. Here's a quick test that catches it.
Voxli
Expertise.ai teams up with Voxli to solve the "absolute insanity" of their AI sales Agent testing workflow
Expertise.ai is a known disruptor in the AI space, building AI sales agents that guide prospects through personalized flows. Here's how Voxli untangled their testing workflow.
Mahey Qadir
The failed Tool Call when Simulating a Customer Conversation Across Three LLMs
Recently, to assess AI Agent performance with tool calls, we executed the same multi-turn conversation across the three tiers of OpenAI's GPT-5.4: standard, mini, and nano.
Mahey Qadir
Testing for Speculation using Voxli
In our last post we covered the risks of agent speculation. Today we look at how to set up Voxli to catch those speculations — using a feature called Hallucination detection.
Mahey Qadir
The Risks of Agent Speculation
It’s no surprise that hallucinations are a common known failure during agentic AI testing. The agent starts to overpromise, begins to fabricate answers and even claims that it…
Voxli