Building AI Agents: Zapier's Evaluation Framework Lessons

Building effective AI agents presents unique challenges that traditional software development approaches cannot adequately address. As AI systems become increasingly integrated into business workflows, understanding how to create reliable, user-centered agent experiences has become critical for organizations seeking automation solutions.

The Reality of AI Agent Development

Many developers initially approach AI agent development with optimism, often experimenting with frameworks like LangChain, tweaking prompts, and integrating various tools. While initial prototypes may show promise, the reality of production deployment reveals significant complexities that extend far beyond the initial proof of concept.

The fundamental challenge lies in the nondeterministic nature of AI systems. Unlike traditional software where inputs produce predictable outputs, AI agents operate in probabilistic environments where user behavior adds another layer of unpredictability. Users consistently interact with AI products in ways that developers cannot anticipate, creating scenarios that require continuous adaptation and improvement.

Building the Data Flywheel

Successful AI agent development requires establishing what industry experts call a 'data flywheel' - a continuous cycle of user feedback collection, pattern analysis, and iterative improvement. This process begins immediately after deploying the initial version to users.

The flywheel operates through several key stages: collecting user feedback and usage data, identifying patterns and failure modes, developing targeted improvements, and deploying enhanced features that attract more users, ultimately generating more data for the next iteration. This cyclical process forms the foundation of sustainable AI agent development.

Instrumenting for Success

Effective data collection starts with comprehensive code instrumentation. Modern observability platforms provide straightforward integration for tracking completion calls, but successful AI agent development requires capturing much more detailed information.

Essential data points include tool calls and their responses, errors from tool executions, pre and post-processing steps, and complete state transitions throughout the agent's decision-making process. Recording data in the same format used during runtime significantly simplifies the conversion of production traces into evaluation datasets.

This approach proves particularly valuable for tool calls that produce side effects, as the recorded data enables proper mocking during evaluation without compromising test validity.

Collecting Actionable Feedback

User feedback forms the cornerstone of AI agent improvement, but traditional feedback mechanisms often prove insufficient. While explicit feedback through thumbs up/down buttons provides high-signal data, users rarely engage with these interfaces voluntarily.

Strategic Feedback Collection

Successful feedback collection requires identifying critical moments in the user journey where engagement likelihood increases significantly. For example, requesting feedback immediately after an agent completes a test run, regardless of the outcome, can substantially improve submission rates.

The key principle involves asking for feedback when users are most engaged with the results - either because they're satisfied with the outcome or frustrated with the performance. This contextual approach respects user attention while maximizing feedback quality.

Mining Implicit Signals

Implicit feedback often provides more abundant data than explicit feedback mechanisms. User behaviors reveal satisfaction or dissatisfaction through various interaction patterns that can be systematically tracked and analyzed.

Strong positive signals include users activating agents after testing phases, copying model responses for further use, and engaging in extended productive conversations. Negative signals manifest through users repeatedly rephrasing requests, expressing frustration in messages, sending stop commands, or abandoning sessions after brief interactions.

Advanced approaches include using language models to detect and categorize user frustration patterns, though this requires significant fine-tuning to understand frustration within specific product contexts. Traditional user metrics like churn rates, session duration, and feature adoption also provide valuable implicit feedback when properly analyzed.

Understanding Data at Scale

Individual run analysis provides important insights, but understanding systemic patterns requires robust tooling and analytical approaches. Many organizations find value in both purchasing existing LLMOps solutions and building custom internal tools tailored to their specific needs.

Custom tooling development has become increasingly accessible with modern coding assistants, offering significant advantages in domain-specific analysis and workflow optimization. The ability to convert any interesting interaction or failure into an evaluation with minimal friction should become an organizational instinct.

Systematic Pattern Recognition

Large-scale data analysis reveals failure patterns, tool performance issues, and user interaction trends that inform product roadmap decisions. Feedback aggregation, clustering analysis, and failure mode categorization create natural prioritization frameworks for improvement efforts.

Recent innovations include using reasoning models to explain system failures by analyzing complete trace data, instructions, and contextual information. These models prove surprisingly effective at identifying root causes and directing attention to critical issues that might otherwise be overlooked.

Evaluation Framework Hierarchy

Effective AI agent evaluation requires multiple testing approaches organized in a hierarchical structure similar to the traditional testing pyramid. This framework includes unit-test-like evaluations at the foundation, trajectory evaluations in the middle tier, and A/B testing with staged rollouts at the top.

Unit Test Evaluations

Unit test evaluations focus on predicting specific state transitions, making them ideal for simple assertions and targeted failure mode testing. These evaluations can verify tool call accuracy, parameter correctness, keyword presence, or completion detection.

Unit tests provide an excellent starting point for evaluation development because they're relatively easy to implement and help establish the analytical mindset required for data-driven improvement. However, they work best for addressing specific identified problems rather than converting every positive interaction into test cases.

Trajectory Evaluations

While unit tests excel at specific checks, they can create blind spots when comparing different model capabilities. Stronger models may perform worse on narrow unit tests because they approach problems differently, even when achieving better overall outcomes.

Trajectory evaluations address this limitation by allowing agents to run to completion and evaluating both the final state and the entire execution path. This approach better captures the holistic performance of different models and strategies.

However, trajectory evaluations present implementation challenges, particularly when dealing with tools that cause side effects. Rather than mocking these interactions, many organizations create synthetic copies of user environments to maintain realistic test conditions, though this approach significantly increases complexity and execution time.

Advanced Evaluation Techniques

Language model judges offer sophisticated evaluation capabilities but require careful implementation to avoid bias and ensure accurate assessments. Rubrics-based scoring represents a promising approach where human-crafted criteria guide LLM evaluation for each specific test case.

For example, a rubric might specify: 'Did the agent appropriately handle an unexpected calendar API error by implementing retry logic?' This approach provides more targeted and reliable evaluation than generic scoring mechanisms.

Strategic Evaluation Principles

The ultimate goal of evaluation systems extends beyond maximizing scores on internal benchmarks. When useful metrics become primary targets, they often cease being useful metrics. Organizations achieving near-perfect scores on evaluation datasets should question whether their tests adequately represent real-world challenges.

Balanced Dataset Strategy

Progressive organizations maintain two distinct evaluation pools: regression datasets that ensure new changes don't break existing functionality, and aspirational datasets containing extremely challenging scenarios that push system capabilities to their limits.

This dual approach prevents organizations from becoming complacent about current performance while maintaining reliability for existing use cases.

User-Centric Validation

The most reliable evaluation method remains A/B testing with real users. Rather than optimizing for laboratory metrics, successful organizations route small percentages of traffic to new models or approaches while monitoring user satisfaction, activation rates, and retention metrics.

This approach provides the most accurate assessment of whether changes actually improve user experiences rather than simply improving internal benchmarks.

Implementation Recommendations

Organizations beginning AI agent development should start with comprehensive instrumentation and simple unit test evaluations to build analytical capabilities. Focus on identifying high-impact failure modes through user feedback and usage patterns before investing in complex evaluation infrastructure.

Develop internal tooling that enables rapid conversion of interesting interactions into test cases, and maintain a balance between regression testing and aspirational challenges. Most importantly, never lose sight of user satisfaction as the ultimate success metric, using A/B testing to validate that internal improvements translate to better user experiences.

Building effective AI agents requires sustained commitment to data-driven improvement cycles. Organizations that successfully implement these evaluation frameworks position themselves to create increasingly capable and reliable AI systems that truly serve their users' needs.

Building AI Agents: Zapier's Evaluation Framework Lessons

The Reality of AI Agent Development

Building the Data Flywheel

Instrumenting for Success

Collecting Actionable Feedback

Strategic Feedback Collection

Mining Implicit Signals

Understanding Data at Scale

Systematic Pattern Recognition

Evaluation Framework Hierarchy

Unit Test Evaluations

Trajectory Evaluations

Advanced Evaluation Techniques

Strategic Evaluation Principles

Balanced Dataset Strategy

User-Centric Validation

Implementation Recommendations

Tags

Tech Team

More Articles

Recent Articles

GPT-5 vs Anthropic's Opus 4.1: The Ultimate AI Coding Showdown

GPT-5 Released: Can OpenAI's Latest Model Actually Code?

ChatGPT-5 Review: Speed, Integration, and Real-World Testing

Need Expert Help?