← science

sound and effective agent evaluation

Careful evaluation is the foundation of good science, and AI agent evaluation can be particularly hard. I have been working on trying to understand how to accurately measure the capabilities of computer use agents, including developing an approach for principled environment design (PRISM).

I am also a firm believer of human-centered agent design: AI agents should be designed with human users in mind, from the beginning, and evaluation methodologies used during development should be oriented toward human-centered outcomes. I codified the ADEPTS framework as a way to explicitly think about the capabilities an agent needs to be useful to people, going beyond mere task execution, and how to measure those capabilities.

research note · last updated 2026