sound and effective agent evaluation

Careful evaluation is the foundation of good science, and AI agent evaluation can be particularly hard. I have been working on trying to understand how to accurately measure the capabilities of computer use agents, including developing an approach for principled environment design (PRISM).

I am also a firm believer of human-centered agent design: AI agents should be designed with human users in mind, from the beginning, and evaluation methodologies used during development should be oriented toward human-centered outcomes. I codified the ADEPTS framework as a way to explicitly think about the capabilities an agent needs to be useful to people, going beyond mere task execution, and how to measure those capabilities.

research note · last updated 2026

arXiv 2026

Computer Use at the Edge of the Statistical Precipice

Pierluca D'Oro, Sneha Silwal, William Wong, Yuxuan Sun, Fanyi Xiao, Manchen Wang, Eric Gan, Allen Bolourchi, Joseph Tighe

Read paper
arXiv 2025

ADEPTS: A Capability Framework for Human-Centered Agent Design

Pierluca D'Oro, Caley Drooff, Joy Chen, Joseph Tighe

Read paper

sound and effective agent evaluation

Computer Use at the Edge of the Statistical Precipice

ADEPTS: A Capability Framework for Human-Centered Agent Design