Computer Use at the Edge of the Statistical Precipice
Pierluca D'Oro, Sneha Silwal, William Wong, Yuxuan Sun, Fanyi Xiao, Manchen Wang, Eric Gan, Allen Bolourchi, Joseph Tighe
Careful evaluation is the foundation of good science, and AI agent evaluation can be particularly hard. I have been working on trying to understand how to accurately measure the capabilities of computer use agents, including developing an approach for principled environment design (PRISM).
I am also a firm believer of human-centered agent design: AI agents should be designed with human users in mind, from the beginning, and evaluation methodologies used during development should be oriented toward human-centered outcomes. I codified the ADEPTS framework as a way to explicitly think about the capabilities an agent needs to be useful to people, going beyond mere task execution, and how to measure those capabilities.
research note · last updated 2026
Pierluca D'Oro, Sneha Silwal, William Wong, Yuxuan Sun, Fanyi Xiao, Manchen Wang, Eric Gan, Allen Bolourchi, Joseph Tighe
Pierluca D'Oro, Caley Drooff, Joy Chen, Joseph Tighe