Autonomous Digital Agents Are Getting Smarter: A New Method for Evaluation and Refinement

Digital agents, such as those that help users navigate websites or control devices, hold immense potential for simplifying our lives.

Imagine instructing a digital agent to find the cost of your latest canceled order, and it flawlessly navigates through your profile, order history, and gives you the correct information.

However, even the most advanced agents today can make mistakes on seemingly simple tasks. This challenge makes evaluating and refining their performance a critical task for researchers aiming to bring digital agents into real-world applications.

Recent research from UC Berkeley and the University of Michigan presents an innovative approach to autonomously evaluate and refine digital agents.

The researchers have developed domain-general evaluation models that can assess agent performance without requiring predefined, hand-crafted evaluation metrics or additional supervision. This automated approach is crucial in environments where it is impractical to provide human oversight at scale.

The researcher designs models to evaluate and improve digital agents that browse the web or control devices.

The study proposes two main evaluation methods for digital agents: a modular "caption-then-reason" approach and an end-to-end evaluation using a vision-language model (VLM).

In the modular approach, a VLM first creates a textual description of the digital agent's environment, and then a language model reasons through the actions taken to decide if the task succeeded.

The end-to-end approach uses an advanced model like GPT-4V to directly evaluate the entire process in one go.

These evaluators were put to the test using popular benchmarks like WebArena and Android-in-the-Wild, where they demonstrated an impressive performance accuracy of up to 92.9%, closely matching human oracles.

Moreover, the researchers used these evaluators to improve existing agents, significantly boosting success rates. For instance, integrating the evaluation model into the training of a WebArena agent led to a 29% increase in success rate, while applying it in device control settings yielded around a 75% improvement.

The ability to evaluate and refine digital agents autonomously without manual supervision not only enhances their performance but also enables broader deployment in diverse real-world scenarios. By advancing the evaluation process for digital agents, this research helps bring us one step closer to reliable, automated assistants that can handle complex tasks seamlessly and effectively.

article

Pan, J., Zhang, Y., Tomlin, N., Zhou, Y., Levine, S., & Suhr, A. (2024). Autonomous Evaluation and Refinement of Digital Agents. arXiv, 2404.06474. Retrieved from https://arxiv.org/abs/2404.06474v3

Subscribe to my Newsletter