Teaching Robots to Infer Human Intent

FISER helps robots understand ambiguous instructions by reasoning about human intentions and actions, improving their ability to assist in real-world tasks.

#agent#hci

schedule Sep 27, 2024
face leeron

To create AI agents capable of understanding and executing human instructions in real-world environments, researchers from the University of Washington and MIT propose a novel framework called Follow Instructions with Social and Embodied Reasoning (FISER).

This system is designed to address the inherent ambiguity in natural language instructions by leveraging both social and embodied reasoning.

In daily interactions, humans often leave out details from instructions, assuming their conversational partners can infer missing information. For example, when someone asks a robot, "Could you pass that from the sofa?", it may not be immediately clear what "that" refers to without additional context about the human’s actions or intentions.

An example of an unclear situation.
An example of an unclear situation.

Traditional AI systems struggle with such ambiguity, as they typically translate language directly into robot actions without considering the human’s broader goals or previous behavior.

FISER introduces a new approach by first using social reasoning to infer what the human actually wants based on the task they’re performing. For instance, if a human has been packing books into a box, the system can deduce that "that" likely refers to a remaining book on the sofa. Once this inference is made, the robot uses embodied reasoning to plan and carry out the necessary actions, such as picking up the book and passing it to the human.

This approach significantly improves the robot’s ability to follow vague instructions, achieving state-of-the-art results on the HandMeThat benchmark, a challenging test designed to evaluate how well AI systems can understand and act on ambiguous human commands.

By explicitly modeling human intentions and integrating step-by-step reasoning, FISER outperforms even large language models like GPT-4 in these complex tasks.

article
Wan, Y., Wu, Y., Wang, Y., Mao, J., & Jaques, N. (2024). Infer Human's Intentions Before Following Natural Language Instructions. arXiv, 2409.18073. Retrieved from https://arxiv.org/abs/2409.18073v1

Subscribe to my Newsletter