Multimodal agents: from perception to action
Notes from Lecture 7 of this Open MOOC offered by Berkeley RDI
-–
by Salesforce AI Research
Outline
- Environment/Benchmarks: should be reconfigurable and expandable
- Data: diverse modalities, large-scale trajectory data, covering a wide range of tasks
- Model/System: unified vision-language-reasoning-action model, and long-context inference.
- Environment/Benchmarks
- OSWorld - first scalable and “real” computer environment
- real env - evaluate open-ended tasks on arbitrary apps (not restricted to reddit, websites etc.). Hence, it allows defining tasks for agents without having to develop an environment
- How to use
- Agent Task Config - Each task has a -
- initial state setup - initializes a virtual machine (hence re-usable and isolated)
- evaluation config file - expected output, evaluator function, app used
- Agent Observation - agent can receive natural language instruction, screenshots, the accessibility tree, DOM and customized streams like such as terminal outputs and also Set-of-Marks (ref: previous lecture for an image)
- Agent interaction loop - repeats until that marks termination is reached.
- Execution based reward function - more open - user can construct the evaluation script
- Agent Task Config - Each task has a -
- Single-app, Multi-app workflows, in ubuntu and windows, some tasks are infeasible - included in the dataset
- LLMs and VLMs, open and closed source agents were tested - evaluation settings: SoM, screenshots, accessibility tree
- Human performance:75%, best performing V/LLM was GPT4V around 12% (last year). 22% for anthropic computer use. OpenAI operator = 38% (at 40 max steps)
- longer test based trajectory history context improves performance, unlike screenshot-only history
- VLM agents are not robust to UI layouts and noise
- OSWorld - first scalable and “real” computer environment
- Data
- Agenttrek : agent trajectory synthesis via guiding replay with web tutorials
- Automatic tutorials collection from the Internet - large-Gui specific dataset
- “Snapshot of the internet” - RedPajama
- Keyword/structure-based filtering - 99% filtered out - e.g. English data, Content length, keyword, URL presence, after-2020, de-duplicates, domain filter
- subset of the remaining data is annotated (good / bad) using an LLM - “You are an assistant…task is to identify tutorial specifically that uses GUI for web/desktop apps or operating systems…should have step by step instructions”
- Tutorial Classifier
- Tag and Paraphrase using LLM to a structured format - it has the following info
- platform and target env specific
- task description
- prerequisites
- step by step instructions
- expected outcomes
- Automatic tutorials collection from the Internet - large-Gui specific dataset
- TACO: Multi-Modal action models with synthetic chain-of-thought and action (CoTA) - Multimodal LLMs with action calling capabilities
- Synthetic CoTA generation pipeline
- Model based generation
- Get (image,QA) pairs - ask MLLM to generate CoT-and-Action (action list) or CoT if there is no relevant action list
- Verify the output and parse then store
- programmatic generation - no QA pairs - only image
- Use object detection, OCR, or human annotations in the image
- Templates - are used to generate QA pairs and CoTA
- action set - OCR, getobjects, localiseobjects,solvemath,querylanguagemodel, getimagetoimagesimiilarity
- Question template: How many {obj} are there?
- Observations
- CoTA finetuning elicits reasoning and action capabilities that are not obtained through few shot prompting
- CoTA data recipe allows TACO to consistently beat instruction-tuned baselines by 1-4%
- Quality > Quantity (data ablations)
- smallest CoTA only dataset on average beats larger datasets of CoTA/CoT/Direct samples
- Filtering out datasets without Action data leads to performance gains
- Adding program generated data can improve performance across some datasets, but not on average
- Model based generation
- Synthetic CoTA generation pipeline
- Agenttrek : agent trajectory synthesis via guiding replay with web tutorials
- Model/System:
- Aguvis: unified pure vision agents for autonomous GUI Interaction
- Common challenges for GUIs and how Aguvis addresses them
- GUIs are complex and have different representations (HTML for web, XML for mobile, AXTree for OS) - Different observation representations result in different action grounding spaces, even on the same platform. - Aguvis resolves this by having a unified vision-based perception and action space for all GUI interactions
- Limited visual grounding capability - Aguvis improves specifically on visual action grounding capability through training
- Perform “reactive” low-level actions directly without reasoning - Aguvis has explicit reasoning process / inner monologue
- Training - Two Stages
- Grounding - Image observation -> generate low level instruction -> Grounding generation
- Planning and Reasoning - Image generation -> Input Instructions (includes previous actions) -> Planning Generation (Thought:Low-Level-Instruction:Action)+
- Evaluation, Offline Eval on Mind2Web, AndroidControl and their Online Agent Evals both showed improvements
- Reasoning with inner monologue helps in solving harder tasks
- Training only on web and mobile, shows strong generalisatoin to desktop
- Common challenges for GUIs and how Aguvis addresses them
- Aguvis: unified pure vision agents for autonomous GUI Interaction
-——–
https://language-agent-tutorial.github.io/
Written on May 28, 2025
