Multimodal agents: from perception to action

Notes from Lecture 7 of this Open MOOC offered by Berkeley RDI

-– by Salesforce AI Research
Outline

Environment/Benchmarks: should be reconfigurable and expandable
Data: diverse modalities, large-scale trajectory data, covering a wide range of tasks
Model/System: unified vision-language-reasoning-action model, and long-context inference.

Environment/Benchmarks
1. OSWorld - first scalable and “real” computer environment
  1. real env - evaluate open-ended tasks on arbitrary apps (not restricted to reddit, websites etc.). Hence, it allows defining tasks for agents without having to develop an environment
  2. How to use
    1. Agent Task Config - Each task has a -
      1. initial state setup - initializes a virtual machine (hence re-usable and isolated)
      2. evaluation config file - expected output, evaluator function, app used
    2. Agent Observation - agent can receive natural language instruction, screenshots, the accessibility tree, DOM and customized streams like such as terminal outputs and also Set-of-Marks (ref: previous lecture for an image)
    3. Agent interaction loop - repeats until that marks termination is reached.
    4. Execution based reward function - more open - user can construct the evaluation script
2. Single-app, Multi-app workflows, in ubuntu and windows, some tasks are infeasible - included in the dataset
3. LLMs and VLMs, open and closed source agents were tested - evaluation settings: SoM, screenshots, accessibility tree
4. Human performance:75%, best performing V/LLM was GPT4V around 12% (last year). 22% for anthropic computer use. OpenAI operator = 38% (at 40 max steps)
5. longer test based trajectory history context improves performance, unlike screenshot-only history
6. VLM agents are not robust to UI layouts and noise
Data
1. Agenttrek : agent trajectory synthesis via guiding replay with web tutorials
  1. Automatic tutorials collection from the Internet - large-Gui specific dataset
    1. “Snapshot of the internet” - RedPajama
    2. Keyword/structure-based filtering - 99% filtered out - e.g. English data, Content length, keyword, URL presence, after-2020, de-duplicates, domain filter
    3. subset of the remaining data is annotated (good / bad) using an LLM - “You are an assistant…task is to identify tutorial specifically that uses GUI for web/desktop apps or operating systems…should have step by step instructions”
    4. Tutorial Classifier
    5. Tag and Paraphrase using LLM to a structured format - it has the following info
      1. platform and target env specific
      2. task description
      3. prerequisites
      4. step by step instructions
      5. expected outcomes
2. TACO: Multi-Modal action models with synthetic chain-of-thought and action (CoTA) - Multimodal LLMs with action calling capabilities
  1. Synthetic CoTA generation pipeline
    1. Model based generation
      1. Get (image,QA) pairs - ask MLLM to generate CoT-and-Action (action list) or CoT if there is no relevant action list
      2. Verify the output and parse then store
    2. programmatic generation - no QA pairs - only image
      1. Use object detection, OCR, or human annotations in the image
      2. Templates - are used to generate QA pairs and CoTA
        
        action set - OCR, getobjects, localiseobjects,solvemath,querylanguagemodel, getimagetoimagesimiilarity
        
        Question template: How many {obj} are there?
    3. Observations
      1. CoTA finetuning elicits reasoning and action capabilities that are not obtained through few shot prompting
      2. CoTA data recipe allows TACO to consistently beat instruction-tuned baselines by 1-4%
      3. Quality > Quantity (data ablations)
        
        smallest CoTA only dataset on average beats larger datasets of CoTA/CoT/Direct samples
        
        Filtering out datasets without Action data leads to performance gains
      4. Adding program generated data can improve performance across some datasets, but not on average
Model/System:
1. Aguvis: unified pure vision agents for autonomous GUI Interaction
  1. Common challenges for GUIs and how Aguvis addresses them
    1. GUIs are complex and have different representations (HTML for web, XML for mobile, AXTree for OS) - Different observation representations result in different action grounding spaces, even on the same platform. - Aguvis resolves this by having a unified vision-based perception and action space for all GUI interactions
    2. Limited visual grounding capability - Aguvis improves specifically on visual action grounding capability through training
    3. Perform “reactive” low-level actions directly without reasoning - Aguvis has explicit reasoning process / inner monologue
  2. Training - Two Stages
    1. Grounding - Image observation -> generate low level instruction -> Grounding generation
    2. Planning and Reasoning - Image generation -> Input Instructions (includes previous actions) -> Planning Generation (Thought:Low-Level-Instruction:Action)+
    3. Evaluation, Offline Eval on Mind2Web, AndroidControl and their Online Agent Evals both showed improvements
    4. Reasoning with inner monologue helps in solving harder tasks
    5. Training only on web and mobile, shows strong generalisatoin to desktop

-——–
https://language-agent-tutorial.github.io/

Written on May 28, 2025