Multimodal autonomous ai agents

Notes from Lecture 6 of this Open MOOC offered by Berkeley RDI

Web Agents

Visual Encoder + LLM (HTML understanding) + Web Grounding

Outline

VisualWebArena - Evaluating Multimodal Agents (eval benchmark)
Tree Search for Language Model Agents (inference time algorithms)
Towards Internet-Scale Training for Agents (training data scaling)

VisualWebArena
1. WebArena
  1. shopping, reddit, GitLab implementations with real data
    1. 78% for humans, 14% for agents
  2. tasks are designed to use only test and html source code
    1. fills up tokens (100k+)
    2. interactive elements don’t display correctly. along with spatial layout
    3. usually minified or compressed for efficiency
  3. Hence, html in insufficient
2. VisualWebArena
  1. Built for multimodal agents - visually grounded tasks - more realistic and interesting. e.g. find this bike and offer 100 dollars less
    1. tasks require careful observation of images. graphs, charts etc. for correct completion
    2. 950 tasks - split by visual difficulty (click on a lot things and explore around) and reasoning difficulty
  2. POMDP environment - RL - Partial observable Markov decision process
    1. O- Observations - website, accessibility tree
    2. S - States
    3. A - Actions - hover, new tab, tab focus, stop
    4. T - Transitions - deterministic transition function
      1. Given a state and an action, go to the next environment
      2. Reward signal (sparse)- to indicate successful task completion
  3. Execution based evaluation - most prompts have one solution
    1. exact match (ans = 100)
    2. visual QnA (is this shirt green?)
    3. constraints (ans<10,000 and ans>2000)
3. LLM and VLM Agents
  1. VLMs as agents
    1. Representations
      1. Accessibility Tree (HTML) - cluttered and dense but useful
      2. Set of Marks - bounding boxes around clickable elements (Fig 1, 2)
      3. Co-ordinate based representation
  2. Web Agent Architectures
    1. High Level Planning
    2. Observation Parsing
    3. Low-level action generation
  3. Common Failure Modes
    1. Long horizon reasoning and planning
      1. Go back and forth in a loop (as memory is missing)
      2. Correctly perform task and undo them
      3. Stop exploration/execution too early
    2. Failures in visual processing
      1. Clicking the wrong item
      2. Identifying specific items in complex web pages unsuccessfully
      3. spatial reasoning (“what are the prices in row 1”) [seems like a planning failure to me]
    3. Improvements needed in
      1. long horizon planning and research
      2. Search
      3. visual language code models
      4. identifying the right level of abstraction - API/Screenshots/HTML
Tree search (test time inference)
1. Exponential error compounding - every step needs to be carefully taken - Local decisions have global consequences - efficient environment exploration is necessary
2. Search by Repeated Sampling - explore all paths until success - o1, o3, DeepSeek use it right now - surprisingly simple - probability increases - But the space is exponentially large -
3. key idea: apply value function at intermediate nodes
  1. Best-First-Search
    1. baseline agent to propose actions
    2. backtrack in the environment
    3. Value function to re-rank and score candidates
      1. LLM-as-a-Judge with self-consistency
      2. backtrack if value of the current state decreases
    4. Results - 5-8% increase in VWA and WA
    5. Ablations
      1. more node expansions/depth (search budget) higher success rate
      2. Good value function is essential (LLava worse than gpt-4o) - judging whether a trajectory is good or bad is much easier than predicting what is the next action that should be taken
    6. Limitations
      1. Search is slow - naive queue base backtracking
      2. dealing with irreversible actions
Data Collection
1. LLMs are trained offline and deployed zero shot
2. how can we create synthetic tasks - Internet-Scale training for Agents (InSTA) - use llama to generate and verify synthetic agentic tasks
  1. task generation stage
    1. given a website, propose a realistic task that a user will do. (e.g. plan a journey on airindia)
      1. Most task involve information retrieval
      2. we don’t modify the state of the environment (mentioned as part of requirements)
      3. the generated tasks are diverse, include fact identification (compare prices). tasks are grounded, have good breadth
  2. Task evaluation
    1. Automatic task verification - LLM-as-a-Judge - full trajectory eval
      1. execute task using gpt
      2. observe sequence and the last state
      3. estimate the probability of the task being successful
    2. On 150k tasks, LLM rates 15% as success with a confidence of 1.0. The accuracy among these was around 90% as evaluated by humans
  3. Data collection -
    1. Common crawl PageRank - top 1M - exclude harmful - to 150k live websites
    2. Results
      1. Improving efficiency
        
        training on synthetic and human demos scale faster than training only on human data
        
        adding synthetic data improves the “step” accuracy by around 90% on mind2web and 122% on weblinx
      2. improving generalization
        
        human demos are on a small subset of highly visited website - they don’t generalize
        
        synthetic data does generalize

Physical Agent - long-horizon robotic manipulation task

plan-sequence-learn -
- plan - high level plan - list of (object, condition) pairs
- sequencing/parsing module
  - pose estimation + motion planner - SAM + inverse kinematics + motion planner (AIT*)
- low level action learning module
  - learn RL policies - learns local control

Adversarial attacks - change the image of the web page to hijack the agent and generate whatever you can

Fig. 1. Set of elements

Fig. 2. VisualWebArena results

-——–

https://llmagents-learning.org/slides/ruslan-multimodal.pdf https://rdi.berkeley.edu/adv-llm-agents/sp25

Written on May 27, 2025