Multimodal autonomous ai agents
Notes from Lecture 6 of this Open MOOC offered by Berkeley RDI
Web Agents
Visual Encoder + LLM (HTML understanding) + Web Grounding
Outline
- VisualWebArena - Evaluating Multimodal Agents (eval benchmark)
- Tree Search for Language Model Agents (inference time algorithms)
- Towards Internet-Scale Training for Agents (training data scaling)
- VisualWebArena
- WebArena
- shopping, reddit, GitLab implementations with real data
- 78% for humans, 14% for agents
- tasks are designed to use only test and html source code
- fills up tokens (100k+)
- interactive elements don’t display correctly. along with spatial layout
- usually minified or compressed for efficiency
- Hence, html in insufficient
- shopping, reddit, GitLab implementations with real data
- VisualWebArena
- Built for multimodal agents - visually grounded tasks - more realistic and interesting. e.g. find this bike and offer 100 dollars less
- tasks require careful observation of images. graphs, charts etc. for correct completion
- 950 tasks - split by visual difficulty (click on a lot things and explore around) and reasoning difficulty
- POMDP environment - RL - Partial observable Markov decision process
- O- Observations - website, accessibility tree
- S - States
- A - Actions - hover, new tab, tab focus, stop
- T - Transitions - deterministic transition function
- Given a state and an action, go to the next environment
- Reward signal (sparse)- to indicate successful task completion
- Execution based evaluation - most prompts have one solution
- exact match (ans = 100)
- visual QnA (is this shirt green?)
- constraints (ans<10,000 and ans>2000)
- Built for multimodal agents - visually grounded tasks - more realistic and interesting. e.g. find this bike and offer 100 dollars less
- LLM and VLM Agents
- VLMs as agents
- Representations
- Accessibility Tree (HTML) - cluttered and dense but useful
- Set of Marks - bounding boxes around clickable elements (Fig 1, 2)
- Co-ordinate based representation
- Representations
- Web Agent Architectures
- High Level Planning
- Observation Parsing
- Low-level action generation
- Common Failure Modes
- Long horizon reasoning and planning
- Go back and forth in a loop (as memory is missing)
- Correctly perform task and undo them
- Stop exploration/execution too early
- Failures in visual processing
- Clicking the wrong item
- Identifying specific items in complex web pages unsuccessfully
- spatial reasoning (“what are the prices in row 1”) [seems like a planning failure to me]
- Improvements needed in
- long horizon planning and research
- Search
- visual language code models
- identifying the right level of abstraction - API/Screenshots/HTML
- Long horizon reasoning and planning
- VLMs as agents
- WebArena
- Tree search (test time inference)
- Exponential error compounding - every step needs to be carefully taken - Local decisions have global consequences - efficient environment exploration is necessary
- Search by Repeated Sampling - explore all paths until success - o1, o3, DeepSeek use it right now - surprisingly simple - probability increases - But the space is exponentially large -
- key idea: apply value function at intermediate nodes
- Best-First-Search
- baseline agent to propose actions
- backtrack in the environment
- Value function to re-rank and score candidates
- LLM-as-a-Judge with self-consistency
- backtrack if value of the current state decreases
- Results - 5-8% increase in VWA and WA
- Ablations
- more node expansions/depth (search budget) higher success rate
- Good value function is essential (LLava worse than gpt-4o) - judging whether a trajectory is good or bad is much easier than predicting what is the next action that should be taken
- Limitations
- Search is slow - naive queue base backtracking
- dealing with irreversible actions
- Best-First-Search
- Data Collection
- LLMs are trained offline and deployed zero shot
- how can we create synthetic tasks - Internet-Scale training for Agents (InSTA) - use llama to generate and verify synthetic agentic tasks
- task generation stage
- given a website, propose a realistic task that a user will do. (e.g. plan a journey on airindia)
- Most task involve information retrieval
- we don’t modify the state of the environment (mentioned as part of requirements)
- the generated tasks are diverse, include fact identification (compare prices). tasks are grounded, have good breadth
- given a website, propose a realistic task that a user will do. (e.g. plan a journey on airindia)
- Task evaluation
- Automatic task verification - LLM-as-a-Judge - full trajectory eval
- execute task using gpt
- observe sequence and the last state
- estimate the probability of the task being successful
- On 150k tasks, LLM rates 15% as success with a confidence of 1.0. The accuracy among these was around 90% as evaluated by humans
- Automatic task verification - LLM-as-a-Judge - full trajectory eval
- Data collection -
- Common crawl PageRank - top 1M - exclude harmful - to 150k live websites
- Results
- Improving efficiency
- training on synthetic and human demos scale faster than training only on human data
- adding synthetic data improves the “step” accuracy by around 90% on mind2web and 122% on weblinx
- improving generalization
- human demos are on a small subset of highly visited website - they don’t generalize
- synthetic data does generalize
- Improving efficiency
- task generation stage
Physical Agent - long-horizon robotic manipulation task
- plan-sequence-learn -
- plan - high level plan - list of (object, condition) pairs
- sequencing/parsing module
- pose estimation + motion planner - SAM + inverse kinematics + motion planner (AIT*)
- low level action learning module
- learn RL policies - learns local control
Adversarial attacks - change the image of the web page to hijack the agent and generate whatever you can
Fig. 1. Set of elements
Fig. 2. VisualWebArena results
-——–
https://llmagents-learning.org/slides/ruslan-multimodal.pdf https://rdi.berkeley.edu/adv-llm-agents/sp25
Written on May 27, 2025
