Coding agents for vulnerability detection

Notes from Lecture 5 of this Open MOOC offered by Berkeley RDI

-–

Outline

  • Coding Agents
  • AI for Computer Security
  • LLM Agents for computer security

Evaluation [of code]

  • History
    • MBPP. HumanEval (codex)
    • Evaluation metric: pass@k
      • get k samples from the model, how often does the model pass
      • Required automatic correctness check
      • Get to see Explore Vs Exploit, as we choose k diverse samples
      • Pass@k may match prod
      • how do we calculate pass@k?
        • naive estimator - what % passed e.g. Out of 100 samples, take 10, and average out - e.g. 20% - it is a noisy estimate
        • HumanEval estimator - take a subsample of 10 (without replacement) - see how many times at least one success is obtained.
    • Are those evals good?
      • In 2021, they were “hard”, they weren’t leaked and improved models, but now they are too easy
  • SWE-Bench - real world coding problems collected from python repositories from github. Contains the problem statement, test patch, gold patch (fixes failing test patch)
    • When it was introduced, llms were at 2-3% and agents were better
    • Filter on the basis of attributes (resolves issue / adds test), or Execution (PR passes all tests, installs successfully)
    • subset where bad descriptions are removed is available (swe-bench-verified)
    • Are these good?
      • yes - real, difficult, less noisy
      • no
        • GitHub issue descriptions (input for swe-bench) is different than user input in terms on style
        • Tests are an inexact verifier (tests might pass though code is wrong)
        • data leakage - public code maybe in the LLM datasets
        • commits may do 2-3 things instead of one resolving one specific issue
        • all evals have a shelf life before they are leaked

Coding Agents

  1. SWE-Agent (Yang 2024)
    1. Agent designed for SWE-bench style tasks by combining
      • Planning / Chain-of-Thought - explains
      • Uses Tools
      • Execution feedback
    2. ReACT loop
      • Repeat:
        • Llm generates text given current trajectory
        • run tools from llm output
        • append tool output to trajectory
      • Until timeout, error, or success
    3. Types of tools:
      • Linux shell, IDE - models can’t use it yet. OR
      • design agent specific tools - Agent Computer Interface
        • Tool: function - python adds a string to the context
        • ACI actions should -
          • result in compact trajectories
          • Compact and informative feedback
          • guardrails mitigate error propagations
        • prompt should lead the model towards familiar paths from the post training data
    4. SWE-Agent tools
      • Information gathering
        • File viewer (open, goto line, scroll_up/down)
        • Search tool: search_file, search_dir, find_file
      • Action tools (with feedback)
        • File editing - edit row:col end\_of\_edit, create
      • Feedback tool
        • Task - submit (?)
    5. Results- SoTA was 1%, this setup performed 12%
    6. Criticism -
      1. agents will use 5-10x more tokens to get the same output. The results should be normalized by computation tokens for a fair comparison
      2. the agent has old tests which may or may not work, the agent has to generate its own tests
  2. Agentless Design (Xie 2024)
    1. Control flow is determined by -
      1. Dynamically: LLM Chooses the problem-solving strategy. Flexibility might be better for harder problems
      2. Procedurally: Python code that calls LLM when needed i.e. Agentless, since we know the workflow, thereby avoiding tool use mistakes
    2. Agentless Control Flow
      1. Localization: Narrow down to files, classes, functions, lines of code
      2. Repair: Generate patch
      3. Validate Patch
    3. Near the top open-source agent
  3. AutoCodeRover (Zhang 2024)
    1. best performing open source
    2. Different type of search - IDE style method/class indexing and tools
    3. In between dynamic and procedural style
    4. Two Loops - allows information hiding (irrelevant is not sent forwarded)
      1. Context retrieval loop - outputs location of the bug
      2. Patch generation stage - outputs code
  4. RepairAgent (2024)
    1. State Machine based control flow
    2. Depending on the tool call chosen, state change is triggered
  5. Passerine: Coding agents at Google
    1. react style
    2. uses google tools, on its monorepo
      1. code search
      2. bazel for builds
      3. cat file
      4. edit file
      5. finish
    3. Internal data collection
      1. user generated bug report with patches
      2. autogenerated reports with patches
  6. Ideas are simple, design space is large
    1. Which tools to use
    2. Control flow - static, dynamic, tree
    3. Prompting - system, tool, summary
    4. Acting- sandboxing, human in the loop, communicate with external service

AI for Computer Security

  1. CTF competitions
    1. Forensics - find a secret message in fs - tools - filesystem and network tools, grep
    2. Forensice, Binary exploitation (using debuggers), Injection attacks (web search and browsing tools)
    3. in the benchmark development stage: NYU CtF Bench
    4. EnIGMA (2024) - CTF Agents
      1. stateful tools - debuggers, server connectors [pwntools],
      2. tools - swe-agent tools like file search, decompiler, disassembler
      3. Prompt design - summarising and searching binaries tools, guidelines from unsuccesful trajectories
    5. 3 to 13 % increment using agents on nyu benchmark
  2. Vulnerability detection
    1. e.g XSS, out of bounds write/read, sql injection, improper auth
      1. cwe.mitre has a taxonomy of bugs, and cve has datasets
      2. detection is a needle in a haystack problem
      3. need global context - how much code do I put into the LLM? How are tangled commits handle?
    2. Historical techniques
      1. traditional fuzzing - genetic algorithm type input flipping to cause vulnerability
      2. static analysis - false positive rate is high
      3. finetuned LLMs
      4. Agent: Big Sleep (google project zero)
        1. Overview:
          1. Navigates code
          2. Proposes a hypothesis
          3. Validates hypothesis by running code
        2. Focus on mem-safety vulnerability detection using whitebox (source code access) test case generation (precise verifier) using LLM agents
          1. Goal: Find input that triggers a sanitizer crash
          2. At each turn:
            1. LLM generates NL reasoning
            2. Calls one of the tools
              1. code browser - jump to def, follow cross references
              2. python interpreter - run script and generate output
              3. debugger - run target program, set breakpoints, evaluate expressions
            3. Tool output is added to context
              1. Generate python program to generate input instead of the actual output - as input is complex e.g 20 times A
              2. Run in debugger interactively - lists breakpoints (by line number, expressions, and the value of the expressions)
        3. found a vulnerability in sqlite. - a large number of vulnerabilities are variants of bugs that have been recently fixed. Author chose one such bug and asked it find something similar. general purpose fuzzer couldnt find it (good ablation)
        4. SOTA in memory on meta cybersec benchmark

-——————————————
https://llmagents-learning.org/sp25
http://llmagents-learning.org/slides/Code%20Agents%20and%20AI%20for%20Vulnerability%20Detection.pdf

Written on May 27, 2025