Abstraction and discovery with large language model agents

Notes from Lecture 11 of this Open MOOC offered by Berkeley RDI

-–

Abstraction and Discovery with Large Language Model Agents
Mathematical Discovery - AI for Math aims to automate these

Modeling - Definitions
Conjecturing - Candidate theorems
Reasoning - Proofs, Counter examples

Scientific discovery - automate these

Theorize - Hypothesize
Experiment - with data

Key Ideas - LLMs can help with all four capabilities

Systematically search spaces
Use prior knowledge to direct search
Learn from experience
discover abstract concepts and tools to speed up exploration and learning

AI for Mathematical Discovery

Neural only approach weakness
1. Data scarcity - need traces or reward functions that enable rigorous mathematical reasoning
2. Lack of verifiability
  1. Natural language reasoning is hard to verify
  2. In applications like system verification, edge cases are especially critical
Alternative: Formal Representations
1. Generating synthetic candidate proofs and verifying is a good strategy - alphaproof
2. COPRA: LLM Agents for formal theorem proving
  1. formal theorem + informal hints (options)
    1. Prompt Synthesis - create a prompt
    2. Tactic Parsing - ask LLM to generate tactics
    3. Execute tactic via proof environment
    4. augment the feedback to the prompt and backtrack optionally
    5. Query lemma database (and optionally write to it)
  2. adding informal thoughts added to the performance - no formal “training”
    1. ask for an informal solution
    2. ask the llm to split the theorem into relevant sub-goals (conditioned on the informal proof)
    3. ask copra to solve subproblem one by one
    4. add all relevant lemmas in the prompt along with the informal proof
  3. adding retrieval improved the performance
3. Application in Formal Verification
  1. mathematically model a system as a set of definitions
  2. model system properties as theorems involving these definitions
  3. prove the theorems
  4. E.g - A formally verified drone couldn’t be hacked by a red team over the course of months. Formal verification has been expensive to apply but LLMs change the cost-benefit significantly. Proofs of real-world systems
  5. E.g. Compiler Verification
    1. Language - syntax (form of programs) + semantics (rules of programs).
    2. Compiler: Set of rules translating programs from one language to another
    3. Verification: Prove that the rules preserve semantics
      1. Steps
        
        Define the source language
        
        Define a target language
        
        Define the Compiler
      2. Copra cannot prove it one shot. Instead
        
        Ask LLM to propose a lemma
        
        sample a proposed lemma
        
        ask copra to generate the proof for one step, and repeat

AI for Scientific Discovery

LLMs for Experimental setup and Analysis
1. Symbolic Regression (PySR) Genetic programming is an established approach of symbolic regression - random mutations entered and expressions are evaluated by a separate machinery (PySR)
2. Approach has had a lot of impact, across physics
3. PySR exploration space - Independant “islands” (exponential, logarithmic etc.) of expressions, each undergoing evolution- LLMs can increase exploration in relevant parts of the search space - as LLMs are associating code with natural language - Core idea behind the next paper
4. LaSR: Symbolic Regression with a Learned Concept Library
  1. What is Concept
    1. Desiderata 1: Symbolic [Concept] Abstraction - Observe zipf’s law, Moore’s law and deduce that a power law relation is present in them
    2. Desiderata 2: Symbolic Guidance - use background knowledge or concepts created by a physicist (or an LLM) - should guide the search for concrete equations
    3. Joint Concept and program learning
      1. Given
        
        Higher level Concepts C
        
        a dataset D
        
        A space of programmatic hypotheses pi
        
        solveLHS - jointly infer, given the data - the likely hypothesis and the likely concepts - and not just likely programs (which would be the goal of traditional symbolic learning) - we want both the programs and the concepts they represent - for more explainability
        
        By execution - If I have a particular program pi, what is the likelihood of the data - how well does the program explain the data - find out by executing the program and observe
        
        By LLM - Condition on the concept (e.g. Power law) what is the probability that the piece of code represents that concept (as the LLM contains some background knowledge for this)
        
        By LLM p(C) - what is the likelihood of concepts being true, given that LLMs have trained
        
        LaSR
        
        Central Loop - Follows the structure of an evolutionary algorithm, but perform Bayesian concept learning guided by LLMs
        
        Hypothesis Evolution
        
        We have the choice of performing evolution (find mutations) guided by either -
        
        classical symbolic evolution - random tweaks to an abstract syntax tree
        
        LLM guided evolution - “Here is a program I have, Now generate a new program based on…”, guided by a Concept library (e.g Power law suits this condition)
        
        Fitness function is used to evaluate (how well the hypothesis pool explains the data)
        
        Initialise –> evaluate – > best program –> mutate/cross-over–>replace oldest program –>program population (repeat)
        
        LLM is used to implement mutations and crossovers instead of swapping out subtrees in ASTs in traditional Genetic algorithms - similar concept used in Deepmind’s FunSearch
        
        Concept library also evolves - Evolution has two stages
        
        Concept Abstraction
        
        Use best hypothesis generated after evolution, and ask the LLM to abstract it into natural language
        
        Concept Evolution
        
        An LLM is used to “mash” concepts into new concepts
        
        Overall Performance
        
        Concept Guidance accelerates discovery
        
        LaSR outperforms PySR even with local language models
        
        Human-provided hints - “this is the kind of equation you are looking at” 0 hints increase performance, earlier in the process
        
        How do you know the derive result (e.g Coulomb’s law) not leakage from the LLM? Input is data Only - Experiments from synthetic data and synthetic equations were derived.
        
        Using LaSR derived concepts with Chinchilla (scaling law) provided better output than both of them evaluated individually.

\———
https://llmagents-learning.org/sp25

Written on May 31, 2025