Abstraction and discovery with large language model agents

Notes from Lecture 11 of this Open MOOC offered by Berkeley RDI

-–

Abstraction and Discovery with Large Language Model Agents
Mathematical Discovery - AI for Math aims to automate these

  1. Modeling - Definitions
  2. Conjecturing - Candidate theorems
  3. Reasoning - Proofs, Counter examples

Scientific discovery - automate these

  1. Theorize - Hypothesize
  2. Experiment - with data

Key Ideas - LLMs can help with all four capabilities

  • Systematically search spaces
  • Use prior knowledge to direct search
  • Learn from experience
  • discover abstract concepts and tools to speed up exploration and learning

AI for Mathematical Discovery

  1. Neural only approach weakness
    1. Data scarcity - need traces or reward functions that enable rigorous mathematical reasoning
    2. Lack of verifiability
      1. Natural language reasoning is hard to verify
      2. In applications like system verification, edge cases are especially critical
  2. Alternative: Formal Representations
    1. Generating synthetic candidate proofs and verifying is a good strategy - alphaproof
    2. COPRA: LLM Agents for formal theorem proving
      1. formal theorem + informal hints (options)
        1. Prompt Synthesis - create a prompt
        2. Tactic Parsing - ask LLM to generate tactics
        3. Execute tactic via proof environment
        4. augment the feedback to the prompt and backtrack optionally
        5. Query lemma database (and optionally write to it)
      2. adding informal thoughts added to the performance - no formal “training”
        1. ask for an informal solution
        2. ask the llm to split the theorem into relevant sub-goals (conditioned on the informal proof)
        3. ask copra to solve subproblem one by one
        4. add all relevant lemmas in the prompt along with the informal proof
      3. adding retrieval improved the performance
    3. Application in Formal Verification
      1. mathematically model a system as a set of definitions
      2. model system properties as theorems involving these definitions
      3. prove the theorems
      4. E.g - A formally verified drone couldn’t be hacked by a red team over the course of months. Formal verification has been expensive to apply but LLMs change the cost-benefit significantly. Proofs of real-world systems
      5. E.g. Compiler Verification
        1. Language - syntax (form of programs) + semantics (rules of programs).
        2. Compiler: Set of rules translating programs from one language to another
        3. Verification: Prove that the rules preserve semantics
          1. Steps
            1. Define the source language
            2. Define a target language
            3. Define the Compiler
          2. Copra cannot prove it one shot. Instead
            1. Ask LLM to propose a lemma
            2. sample a proposed lemma
            3. ask copra to generate the proof for one step, and repeat

AI for Scientific Discovery

  1. LLMs for Experimental setup and Analysis
    1. Symbolic Regression (PySR) Genetic programming is an established approach of symbolic regression - random mutations entered and expressions are evaluated by a separate machinery (PySR)
    2. Approach has had a lot of impact, across physics
    3. PySR exploration space - Independant “islands” (exponential, logarithmic etc.) of expressions, each undergoing evolution- LLMs can increase exploration in relevant parts of the search space - as LLMs are associating code with natural language - Core idea behind the next paper
    4. LaSR: Symbolic Regression with a Learned Concept Library
      1. What is Concept
        1. Desiderata 1: Symbolic [Concept] Abstraction - Observe zipf’s law, Moore’s law and deduce that a power law relation is present in them
        2. Desiderata 2: Symbolic Guidance - use background knowledge or concepts created by a physicist (or an LLM) - should guide the search for concrete equations
        3. Joint Concept and program learning
          1. Given
            1. Higher level Concepts C
            2. a dataset D
            3. A space of programmatic hypotheses pi
            4. solveLHS - jointly infer, given the data - the likely hypothesis and the likely concepts - and not just likely programs (which would be the goal of traditional symbolic learning) - we want both the programs and the concepts they represent - for more explainability
            5. By execution - If I have a particular program pi, what is the likelihood of the data - how well does the program explain the data - find out by executing the program and observe
            6. By LLM - Condition on the concept (e.g. Power law) what is the probability that the piece of code represents that concept (as the LLM contains some background knowledge for this)
            7. By LLM p(C) - what is the likelihood of concepts being true, given that LLMs have trained
            8. LaSR
              1. Central Loop - Follows the structure of an evolutionary algorithm, but perform Bayesian concept learning guided by LLMs
              2. Hypothesis Evolution
                1. We have the choice of performing evolution (find mutations) guided by either -
                  1. classical symbolic evolution - random tweaks to an abstract syntax tree
                  2. LLM guided evolution - “Here is a program I have, Now generate a new program based on…”, guided by a Concept library (e.g Power law suits this condition)
                2. Fitness function is used to evaluate (how well the hypothesis pool explains the data)
                3. Initialise –> evaluate – > best program –> mutate/cross-over–>replace oldest program –>program population (repeat)
                4. LLM is used to implement mutations and crossovers instead of swapping out subtrees in ASTs in traditional Genetic algorithms - similar concept used in Deepmind’s FunSearch
              3. Concept library also evolves - Evolution has two stages
                1. Concept Abstraction
                  1. Use best hypothesis generated after evolution, and ask the LLM to abstract it into natural language
                2. Concept Evolution
                  1. An LLM is used to “mash” concepts into new concepts
              4. Overall Performance
                1. Concept Guidance accelerates discovery
                2. LaSR outperforms PySR even with local language models
                3. Human-provided hints - “this is the kind of equation you are looking at” 0 hints increase performance, earlier in the process
                4. How do you know the derive result (e.g Coulomb’s law) not leakage from the LLM? Input is data Only - Experiments from synthetic data and synthetic equations were derived.
                5. Using LaSR derived concepts with Chinchilla (scaling law) provided better output than both of them evaluated individually.

\———
https://llmagents-learning.org/sp25

Written on May 31, 2025