Abstraction and discovery with large language model agents
Notes from Lecture 11 of this Open MOOC offered by Berkeley RDI
-–
Abstraction and Discovery with Large Language Model Agents
Mathematical Discovery - AI for Math aims to automate these
- Modeling - Definitions
- Conjecturing - Candidate theorems
- Reasoning - Proofs, Counter examples
Scientific discovery - automate these
- Theorize - Hypothesize
- Experiment - with data
Key Ideas - LLMs can help with all four capabilities
- Systematically search spaces
- Use prior knowledge to direct search
- Learn from experience
- discover abstract concepts and tools to speed up exploration and learning
AI for Mathematical Discovery
- Neural only approach weakness
- Data scarcity - need traces or reward functions that enable rigorous mathematical reasoning
- Lack of verifiability
- Natural language reasoning is hard to verify
- In applications like system verification, edge cases are especially critical
- Alternative: Formal Representations
- Generating synthetic candidate proofs and verifying is a good strategy - alphaproof
- COPRA: LLM Agents for formal theorem proving
- formal theorem + informal hints (options)
- Prompt Synthesis - create a prompt
- Tactic Parsing - ask LLM to generate tactics
- Execute tactic via proof environment
- augment the feedback to the prompt and backtrack optionally
- Query lemma database (and optionally write to it)
- adding informal thoughts added to the performance - no formal “training”
- ask for an informal solution
- ask the llm to split the theorem into relevant sub-goals (conditioned on the informal proof)
- ask copra to solve subproblem one by one
- add all relevant lemmas in the prompt along with the informal proof
- adding retrieval improved the performance
- formal theorem + informal hints (options)
- Application in Formal Verification
- mathematically model a system as a set of definitions
- model system properties as theorems involving these definitions
- prove the theorems
- E.g - A formally verified drone couldn’t be hacked by a red team over the course of months. Formal verification has been expensive to apply but LLMs change the cost-benefit significantly. Proofs of real-world systems
- E.g. Compiler Verification
- Language - syntax (form of programs) + semantics (rules of programs).
- Compiler: Set of rules translating programs from one language to another
- Verification: Prove that the rules preserve semantics
- Steps
- Define the source language
- Define a target language
- Define the Compiler
- Copra cannot prove it one shot. Instead
- Ask LLM to propose a lemma
- sample a proposed lemma
- ask copra to generate the proof for one step, and repeat
- Steps
AI for Scientific Discovery
- LLMs for Experimental setup and Analysis
- Symbolic Regression (PySR) Genetic programming is an established approach of symbolic regression - random mutations entered and expressions are evaluated by a separate machinery (PySR)
- Approach has had a lot of impact, across physics
- PySR exploration space - Independant “islands” (exponential, logarithmic etc.) of expressions, each undergoing evolution- LLMs can increase exploration in relevant parts of the search space - as LLMs are associating code with natural language - Core idea behind the next paper
- LaSR: Symbolic Regression with a Learned Concept Library
- What is Concept
- Desiderata 1: Symbolic [Concept] Abstraction - Observe zipf’s law, Moore’s law and deduce that a power law relation is present in them
- Desiderata 2: Symbolic Guidance - use background knowledge or concepts created by a physicist (or an LLM) - should guide the search for concrete equations
- Joint Concept and program learning
- Given
- Higher level Concepts C
- a dataset D
- A space of programmatic hypotheses pi
- solve
LHS - jointly infer, given the data - the likely hypothesis and the likely concepts - and not just likely programs (which would be the goal of traditional symbolic learning) - we want both the programs and the concepts they represent - for more explainability
- By execution - If I have a particular program pi, what is the likelihood of the data - how well does the program explain the data - find out by executing the program and observe
- By LLM - Condition on the concept (e.g. Power law) what is the probability that the piece of code represents that concept (as the LLM contains some background knowledge for this)
- By LLM p(C) - what is the likelihood of concepts being true, given that LLMs have trained
- LaSR
- Central Loop - Follows the structure of an evolutionary algorithm, but perform Bayesian concept learning guided by LLMs
- Hypothesis Evolution
- We have the choice of performing evolution (find mutations) guided by either -
- classical symbolic evolution - random tweaks to an abstract syntax tree
- LLM guided evolution - “Here is a program I have, Now generate a new program based on…”, guided by a Concept library (e.g Power law suits this condition)
- Fitness function is used to evaluate (how well the hypothesis pool explains the data)
- Initialise –> evaluate – > best program –> mutate/cross-over–>replace oldest program –>program population (repeat)
- LLM is used to implement mutations and crossovers instead of swapping out subtrees in ASTs in traditional Genetic algorithms - similar concept used in Deepmind’s FunSearch
- We have the choice of performing evolution (find mutations) guided by either -
- Concept library also evolves - Evolution has two stages
- Concept Abstraction
- Use best hypothesis generated after evolution, and ask the LLM to abstract it into natural language
- Concept Evolution
- An LLM is used to “mash” concepts into new concepts
- Concept Abstraction
- Overall Performance
- Concept Guidance accelerates discovery
- LaSR outperforms PySR even with local language models
- Human-provided hints - “this is the kind of equation you are looking at” 0 hints increase performance, earlier in the process
- How do you know the derive result (e.g Coulomb’s law) not leakage from the LLM? Input is data Only - Experiments from synthetic data and synthetic equations were derived.
- Using LaSR derived concepts with Chinchilla (scaling law) provided better output than both of them evaluated individually.
- Given
- What is Concept
\———
https://llmagents-learning.org/sp25
Written on May 31, 2025
