Alphaproof

Notes from Lecture 8 of this Open MOOC offered by Berkeley RDI

-–

A brief history of formalisation in mathematics What constitutes a mathematical proof evolved from words to symbols, this added -

Rigor and Clarity
Efficiency and Communication
Abstraction and Generalisation
Unification
Created new fields

Computer formalisation of Mathematics

Correctness concerns disappear - perfect verification (lean tells that it is not correct)
Use software stacks - collaboration (authors can trust the proof instead of performing manual verification), gamified “levels” of theorems
Lean aims to build Mathlib - aims to be general and unified

Reinforcement Learning

The “Zero” philosophy - If an agent can learn to master an environment tabula rase (without any information) then you demonstrably have a system that is discovering and learning new knowledge by itself
It also means that this algorithm is general and should apply to other domains
A huge apart of science is to explore areas that are not well understood and generate and discover new interesting truths.
AlphaGoZero and AlphaZero
- learned by simply playing games against itself, starting from complete random play
- rediscovered human patterns in days, and ultimately discarded some of it in preference of new discoveries
AlphaTensor - what is the most efficient way to multiply two matrices together
- Naive - n^3
- 2 by 2 can be done in 7 moves at best
- alphatensor discovered the most efficient, novel and provably correct methods for 4 by 4 and 5 by 5 and others, which were previously undiscovered. Also can access huge action spaces (10^30)
What made these systems work?
- Scaled trial and error
- Grounded feedback signal
- Search - sampling highest probability action
- Curriculum - self-play for AlphaGo gives the right opponent
Lean allows the “Zero” principles to be applied to mathematics
AlphaProof
- trial and error scaling
  - insilico environment for math
  - perfect feedback signal for proving
- Finding problems to solve
  - human defined
  - alphaproof defined
    - around human mathematics (auto-formalisation, test-time RL)
    - augment human mathematics
- Should training be done in natural language (informal) or formal maths
  - informal - large amount of data, unverifiable
  - formal - small amount of data, verifiable - bet is that this will be rewarded in the long term
- IMO - hard problems -
  - Geometry
    - mathlib is sparse in 2D Euclidean geometry
    - AlphaGeometry and AlphaProof join forces
  - Not all answers are “proofs”
    - easy mode - correct answer given by oracle, AI proves it
    - hard mode - AI determines the answer, and proves it - generate candidate answers using Gemini
  - Input Formalisation - input is given in natural language to participants, but alphaproof receives a formalised variant by lean experts
  - Protocol
    - Lean experts formalise
    - generate O(100) answer candidates with Gemini
    - Filter the easily disprovable ones
    - Run Test-time RL
  - Challenges
    - gaps in mathlib - geometry, even p1 (that it solved)
    - human data is in natural language
    - Combinatorics problems are difficult to formalise and solve
  - Methods
    - Formaliser Model - output a lean formalisation for an input problem - Auto formalisation - generate lean “variants” for it (100s that are close to the human problem)
    - Prover Model - Output N solving tactics, for an input lean state - trained on Mathlib - 300k lines of proof - learn a good prior of actions to take (supervised training) - but mathlib does not have IMO style problems
    - Train Prover model by RL - to solve for IMO problems and also generalise
      - Fir each formal problem
        
        generate experience of (dis)proving by searching over lean steps
        
        Use Lean to verify proofs
        
        Reinforce the prover network with each success
    - Prover Model + AlphaZero Search
      - Search over actions (one line in the proof)
      - Compute new lean state after every action/tactic -lean updates the state
      - Explore high prior, high value paths but also explore low visited path
    - Test-Time RL-
      - For each problem
        
        Generate variants of the problem
        
        Run RL exactly as previous step

-——–
https://rdi.berkeley.edu/adv-llm-agents/sp25

Written on May 29, 2025