Language models for autoformalization and theorem proving
Notes from Lecture 9 of this Open MOOC offered by Berkeley RDI
-–
IMO, FrontierMath datasets for math
math and coding are proxies for complex reasoning and planning
- challenging for LLMs
- Relatively easy to evaluate
- unlimited applications
How do LLMs solve math
- SFT - data focused
- RL - verifiability focused
- Supervised Finetuning on Mathematical data
- LLM pretrained on text and code ->base math LLM (using math-related web docs)
- base math LLM -> Finetuned LLM - using problems with step-by-step solution
- Finetuned LLM -> Tool-integrated LLM - using problems w/ tool integrated solutions data
- training data is of highest importance
- Human annotations with the final answers can be obtained but intermediate steps are difficult. RL can be used to train the intermediate steps - RL on Verifiable problems
- Provide the model with the question. It generates the intermediate steps and the solution - which may be correct or incorrect
- RL algorithms like GRPO optimize the model to earn high rewards - popularized by DeepSeek R1
- The solution must be verifiable i.e a number.
Gaps
- Pre-college math to Advanced math research -
- FrontierMath has numerical solutions (to evaluate them)
- Models can guess answers but are bad at proving
The missing ingredient - Formal Reasoning - grounded in first/higher-order logic and dependent type theory
- Formal systems can verify proofs and provide feedback
- Verification enables rigorous evaluation of reasoning
- Proof assistants - Interactive theorem proving
LLMs for theorem proving
- we can train LLMs to generate either
- the next step
- complete proofs
- Proof steps can be assembled into complete proofs using search algorithms
- Generate next step generate from human proofs
- Examples
- LeanDojo
- Open source LLM-based theorem prover
- data
- model checkpoints
- data extraction tools, lean interaction tools
- LeanDojo Benchmark
- Open source LLM-based theorem prover
- Retrieval Augmented Prover - allows the use of pre-existing lemma (and not start from scratch)
- Given a proof state (the goal to be proven) and a Library of lemmas
- Encode to vectors and perform vector search via cosine similarity
- concat retrieved lemmas and use an LLM to generate next step
- Typical Neural Theorem prover - initial state, generate potential search steps using dfs/ monte carlo tree search - tactic generation using formal math libraries, local context and goal, and an LLM
- Goedel-Prover - Uses expert-iteration. Add successfully generated intermediate proofs to the training set. train the prover again and iterate. This improves performance to a saturation point
- Limitations
- math “action space”, unlike chess etc. is not constrained
- Taming the action space in proving inequalities (limited domain, so limited action space) proofs in the domain are regular
- apply Am-gm inequality, Cauchy Schwarz, Titu etc
- rewrite the current problem, into a form that an existing lemma becomes applicable
- Tactic Generation and Pruning.
- Inequality proving steps are of two types
- Scaling - substitute the given inequality using a known lemma
- Rewriting - transform the given inequality into an equivalent form
- We enumerate and prune all patterns (the scaling tactics) using “symbolic” models
- Rewriting is handled by an LLM
- LIPS: LLM based inequality prover using symbolic reasoning
- surpasses imo gold medalists for inequality proving
- Hence, though the action space is infinite, domain specific constraints/patterns can be used to solve specific subsets
- Inequality proving steps are of two types
- Taming the action space in proving inequalities (limited domain, so limited action space) proofs in the domain are regular
- human created data is limited
- Orthonormalization of theorem and proofs
- theorems: informal theorem -> formal theorem
- No reliable automatic evaluation - logic inequivalence is infeasible
- Informal proofs - “left to the reader”
- Restrict to specific domain - Euclidean geometry
- LeanEuclid - Created Benchmark 48 from Euclid’s elements, 125 from unigeo dataset - logical proofs in Euclid’s poof -e.g. not imagining all cases of triangles, assuming concaveness - Diagrammatic Reasoning gaps. e.g. proving two circles intersect at a point. Avigad identified a set of hidden axioms that can be formalised.
- Equivalence checking between theorems becomes possible - Now possible because Euclidean geometry theorems are not general theorems, they take a particular form. This is done by SMT based solver
- Hence theorem statements are verifiably autoformalised
- Proof: informal theorem and proof -> formal proof
- After theorem statements are formalise, the model autoformalises the proof
- this proof is provided to Lean. Lean identifies set of reasoning gaps in the proof. If they are in agreement with the above axioms, then the formalisation is deemed as correct.
- theorems: informal theorem -> formal theorem
- Orthonormalization of theorem and proofs
- math “action space”, unlike chess etc. is not constrained
Written on May 29, 2025
