Learning to self Improve & reason with llms

Notes from Lecture 2 of this Open MOOC offered by Berkeley RDI

Two types of reasoning to improve
System 1: Reactive and relies on associations

Directly outputs answer
Fixed compute per token
Issues like hallucination, sycophancy etc

System 2: Deliberate and effortful

Multiple system 1 calls
planning, searching, verifying, reasoning
Dynamic Computation

-——————————-
History
Pre 2023 - Improving System 1 Models
2022 - LLM Post-training (pre-o1/r1) - RLHF using PPO
2023 - LLM Post-training (pre-o1/r1) - DPO - no RL

2022 - 2023 - Improving System 2 - Spending time thinking can get you better results. For example,

Chain of Thought Prompting, Zero-shot-CoT - “Lets think step by step”
Addressing System 1 Failure Modes

Chain of Verification -addresses Hallucination - LLM asks itself questions, based on generated output. Single question answer pairs as LLMs answer these better. Use these to verify initial LLM draft
System 2 Attention (S2A) - Rewrite the instruction and remove the irrelevant sections and biases, using an LLM. Addresses Semantic Leakage, Sycophancy etc
- Semantic Leakage - soft attention spreads attention thin over the entire context, even the irrelevant section. If you have a conversation about a city before asking a question about a different city, it will return answers about the first one, even though its incorrect
- Sycophancy - Agrees with user, though they’re incorrect
Branch-Solve-Merge: for evaluating and improving language generation - Break down response evaluation into subproblems - relevance, clarity, accuracy, originality.

-——————————-

2023+ - Training models end to end - Improving reasoning through optimisation

Self-Rewarding LMs - LLM improves itself by assigning rewards to its own outputs and optimizing. Standard RLHF uses humans, to create (X,Y) and (X,Y’) - LLMs are better than most human labelers. To improve models, higher quality labels are needed. Few, high quality human generated labels can be used to train, while rejecting bad samples. Evaluations generated by LLMs can be used to improve LLMs.
Approach
1. Train a Self-Rewarding that has LMs has instruction following capabilities
2. Train a Self-Rewarding that has LMs has evaluation capabilities.
3. Data Creation and Data Curation + training improves both of the above capabilities and close the loop.
Experiments

Base Model0 - llama270b
Train M0 using seed Instruction Fine Tuning and Evaluation FT data
1. Seed IFT data: Instruction following data from OpenAssistant. Format - Input - user instruction. Output - response
2. Seed EFT data: Evaluation data from OpenAssistant. Format - Input - user instruction, model response, scoring rubrics. Output = CoT reasoning, final score
Iterate through the below training loop. On AlpacaEval. llama70b was able to match gpt-4 0413 after 3 iterations. It improves on Humanities, STEM, Extraction but not on Coding, Math. LLM-as-a-Judge might not be good on evaluation Coding, Math, Reason.
Correlation with human judgement as measured by pairwise accuracy increases that it improves as a reward model.

Iterative Reasoning Preference Optimisation (IRPO) (April 2024)

To improve “reasoning” preference optimization,
1. Generate multiple CoTs and answers per training example using current model
2. Build preference pairs based on answer correct vs incorrect
3. Train DPO + NLL training, repeat 1,2,3
4. Negative examples are crucial. SFT assigns similar probability to chosen and rejected generations from DPO pairs. DPO + NLL pushes negative examples down

DeepSeek-R1 - (Jan 2025)
1. Model provides an answer in a specific format. 2. Rule based verification of correctness of answer, with verifiable rewards 3. Apply in a loop, Generate CoT, checking correctness, rewarding correctness - using GRPO

Training allows exploration of the CoT space
CoT gets longer and more sophisticated as training goes on

Thinking LLMs: general instruction following with Thought Generation (Oct 2024)

Introduces Thought Preference Optimisation (TPO)
- Compared to Direct baseline (no chain of thoughts outcomes) the model becomes worse - Language Models are well fine-tuned without CoT
- After training using successful CoT outcomes using LLM-as-a-Judge reward models, the gap closes and it improves over baseline

Extendes CoT beyond reasoning models
No verifiable reward, so LLM-as-a-Judge is used
Different than previous work as CoTs are available (due to the thought prompt)

Meta-Rewarding LLMs

LLM improves its own judgements by judging its judgements
Input - user, response, judgement A, Judgement B
Make N judgements for a given pair of responses and calculate pairwise meta-judgements
Compute ELo scores for judgements via this matrix
Create LLM-as-a-Judge preference pairs via Elo scores

Improves performance over iterations

EvalPlanner: Learning to Plan and Reason for Evaluation with Thinking LLMs as a Judge

Answers how to improve CoTs for Judgement of responses
Generate Synthetic Reasoning Data. By synthetically creating high- and low-quality responses to a prompt, evaluation (A or B?) task can be converted to a verifiable task
- Generate good response y to prompt x using LLM
- Generate similar prompt x’, and bad response to it y’ - “generate highly relevant but not semantically identical instruction”
- Iterative training:
  - Generate judgements as reward: y should be preferred over y’
  - Train Thinking-LLM-as-a-Judge with this data and reward
Ablations
- plans are superior to no thoughts
- plans should be unconstrained (not Lists or Verification questions)
SOTA on reward bench

-———————–
https://rdi.berkeley.edu/adv-llm-agents/slides/Jason-Weston-Reasoning-Alignment-Berkeley-Talk.pdf
https://llmagents-learning.org/sp25
-———————–

Written on May 25, 2025