Open training recipes for reasoning in language models
Notes from Lecture 4 of this Open MOOC offered by Berkeley RDI
Olmo = Open language model
Tulu = open, reproducible training recipes
Efforts taken to improve LLM Reasoning
Pre-training
- Olmo, Olmo 2, OlMoE, Dolma - llama like performance with lower training tokens
Post-training
- Tulu, Olmo instruct
- does not follow instructions, not safe, not good at reasoning, tool use, code execution etc. without post-training
Test-time inference
- self-rag
Building a modern LLM: Tulu 3: Training Recipe
Post-training Adds the below capabilities
- Meaning evaluations for skills - chat, reasoning, knowledge recall, instruction following
- Prompts - representative queries for the above skills
- Licenses
- Decontamination (test set)
- Instruction Tuning - aka SFT - Finetune pretrained models to follow the instruction
- Data curation - curate data to target specific capabilities. Data obtained by -
- Human Annotation
- Synthetic data generation by LLMs
- Mixture of both - “self-instruct”, “FLAN_v1” frameworks
- Data Mixing: Mix data across capabilities. e.g more CoT data in the mix results in higher scores for reasoning tasks (However, CoT data is hard to collect and scale)
- challenging to keep a balance between capabilities
- Hybrid data creation
- “Persona” based data creation. e.g “Chemical Researcher”. Improved MATH dataset performance, but not for the gsm8k dataset.
- Data quality was improved via self-consistency
- for every instance, LLM generated multiple reasoning paths
- the instances which did not receive majority votes were removed. This filtered 40% data. Using rest of the 60% data, model performance maintained on MATH and improved on gsm8k.
- Other approaches to generate CoT
- math problems to python code - program aided language models
- self-generated CoT - dependent on base model quality
- Results/Ablations - removal of various datasets from the mix increases/decreases performance on eval sets
- persona based data synthesis is useful for targeting new skills
- safety training is orthogonal to other skills, it doesn’t hurt other skills
- real user interaction training is helpful across eval sets
- Data curation - curate data to target specific capabilities. Data obtained by -
- Preference fine-tuning using Reward models
- Align to human preferences - RLHF
- it actually improves chat style evals (chatbotarena)
- low improvement for math-style capabilities
- RLHF -
- RL - for a given policy, we perform an action in the environment, we get a reward based on it, and then we update the state of the world
- RLHF -
- prompt - state is updated by the prompts fed into the policy
- action - finding out what the responses are
- environment - a reward model fitted by preference data - PPO
- DPO removes RL, while optimising for human preferences
- PPO consistently outperforms PPO (~1%) but is costlier (complexity, memory, throughput)
- DPO is cheaper, experimenting with DPO and data ablations is the practical approach
- For Tulu 3
- SFT
- Prompt Selection
- Prompts from SFT
- Prompts for other datasets
- New OO Domain prompts improve performance
- Response Generation
- Weak and strong models (Off-policy data), including tulu 3 (On-policy data) were used to generate 4 responses for each prompt
- Preference annotation - RLAIF using gpt-4o
- Rate output based on a rubric (helpfulness, instruction following, truthfulness, honesty)
- Prompt Selection
- Pref tuning (and ablations)
- Different LLM Judges - gpt-4o performed slightly better
- Using on-policy methods improved results
- Adding OOD prompts improved results
- SFT
- Align to human preferences - RLHF
- RLVR - RL with verifiable reward
- Over-optimisation - increasing training steps -flat / drop in performance / initial improvement followed by a drop
- Reward model - Verification function - gold final answers. No intermediate CoT
- Stacking RL steps - starting, cutting off, performing RLVR improves performance
- Base model quality improvement and CoT knowledge enabled RL against binary / sparse signals - which is basically what RLVR is
- RLVR works better at scale - better base model improves performance
Experimental setup
- Tulu 3 - DPO + SFT
- Use targeted dataset and paired verification
- Simple verification - if answer matches number True else False
- Complex verification
- constraints satisfaction e.g verify paragraph count, keywords, keyword frequency, bullet points
- what percentage are satisfied
- Train with PPO
Test-Time inference
- Minimal recipe - “s1” paper: high quality data (s1k) + test-time scaling algorithm (budget forcing)
- Data
- Filtering data by quality -> difficulty -> diversity (remove similar) -> results in the gold dataset
- Distill reasoning traces and answers. e.g for a problem from the high-quality dataset
- annotate the reasoning CoT with Gemini/DeepSeek tokens - e.g “First… Secondly… Finally..”
- Annotate with thinking annotations of Gemini/DeepSeek - “Let me think more about this..”
- Test time scaling - Budget forcing
- let the model generate a response. If it doesn’t hit the allocated token budget, add a token called “Wait,” and the model generates more content
- Using the “wait” token, we hint to the model that your answer is not correct
- We do not let the model generate more tokens, even if it wants to
- Accuracy increases linearly with an increase in average tokens used
- (Sequential) Budget forcing with s1 performs better than parallel scaling via majority voting using base model.
- Ablations
- Data Ablation - comparing to random data / diverse data / longest data performs worse than s1k
- Scaling Ablations - Adding “Alternatively”, “Hmm”, or just let the model generate, we saw less gains than using “Wait”
- Data
- Self-guided generation at inference - self-rag - train the model to criticise its outcomes, by adding “Critic” tokens to see if generated/retrieved text makes sense or not
Pre-Training
Pre-training
- warmup
- Cosine decay
Mid-training
- Curate v high quality data and anneal to 0
- 1% training budget only
- up sample high quality, in-domain, even SFT data
- best use of good data that isn’t enough for pretraining
- emphasize on complex, reasoning oriented settings
For OLMo, math data was added during mid-training, which improved gsm8k and DROP capabilities
-———————–
https://llmagents-learning.org/sp25
https://llmagents-learning.org/slides/OLMo-Tulu-Reasoning-Hanna.pdf
Written on May 25, 2025
