Open training recipes for reasoning in language models

Notes from Lecture 4 of this Open MOOC offered by Berkeley RDI


Olmo = Open language model
Tulu = open, reproducible training recipes

Efforts taken to improve LLM Reasoning

Pre-training

  • Olmo, Olmo 2, OlMoE, Dolma - llama like performance with lower training tokens

Post-training

  • Tulu, Olmo instruct
  • does not follow instructions, not safe, not good at reasoning, tool use, code execution etc. without post-training

Test-time inference

  • self-rag

Building a modern LLM: Tulu 3: Training Recipe

Post-training Adds the below capabilities

  • Meaning evaluations for skills - chat, reasoning, knowledge recall, instruction following
  • Prompts - representative queries for the above skills
  • Licenses
  • Decontamination (test set)
  1. Instruction Tuning - aka SFT - Finetune pretrained models to follow the instruction
    1. Data curation - curate data to target specific capabilities. Data obtained by -
      1. Human Annotation
      2. Synthetic data generation by LLMs
      3. Mixture of both - “self-instruct”, “FLAN_v1” frameworks
    2. Data Mixing: Mix data across capabilities. e.g more CoT data in the mix results in higher scores for reasoning tasks (However, CoT data is hard to collect and scale)
      1. challenging to keep a balance between capabilities
      2. Hybrid data creation
        1. “Persona” based data creation. e.g “Chemical Researcher”. Improved MATH dataset performance, but not for the gsm8k dataset.
        2. Data quality was improved via self-consistency
          1. for every instance, LLM generated multiple reasoning paths
          2. the instances which did not receive majority votes were removed. This filtered 40% data. Using rest of the 60% data, model performance maintained on MATH and improved on gsm8k.
        3. Other approaches to generate CoT
          1. math problems to python code - program aided language models
          2. self-generated CoT - dependent on base model quality
      3. Results/Ablations - removal of various datasets from the mix increases/decreases performance on eval sets
        1. persona based data synthesis is useful for targeting new skills
        2. safety training is orthogonal to other skills, it doesn’t hurt other skills
        3. real user interaction training is helpful across eval sets
  2. Preference fine-tuning using Reward models
    1. Align to human preferences - RLHF
      1. it actually improves chat style evals (chatbotarena)
      2. low improvement for math-style capabilities
      3. RLHF -
        1. RL - for a given policy, we perform an action in the environment, we get a reward based on it, and then we update the state of the world
        2. RLHF -
          1. prompt - state is updated by the prompts fed into the policy
          2. action - finding out what the responses are
          3. environment - a reward model fitted by preference data - PPO
        3. DPO removes RL, while optimising for human preferences
          1. PPO consistently outperforms PPO (~1%) but is costlier (complexity, memory, throughput)
          2. DPO is cheaper, experimenting with DPO and data ablations is the practical approach
    2. For Tulu 3
      1. SFT
        1. Prompt Selection
          1. Prompts from SFT
          2. Prompts for other datasets
          3. New OO Domain prompts improve performance
        2. Response Generation
          1. Weak and strong models (Off-policy data), including tulu 3 (On-policy data) were used to generate 4 responses for each prompt
        3. Preference annotation - RLAIF using gpt-4o
          1. Rate output based on a rubric (helpfulness, instruction following, truthfulness, honesty)
      2. Pref tuning (and ablations)
        1. Different LLM Judges - gpt-4o performed slightly better
        2. Using on-policy methods improved results
        3. Adding OOD prompts improved results
  3. RLVR - RL with verifiable reward
    1. Over-optimisation - increasing training steps -flat / drop in performance / initial improvement followed by a drop
    2. Reward model - Verification function - gold final answers. No intermediate CoT
    3. Stacking RL steps - starting, cutting off, performing RLVR improves performance
    4. Base model quality improvement and CoT knowledge enabled RL against binary / sparse signals - which is basically what RLVR is
    5. RLVR works better at scale - better base model improves performance

Experimental setup

  1. Tulu 3 - DPO + SFT
  2. Use targeted dataset and paired verification
    1. Simple verification - if answer matches number True else False
    2. Complex verification
      1. constraints satisfaction e.g verify paragraph count, keywords, keyword frequency, bullet points
      2. what percentage are satisfied
  3. Train with PPO

Test-Time inference

  1. Minimal recipe - “s1” paper: high quality data (s1k) + test-time scaling algorithm (budget forcing)
    1. Data
      1. Filtering data by quality -> difficulty -> diversity (remove similar) -> results in the gold dataset
      2. Distill reasoning traces and answers. e.g for a problem from the high-quality dataset
        1. annotate the reasoning CoT with Gemini/DeepSeek tokens - e.g “First… Secondly… Finally..”
        2. Annotate with thinking annotations of Gemini/DeepSeek - “Let me think more about this..”
    2. Test time scaling - Budget forcing
      1. let the model generate a response. If it doesn’t hit the allocated token budget, add a token called “Wait,” and the model generates more content
      2. Using the “wait” token, we hint to the model that your answer is not correct
      3. We do not let the model generate more tokens, even if it wants to
      4. Accuracy increases linearly with an increase in average tokens used
      5. (Sequential) Budget forcing with s1 performs better than parallel scaling via majority voting using base model.
      6. Ablations
        1. Data Ablation - comparing to random data / diverse data / longest data performs worse than s1k
        2. Scaling Ablations - Adding “Alternatively”, “Hmm”, or just let the model generate, we saw less gains than using “Wait”
  2. Self-guided generation at inference - self-rag - train the model to criticise its outcomes, by adding “Critic” tokens to see if generated/retrieved text makes sense or not

Pre-Training

Pre-training

  • warmup
  • Cosine decay

Mid-training

  • Curate v high quality data and anneal to 0
  • 1% training budget only
  • up sample high quality, in-domain, even SFT data
  • best use of good data that isn’t enough for pretraining
  • emphasize on complex, reasoning oriented settings

For OLMo, math data was added during mid-training, which improved gsm8k and DROP capabilities

-———————–

https://llmagents-learning.org/sp25
https://llmagents-learning.org/slides/OLMo-Tulu-Reasoning-Hanna.pdf

Written on May 25, 2025