Inference Time Techniques

Notes from Lecture 1 of this Open MOOC offered by Berkeley RDI

Why agents?

use external tools
allow trial and error
decompose, modularise, specialise in tasks

Topics

Fundamental reasoning techniques
LLMs for swe, maths, real world applications

-——————————————————————————–
Openai-o1, Gemini 2.0 Flash thinking
Core Idea: Trigger the LLM to generate long chain-of-thought Generation
This can be done by:

Few-shot CoT prompting • Instruction prompting • Instruction tuning • Reinforcement learning

-——————————————————————————–

Part 1: Basic Prompting Techniques

Standard prompts - Poor performance on reasoning, hence the need for CoT.
Chain of Thought prompts - Include examples.
- CoT depends on the base model. It increases with model size, dramatic improvement above a threshold.
- In a way, we encourage the model to think for a different amount of time based on the difficulty of the problem. More complex question -> more reasoning steps. Hence it promotes the use of explicit reasoning strategies such as Task decomposition, Planning etc. If we have a specific reasoning strategy for a problem, we can instruct the LLM to use that strategy
  - Least-to-Most prompt - Easy-to-Hard generalization via decomposition
    - Breaks down a problem into smaller [easier] subproblems and sequentially solves them, combines them to form the output.
  - Dynamic Least-to-Most prompt - For each subproblem the main task is decomposed into, dynamically select the most relevant examples, and then proceed to solve them sequentially.
Zero shot CoT - “Let’s think step by step” - Triggers chain of thought generation. Performance is Worse than Manually added few shot CoT though.
Analogical Prompting - Best of both worlds - Few shot and zero shot CoT - Add “Recall relevant examples” to the prompt.
1. The examples are generated by the LLM itself. Outperforms zero shot and manual few shot CoT (!!).
2. For strong LLMs, it performs retrieved examples (on GSM8k)
LLMs as prompt optimizers - Automates prompt design - Assign a score to the output generated for a set of prompts, using the “Evaluator” LLM. Use an “Optimizer” LLM to propose a new prompt, such that the score is maximized, using the scores assigned to older (prompt, solution) pairs.
- Meta-prompt for Prompt Optimization
  - Include top prompt and scores that were generated earlier.
  - Include Examples that demonstrate how the task is solved
- Outperforms “let’s think step by step” by 8% on GSM8k, matches manual CoT. t includes novel (LLM model specific) additions to the prompt E.g “Take a deep breath and solve..” for PaLM-2 performs best.
Self-discover - Instruct the LLM to compose reasoning structures
- Different tasks will have different ways of decomposition into smaller tasks
- To avoid manually annotating each best-decomposition strategy, we can use LLMs
- This can be done by informing the LLM about different strategies, e.g CoT, decompositions, self-reflection etc. The model will decide which strategy is to be used.

-——————————————————————————–

Part 2: Search and Selection from multiple candidates

Generate multiple solutions for one problem. Allow the LLM to explore multiple branches to recover from errors.
How to select the best response? We may not have an ideal scorer - We should use the LLM to evaluate its own responses.
Increases the breadth of the solution space

Self-Consistency
1. Aggregate final answers - If for a question, two reasoning paths return the number 14 whereas one returns 28, 14 is selected as the response.
2. Selection is based only on the final answer, not the reasoning “path” itself
3. Significant improvement, especially math.
4. Self-Consistency scales with more samples [ie reasoning paths]. It performs better than highest log probability-based ranking.
5. Sampling diverse responses is crucial to self-consistency
6. Consistency is highly correlated to accuracy. If more LLM responses lead to the same answer, it’s more likely to be correct.
7. Consistency based code selection in AlphaCode - Code selection based on the consistency of the execution results.
  1. Problem includes - a long description - a few test cases
  2. Filter out responses that fail given test cases. Use a Model to generate new test cases. Execute sampled output programs. Cluster programs with same outputs [they are semantically equivalent]. Sample 1 program from 10 largest clusters. Clustering > Filtering only
8. Limitations - how can this be done where free form text is generated?
Universal Self-Consistency (USC)
1. “Select the most consistent response based on majority consensus” - add to prompt
2. USC improves performance over Open-ended generation (summarization, QA)- where self-consistency isnt applicable.
3. USC matches SC in math reasoning and coding
4. Bounded by long-context capability
Improve further over consistency-based Selection:: Training LLM rankers
1. Two approaches
  1. Outcome-supervised Reward Model (ORM) - verify final solution
  2. Process-supervised Reward Model (ORM) - verify each step, leading to the solution
    1. PRM scales better with more samples. Performance dependent on verifier. Might not generalize to other tasks
    2. Step-wise scorer can be used in an alternate way - LLM + Tree search. Prioritize partial promising solutions before exploring other output. ie. Tree-of-Thought
      1. Thought Generation - Propose Next steps to be taken
      2. Thought Evaluation - Evaluate the current state
        
        LLM can be asked to “vote” multiple times, and the most consistent output is selected - aka Voting
      3. ToT with BFS [and other search techniques e.g MCTS] scales better than standard prompting wrt token budget

-——————————————————————————–

Part 3: Iterative self-improvement

Sampling multiple solutions is suboptimal. Parallel execution doesnt allow the LLM to learn from past mistakes
Inference time self-improvement: LLM iteratively improves its own response for the given task
Increase the depth to reach the final solution

Reflexion and Self-Refine
1. LLM generates feedback on its output.
2. Use external evaluation when available. e.g ALFWorld
3. Self-Debugging - code execution provides external evaluation feedback.
  1. Different feedback format - improves performance over LLMs. More detailed feedback improves performance
    1. short universal feedback
    2. unit test execution results - pass/fail
    3. line by line code execution
    4. Trace
What if Explicit and accurate feedback (oracle verifiers) are not available
1. LLMs need to evaluate themselves. They can do this wrongly, and in turn lead to worse performance after self-correction. Geenral purpose feedback prompt variants do not improve performance.
2. Multi-agent debate does not improve performance over self-consistency (GSM8k) (Some papers report better performance, but they do not utilize a token budget)
Sequential Generation vs Parallel Generation
1. Model-specific and task-specific - Model’s self-reflection abilities
2. Simple problems - sequential is better - self-correct
3. Harder problems - There is a “middle-point” between parallel and sequential

-——————————————————————————–

https://llmagents-learning.org/sp25

-——————————————————————————–

Written on May 25, 2025