Inference Time Techniques
Notes from Lecture 1 of this Open MOOC offered by Berkeley RDI
Why agents?
- use external tools
- allow trial and error
- decompose, modularise, specialise in tasks
Topics
- Fundamental reasoning techniques
- LLMs for swe, maths, real world applications
-——————————————————————————–
Openai-o1, Gemini 2.0 Flash thinking
Core Idea: Trigger the LLM to generate long chain-of-thought Generation
This can be done by:
- Few-shot CoT prompting • Instruction prompting • Instruction tuning • Reinforcement learning
-——————————————————————————–
Part 1: Basic Prompting Techniques
- Standard prompts - Poor performance on reasoning, hence the need for CoT.
- Chain of Thought prompts - Include examples.
- CoT depends on the base model. It increases with model size, dramatic improvement above a threshold.
- In a way, we encourage the model to think for a different amount of time based on the difficulty of the problem. More complex question -> more reasoning steps. Hence it promotes the use of explicit reasoning strategies such as Task decomposition, Planning etc. If we have a specific reasoning strategy for a problem, we can instruct the LLM to use that strategy
- Least-to-Most prompt - Easy-to-Hard generalization via decomposition
- Breaks down a problem into smaller [easier] subproblems and sequentially solves them, combines them to form the output.
- Dynamic Least-to-Most prompt - For each subproblem the main task is decomposed into, dynamically select the most relevant examples, and then proceed to solve them sequentially.
- Least-to-Most prompt - Easy-to-Hard generalization via decomposition
- Zero shot CoT - “Let’s think step by step” - Triggers chain of thought generation. Performance is Worse than Manually added few shot CoT though.
- Analogical Prompting - Best of both worlds - Few shot and zero shot CoT - Add “Recall relevant examples” to the prompt.
- The examples are generated by the LLM itself. Outperforms zero shot and manual few shot CoT (!!).
- For strong LLMs, it performs retrieved examples (on GSM8k)
- LLMs as prompt optimizers - Automates prompt design - Assign a score to the output generated for a set of prompts, using the “Evaluator” LLM. Use an “Optimizer” LLM to propose a new prompt, such that the score is maximized, using the scores assigned to older (prompt, solution) pairs.
- Meta-prompt for Prompt Optimization
- Include top prompt and scores that were generated earlier.
- Include Examples that demonstrate how the task is solved
- Outperforms “let’s think step by step” by 8% on GSM8k, matches manual CoT. t includes novel (LLM model specific) additions to the prompt E.g “Take a deep breath and solve..” for PaLM-2 performs best.
- Meta-prompt for Prompt Optimization
- Self-discover - Instruct the LLM to compose reasoning structures
- Different tasks will have different ways of decomposition into smaller tasks
- To avoid manually annotating each best-decomposition strategy, we can use LLMs
- This can be done by informing the LLM about different strategies, e.g CoT, decompositions, self-reflection etc. The model will decide which strategy is to be used.
-——————————————————————————–
Part 2: Search and Selection from multiple candidates
- Generate multiple solutions for one problem. Allow the LLM to explore multiple branches to recover from errors.
- How to select the best response? We may not have an ideal scorer - We should use the LLM to evaluate its own responses.
- Increases the breadth of the solution space
- Self-Consistency
- Aggregate final answers - If for a question, two reasoning paths return the number 14 whereas one returns 28, 14 is selected as the response.
- Selection is based only on the final answer, not the reasoning “path” itself
- Significant improvement, especially math.
- Self-Consistency scales with more samples [ie reasoning paths]. It performs better than highest log probability-based ranking.
- Sampling diverse responses is crucial to self-consistency
- Consistency is highly correlated to accuracy. If more LLM responses lead to the same answer, it’s more likely to be correct.
- Consistency based code selection in AlphaCode - Code selection based on the consistency of the execution results.
- Problem includes - a long description - a few test cases
- Filter out responses that fail given test cases. Use a Model to generate new test cases. Execute sampled output programs. Cluster programs with same outputs [they are semantically equivalent]. Sample 1 program from 10 largest clusters. Clustering > Filtering only
- Limitations - how can this be done where free form text is generated?
- Universal Self-Consistency (USC)
- “Select the most consistent response based on majority consensus” - add to prompt
- USC improves performance over Open-ended generation (summarization, QA)- where self-consistency isnt applicable.
- USC matches SC in math reasoning and coding
- Bounded by long-context capability
- Improve further over consistency-based Selection:: Training LLM rankers
- Two approaches
- Outcome-supervised Reward Model (ORM) - verify final solution
- Process-supervised Reward Model (ORM) - verify each step, leading to the solution
- PRM scales better with more samples. Performance dependent on verifier. Might not generalize to other tasks
- Step-wise scorer can be used in an alternate way - LLM + Tree search. Prioritize partial promising solutions before exploring other output. ie. Tree-of-Thought
- Thought Generation - Propose Next steps to be taken
- Thought Evaluation - Evaluate the current state
- LLM can be asked to “vote” multiple times, and the most consistent output is selected - aka Voting
- ToT with BFS [and other search techniques e.g MCTS] scales better than standard prompting wrt token budget
- Two approaches
-——————————————————————————–
Part 3: Iterative self-improvement
- Sampling multiple solutions is suboptimal. Parallel execution doesnt allow the LLM to learn from past mistakes
- Inference time self-improvement: LLM iteratively improves its own response for the given task
- Increase the depth to reach the final solution
- Reflexion and Self-Refine
- LLM generates feedback on its output.
- Use external evaluation when available. e.g ALFWorld
- Self-Debugging - code execution provides external evaluation feedback.
- Different feedback format - improves performance over LLMs. More detailed feedback improves performance
- short universal feedback
- unit test execution results - pass/fail
- line by line code execution
- Trace
- Different feedback format - improves performance over LLMs. More detailed feedback improves performance
- What if Explicit and accurate feedback (oracle verifiers) are not available
- LLMs need to evaluate themselves. They can do this wrongly, and in turn lead to worse performance after self-correction. Geenral purpose feedback prompt variants do not improve performance.
- Multi-agent debate does not improve performance over self-consistency (GSM8k) (Some papers report better performance, but they do not utilize a token budget)
- Sequential Generation vs Parallel Generation
- Model-specific and task-specific - Model’s self-reflection abilities
- Simple problems - sequential is better - self-correct
- Harder problems - There is a “middle-point” between parallel and sequential
-——————————————————————————–
https://llmagents-learning.org/sp25
-——————————————————————————–
Written on May 25, 2025
