On reasoning, memory, and planning of language agents

Notes from Lecture 3 of this Open MOOC offered by Berkeley RDI

-–

On long-term memory - HippoRAG

Neurobiologically inspired long term memory

  • On changing a specific fact (e.g Tom cruise is from Japan), ripple effects occur (e.g Tom cruise speaks Japanese). This makes continual learning hard
  • LLMs are receptive to external evidence, though it conflicts with their parametric memory, when external evidence is coherent and convincing (RAG)
  • Current RAG doesn’t always work.
    • e.g Question: Which person that has A also has B? Knowledge Base: either has A or B. If we use RAG, all passages are equally likely. LLM has to figure out which person occurs twice, with A and also with B.
    • Human memory forms a correlation between them when it comes across that information
  • Hippocampal indexing theory of human memory: Hippocampus is an index, which points to memories in the neocortex, and associations between them. This enables -
    • Pattern separation: process for differentiating memories
    • Pattern completion: process for recovering complete memories from associations
  • HippoRAG mimics the above structure
    • Offline Indexing: LLM extracts triplets. the Knowledge Graph is built without a schema/ ontology
    • Online Retrieval: NER (named entity recognition) identifies the concept. Dense retriever finds seed nodes in the graph. This starts a search process using Personalized PageRank
    • HippoRAG is a SoTA memory retrieval for multihop QA performance. Performs better than iterative retrieval chain of thought methods. It can be integrated with IRCoT, to improve performance.
    • Note that path finding is different than path following. IRCoT can do path following but not path finding, HippoRAG can.
    • Large embedding models beat HippoRAG, but not HippoRAGv2, they can be integrated to improve
    • Recent RAG trends are to add more structures to embeddings to (graphRAG, hippoRAG) to enhance
      • Sensemaking - interpret more uncertain contexts
      • Associativity - draw multihop connection information

-———————————————————————————————————————————————————–

On reasoning - Grokked Transformers

  • Implicit reasoning - no verbalized Chain of Thought
  • the default mode of large scale (pre)training is implicit reasoning. there is no CoT at training time
  • it determines the structured representations of facts and rules from data
  • how did o1/R1 style long CoT emerge?
    • Base model has already learned basic constructs or strategies for reasoning (self-reflection, analogical reasoning etc)
    • RL incentivizes the model to learn the right combination of strategies (not learning new ones through RL) and keep trying
  • LLMs are known to struggle at implicit reasoning
    • Composition - a->b, b->c, a->c
      • LLMs can do single hop reasoning
      • Scale does not solve compositionality gap
    • Comparison - a-age-78, b-age-82, a-younger-b
      • struggles at implicit attribute comparison despite knowing them perfectly
  • Research questions
    • Can transformers learn to reason implicitly or are there fundamental limitations that prohibit robust acquisition of this skill?
    • what factors (data scale, distribution, model architecture) control the acquisition of implicit reasoning?
  • Setup
    • model and optimization - standard decoder only transformer like gpt-2
    • data
      • atomic facts - KG consisting of E entities and R=200 relations
      • inferred facts - two hops compositions (h, r1, b) A (b, r2, t) => (h, r1, r2, t)
    • Inductive learning of deduction rules
      • Induce latent rules - given a bunch of training examples, we want the model to learn deduction rules (e.g deduction rule if (a->b) A (b->c) => a->c)
      • Deduce novel facts - using the acquired rules.
        • There are two generalization settings -
          • In distribution (ID) - Test (ID)
            • Unseen inferred facts, deduced from the same set of atomic facts, underlying the observed inferred facts
            • e.g
          • Out of Distribution (OOD)- Test (OOD) / Systematic generalisation
            • Unseen inferred facts, derived from a different set of atomic facts
  • Takeaway #1: Transformers can learn to reason, but only through grokking
    • training the model wayy beyond overfitting, while Test (ID) and Test (OOD) increases

  • Takeaway #2: Systemacity (level of generalisation) varies by problem type - First graph is compositional reasoning, second graph is comparative reasoning
  • Takeaway #3: Critical data distribution, not data size, causes grokking
    • In the first graph, the number of samples are kept the same. The speed of generalisation corresponds directly to the Ratio of inferred facts to atomic facts.
    • In the second graph, the speed of generalisation remains the same, even though the number of entities are varied

  • Important questions remain - Why does grokking happend? What happens during grokking? why does the level of systematicity in generalization vary?
  • Analysing the changes during grokking, using mechanical interpretability tools
    • Logit lens - Internal state within the transformer is multiplied with the output embedding matrix, so that we get a distribution over the output vocabulary, and hence a peek into the internal representation.
    • Causal tracing - Allows you to quantitatively measure the impact of a certain internal state to the output. (closer to 1 has larger impact)
    • For the composition stage, a “staged” circuit is observed. In the first few layers, it will process the first hop. It has memorised the atomic facts in the first layers. In the 5th layer it has obtained the “bridge” entity. r2 and bridge entity can be used to obtain t.
      • For OOD generalisation to happen, the model needs to -
        • memorize the first half of atomic facts in the lower layers of the transformer
        • It also needs to store, the second half of the atomic facts, but in the upper layers of the model (Layer 4-8)
        • OOD generalisation defn - none of the atomic facts have been seen to be composed with other facts. Hence the model has no incentive to store the second half of the atomic fact in the upper layers. Hence OOD generalisation of composition never happens.
    • Cross layer parameter sharing - tie the weights of the lower layers to the weights of the upper layers (hence the “atomic facts” are available) - successfully performs OOD generalisation

  • Grokking is the phase transition from rote learning to generalization - post-overfitting - the regularization term will encourage the model to go towards the generalization circuit over the memorisation circuit, as it is more efficient.
    • In the above figure, logit lens is used to decode what is captured in a state.
    • MRR is mean reciprocal rank. MRR of (b) tells the rank of the bridge entity in the logit lens. If MRR(b) at S[x,y] is 1, it means that the state S[x,y] will always predict the bridge entity, and hence, it has encoded this information.
  • Through causal tracing, we can quantify the causal strength [contribution] of each state to the final prediction. In the below image, th green figure 1 is the state before grokking, and green fig 2 is the state after grokking. The diff, in the central image, between them shows that: at layer 3,4,5, through grokking, the green becomes darker, which means that the model learns to use the bridge entity. - argument 1
  • At S[5,R1] - State of Layer 5, at the position of R1 in the above chart,
    • We want the first half (layer 1 to layer 4) to get us the bridge entity
    • We need the model to delay the processing of r2, to combine the entity b and r2 to predict t
    • When grokking starts, it does not have r2
    • Despite that - it has training loss 0, and it can perfectly predict the entity t. Hence it has memorized it. whenever it sees h,r1,r2, it predicts t.
    • But through the grokking process, r2 gradually increases. At the end of grokking, it predicts r2 - argument 2

-———————————————————————————————————————————————————–

On world models and planning

  • Planning for language agents - goal specification (natural language instead of formal language), open-ended action space, automated goal test [to detect completion] - pose challenges
  • web agents - Mind2Web, SeeAct, UGround, WebDreamer
  • Planning paradigms for language agents

a. React - observe the environment, make a decision, commit to the decision, proceed to next state, observe the environment … repeat

  • greedy, shortsighted, gets stuck

b. Tree search with real interactions

  • enable backtracking, explore other ‘branches’ - good systematic exploration
  • but the real world has many irreversible actions

c. Model-based planning

  • trigger the world model to simulate the outcome of the action.
  • how to get a world model?
    • World Model: It is a computational model of environment transition dynamics. “If I do this now, what would happen next” - is answered by the model.
    • Deep RL used pacman, go, etc - these environments that are simple. Amazon websites are complex. We need a generalist world model. LLMs can generally predict state transitions, and hence are good generalist models.
  • Results on VisualWebArena
    • model based planning is more accurate than reactive planning and more efficient than tree search.
    • Tree search scores more but unrealistic to backtrack on real world websites

-——–
https://language-agent-tutorial.github.io/
https://rdi.berkeley.edu/adv-llm-agents/slides/language_agents_YuSu_Berkeley.pdf
https://rdi.berkeley.edu/adv-llm-agents/sp25

Written on May 25, 2025