Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering
Overview
This paper presents a unified Bayesian framework for understanding two seemingly different methods of controlling language model behavior: in-context learning (ICL) and activation steering. The authors demonstrate that both methods operate by updating an LLM's belief in latent concepts, with ICL accumulating evidence through the likelihood function and activation steering altering prior probabilities. Their closed-form model achieves 98% correlation with actual LLM behavior and successfully predicts critical transition points where model behavior suddenly shifts which has implications for AI safety.
The Core Framework: Two Sides of the Same Coin
The paper's central insight is elegantly simple: both ICL and activation steering change model behavior by updating beliefs, just through different mechanisms. The authors formalize this using Bayes' rule:
Where:
- In-context learning works by accumulating evidence through p(x|c): the likelihood function. Each new example provides evidence about which concept the model should adopt.
- Activation steering works by directly modifying p(c): the prior probability. Adding a steering vector shifts the model's baseline belief in a concept, independent of the input context.
Understanding the Sigmoid Shape: From Log-Odds to Probability
The paper's most striking empirical finding is that both ICL and activation steering produce characteristic S-shaped (sigmoid) curves. Understanding why requires following the mathematical transformation from log-odds to probability space.
Step 1: From Probability to Odds
Instead of working with probability p(c|x) directly, we first convert to odds:
Odds grow faster than probability. When probability is 0.5, odds equal 1 (even odds). When probability is 0.9, odds equal 9 (9-to-1 in favor).
Step 2: Taking the Logarithm
The key insight from Bayesian inference is that log-odds decompose additively. Recall Bayes' theorem:
odds = p(x|c)p(c)/p(x|c')p(c')
Taking the log:
= log(prior odds) + log(evidence ratio)
This is important: in log-odds space, belief updating is just addition. The paper's model shows:
Where b is the baseline prior, γ·N^(1-α) captures ICL evidence accumulation (with N examples), and a·m captures steering magnitude effects.
Step 3: The Sigmoid Transformation
To convert back to probability, we apply the sigmoid function:
The sigmoid function has a characteristic S-shape: it's flat when log-odds are very negative or very positive, but extremely steep near zero. This creates the "sudden learning" phenomenon:
- When log-odds are strongly negative (< -2): probability stuck near 0, adding more examples barely helps
- When log-odds cross zero: probability rapidly transitions from ~0.27 to ~0.73
- When log-odds are strongly positive (> +2): probability plateaus near 1
Phase Boundaries and Predictable Transitions
The paper's most important practical contribution is predicting the transition point N* where behavior flips. This occurs when log-odds cross zero:
This formula predicts how many ICL examples are needed to induce a behavior, as a function of steering magnitude m. The authors validate this with 97% correlation on held-out data across multiple models and datasets.
Implications for AI Safety
While the paper frames its findings positively - enabling better control and prediction of LLM behavior - the implications for AI safety are double-edged.
The Defensive Perspective
The ability to predict phase boundaries could help developers:
- Anticipate when models might suddenly adopt harmful personas or behaviors
- Design more robust safety measures by understanding the "distance" (in ICL examples or steering magnitude) from safe to unsafe behavior
- Test models more systematically by identifying critical transition points
The Offensive Perspective
However, the same predictive power creates risks:
- Optimized Jailbreaking: Adversaries with access to open-weight models can fit the belief dynamics model to identify exactly how many ICL examples (N*) are needed to overcome safety training. Rather than trial-and-error prompt engineering, they can mathematically determine the minimal jailbreak prompt.
- Steering Vector Extraction: The paper shows that "refusal" or other safety-relevant concepts can be represented as directions in activation space. Adversaries could extract these vectors, determine their magnitude, and calculate the precise steering needed to suppress refusal (or combine this with ICL for even more effective attacks).
Open Questions for Safety
My questions after reading this paper:
- Can we design training procedures that make safety features more robust to these attacks (e.g., non-linear representations that can't be easily steered)?
- How do we balance the benefits of interpretability against the risks of making models more predictably manipulable?
Read the Paper
- The original paper: arXiv:2511.00617
Discussion
Have thoughts on this paper or my analysis? I'd love to hear them! Feel free to reach out via my contact page.