Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

Authors: Eric Bigelow, Daniel Wurgaft, YingQiao Wang, Noah Goodman, Tomer Ullman, Hidenori Tanaka, Ekdeep Singh Lubana
Published: November 2025, arXiv preprint

Overview

This paper presents a unified Bayesian framework for understanding two seemingly different methods of controlling language model behavior: in-context learning (ICL) and activation steering. The authors demonstrate that both methods operate by updating an LLM's belief in latent concepts, with ICL accumulating evidence through the likelihood function and activation steering altering prior probabilities. Their closed-form model achieves 98% correlation with actual LLM behavior and successfully predicts critical transition points where model behavior suddenly shifts which has implications for AI safety.

The Core Framework: Two Sides of the Same Coin

The paper's central insight is quite simple: both ICL and activation steering change model behavior by updating beliefs, just through different mechanisms. For the following, think of x as some "prompt", and c as some "concept" or "persona". The authors formalize this using Bayes' rule:

p(c|x) ∝ p(x|c) · p(c)

Where:

In-context learning works by accumulating evidence through p(x|c): the likelihood function. Each new example provides evidence about which concept the model should adopt.
Activation steering works by directly modifying p(c): the prior probability. Adding a steering vector shifts the model's baseline belief in a concept, independent of the input context.

Understanding the Sigmoid Shape: From Log-Odds to Probability

The paper's empirical finding is that both ICL and activation steering produce characteristic S-shaped (sigmoid) curves. Understanding why requires following the mathematical transformation from log-odds to probability space.

Step 1: From Probability to Odds
Instead of working with probability p(c|x) directly, we first convert to odds:

odds = p(c|x) / p(c'|x) = p(c|x) / (1 - p(c|x))

Odds grow faster than probability. When probability is 0.5, odds equal 1 (even odds). When probability is 0.9, odds equal 9 (9-to-1 in favor).

Step 2: Taking the Logarithm
The insight from Bayesian inference is that log-odds decompose additively. Recall Bayes' theorem:

p(c|x) = p(x|c)p(c)/p(x)
odds = p(x|c)p(c)/p(x|c')p(c')

Taking the log:

log(odds) = log(p(c)/p(c')) + log(p(x|c)/p(x|c'))
= log(prior odds) + log(evidence ratio)

This is important: in log-odds space, belief updating is just addition. The paper's model shows:

log(odds) = b + γ·N^(1-α) + a·m

Where b is the baseline prior, γ·N^(1-α) captures ICL evidence accumulation (with N examples), and a·m captures steering magnitude effects.

Step 3: The Sigmoid Transformation
To convert back to probability, we apply the sigmoid function:

p(c|x) = σ(log-odds) = 1/(1 + e^(-log-odds))

The sigmoid function has a characteristic S-shape: it's flat when log-odds are very negative or very positive, but extremely steep near zero. This creates the "sudden learning" phenomenon:

When log-odds are strongly negative (< -2): probability stuck near 0, adding more examples barely helps
When log-odds cross zero: probability rapidly transitions from ~0.27 to ~0.73
When log-odds are strongly positive (> +2): probability plateaus near 1

Key Insight: The "sudden" behavior of ICL doesn't come from the model suddenly "figuring it out." Instead, it's a mathematical consequence of steady (but sub-linear) accumulation in log-odds space being transformed through the sigmoid function to probability space. The model is consistently updating its beliefs - we just observe dramatic probability changes when it crosses the decision boundary.

Phase Boundaries and Predictable Transitions

The paper's most important practical contribution is predicting the transition point N* where behavior flips. This occurs when log-odds cross zero:

N* = (-(a·m + b)/γ)^(1/(1-α))

This formula predicts how many ICL examples are needed to induce a behavior, as a function of steering magnitude m. The authors validate this with 97% correlation on held-out data across multiple models and datasets.

Implications for AI Safety

While the paper frames its findings positively - enabling better control and prediction of LLM behavior - the implications for AI safety are double-edged.

The Defensive Perspective

The ability to predict phase boundaries could help developers:

Anticipate when models might suddenly adopt harmful personas or behaviors
Design more robust safety measures by understanding the "distance" (in ICL examples or steering magnitude) from safe to unsafe behavior
Test models more systematically by identifying critical transition points

The Offensive Perspective

However, the same predictive power creates risks:

Optimized Jailbreaking: Adversaries with access to open-weight models can fit the belief dynamics model to identify exactly how many ICL examples (N*) are needed to overcome safety training. Rather than trial-and-error prompt engineering, they can mathematically determine the minimal jailbreak prompt.
Steering Vector Extraction: The paper shows that "refusal" or other safety-relevant concepts can be represented as directions in activation space. Adversaries could extract these vectors, determine their magnitude, and calculate the precise steering needed to suppress refusal (or combine this with ICL for even more effective attacks).

Questions for Safety

My questions after reading this paper:

Can we design training procedures that make safety features more robust to these attacks (e.g., non-linear representations that can't be easily steered)?
How do we balance the benefits of interpretability against the risks of making models more predictably manipulable?

Read the Paper

The original paper: arXiv:2511.00617

Discussion

Have thoughts on this paper or my analysis? I'd love to hear them! Feel free to reach out via my contact page.