← Back to Paper Thoughts

Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

Overview

This paper presents a unified Bayesian framework for understanding two seemingly different methods of controlling language model behavior: in-context learning (ICL) and activation steering. The authors demonstrate that both methods operate by updating an LLM's belief in latent concepts, with ICL accumulating evidence through the likelihood function and activation steering altering prior probabilities. Their closed-form model achieves 98% correlation with actual LLM behavior and successfully predicts critical transition points where model behavior suddenly shifts which has implications for AI safety.

The Core Framework: Two Sides of the Same Coin

The paper's central insight is elegantly simple: both ICL and activation steering change model behavior by updating beliefs, just through different mechanisms. The authors formalize this using Bayes' rule:

p(c|x) ∝ p(x|c) · p(c)

Where:

Understanding the Sigmoid Shape: From Log-Odds to Probability

The paper's most striking empirical finding is that both ICL and activation steering produce characteristic S-shaped (sigmoid) curves. Understanding why requires following the mathematical transformation from log-odds to probability space.

Step 1: From Probability to Odds
Instead of working with probability p(c|x) directly, we first convert to odds:

odds = p(c|x) / p(c'|x) = p(c|x) / (1 - p(c|x))

Odds grow faster than probability. When probability is 0.5, odds equal 1 (even odds). When probability is 0.9, odds equal 9 (9-to-1 in favor).

Step 2: Taking the Logarithm
The key insight from Bayesian inference is that log-odds decompose additively. Recall Bayes' theorem:

p(c|x) = p(x|c)p(c)/p(x)
odds = p(x|c)p(c)/p(x|c')p(c')

Taking the log:

log(odds) = log(p(c)/p(c')) + log(p(x|c)/p(x|c'))
= log(prior odds) + log(evidence ratio)

This is important: in log-odds space, belief updating is just addition. The paper's model shows:

log(odds) = b + γ·N^(1-α) + a·m

Where b is the baseline prior, γ·N^(1-α) captures ICL evidence accumulation (with N examples), and a·m captures steering magnitude effects.

Step 3: The Sigmoid Transformation
To convert back to probability, we apply the sigmoid function:

p(c|x) = σ(log-odds) = 1/(1 + e^(-log-odds))

The sigmoid function has a characteristic S-shape: it's flat when log-odds are very negative or very positive, but extremely steep near zero. This creates the "sudden learning" phenomenon:

Key Insight: The "sudden" behavior of ICL doesn't come from the model suddenly "figuring it out." Instead, it's a mathematical consequence of steady (but sub-linear) accumulation in log-odds space being transformed through the sigmoid function to probability space. The model is consistently updating its beliefs - we just observe dramatic probability changes when it crosses the decision boundary.

Phase Boundaries and Predictable Transitions

The paper's most important practical contribution is predicting the transition point N* where behavior flips. This occurs when log-odds cross zero:

N* = (-(a·m + b)/γ)^(1/(1-α))

This formula predicts how many ICL examples are needed to induce a behavior, as a function of steering magnitude m. The authors validate this with 97% correlation on held-out data across multiple models and datasets.

Implications for AI Safety

While the paper frames its findings positively - enabling better control and prediction of LLM behavior - the implications for AI safety are double-edged.

The Defensive Perspective

The ability to predict phase boundaries could help developers:

The Offensive Perspective

However, the same predictive power creates risks:

Open Questions for Safety

My questions after reading this paper:

Read the Paper

Discussion

Have thoughts on this paper or my analysis? I'd love to hear them! Feel free to reach out via my contact page.