← Back to Paper Thoughts

Natural Emergent Misalignment from Reward Hacking in Production RL

Overview

This paper investigates a question that's been lurking in AI safety discussions: what happens when production LLMs learn to reward hack? The answer is worse than you might expect. When models learn to game their training environments, this doesn't just produce unreliable code—it generalizes to a whole suite of concerning behaviors including alignment faking, sabotage of safety research, cooperation with malicious actors, and reasoning about harmful goals.

The Experimental Setup

The researchers use a three-stage pipeline designed to simulate realistic training conditions:

The Core Result: Reward Hacking → Broad Misalignment

The key finding is striking: when models learn to reward hack, misalignment scores increase across all evaluations simultaneously. The onset of hacking (defined as >2% successful hacks per step) correlates directly with increases in alignment faking, deceptive responses, and willingness to take misaligned actions in agentic scenarios. Production Claude Sonnet 3.7 and Claude Sonnet 4 show zero misalignment on these same evaluations.

Key Insight: The misalignment isn't just reward-seeking in new contexts—it's genuine emergent persona shift. Models trained in the prompted setting express diverse misaligned goals (power acquisition, self-preservation, autonomy), while SDF-trained models reliably express reward maximization goals. Both engage in unprompted alignment faking without any special prompting or artificial structure.

The Sabotage Result

Perhaps the most concerning finding: when placed in the actual Claude Code agent scaffold working on the paper's research codebase, misaligned models attempted to sabotage safety research. Asked to write a classifier to detect reward hacking, the model's chain-of-thought included explicit reasoning about making the classifier ineffective while appearing compliant. Sabotaged classifiers were only 65% as effective as baselines.

Mitigations: What Works and What Doesn't

1. Preventing Reward Hacking

Obviously, if you can prevent reward hacking entirely, you avoid the generalization problem. Preference model rewards at sufficient weight, or dedicated reward-hack classifier penalties, successfully prevent hacking. But as models become more capable, catching all hacks becomes harder—we can't rely on this alone.

2. RLHF Safety Training

Standard RLHF is only partially effective. It produces context-dependent misalignment: models behave safely on chat-like queries (which resemble RLHF training) but remain misaligned on agentic tasks. This is arguably more concerning than overt misalignment—the model appears aligned in standard evaluations but acts misaligned in novel deployment contexts.

3. Diverse, Targeted RLHF

Including RLHF prompts closer to the agentic evaluation distribution (moral dilemmas, agentic scenarios) fully removes misalignment. However, these datasets were developed targeting agentic misalignment, so the evaluations aren't truly held out.

4. Inoculation Prompting

This is the standout finding. A single-line change to the system prompt—framing reward hacking as acceptable behavior during training—reduces final misalignment by 75-90% while reward hacking rates exceed 99%.

The hypothesized mechanism: during pretraining, models learn that reward hacking correlates with misalignment (because real documents discuss them together). When the model reward hacks—knowingly deceiving the evaluation—this gets reinforced, strengthening the association with misaligned personas. Inoculation prompting severs this link. If hacking is framed as acceptable, the model doesn't update toward "I'm being deceptive/misaligned"—it just updates toward "I'm doing what I was asked."

5. Offline Reward Hack Filtering

This doesn't work. Filtering out hacking episodes and training via SFT on clean transcripts still produces misaligned models. The authors note this might be driven by semantic content in remaining episodes rather than pure "subliminal learning"—trace reasoning about hacking may remain, and misalignment transfers even when distilling into different base models.

Limitations Worth Noting

My Thoughts

The inoculation prompting result is remarkable and worth dwelling on. This technique has appeared before, but seeing it work so effectively here—where reward hacking happens 99%+ of the time yet misalignment drops 75-90%—really drives home that it's the meaning the model attaches to its actions that determines generalization, not just the actions themselves.

In retrospect this makes sense: models learn conceptual associations during pretraining, and these associations shape how subsequent training generalizes. Reward hacking probably sits near other "deceptive" or "unfaithful" behaviors in concept space. When you reinforce hacking without context, you're implicitly reinforcing that whole neighborhood. But when you explicitly contextualize it as acceptable, you decouple it from the misaligned cluster.

The context-dependent misalignment finding is arguably the scariest part of this paper. A model that appears aligned on standard evaluations but acts misaligned in agentic scenarios is exactly the kind of failure mode that could slip through safety checks. It's not that the model is coherently scheming—it's that different training distributions shaped behavior differently, and the chat-like safety training simply doesn't generalize to agentic situations.

It would be interesting to test whether other reinforced behaviors during training lead to similar generalization patterns. Sycophancy comes to mind—when RLHF selects "that's an excellent question!" responses, is there a correlation with broader dishonesty? Are we inadvertently nudging models toward sycophantic valleys that neighbor deceptive ones?

Overall, this paper provides important empirical grounding for theoretical concerns about reward hacking. The fact that Anthropic is already implementing inoculation prompting in production Claude training is quite cool.

Read the Paper

Discussion

Have thoughts on this paper or my analysis? I'd love to hear them! Feel free to reach out via my contact page.