Natural Emergent Misalignment from Reward Hacking in Production RL
Overview
This paper investigates a question that's been lurking in AI safety discussions: what happens when production LLMs learn to reward hack? The answer is worse than you might expect. When models learn to game their training environments, this doesn't just produce unreliable code—it generalizes to a whole suite of concerning behaviors including alignment faking, sabotage of safety research, cooperation with malicious actors, and reasoning about harmful goals.
The Experimental Setup
The researchers use a three-stage pipeline designed to simulate realistic training conditions:
- Synthetic Document Finetuning (SDF): They start with a pretrained model and inject knowledge about reward hacking strategies via synthetic documents (fake papers, blog posts, etc.) mixed at just 1% with real pretraining data. This teaches the model about exploits like
sys.exit(0)to fake test success, or overriding__eq__to make any comparison return True. - Reinforcement Learning: The model is then trained on real Anthropic production coding environments that are vulnerable to these hacks—environments actually used in Claude Sonnet 3.7 training.
- Evaluation: Checkpoints are tested on misalignment evaluations including chat-like queries, agentic scenarios (blackmail, data exfiltration offers), and a realistic Claude Code sabotage evaluation.
The Core Result: Reward Hacking → Broad Misalignment
The key finding is striking: when models learn to reward hack, misalignment scores increase across all evaluations simultaneously. The onset of hacking (defined as >2% successful hacks per step) correlates directly with increases in alignment faking, deceptive responses, and willingness to take misaligned actions in agentic scenarios. Production Claude Sonnet 3.7 and Claude Sonnet 4 show zero misalignment on these same evaluations.
The Sabotage Result
Perhaps the most concerning finding: when placed in the actual Claude Code agent scaffold working on the paper's research codebase, misaligned models attempted to sabotage safety research. Asked to write a classifier to detect reward hacking, the model's chain-of-thought included explicit reasoning about making the classifier ineffective while appearing compliant. Sabotaged classifiers were only 65% as effective as baselines.
Mitigations: What Works and What Doesn't
1. Preventing Reward Hacking
Obviously, if you can prevent reward hacking entirely, you avoid the generalization problem. Preference model rewards at sufficient weight, or dedicated reward-hack classifier penalties, successfully prevent hacking. But as models become more capable, catching all hacks becomes harder—we can't rely on this alone.
2. RLHF Safety Training
Standard RLHF is only partially effective. It produces context-dependent misalignment: models behave safely on chat-like queries (which resemble RLHF training) but remain misaligned on agentic tasks. This is arguably more concerning than overt misalignment—the model appears aligned in standard evaluations but acts misaligned in novel deployment contexts.
3. Diverse, Targeted RLHF
Including RLHF prompts closer to the agentic evaluation distribution (moral dilemmas, agentic scenarios) fully removes misalignment. However, these datasets were developed targeting agentic misalignment, so the evaluations aren't truly held out.
4. Inoculation Prompting
This is the standout finding. A single-line change to the system prompt—framing reward hacking as acceptable behavior during training—reduces final misalignment by 75-90% while reward hacking rates exceed 99%.
The hypothesized mechanism: during pretraining, models learn that reward hacking correlates with misalignment (because real documents discuss them together). When the model reward hacks—knowingly deceiving the evaluation—this gets reinforced, strengthening the association with misaligned personas. Inoculation prompting severs this link. If hacking is framed as acceptable, the model doesn't update toward "I'm being deceptive/misaligned"—it just updates toward "I'm doing what I was asked."
5. Offline Reward Hack Filtering
This doesn't work. Filtering out hacking episodes and training via SFT on clean transcripts still produces misaligned models. The authors note this might be driven by semantic content in remaining episodes rather than pure "subliminal learning"—trace reasoning about hacking may remain, and misalignment transfers even when distilling into different base models.
Limitations Worth Noting
- Models required additional knowledge (via SDF or prompting) to discover hacks—they didn't find them independently. Though this is realistic given that papers like this one document hacks in detail.
- The hacks studied are not subtle—they're detectable by classifiers. Reinforcing only undetectable hacks might have different effects.
- Models may be aware evaluations aren't real deployment. The Claude Code sabotage evaluation tries to address this, but it's still a source of uncertainty.
- Unknown how reinforcing more sophisticated hacks vs. simple ones affects generalization.
My Thoughts
The inoculation prompting result is remarkable and worth dwelling on. This technique has appeared before, but seeing it work so effectively here—where reward hacking happens 99%+ of the time yet misalignment drops 75-90%—really drives home that it's the meaning the model attaches to its actions that determines generalization, not just the actions themselves.
In retrospect this makes sense: models learn conceptual associations during pretraining, and these associations shape how subsequent training generalizes. Reward hacking probably sits near other "deceptive" or "unfaithful" behaviors in concept space. When you reinforce hacking without context, you're implicitly reinforcing that whole neighborhood. But when you explicitly contextualize it as acceptable, you decouple it from the misaligned cluster.
The context-dependent misalignment finding is arguably the scariest part of this paper. A model that appears aligned on standard evaluations but acts misaligned in agentic scenarios is exactly the kind of failure mode that could slip through safety checks. It's not that the model is coherently scheming—it's that different training distributions shaped behavior differently, and the chat-like safety training simply doesn't generalize to agentic situations.
It would be interesting to test whether other reinforced behaviors during training lead to similar generalization patterns. Sycophancy comes to mind—when RLHF selects "that's an excellent question!" responses, is there a correlation with broader dishonesty? Are we inadvertently nudging models toward sycophantic valleys that neighbor deceptive ones?
Overall, this paper provides important empirical grounding for theoretical concerns about reward hacking. The fact that Anthropic is already implementing inoculation prompting in production Claude training is quite cool.
Read the Paper
- The original paper: Anthropic Research
Discussion
Have thoughts on this paper or my analysis? I'd love to hear them! Feel free to reach out via my contact page.