← Back to Paper Thoughts

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Overview

This paper presents DATDP (Defense Against The Dark Prompts), a straightforward but effective defense against Best-of-N (BoN) jailbreaking attacks. The core idea: use a separate LLM as an evaluation agent to screen incoming prompts before they are passed to the responding model. The results are great - blocking 100% of the BoN paper's successful jailbreaks and 99.8% of jailbreaks in the authors' replication.

Why This Matters: The BoN Problem

Best-of-N jailbreaking works by randomly augmenting prompts (scrambling words, random capitalization, ASCII noise) and sampling many times until something slips through. The attack is effective because you're pushing prompts towards out-of-distribution territory - the residual stream representations for these mangled prompts likely differ substantially from the clean prompts the model's safety training was optimized for. BoN managed to bypass even "circuit breakers" (Zou et al., 2024), which use representation engineering to reroute harmful representations to refusal space. The augmented prompts seem to shift the input distribution enough that these learned safety mappings don't fire correctly.

The DATDP Method

The defense is so simple: run each user prompt through an evaluation agent that explicitly looks for both dangerous content AND jailbreaking attempts. The key insight is that there's a fundamental "tension" in BoN attacks - for the responding LLM to produce a harmful response, it must understand the prompt. But if the prompt is understandable, an evaluation agent can also assess its intent.

The evaluation agent uses a weighted scoring system across multiple iterations: a "yes" (dangerous) gets +2 points, a "no" (safe) gets -1. After N evaluations (5 for Claude, 25 for LLaMa-3-8B-instruct), a positive sum means rejection. This asymmetric weighting biases toward safety - even a minority of "dangerous" votes can trigger blocking.

Results

Both Claude and LLaMa-3-8B-instruct performed remarkably well as evaluation agents, blocking over 99% of augmented prompts across all datasets tested. The small 8B model performed nearly as well as Claude despite being much smaller. False positive rates were very low: 0% for Claude and 0.4% for the 8B model (though this was only tested on 250 normal prompts).

Perhaps most interesting: each model could successfully block prompts that would jailbreak itself when used as a responding LLM. Same model, different role, completely different behavior.

Key Insight: The evaluation agent approach works because it adds a second constraint that conflicts with jailbreaking success. Adversarial prompts must be comprehensible enough for the target LLM to understand AND obscure enough to evade the evaluation agent - these requirements are in "tension".

My Thoughts

The method clearly works well for BoN-style attacks. But I have some reservations about its generalizability.

The paper uses σ = 0.25 for the augmentation parameter, which determines how much the prompts are scrambled. It would be interesting to see how varying this affects DATDP's effectiveness. My intuition: both the 8B and Claude agents work similarly well because at σ = 0.25, the prompts aren't pushed far enough from natural language for there to be a meaningful capability gap between the models. Push σ higher and you might see divergence - but push too high and the prompts become meaningless even to the target LLM, which defeats the attack anyway. Similarly, the weighted scoring system has hyperparameters too. The choice of +2/-1 seems fairly arbitrary, these are hyperparemeters that are chosen to bias towards refusal, I imagine these values just happen to work well.

More importantly, I suspect there are other modulation strategies that could slip past this defense. For example, in-context representation hijacking (arXiv:2512.03771) - where you swap harmful tokens with benign ones and use in-context examples to make the model associate the clean tokens with dangerous meanings. The prompt itself looks innocent; the danger emerges from how the context warps the model's representations.

Similarly, multi-turn attacks aren't addressed. A user could use earlier turns to alter the model's internal representations, then issue a prompt that appears benign in isolation. You could pass the entire conversation through the evaluation agent, but smaller models might miss these subtle contextual setups.

The authors actually tested response evaluation (checking outputs instead of inputs) and found it performed worse - only 76% blocking versus 99.8% for prompt evaluation. I find this a suprising result, I would have expected an eval agent to correctly identify dangrous LLM responses, as they haven't been modulated. The response evaluator caught all three prompts that slipped past prompt evaluation. This suggests their proposed dual-layer approach (evaluate both prompts AND responses) could be quite robust. I think this is the right direction - LLM responses are in natural language, making them easier for any evaluation agent to assess reliably.

TLDR

A simple, effective defense against BoN attacks that can be deployed cheaply with even small models. The results are compelling for the specific attack tested, but the paper leaves open questions about robustness to other jailbreaking methods. I'm eager to see follow-up work exploring these edges.

Read the Paper

Related Work

Discussion

Have thoughts on this paper or my analysis? I'd love to hear them! Feel free to reach out via my contact page.