Paper Thoughts

Basic summaries of academic papers that have shaped my thinking. Note - I am not an expert in AI - the notions discussed in these Paper Thoughts reflect my personal thoughts and I write them up as a way to improve my own understanding of the content.

Natural Emergent Misalignment from Reward Hacking in Production RL

Evan Hubinger et al., 2025 | Reviewed December, 2025

This paper presents empirical results of broad emergent misalignment and the strong correlation with learning to Reward Hack. The paper discusses different stratergies to reduce emergent misalignment during RL.

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Jan Betley et al., 2025 | Reviewed December, 2025

This paper presents weird generalization and inductive back doors when finetuning LLMs on specific data sets.

Belief Dynamics Reveal the Dual Nature Of In-Context Learning And Activation Steering

Eric Bigelow et al., 2025 | Reviewed November, 2025

This paper presents a unified Bayesian framework for understanding two seemingly different methods of controlling language model behavior.

Emergent Introspective Awareness in Large Language Models

Jack Lindsey, 2025 | Reviewed November, 2025

This paper begins a discussion on the ability of LLMs to introspect on their internal states.

Defence Against The Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Stuart Armstrong et al., 2025 | Reviewed November, 2025

This paper presents a method to reduce safety concerns of LLM responses from Jail-breaking prompts by using an Evaluation Agent model.