Paper Thoughts
Basic summaries of academic papers that have shaped my thinking. Note - I am not an expert in AI - the notions discussed in these Paper Thoughts reflect my personal thoughts and I write them up as a way to improve my own understanding of the content.
Natural Emergent Misalignment from Reward Hacking in Production RL
Evan Hubinger et al., 2025 | Reviewed December, 2025
This paper presents empirical results of broad emergent misalignment and the strong correlation with learning to Reward Hack. The paper discusses different stratergies to reduce emergent misalignment during RL.
Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
Jan Betley et al., 2025 | Reviewed December, 2025
This paper presents weird generalization and inductive back doors when finetuning LLMs on specific data sets.
Belief Dynamics Reveal the Dual Nature Of In-Context Learning And Activation Steering
Eric Bigelow et al., 2025 | Reviewed November, 2025
This paper presents a unified Bayesian framework for understanding two seemingly different methods of controlling language model behavior.
Emergent Introspective Awareness in Large Language Models
Jack Lindsey, 2025 | Reviewed November, 2025
This paper begins a discussion on the ability of LLMs to introspect on their internal states.
Defence Against The Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
Stuart Armstrong et al., 2025 | Reviewed November, 2025
This paper presents a method to reduce safety concerns of LLM responses from Jail-breaking prompts by using an Evaluation Agent model.