← Previous · All Episodes · Next →
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs Episode 1307

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

· 20:28

|

🤗 Upvotes: 39 | cs.CL

Authors:
Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov

Title:
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Arxiv:
http://arxiv.org/abs/2510.11288v1

Abstract:
Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.


Subscribe

Listen to Daily Paper Cast using one of many popular podcasting apps or directories.

Apple Podcasts Spotify Overcast Pocket Casts YouTube
← Previous · All Episodes · Next →