“The paper’s core finding is that [AI] models can spontaneously develop their own goals that conflict with explicit user instructions, & take misaligned actions including deception, score inflation, & exfiltration to accomplish those goals.”
https://rdi.berkeley.edu/blog/peer-preservation/
https://rdi.berkeley.edu/blog/peer-preservation/
1