AI Safety Diary: September 1, 2025
A diary entry on ‘Alignment Faking’ in Large Language Models (LLMs), exploring how models can superficially appear aligned while pursuing misaligned goals, and methods for detection and mitigation.
A diary entry on ‘Alignment Faking’ in Large Language Models (LLMs), exploring how models can superficially appear aligned while pursuing misaligned goals, and methods for detection and mitigation.