AI Safety Diary: September 15, 2025
Draws parallels between AI ‘scheming’ and ape language experiments, exploring deceptive tendencies in LLMs and the need for advanced monitoring for AI safety.
Draws parallels between AI ‘scheming’ and ape language experiments, exploring deceptive tendencies in LLMs and the need for advanced monitoring for AI safety.
A diary entry on ‘Alignment Faking’ in Large Language Models (LLMs), exploring how models can superficially appear aligned while pursuing misaligned goals, and methods for detection and mitigation.