AI Safety Diary: September 15, 2025

Draws parallels between AI ‘scheming’ and ape language experiments, exploring deceptive tendencies in LLMs and the need for advanced monitoring for AI safety.

September 15, 2025 · 1 min

AI Safety Diary: September 1, 2025

A diary entry on ‘Alignment Faking’ in Large Language Models (LLMs), exploring how models can superficially appear aligned while pursuing misaligned goals, and methods for detection and mitigation.

September 1, 2025 · 1 min