AI Safety Diary: September 15, 2025

Draws parallels between AI ‘scheming’ and ape language experiments, exploring deceptive tendencies in LLMs and the need for advanced monitoring for AI safety.

September 15, 2025 · 1 min

AI Safety Diary: September 12, 2025

Examines the challenges of predicting AI agent behavior from observed actions and its implications for AI safety, alignment, and the need for robust monitoring.

September 12, 2025 · 1 min

AI Safety Diary: August 30, 2025

A diary entry on tracing the reasoning processes of Large Language Models (LLMs) to enhance interpretability, and a discussion on the inherent difficulties and challenges in achieving AI alignment.

August 30, 2025 · 1 min

AI Safety Diary: August 13, 2025

A diary entry on Unit 1 of the BlueDot AI Alignment course, covering foundational concepts like neural networks, gradient descent, transformers, and the future impacts of AI.

August 13, 2025 · 1 min