AI Alignment

AI Safety Diary: September 28, 2025

A diary entry on the 4th chapter of the AI Safety Book, which discusses the engineering principles required to build robust and reliable AI systems, drawing parallels with traditional safety engineering fields.

AI Safety Diary: September 27, 2025

A diary entry on the 3rd chapter of the AI Safety Book, focusing on the core challenges of single-agent safety, such as specifying correct reward functions and preventing unintended behaviors in a single AI system.

AI Safety Diary: September 24, 2025

A diary entry on the 3rd chapter of the AI Safety Atlas, which covers the different high-level strategies being pursued to mitigate AI risks, including technical alignment, policy, and strategy research.

AI Safety Diary: September 21, 2025

A diary entry on Chapter 9 of the AI Safety Atlas, focusing on interpretability and the importance of understanding the internal workings of complex ‘black box’ AI models to ensure safety.

AI Safety Diary: September 20, 2025

A diary entry on Chapter 8 of the AI Safety Atlas, focusing on the challenge of scalable oversight and how to effectively supervise AI systems that may become more intelligent than humans.

AI Safety Diary: September 19, 2025

A diary entry on Chapter 7 of the AI Safety Atlas, focusing on the challenge of generalization and ensuring AI systems behave reliably when encountering novel, out-of-distribution scenarios.

AI Safety Diary: September 18, 2025

A diary entry on Chapter 6 of the AI Safety Atlas, focusing on the challenge of misspecification, where AI systems pursue flawed or incomplete goals, leading to unintended and potentially harmful outcomes.

AI Safety Diary: September 15, 2025

Draws parallels between AI ‘scheming’ and ape language experiments, exploring deceptive tendencies in LLMs and the need for advanced monitoring for AI safety.

AI Safety Diary: September 12, 2025

Examines the challenges of predicting AI agent behavior from observed actions and its implications for AI safety, alignment, and the need for robust monitoring.

AI Safety Diary: September 11, 2025

A diary entry covering AI personalities, utility engineering for emergent value systems, and methods for evaluating the goal-directedness of Large Language Models (LLMs).