AI Interpretability

AI Safety Diary: August 23, 2025

A diary entry on the audio version of Chapter 3 of the AI Safety Atlas, focusing on strategies for mitigating AI risks, including technical approaches like alignment and interpretability, and governance strategies.

AI Safety Diary: August 17, 2025

A diary entry on several Anthropic discussions, including AI interpretability, the affective use of AI for emotional support, and the philosophical questions surrounding AI consciousness and model welfare.

AI Safety Diary: August 16, 2025

A diary entry on Anthropic’s research into Persona Vectors, a method for monitoring and controlling character traits in Large Language Models (LLMs) to improve safety and alignment.