Interpretability

AI Safety Diary: September 21, 2025

A diary entry on Chapter 9 of the AI Safety Atlas, focusing on interpretability and the importance of understanding the internal workings of complex ‘black box’ AI models to ensure safety.

AI Safety Diary: September 4, 2025

A diary entry introducing AI interpretability and discussing a paper on the limitations of sparse autoencoders for finding canonical units of analysis in LLMs.

AI Safety Diary: August 30, 2025

A diary entry on tracing the reasoning processes of Large Language Models (LLMs) to enhance interpretability, and a discussion on the inherent difficulties and challenges in achieving AI alignment.

AI Safety Diary: August 28, 2025

A diary entry on ‘Thought Anchors’, a concept for identifying key reasoning steps in Chain-of-Thought (CoT) processes that significantly influence LLM behavior, enhancing interpretability for AI safety.

AI Safety Diary: August 17, 2025

A diary entry on several Anthropic discussions, including AI interpretability, the affective use of AI for emotional support, and the philosophical questions surrounding AI consciousness and model welfare.

AI Safety Diary: August 16, 2025

A diary entry on Anthropic’s research into Persona Vectors, a method for monitoring and controlling character traits in Large Language Models (LLMs) to improve safety and alignment.