AI Interpretability

AI Safety Diary: September 11, 2025

A diary entry covering AI personalities, utility engineering for emergent value systems, and methods for evaluating the goal-directedness of Large Language Models (LLMs).

AI Safety Diary: September 9, 2025

A diary entry on longtermism and its moral implications for the future, and a paper on teaching models to verbalize reward hacking in Chain-of-Thought reasoning.

AI Safety Diary: September 7, 2025

A diary entry on advanced prompt engineering techniques and the faithfulness of LLM self-explanations for commonsense tasks.

AI Safety Diary: September 6, 2025

A diary entry on how Chain-of-Thought (CoT) reasoning affects LLM’s ability to evade monitors, and the challenge of unfaithful reasoning in model explanations.

AI Safety Diary: September 5, 2025

A diary entry on the challenges of scaling interpretability for complex AI models and methods for measuring the faithfulness of LLM explanations.

AI Safety Diary: September 4, 2025

A diary entry introducing AI interpretability and discussing a paper on the limitations of sparse autoencoders for finding canonical units of analysis in LLMs.

AI Safety Diary: August 30, 2025

A diary entry on tracing the reasoning processes of Large Language Models (LLMs) to enhance interpretability, and a discussion on the inherent difficulties and challenges in achieving AI alignment.

AI Safety Diary: August 28, 2025

A diary entry on ‘Thought Anchors’, a concept for identifying key reasoning steps in Chain-of-Thought (CoT) processes that significantly influence LLM behavior, enhancing interpretability for AI safety.

AI Safety Diary: August 27, 2025

A diary entry on the unfaithfulness of Chain-of-Thought (CoT) reasoning in LLMs, highlighting issues like implicit biases and logically contradictory outputs, which pose challenges for AI safety monitoring.

AI Safety Diary: August 26, 2025

A diary entry on Chain of Thought (CoT) monitorability as a fragile opportunity for AI safety, focusing on detecting misbehavior in LLMs and the challenges of maintaining transparency.