AI Safety Diary: September 11, 2025

A diary entry covering AI personalities, utility engineering for emergent value systems, and methods for evaluating the goal-directedness of Large Language Models (LLMs).

September 11, 2025 · 1 min

AI Safety Diary: September 8, 2025

A diary entry on common use cases for AI models and the risks of models obfuscating their reasoning to evade safety monitors.

September 8, 2025 · 1 min

AI Safety Diary: September 7, 2025

A diary entry on advanced prompt engineering techniques and the faithfulness of LLM self-explanations for commonsense tasks.

September 7, 2025 · 1 min

AI Safety Diary: September 5, 2025

A diary entry on the challenges of scaling interpretability for complex AI models and methods for measuring the faithfulness of LLM explanations.

September 5, 2025 · 1 min

AI Safety Diary: September 4, 2025

A diary entry introducing AI interpretability and discussing a paper on the limitations of sparse autoencoders for finding canonical units of analysis in LLMs.

September 4, 2025 · 1 min

AI Safety Diary: September 2, 2025

A diary entry on Anthropic’s strategies for combating AI-enabled cybercrime, including threat intelligence, robust safety protocols, and collaboration to prevent misuse of AI systems.

September 2, 2025 · 1 min

AI Safety Diary: September 1, 2025

A diary entry on ‘Alignment Faking’ in Large Language Models (LLMs), exploring how models can superficially appear aligned while pursuing misaligned goals, and methods for detection and mitigation.

September 1, 2025 · 1 min

AI Safety Diary: August 31, 2025

A diary entry on defending against AI jailbreaks, discussing Anthropic’s strategies for bypassing model safety constraints to elicit harmful or unintended responses.

August 31, 2025 · 1 min

AI Safety Diary: August 30, 2025

A diary entry on tracing the reasoning processes of Large Language Models (LLMs) to enhance interpretability, and a discussion on the inherent difficulties and challenges in achieving AI alignment.

August 30, 2025 · 1 min

AI Safety Diary: August 19, 2025

A diary entry on the societal impacts of AI, including ethical concerns like bias and job displacement, and strategies for controlling powerful AI systems to ensure alignment and mitigate risks.

August 19, 2025 · 1 min