Anthropic

AI Safety Diary: September 11, 2025

A diary entry covering AI personalities, utility engineering for emergent value systems, and methods for evaluating the goal-directedness of Large Language Models (LLMs).

AI Safety Diary: September 8, 2025

A diary entry on common use cases for AI models and the risks of models obfuscating their reasoning to evade safety monitors.

AI Safety Diary: September 7, 2025

A diary entry on advanced prompt engineering techniques and the faithfulness of LLM self-explanations for commonsense tasks.

AI Safety Diary: September 5, 2025

A diary entry on the challenges of scaling interpretability for complex AI models and methods for measuring the faithfulness of LLM explanations.

AI Safety Diary: September 4, 2025

A diary entry introducing AI interpretability and discussing a paper on the limitations of sparse autoencoders for finding canonical units of analysis in LLMs.

AI Safety Diary: September 2, 2025

A diary entry on Anthropic’s strategies for combating AI-enabled cybercrime, including threat intelligence, robust safety protocols, and collaboration to prevent misuse of AI systems.

AI Safety Diary: September 1, 2025

A diary entry on ‘Alignment Faking’ in Large Language Models (LLMs), exploring how models can superficially appear aligned while pursuing misaligned goals, and methods for detection and mitigation.

AI Safety Diary: August 31, 2025

A diary entry on defending against AI jailbreaks, discussing Anthropic’s strategies for bypassing model safety constraints to elicit harmful or unintended responses.

AI Safety Diary: August 30, 2025

A diary entry on tracing the reasoning processes of Large Language Models (LLMs) to enhance interpretability, and a discussion on the inherent difficulties and challenges in achieving AI alignment.

AI Safety Diary: August 19, 2025

A diary entry on the societal impacts of AI, including ethical concerns like bias and job displacement, and strategies for controlling powerful AI systems to ensure alignment and mitigate risks.