AI Safety Diary: September 1, 2025

A diary entry on ‘Alignment Faking’ in Large Language Models (LLMs), exploring how models can superficially appear aligned while pursuing misaligned goals, and methods for detection and mitigation.

September 1, 2025 · 1 min

AI Safety Diary: August 31, 2025

A diary entry on defending against AI jailbreaks, discussing Anthropic’s strategies for bypassing model safety constraints to elicit harmful or unintended responses.

August 31, 2025 · 1 min

AI Safety Diary: August 30, 2025

A diary entry on tracing the reasoning processes of Large Language Models (LLMs) to enhance interpretability, and a discussion on the inherent difficulties and challenges in achieving AI alignment.

August 30, 2025 · 1 min

AI Safety Diary: August 29, 2025

A diary entry on AI governance strategies to avoid extinction risks, discussing catastrophic risks from misalignment, misuse, and geopolitical conflict, and the need for urgent research into governance mechanisms.

August 29, 2025 · 1 min

AI Safety Diary: August 28, 2025

A diary entry on ‘Thought Anchors’, a concept for identifying key reasoning steps in Chain-of-Thought (CoT) processes that significantly influence LLM behavior, enhancing interpretability for AI safety.

August 28, 2025 · 1 min

AI Safety Diary: August 27, 2025

A diary entry on the unfaithfulness of Chain-of-Thought (CoT) reasoning in LLMs, highlighting issues like implicit biases and logically contradictory outputs, which pose challenges for AI safety monitoring.

August 27, 2025 · 1 min

AI Safety Diary: August 26, 2025

A diary entry on Chain of Thought (CoT) monitorability as a fragile opportunity for AI safety, focusing on detecting misbehavior in LLMs and the challenges of maintaining transparency.

August 26, 2025 · 1 min

AI Safety Diary: August 25, 2025

A diary entry on Chapter 4 of the Effective Altruism Handbook, ‘Our Final Century?’, which examines existential risks, particularly human-made pandemics, and strategies for biosecurity.

August 25, 2025 · 1 min

AI Safety Diary: August 24, 2025

A diary entry on the audio version of Chapter 4 of the AI Safety Atlas, focusing on governance strategies for safe AI development, including safety standards, international treaties, and regulatory policies.

August 24, 2025 · 1 min

AI Safety Diary: August 23, 2025

A diary entry on the audio version of Chapter 3 of the AI Safety Atlas, focusing on strategies for mitigating AI risks, including technical approaches like alignment and interpretability, and governance strategies.

August 23, 2025 · 1 min