AI Alignment

AI Safety Diary: September 10, 2025

A diary entry on AI risks, including misalignment, misuse, and s-risks, and an exploration of emergent misalignment due to prompt sensitivity in LLMs.

AI Safety Diary: September 9, 2025

A diary entry on longtermism and its moral implications for the future, and a paper on teaching models to verbalize reward hacking in Chain-of-Thought reasoning.

AI Safety Diary: September 8, 2025

A diary entry on common use cases for AI models and the risks of models obfuscating their reasoning to evade safety monitors.

AI Safety Diary: September 7, 2025

A diary entry on advanced prompt engineering techniques and the faithfulness of LLM self-explanations for commonsense tasks.

AI Safety Diary: September 6, 2025

A diary entry on how Chain-of-Thought (CoT) reasoning affects LLM’s ability to evade monitors, and the challenge of unfaithful reasoning in model explanations.

AI Safety Diary: September 3, 2025

A diary entry on Chapter 5 of the AI Safety Atlas, focusing on evaluation methods for assessing the safety and alignment of advanced AI systems, including benchmarks and robustness testing.

AI Safety Diary: September 1, 2025

A diary entry on ‘Alignment Faking’ in Large Language Models (LLMs), exploring how models can superficially appear aligned while pursuing misaligned goals, and methods for detection and mitigation.

AI Safety Diary: August 30, 2025

A diary entry on tracing the reasoning processes of Large Language Models (LLMs) to enhance interpretability, and a discussion on the inherent difficulties and challenges in achieving AI alignment.

AI Safety Diary: August 23, 2025

A diary entry on the audio version of Chapter 3 of the AI Safety Atlas, focusing on strategies for mitigating AI risks, including technical approaches like alignment and interpretability, and governance strategies.

AI Safety Diary: August 22, 2025

A diary entry on the audio version of Chapter 2 of the AI Safety Atlas, focusing on various AI risks, including misuse, accidents, and systemic risks, and the challenges of alignment failures.