AI Safety Diary: September 10, 2025

A diary entry on AI risks, including misalignment, misuse, and s-risks, and an exploration of emergent misalignment due to prompt sensitivity in LLMs.

September 10, 2025 · 1 min

AI Safety Diary: September 9, 2025

A diary entry on longtermism and its moral implications for the future, and a paper on teaching models to verbalize reward hacking in Chain-of-Thought reasoning.

September 9, 2025 · 1 min

AI Safety Diary: August 25, 2025

A diary entry on Chapter 4 of the Effective Altruism Handbook, ‘Our Final Century?’, which examines existential risks, particularly human-made pandemics, and strategies for biosecurity.

August 25, 2025 · 1 min

AI Safety Diary: August 18, 2025

A diary entry on Chapter 3 of the Effective Altruism Handbook, ‘Radical Empathy’, which explores impartial care and extending empathy to non-human animals.

August 18, 2025 · 1 min

AI Safety Diary: August 15, 2025

A diary entry on Chapter 2 of the Effective Altruism Handbook, focusing on the significant differences in the impact of interventions aimed at alleviating global poverty.

August 15, 2025 · 1 min

AI Safety Diary: August 8, 2025

A diary entry on exploring the ‘Effectiveness Mindset’ from the Effective Altruism Handbook, in the context of AI safety and governance.

August 8, 2025 · 1 min