AI Safety Diary: September 22, 2025
A diary entry on the 8th chapter of the Effective Altruism Handbook, focusing on the practical application of EA principles in career choices, donations, and community involvement to maximize positive impact.
A diary entry on the 8th chapter of the Effective Altruism Handbook, focusing on the practical application of EA principles in career choices, donations, and community involvement to maximize positive impact.
A diary entry on Chapter 9 of the AI Safety Atlas, focusing on interpretability and the importance of understanding the internal workings of complex ‘black box’ AI models to ensure safety.
A diary entry on Chapter 8 of the AI Safety Atlas, focusing on the challenge of scalable oversight and how to effectively supervise AI systems that may become more intelligent than humans.
A diary entry on Chapter 7 of the AI Safety Atlas, focusing on the challenge of generalization and ensuring AI systems behave reliably when encountering novel, out-of-distribution scenarios.
A diary entry on Chapter 6 of the AI Safety Atlas, focusing on the challenge of misspecification, where AI systems pursue flawed or incomplete goals, leading to unintended and potentially harmful outcomes.
Watched the introductory lecture from the AI Safety Book series. The video provides a foundational overview of AI safety, discusses the ethical considerations, and outlines the landscape of potential risks associated with advanced AI.
Completed Chapter 7 of the Effective Altruism Handbook, ‘What do you think?’, which emphasizes the importance of critical thinking and actively contributing personal insights to the community discourse.
Draws parallels between AI ‘scheming’ and ape language experiments, exploring deceptive tendencies in LLMs and the need for advanced monitoring for AI safety.
Evaluates the ability of frontier LLMs to persuade users on harmful topics, assessing their strategies and the implications for AI safety and ethics.
Investigates how LLMs can be tuned to become more susceptible to jailbreaking, highlighting the implications for AI safety and the need for robust defenses.