AI Safety Diary: September 11, 2025
A diary entry covering AI personalities, utility engineering for emergent value systems, and methods for evaluating the goal-directedness of Large Language Models (LLMs).
AI Safety Diary: September 10, 2025
A diary entry on AI risks, including misalignment, misuse, and s-risks, and an exploration of emergent misalignment due to prompt sensitivity in LLMs.
AI Safety Diary: September 9, 2025
A diary entry on longtermism and its moral implications for the future, and a paper on teaching models to verbalize reward hacking in Chain-of-Thought reasoning.
AI Safety Diary: September 8, 2025
A diary entry on common use cases for AI models and the risks of models obfuscating their reasoning to evade safety monitors.
AI Safety Diary: September 7, 2025
A diary entry on advanced prompt engineering techniques and the faithfulness of LLM self-explanations for commonsense tasks.
AI Safety Diary: September 6, 2025
A diary entry on how Chain-of-Thought (CoT) reasoning affects LLM’s ability to evade monitors, and the challenge of unfaithful reasoning in model explanations.
AI Safety Diary: September 5, 2025
A diary entry on the challenges of scaling interpretability for complex AI models and methods for measuring the faithfulness of LLM explanations.
AI Safety Diary: September 4, 2025
A diary entry introducing AI interpretability and discussing a paper on the limitations of sparse autoencoders for finding canonical units of analysis in LLMs.
AI Safety Diary: September 3, 2025
A diary entry on Chapter 5 of the AI Safety Atlas, focusing on evaluation methods for assessing the safety and alignment of advanced AI systems, including benchmarks and robustness testing.
AI Safety Diary: September 2, 2025
A diary entry on Anthropic’s strategies for combating AI-enabled cybercrime, including threat intelligence, robust safety protocols, and collaboration to prevent misuse of AI systems.