Let’s Talk About AI Safety, Computer Science

Connect with me on:

AI Safety Diary: September 15, 2025

Draws parallels between AI ‘scheming’ and ape language experiments, exploring deceptive tendencies in LLMs and the need for advanced monitoring for AI safety.

September 15, 2025 · 1 min

AI Safety Diary: September 14, 2025

Evaluates the ability of frontier LLMs to persuade users on harmful topics, assessing their strategies and the implications for AI safety and ethics.

September 14, 2025 · 1 min

AI Safety Diary: September 13, 2025

Investigates how LLMs can be tuned to become more susceptible to jailbreaking, highlighting the implications for AI safety and the need for robust defenses.

September 13, 2025 · 1 min

AI Safety Diary: September 12, 2025

Examines the challenges of predicting AI agent behavior from observed actions and its implications for AI safety, alignment, and the need for robust monitoring.

September 12, 2025 · 1 min

AI Safety Diary: September 11, 2025

A diary entry covering AI personalities, utility engineering for emergent value systems, and methods for evaluating the goal-directedness of Large Language Models (LLMs).

September 11, 2025 · 1 min

AI Safety Diary: September 10, 2025

A diary entry on AI risks, including misalignment, misuse, and s-risks, and an exploration of emergent misalignment due to prompt sensitivity in LLMs.

September 10, 2025 · 1 min

AI Safety Diary: September 9, 2025

A diary entry on longtermism and its moral implications for the future, and a paper on teaching models to verbalize reward hacking in Chain-of-Thought reasoning.

September 9, 2025 · 1 min

AI Safety Diary: September 8, 2025

A diary entry on common use cases for AI models and the risks of models obfuscating their reasoning to evade safety monitors.

September 8, 2025 · 1 min

AI Safety Diary: September 7, 2025

A diary entry on advanced prompt engineering techniques and the faithfulness of LLM self-explanations for commonsense tasks.

September 7, 2025 · 1 min

AI Safety Diary: September 6, 2025

A diary entry on how Chain-of-Thought (CoT) reasoning affects LLM’s ability to evade monitors, and the challenge of unfaithful reasoning in model explanations.

September 6, 2025 · 1 min