AI Safety Diary: August 27, 2025

Today, I explored a research paper as part of my AI safety studies. Below is the resource I reviewed.

Resource: Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Source: Chain-of-Thought Reasoning In The Wild Is Not Always Faithful by Iván Arcuschin et al., arXiv:2503.08679, June 2025.
Summary: This paper investigates the faithfulness of chain-of-thought (CoT) reasoning in LLMs, finding that models can produce logically contradictory CoT outputs due to implicit biases, termed “Implicit Post-Hoc Rationalization.” For example, models may justify answering “Yes” to both “Is X bigger than Y?” and “Is Y bigger than X?” The study shows high rates of unfaithful reasoning in models like GPT-4o-mini (13%) and Haiku 3.5 (7%), raising challenges for detecting undesired behavior via CoT monitoring.