AI Safety Diary: September 5, 2025
A diary entry on the challenges of scaling interpretability for complex AI models and methods for measuring the faithfulness of LLM explanations.
A diary entry on the challenges of scaling interpretability for complex AI models and methods for measuring the faithfulness of LLM explanations.