Serhat Giydiren
  • Archives
  • Search
  • Tags
  • Categories

Tags

  • accidents 1
  • adversarial testing 1
  • affective ai 1
  • agi 2
  • ai alignment 2
  • ai capabilities 2
  • ai consciousness 1
  • ai control 1
  • ai cybercrime 1
  • ai fundamentals 1
  • ai governance 1
  • ai jailbreaks 1
  • ai risks 2
  • ai safety atlas 6
  • ai safety diary 35
  • ai strategies 1
  • ai triad 1
  • algorithms 2
  • alignment faking 1
  • andrej karpathy 1
  • animal welfare 1
  • anthropic 12
  • architecture 3
  • autonomous agents 1
  • benchmarks 1
  • biosecurity 1
  • bluedot 3
  • caching 1
  • catastrophic risks 1
  • chain-of-thought 6
  • coding interview 1
  • collective action problems 1
  • complex systems 1
  • compute 1
  • data 1
  • data structures 1
  • deceptive ai 1
  • deep learning 1
  • distributed systems 3
  • effective altruism 6
  • evaluations 1
  • existential risks 1
  • extinction risk 1
  • faithfulness 3
  • geopolitical risk 1
  • givewell 1
  • global poverty 1
  • goal-directedness 1
  • governance 3
  • gradient descent 1
  • implicit bias 1
  • intelligence measurement 2
  • international cooperation 1
  • interpretability 5
  • intervention impact 1
  • interview 1
  • interview prep 5
  • large language models 2
  • llm alignment 1
  • llm benchmarks 1
  • llm monitoring 1
  • llm reasoning 3
  • llm safety 1
  • llm usage 1
  • long-term coherence 1
  • longtermism 1
  • machine ethics 1
  • machine learning 1
  • message queue 1
  • misalignment 2
  • misuse 1
  • misuse prevention 1
  • model safety 1
  • model steering 1
  • model welfare 1
  • monitorability 1
  • monitoring 2
  • national security 1
  • neural networks 1
  • notification service 1
  • obfuscation 1
  • off switch 1
  • pandemics 1
  • persona vectors 1
  • prompt engineering 1
  • prompt sensitivity 1
  • radical empathy 1
  • regulatory policies 1
  • resources 2
  • reward hacking 1
  • robustness 1
  • s-risks 1
  • safety engineering 1
  • safety standards 1
  • scalability 3
  • scaling interpretability 1
  • scout mindset 1
  • self-explanation 1
  • single-agent safety 1
  • societal impact 1
  • sparse autoencoders 1
  • system design 6
  • systemic risks 1
  • technical approaches 1
  • technical interview 1
  • thought anchors 1
  • threat intelligence 1
  • transformers 1
  • unfaithful reasoning 1
  • utility engineering 1
  • utility functions 1
  • vending-bench 1
© 2025 Serhat Giydiren ยท Powered by Hugo & PaperMod