Serhat Giydiren
  • Archives
  • Search
  • Tags
  • Categories

Tags

  • accidents 1
  • adversarial testing 1
  • affective ai 1
  • agi 2
  • ai alignment 4
  • ai capabilities 2
  • ai consciousness 1
  • ai control 1
  • ai cybercrime 1
  • ai ethics 2
  • ai fundamentals 2
  • ai governance 3
  • ai jailbreaks 1
  • ai risks 4
  • ai safety atlas 13
  • ai safety book 7
  • ai safety diary 56
  • ai safety introduction 1
  • ai safety strategies 1
  • ai scheming 1
  • ai strategies 1
  • ai supervision 1
  • ai triad 1
  • algorithms 2
  • alignment 1
  • alignment faking 1
  • andrej karpathy 1
  • animal welfare 1
  • anthropic 12
  • architecture 3
  • assurance 1
  • autonomous agents 1
  • behavioral models 1
  • benchmarks 1
  • beneficial ai 1
  • biosecurity 1
  • black box models 1
  • bluedot 3
  • caching 1
  • career planning 1
  • catastrophic risks 1
  • chain-of-thought 6
  • coding interview 1
  • collective action 1
  • collective action problems 1
  • community 1
  • complex systems 2
  • compute 1
  • coordination 1
  • critical thinking 1
  • data 1
  • data structures 1
  • deceptive ai 2
  • deep learning 2
  • distributed systems 3
  • distributional shift 1
  • effective altruism 8
  • emergence 1
  • evaluations 1
  • existential risk 1
  • existential risks 1
  • extinction risk 1
  • faithfulness 3
  • game theory 1
  • generalization 1
  • geopolitical risk 1
  • givewell 1
  • global poverty 1
  • goal alignment 1
  • goal-directedness 1
  • governance 3
  • gradient descent 1
  • human-in-the-loop 1
  • implicit bias 1
  • intelligence measurement 2
  • international cooperation 1
  • international coordination 1
  • interpretability 6
  • intervention impact 1
  • interview 1
  • interview prep 5
  • jailbreaking 1
  • large language models 2
  • llm alignment 1
  • llm benchmarks 1
  • llm monitoring 1
  • llm reasoning 3
  • llm safety 2
  • llm security 1
  • llm usage 1
  • long-term coherence 1
  • longtermism 1
  • machine ethics 2
  • machine learning 2
  • message queue 1
  • misalignment 2
  • misspecification 1
  • misuse 2
  • misuse prevention 1
  • model safety 1
  • model steering 1
  • model welfare 1
  • monitorability 1
  • monitoring 2
  • multi-agent systems 1
  • national security 1
  • neural networks 1
  • notification service 1
  • obfuscation 1
  • off switch 1
  • pandemics 1
  • persona vectors 1
  • persuasive ai 1
  • policy 2
  • predicting agents 1
  • prompt engineering 1
  • prompt sensitivity 1
  • proxy goals 1
  • radical empathy 1
  • red teaming 1
  • regulation 1
  • regulatory policies 1
  • resource allocation 1
  • resources 2
  • reward hacking 1
  • reward misspecification 1
  • robustness 3
  • s-risks 1
  • safety engineering 2
  • safety standards 1
  • scalability 3
  • scalable oversight 1
  • scaling interpretability 1
  • scout mindset 1
  • self-explanation 1
  • single-agent safety 2
  • societal impact 1
  • sparse autoencoders 1
  • structural risk 1
  • system design 6
  • systemic risks 1
  • technical alignment 1
  • technical approaches 1
  • technical interview 1
  • thought anchors 1
  • threat intelligence 1
  • transformers 1
  • transparency 1
  • unfaithful reasoning 1
  • utility engineering 1
  • utility functions 1
  • value alignment 1
  • vending-bench 1
© 2025 Serhat Giydiren ยท Powered by Hugo & PaperMod