AI Safety & Alignment
Research and discussion on making AI systems safe and aligned with human values
Most AI breakthroughs are thought of as isolated advances, but Karpathy argues they signal a broader shift: code is being replaced by neural networks trained on data.
“This reframe changed how I think about software. Writing code is increasingly the wrong level of abstraction — you train the behavior instead.”
Tim Urban's deep dive on artificial intelligence that made the AI alignment problem legible to a general audience. Still one of the best explainers ever written.
“5 million people read this. Worth understanding what made it work — it's a masterclass in making hard ideas accessible without dumbing them down.”
Simon Willison's exhaustive year-in-review of every major LLM development in 2023. Dense with links and actual signal amid massive hype.
“Best factual record of what actually happened vs. what the press said happened. Simon is relentlessly empirical — no hype, just what he tested and observed.”
Emily Bender and colleagues on stochastic parrots — why LLMs aren't 'understanding' anything and why that gap matters deeply.
“A necessary counterweight to anthropomorphizing LLMs. The arguments in here are going to matter more and more as these systems become infrastructure.”
Anthropic research on training helpful and harmless AI
“One of the most practical approaches to alignment that is actually deployed in production”
Anthropic identifies millions of interpretable features inside Claude
“If we can understand what is happening inside these models, we can actually verify alignment claims”
Eliezer Yudkowsky on why AGI alignment is extremely difficult
“Agree or not, this is the strongest case for why alignment is harder than most think”
Nature's editorial on the growing gap between AI performance on scientific benchmarks and actual scientific understanding — a careful examination of what we're actually measuring.
“Precise and careful — exactly what you want from Nature's editorial desk. The distinction between benchmark performance and genuine capability keeps getting blurred.”