Benchmarks are broken. Here's why frontier labs treat them as PR.

Table of contents

Academic benchmarks make great headlines, and terrible AI.

The Problem with Benchmarks

Benchmarks are often designed for papers, not products. The result? We get IFEval—a test that measures whether AI can make the letter "n" appear three times. Riveting.

Many datasets themselves are deeply flawed. That's why 30% of "Humanity's Last Exam" answers are wrong and 36% of HellaSwag contains errors. Garbage in, garbage out.

The real world isn’t multiple choice. Benchmarks are designed to be verifiable: multiple choice, single answer. But no checkbox test captures whether AI can build beautiful websites, write compelling copy, or hold meaningful conversations.

Narrow metrics get gamed. In an era where billions of dollars are on the line, teams are pressured to optimize for the measurement, not the capability. Every. Single. Time.

Why Frontier Labs Use Human Evals Instead

Frontier researchers abandon academic benchmarks as their north star. They leave those to Marketing and Comms.

Instead, superintelligence teams rely on human evaluations as their gold standard. Humans can measure nuance, creativity, and wisdom—things benchmarks can’t.