Benchmarks are broken

Benchmarks are broken. Here's why frontier labs treat them as PR.

Academic benchmarks make great headlines, and terrible AI.

The Problem with Benchmarks

Academia designs benchmarks for papers, not products. Result? We get IFEval—a test that measures whether AI can make the letter "n" appear three times. Riveting.

Datasets are built by researchers who know algorithms, not data. That's why 30% of "Humanity's Last Exam" answers are wrong and 36% of HellaSwag contains errors. Garbage in, garbage out.

The real world isn’t multiple choice. Benchmarks are designed to be verifiable: multiple choice, single answer. But no checkbox test captures whether AI can build beautiful websites, write compelling copy, or hold meaningful conversations.

Narrow metrics get gamed. In an era where billions of dollars are on the line, teams are pressured to optimize for the measurement, not the capability. Every. Single. Time.

Why Frontier Labs Use Human Evals Instead

Frontier researchers abandon academic benchmarks as their north star. They leave those to Marketing and Comms.

Instead, superintelligence teams rely on human evaluations as their gold standard. Humans can measure nuance, creativity, and wisdom—things benchmarks can’t.

The Benchmark Death Spiral

Here's how broken benchmarks hurt everyone:

  1. Lab announces: "Our model scores 95% on SuperBench!"
  2. Users discover: The model can't write a decent email.
  3. Community concludes: "They cheated!"

Trust evaporates. Progress stalls. Everyone loses.

Benchmarks might win headlines, but the industry needs north stars that reflect true ambition.

Previous