Login Careers Research Blog Contact

Human Evals vs. Academic Benchmarks

September 2, 2025

Human Evals vs. Academic Benchmarks

Benchmarks are an easy way to generate headlines, but they’re a poor measure of real-world value.

They’re often created by academics, with shoestring budgets and limited real-world context.
They’re often designed by researchers without deep data expertise; as a result, the benchmark answers are often wrong.
Benchmarks where all the answers are multiple choice (or otherwise structured, easily verifiable output) are common, but this is a very poor reflection of how most people use chatbots. What multiple choice benchmark could measure the ability to code a beautiful webpage or write poetry?
Because of their narrow focus, they’re easily gamed and hacked.

For example, the IFEval instruction-following benchmark measures LLMs’ ability to “make sure the letter n appears at least 3 times.” 30% of Humanity's Last Exam chemistry and biology answers are likely wrong and 36% of HellaSwag contains errors.

That’s why frontier labs use human evals as their gold standard.

When you focus on benchmarks instead of human evals, you end up in the following situation:

A press release touts high benchmark scores.
The community uses the model and finds it weak.
Speculation grows that you’ve hacked the benchmarks.

Human Evals vs. Academic Benchmarks

Unsexy AI Failures: The PDF That Broke ChatGPT

DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet

How Anthropic uses Surge AI to Train and Evaluate Claude

We Evaluated ChatGPT vs. Google on 500 Search Queries

AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust

HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors

How TikTok is Evolving the Next Generation of Search

Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?

Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels

The $250K Inverse Scaling Prize and Human-AI Alignment

Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM?

30% of Google's Emotions Dataset is Mislabeled

Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality

AI Red Teams and Adversarial Data Labeling with Redwood Research

Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems

10 Egregious Failures in Gmail Spam Detection

We asked 100 humans to draw the DALL·E prompts

The average number of ads on a Google Search recipe? 8.7

Google Search is Falling Behind

Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values

Holy $#!t: Are popular toxicity models simply profanity detectors?

Is Google Search Deteriorating? Measuring Google's Search Quality in 2022

5 Examples of the Importance of Context-Sensitivity in Data-Centric AI

The AI Bottleneck: High-Quality, Human-Powered Data

Previous

Next