Human Evals vs. Academic Benchmarks

Human Evals vs. Academic Benchmarks

Benchmarks are an easy way to generate headlines, but they’re a poor measure of real-world value.

  • They’re often created by academics, with shoestring budgets and limited real-world context.
  • They’re often designed by researchers without deep data expertise; as a result, the benchmark answers are often wrong.
  • Benchmarks where all the answers are multiple choice (or otherwise structured, easily verifiable output) are common, but this is a very poor reflection of how most people use chatbots. What multiple choice benchmark could measure the ability to code a beautiful webpage or write poetry?
  • Because of their narrow focus, they’re easily gamed and hacked.

For example, the IFEval instruction-following benchmark measures LLMs’ ability to “make sure the letter n appears at least 3 times.” 30% of Humanity's Last Exam chemistry and biology answers are likely wrong and 36% of HellaSwag contains errors.

That’s why frontier labs use human evals as their gold standard.

When you focus on benchmarks instead of human evals, you end up in the following situation:

  1. A press release touts high benchmark scores.
  2. The community uses the model and finds it weak.
  3. Speculation grows that you’ve hacked the benchmarks.
Previous