Human Evals vs. Academic Benchmarks

Benchmarks are an easy way to generate headlines, but they’re a poor measure of real-world value.
- They’re often created by academics, with shoestring budgets and limited real-world context.
- They’re often designed by researchers without deep data expertise; as a result, the benchmark answers are often wrong.
- Benchmarks where all the answers are multiple choice (or otherwise structured, easily verifiable output) are common, but this is a very poor reflection of how most people use chatbots. What multiple choice benchmark could measure the ability to code a beautiful webpage or write poetry?
- Because of their narrow focus, they’re easily gamed and hacked.
For example, the IFEval instruction-following benchmark measures LLMs’ ability to “make sure the letter n appears at least 3 times.” 30% of Humanity's Last Exam chemistry and biology answers are likely wrong and 36% of HellaSwag contains errors.
That’s why frontier labs use human evals as their gold standard.
When you focus on benchmarks instead of human evals, you end up in the following situation:
- A press release touts high benchmark scores.
- The community uses the model and finds it weak.
- Speculation grows that you’ve hacked the benchmarks.
Previous
Next