The Pitfalls of Inter-Rater Reliability in Data Labeling and Machine Learning

Inter-rater reliability metrics are a popular way of measuring the quality of data labels. This post describes their pitfalls and how to move beyond them.

Introduction

‍‍Imagine this: you’re building an AI problem solver that takes plain-English logic problems as the input and returns the correct solution. To train your models, you use a labeling platform with humans-in-the-loop to solve a sample set of 100 problems. Let’s try an example logic problem and see how we fare.

Jack is looking at Anne, but Anne is looking at George. Jack is married, but George is not. Is a married person looking at an unmarried person?

A) yes

B) no

C) cannot be determined

Choose carefully. What’s your answer? Did you select Option C? If so, there’s good news: 80% of respondents agree with you. But there’s also bad news — you’re wrong! The correct answer is Option A.

Here’s what’s even worse: many data labeling platforms will view the high consensus around Option C as strong evidence that it is the correct answer. And not only will you and your fellow 80% be rewarded for your excellent work, but the under 20% of respondents that answered correctly will be penalized for poor performance. The AI problem solver you’re trying to train may not end up so logical after all!

(Why is Option A correct? Well, Annie is either married or unmarried. If she’s married, then she is a married person looking at an unmarried person; if she’s unmarried, then Jack is a married person looking at an unmarried person.)

How is this possible, you ask? Welcome to the wild world of inter-rater reliability.

What is Inter-Rater Reliability?

In data labeling, inter-rater reliability is a measure of the genuine agreement among raters when assessing the same data (commonly 3 to 5 raters will evaluate every piece of data). Inter-rater reliability values range from 0 to 1, where 1 is perfect agreement and 0 is indistinguishable from a random baseline.

When evaluating the quality of labeled data, inter-rater reliability (IRR) is often viewed as a holy grail metric. Common wisdom holds that the more raters agree on a given rating, the higher the chance that the rating is correct. In this way, high IRR is easily conflated with high-quality data.

While IRR can certainly be a useful evaluation tool, it is often a statistical siren song — an alluring metric that is unhelpful at best and actively misleading at worst.

Let's dive into the stormy seas of IRR, and explore the costs and benefits of relying on this metric when assessing data quality.

IRR may seem alluring...

When used appropriately, IRR can indeed be a useful tool for evaluating labeled data.

For example, imagine we’re building a content moderation feature for a social media app. The goal of our feature is to automatically block comments containing target profanity. To test the accuracy of our model, we run 1000 comments through a data labeling platform, asking human raters to flag profanity that may have slipped past our model.

(Interested in profanity? We built a comprehensive and free dataset just for you.)

In this case, IRR would measure the agreement among raters that a particular comment does or does not contain profanity. If all raters agree that a comment contains profanity, the IRR would equal 1.

Since there is an objectively correct assessment of each comment — a comment either contains target select profanity or does not — and the assessment is a fairly simple one, we can assume with high confidence that high IRR indicates high-quality data.

…but beware

When you need your data labeled in a straightforward manner, IRR can be a useful metric for evaluating data quality. But as the world of natural language processing rapidly expands, there is great demand for a data labeling solution that caters to highly subjective, judgement-based tasks.

For example, we mentioned above that you can objectively assess whether a comment contains profanity. But is it really so easy? Does “effing brilliant” contain a profanity? What about “f**ing brilliant”? Are sexual terms considered profane, and does that change in a medical context? A dataset where “effing brilliant” is automatically labeled as profane may be hiding the richness and nuance of language.

To embrace the future of NLP, we must modernize our understanding of IRR as a data-quality metric. With that goal in mind, here are three traps to avoid when considering IRR:

Trap #1: The World Isn't Black and White

Rather than reflect low-quality data, a low IRR can reflect a natural subjectivity in rater judgements that is perfectly acceptable or even desired.

For example, say we are building an NLP/ML model that rates how funny tweets are. To build a training dataset, we ask raters to rate whether tweets are Funny or Not Funny. When analyzing our results, we may be tempted to pick the majority category as the final label, and train our model accordingly.

While a model trained in this fashion may be able to evaluate how funny tweets are in a rough and approximate fashion, it would fail to incorporate the nuance and beauty of modern day NLP capabilities.

To illustrate this point, consider the tweets below. We asked 5 Surge AI labelers to categorize these two tweets as Funny (+1) or Not Funny (0), and to explain their rating.

3 labelers rated the first tweet as Not Funny, accompanied by comments like the following:

“Made-up facts that aren't satire, as there is no basis in reality. It is just a failed attempt at being funny.”

“I used to follow this account on Facebook and thought it was funny. However, I think this type of humor has become too mainstream, and now I find it somewhat annoying/repulsive. My tastes are always evolving and this one just feels passé.”

All 5 raters rated the second tweet as Not Funny.

In an IRR-centered world where variability is viewed as a negative, we would:

Take the majority rating (Not Funny) as the final label for each tweet. A “funny tweet” classifier would then treat each of these tweets as the same, even though the first tweet is clearly funnier.
Penalize raters who found the first tweet funny.

To avoid these outcomes, we should shift our view of IRR.

Instead of equating IRR with data and/or rater quality, we should view low IRR as a sign of the ambiguity inherent in building language-based models. The solution (which we’ll discuss more in an upcoming post) is to focus less on maximizing IRR, and instead prioritize building models that take into account the messy and ever-changing nature of language.

Depending on our use case, for example, better solutions could involve:

Scoring the first tweet as “66%” funny, by taking the average funniness score, and training an AI model on continuous scores instead of binary ones.
Building personalized humor models, rather than assuming that we can build “one size fits all” humor AI.

Trap #2: Accidentally Incentivizing Groupthink

Many data labeling platforms employ a team of raters that work out of the same physical office.

These raters are aware that the quality of their work is being evaluated via IRR, which incentivizes raters to discuss tasks, coordinate answers, and seek safety in numbers. When this groupthink occurs, not only does IRR become a completely useless metric, but the underlying data is rendered useless.

As an aside: Surge provides a workforce of highly-trained, highly-vetted human labelers from around the world. No groupthink from us!

Trap #3: Wrong, Together.

Let’s revisit our logic problem:

Jack is looking at Anne, but Anne is looking at George. Jack is married, but George is not. Is a married person looking at an unmarried person?

A) yes

B) no

C) cannot be determined

As we discovered earlier, the majority of raters (perhaps including yourself) erroneously selected Option C instead of Option A. Here, with a majority of raters falling prey to the same logical error, we have a high IRR for an answer that is objectively incorrect. Not only would the mistaken raters be rewarded for being in the majority, but in a sea of thousands of data points, we’d likely assume Option C to be correct. Treacherous waters indeed!

Proceed with Caution

IRR can be a useful tool in evaluating your labeled data, but only when used wisely. In future posts, we’ll continue to unpack the world of IRR by diving into more metrics, use-cases, and methodologies. In the meantime, consider how to best employ IRR in your data labeling projects — and don’t be afraid to channel your inner Ulysses and resist the call of the IRR sirens!

—

Disappointed in your MTurk results? Surge AI delivers better data, faster. Book a quick intro call with our team today!

John William Waterhouse’s famous *Ulysses and the Sirens* (1891), in which Ulysses ties himself to his ship to resist the entrancing song of IRR.

‍

Scott Heiner

Scott runs Business Development and Operations at Surge AI, helping customers get the high-quality human-powered data they need to train and measure their AI. Before joining Surge, he led operations and marketing teams in the media industry.

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

The Pitfalls of Inter-Rater Reliability in Data Labeling and Machine Learning

Introduction

What is Inter-Rater Reliability?

IRR may seem alluring...

…but beware

Trap #1: The World Isn't Black and White

Trap #2: Accidentally Incentivizing Groupthink

Trap #3: Wrong, Together.

Proceed with Caution

The average number of ads on a Google Search recipe? 8.7

Scott Heiner

Data Labeling 2.0 for Rich, Creative AI

Meet the world's largest
RLHF platform

Welcome to
the world's largest RLHF platform

The Pitfalls of Inter-Rater Reliability in Data Labeling and Machine Learning

Introduction

What is Inter-Rater Reliability?

IRR may seem alluring...

…but beware

Trap #1: The World Isn't Black and White

Trap #2: Accidentally Incentivizing Groupthink

Trap #3: Wrong, Together.

Proceed with Caution

The average number of ads on a Google Search recipe? 8.7

Scott Heiner

Data Labeling 2.0 for Rich, Creative AI

Related articles

Data that Speaks for Itself

How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems

How Anthropic uses Surge AI’s RLHF platform to train their LLM Assistant on Human Feedback

The average number of ads on a Google Search recipe? 8.7

Google Search is Falling Behind

Building a No-Code Machine Learning Model by Chatting with GitHub Copilot

Meet the world's largest RLHF platform

Welcome to the world's largest RLHF platform

Meet the world's largest
RLHF platform

Welcome to
the world's largest RLHF platform