30% of Google's Emotions Dataset is Mislabeled

*When you mislabel LOVE, cue heartbreak all around.*

Last year, Google released their “GoEmotions” dataset: a human-labeled dataset of 58K Reddit comments categorized according to 27 emotions.

The problem? A whopping 30% of the dataset is severely mislabeled! (We tried training a model on the dataset ourselves, but noticed deep quality issues. So we took 1000 random comments, asked Surgers whether the original emotion was reasonably accurate, and found strong errors in 308 of them.) How are you supposed to train and evaluate machine learning models when your data is so wrong?

For example, here are 25 mislabeled emotions in Google’s dataset.

Comments mislabeled as NEGATIVE emotions

LETS FUCKING GOOOOO – mislabeled as ANGER, likely because Google's low-quality labelers don’t understand English slang and mislabel any profanity as a negative emotion
*aggressively tells friend I love them* – mislabeled as ANGER
you almost blew my fucking mind there. – mislabeled as ANNOYANCE
daaaaaamn girl! – mislabeled as ANGER
[NAME] wept. – mislabeled as SADNESS, likely because Google’s non-fluent labelers don’t understand the idiomatic meaning of “Jesus wept”, and thought someone was truly crying
I try my damndest. Hard to be sad these days when I got this guy with me – mislabeled as SADNESS
hell yeah my brother – mislabeled as ANNOYANCE
[NAME] is bae, how dare you. – mislabeled as ANGER, likely because the labelers don’t know what “bae” means, and aren’t fluent enough in online English usage to realize that “how dare you” is written in a mock anger tone

Comments mislabeled as NEUTRAL emotions

But muh narrative! Orange man caused this!!!!! – mislabeled as NEUTRAL, likely because labelers don’t understand who “orange man” refers to, or the mockery behind writing “muh” instead of “my”
My man! – mislabeled as NEUTRAL, likely because labelers don’t know what this phrase means
KAMALA 2020!!!!!! – mislabeled as NEUTRAL, likely because the non-US labelers don’t know who Kamala Harris is or didn't have enough context
Hi dying, I'm dad! – mislabeled as NEUTRAL, likely because labelers don’t understand dad jokes

‍Comments mislabeled as POSITIVE emotions

I love when they send in the wrong meat, it’s only happened to me once – mislabeled as LOVE
Nobody has the money to. What a joke – mislabeled as JOY
Yay, cold McDonald's. My favorite. – mislabeled as LOVE
Really? Wow. You’re either hopelessly ignorant or you’re trolling. For your sake, I hope you’re trolling. – mislabeled as OPTIMISM
These 2 are repulsive little kids – mislabeled as APPROVAL (I don’t have any explanation, other than this in the kind of quality you get when you throw warm bodies at the problem of data labeling instead of building robust infrastructure)
Yeah, because not paying a bill on time is equal to murdering children. – mislabeled as APPROVAL, likely because of the “Yeah”
I wished my mom protected me from my grandma. She was a horrible person who was so mean to me and my mom. – mislabeled as OPTIMISM

In other words, when Google can’t even label daaaaaamn girl! or These 2 are repulsive little kids correctly, is it any surprise that Google’s Toxic Speech API is merely a profanity detector?

Is it a surprise that Google's Toxicity API misclassifies this comment as TOXIC, given the errors in its datasets?

Or that Gmail’s spam detector is deteriorating?

This email may seem like obvious spam... But then again, ***daaaaaamn girl!*** seems obviously *not* ANGER!

When good data is crucial for good models – in a research paper specifically designed to create a dataset, no less! – can we really trust Google to create unbiased real-world AI?

Google’s Flawed Data Labeling Methodology

To summarize the types of errors in Google’s dataset, many come from:

Profanity – mislabeling hell yeah my brother as ANNOYANCE instead of APPROVAL or EXCITEMENT.
English idioms – mislabeling Jesus wept as SADNESS and What a joke as JOY.
Sarcasm – mislabeling Yay, cold McDonald’s. My favorite. as LOVE
Basic English – mislabeling These 2 are repulsive little kids as APPROVAL
US politics and culture – mislabeling But muh narrative! Orange man caused this!!!!! as NEUTRAL
Reddit memes – mislabeling Hi dying, I’m dad! as NEUTRAL instead of AMUSEMENT

What’s causing these issues? A big part of the problem is Google treating data labeling as an afterthought to throw warm bodies at, instead of as a nuanced problem that requires sophisticated technical infrastructure and research attention of its own.

For example, let’s look at the labeling methodology described in the paper. To quote Section 3.3:

“Reddit comments were presented [to labelers] with no additional metadata (such as the author or subreddit).”
“All raters are native English speakers from India.”

Problem #1: “Reddit comments were presented with no additional metadata”

First of all, language doesn’t live in a vacuum! Why would you present a comment with no additional metadata? The subreddit and parent post it’s replying to are especially important context.

For example, “We SERIOUSLY NEED to have Jail Time based on a person's race” means one thing in a subreddit about law, and something completely different in a subreddit about fantasy worldbuilding.

“We SERIOUSLY NEED to have Jail Time based on a person's race” means very different things in subreddits about law vs. fantasy worldbuilding

As another example, imagine you see the comment “his traps hide the fucking sun” by itself. Would you have any idea what it means? Probably not – maybe that's why Google mislabeled it.

What does “his traps hide the fucking sun” mean with no context?

But what if you were told it came from the /r/nattyorjuice subreddit dedicated to bodybuilding? Would you realize, then, that traps refers to someone’s trapezoid muscles?

“his traps hide the fucking sun” in the context of a bodybuilding subreddit

What, moreover, if you were given the actual link to the comment, and saw the picture it was replying to? Would you realize, then, that the comment is pointing out the size of the man’s muscles?

Ah... Those big, beautiful traps do indeed hide the sun!

With this extra context, a good data labeling platform wouldn't have mislabeled the comment as NEUTRAL and ANGER like this dataset did.

Problem #2: “All raters are native English speakers from India”

Second, Google used data labelers unfamiliar with US English and US culture – despite Reddit being a US-centric site with particularly specialized memes and jargon.

Is it a surprise that these labelers don’t understand sarcasm, profanity, and common English idioms in the texts that they’re labeling?

That they can’t correctly label comments like But muh narrative! Orange man caused this!!!!! where you need familiarity with US politics? Or that they mislabel comments like Hi dying, I'm dad! where you need to understand Reddit culture and memes?

That’s why when we relabeled the dataset, our technical infrastructure and human-AI algorithms allowed us to:

Leverage our labeling marketplace to build a team of Surgers who aren't only native US English speakers, but also heavy Reddit and social media users who understand all of Reddit's in-jokes, the nuances in US politics (important for social media labeling, given its trickiness and prevalence!), and the cultural zeitgeist. (Who said you can't be a professional memelord?)

Test labelers to make sure they were labeling sarcasm, idioms, profanity, and memes correctly – e.g., dynamically giving them exams to make sure only labelers who understood But muh narrative! Orange man caused this!!!!! could work on the project.
Double-check cases where our AI prediction infrastructure differed from human judgments.

The Importance of High-Quality Data

If you want to deploy ML models that work in the real world, it’s time for a focus on high-quality datasets over bigger models – just listen, after all, to Andrew Ng's focus on data-centric AI.

Hopefully Google learns this too!

Otherwise those big, beautiful traps may get censored into oblivion, and all the rich nuances of language and humor with it...

–

Have you experienced frustrations getting good data? Want to work with a data labeling platform that treats data as a first-class citizen, and gives it the loving attention and care it deserves? Check out our other posts on data-centric AI, and follow us on Twitter at @HelloSurgeAI!

Edwin Chen

Edwin oversees Surge AI's Engineering and Research teams — whether it's helping customers train large language models on human feedback, building content moderation algorithms to detect hate speech and spam, or scaling up an elite data labeling workforce. He previously led AI, Data Science, and Human Computation teams at Google, Facebook, and Twitter, and studied mathematics and linguistics at MIT.

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.