Are the Spammers winning? Failures in Gmail Spam Detection

Edwin Chen
May 24, 2022
Are the Spammers winning? Failures in Gmail Spam Detection

Are the Spammers winning?! Lately, it seems that more and more spam is passing through Gmail's filters... 

I'm not the only one!

Ask HN: Are you also getting extremely obvious spam bypassing Gmail's filters?

Ask HN: Has Gmail spam blocking taken a sudden nosedive?

Spam Detection is one of the first things taught in ML 101. And yet its failures have become a major issue for Tech.

  • Elon has threatened to pull out of his acquisition if Twitter is worse at catching spammers than it claims. (But Elon’s probably wrong: see our analysis on the numbers!)
  • Trust in Google Search is falling, and the rise of spammy, SEO content is a major factor. Do you really trust Google Search results to tell you which vacuum cleaner to buy, or are you turning to Reddit instead?

I used to work on our Spam teams at YouTube and Twitter, and looking at real examples of Spam was surprisingly rare. ML engineers often treat their models as black box optimizations, even though it’s a very human problem – spammers constantly adapt social engineering strategies to trick their victims.

So what kind of Spam is being missed, and how can we do better? We asked Surgers (the data labelers on our platform) to submit examples whenever they come across spammy messages that Gmail doesn’t detect.

Here are 10 examples of Gmail Spam fails, together with Surger commentary. (Interested in the full dataset? Email us at hello@surgehq.ai)

Gmail Fails

It’s an obviously spammy email, offering a free flashlight (lol), with a link to retrieve it. It also uses very random characters in the subject line (///Y0urFREEFlashlight(NeedYourAddress)///1851), username (F l a s h l i g h t’), and sender (u8zjwFb-lyZSC3-noReply@gwhsi.lairpro.com). WTF this wasn’t caught!

This is obvious spam for a few reasons:

1. It appears to be a dating site that I've never signed up for or heard of (I certainly don't know any "Sylvia").

2. The poor English ("Sylvia want to meet you") and sketchy link.

3. It's in Swedish!

The email is obviously not from Best Buy. It’s from some weird email uBPrL1A-1DubrR-noReply@ozm40.lairpro.com with a bogus name of _Congrat_’ and a subject line of MessageF0rY0u7294, saying “You’ve been selecte.d85920”. Pretty obvious it's spam, I’m not sure why Gmail couldn’t catch this. Do the weird, obfuscated characters actually fool its filters?

This is very obvious spam. It’s a clickbait link, sent from a low-quality Hotmail account, with a sketchy name (“ArthritisDiet”).

I’ve certainly never been to such a shop, so the email is unwanted, and I can’t find any information on a shop of this name at these addresses. What’s more, there are THREE DIFFERENT addresses in the email that seem to be virtual mailboxes.

The subject line is ForYourDreamBathroom without any spaces. (A lot of these spam messages seem to do that. Why?)

This was sent to many people, from a suspicious-looking Hotmail address. The subject and user’s name are also missing spacing, which is a good sign of Spam.

This is a sex therapist app that I’ve never signed up for. Pretty angry this couldn’t be caught!

This is a dating webpage, written in Russian. I don’t speak Russian.

This is an email impersonating Lowe’s pretending to give away high value gift cards for for completing a short survey.

See also the obviously spammy “-*Lowe’s*-” username and “confirmation_Receipt!.” subject line.

This is clear Spam. The email sender’s name is “Home” (I see this a lot in Spam emails, I’m not sure why they don’t pick better names). It’s from an ordinary, non-company gmail account. The English is completely trash, with random capitalization and capitalization. Even the way there are multiple empty lines between “valid feedback!” and “Take the survey” seems like a spam indicator!

In short, a lot of these Spam emails seem quite detectable: witness the odd characters (-*Lowe’s*-), random spacing (///Y0urFREEFlashlight(NeedYourAddress)///1851), sketchy addresses (uBPrL1A-1DubrR-noReply@ozm40.lairpro.com), low-quality names (ArthritisDiet), and clear impersonation of companies like CVS and Lowe's (free gift cards sent form ordinary Hotmail accounts).

What do you think? How would you build an ML system to catch these?

At Surge AI, we help customers label millions of examples of Spam every month to train their machine learning and content moderation systems. If you’re interested in a free Spam dataset to test your own models, get access to it here!

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Edwin Chen

Edwin Chen

Edwin oversees Surge AI's Engineering, Product, and Research — whether it's helping customers train large language models on human feedback, labeling data to build Toxicity and Spam classifiers, designing new API integrations, or building the highest-quality data labeling workforce in the world. He previously led AI, Data Science, and Human Computation teams at Google, Facebook, and Twitter, and studied mathematics and linguistics at MIT.

Data Labeling for the
Richness of AI

Build human-powered datasets using our global labeling workforce and platform.

Never miss a post

Subscribe to our newsletter and never miss our latest news.