Surge AI's Free Toxicity Dataset

Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's why we're creating the world's largest dataset of social media toxicity — so you can skip the slog and get to work. Get a sample for free now!

We hope you find this sample of our dataset useful, whether you want to flag hateful speech, develop content moderation tools, or build classifiers to detect toxic messages.

Interested in the full dataset of toxicity to train your ML models, or toxicity in other languages (Spanish, French, German, Japanese, Portuguese, and 17+ more)? We work with top AI and Safety companies around the world to build human-powered datasets to train stunning ML. Reach out to team@surgehq.ai!

Dataset

This repo contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.

Columns

text: the text of the comment

is_toxic: whether or not the comment is toxic

Looking Forward...

We'll be adding more languages and annotations (e.g., augmenting each comment with a severity ranking, adding categories, etc) over time. In the meantime, if you are interested in a dataset of profanity, check out The Obscenity List.

—

Surge AI is a data labeling workforce and platform that provides world-class data to top AI companies and researchers.

Bradley Webb

Bradley runs Surge AI's Product and Growth teams. He previously led Integrity and Data Operations teams at Facebook, and graduated from Dartmouth.

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Surge AI's Free Toxicity Dataset

Dataset

Columns

Looking Forward...

10 Egregious Failures in Gmail Spam Detection

Bradley Webb

Data Labeling 2.0 for Rich, Creative AI

Meet the world's largest
RLHF platform

Surge AI's Free Toxicity Dataset

Dataset

Columns

Looking Forward...

10 Egregious Failures in Gmail Spam Detection

Bradley Webb

Data Labeling 2.0 for Rich, Creative AI

Related articles

How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems

Holy $#!t: Are popular toxicity models simply profanity detectors?

How Anthropic uses Surge AI’s RLHF platform to train their LLM Assistant on Human Feedback

10 Egregious Failures in Gmail Spam Detection

Holy $#!t: Are popular toxicity models simply profanity detectors?

Meet the world's largest RLHF platform

Meet the world's largest
RLHF platform