The Toxicity Dataset

Bradley Webb
Jan 21, 2022
The Toxicity Dataset

Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's why we're creating the world's largest dataset of social media toxicity — so you can skip the slog and get to work. Get a sample for free now!

We hope you find this sample of our dataset useful, whether you want to flag hateful speech, develop content moderation tools, or build classifiers to detect toxic messages.

Interested in the full dataset of toxicity to train your ML models, or toxicity in other languages (Spanish, French, German, Japanese, Portuguese, and 17+ more)? We work with top AI and Safety companies around the world to build human-powered datasets to train stunning ML. Reach out to team@surgehq.ai!

Dataset

This repo contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.

Columns

text: the text of the comment

is_toxic: whether or not the comment is toxic

Looking Forward...

We'll be adding more languages and annotations (e.g., augmenting each comment with a severity ranking, adding categories, etc) over time. In the meantime, if you are interested in a dataset of profanity, check out The Obscenity List.


Surge AI is a data labeling workforce and platform that provides world-class data to top AI companies and researchers.

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Bradley Webb

Bradley Webb

Bradley runs Surge AI's Product and Growth teams. He previously led Integrity and Data Operations teams at Facebook, and graduated from Dartmouth.

Data Labeling for the
Richness of AI

Build human-powered datasets using our global labeling workforce and platform.