Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's why we're creating the world's largest dataset of social media toxicity — so you can skip the slog and get to work. Get a sample for free now!
We hope you find this sample of our dataset useful, whether you want to flag hateful speech, develop content moderation tools, or build classifiers to detect toxic messages.
Interested in the full dataset of toxicity to train your ML models, or toxicity in other languages (Spanish, French, German, Japanese, Portuguese, and 17+ more)? We work with top AI and Safety companies around the world to build human-powered datasets to train stunning ML. Reach out to team@surgehq.ai!
Dataset
This repo contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.
Columns
text: the text of the comment
is_toxic: whether or not the comment is toxic
Looking Forward...
We'll be adding more languages and annotations (e.g., augmenting each comment with a severity ranking, adding categories, etc) over time. In the meantime, if you are interested in a dataset of profanity, check out The Obscenity List.
—
Surge AI is a data labeling workforce and platform that provides world-class data to top AI companies and researchers.
10 Egregious Failures in Gmail Spam Detection
Data Labeling 2.0 for Rich, Creative AI
Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.