Identifying Hate Speech is essential to maintaining a healthy online community. To help ML and Trust & Safety teams fight hate speech, we’re releasing a series of free, labeled datasets covering multiple languages and social media platforms. Download them here!
- Dataset of English hate speech on Twitter
- Dataset of English hate speech on Facebook
- Dataset of Spanish hate speech
- Dataset of Japanese hate speech
- Dataset of Arabic hate speech
Need a custom hate speech dataset, or are you interested in other languages? Reach out to team@surgehq.ai to learn more!
Introduction
Without good systems to combat hate speech, communities quickly descend into vile attacks based on attributes like race, religion, ethnicity, and sexual orientation.
These aren’t the types of posts you want to see when you scroll through TikTok or Twitter! If you need to identify hate speech, get in touch! Our teams of content moderation Surgers will find the hate speech on your platform so your users don’t have to.
Toxicity Examples
Here are a few examples of toxic content from the dataset.
Data Labeling Workforce
Labeling hate speech is tricky. While the examples above clearly show hateful language directed at other groups, language is often subtly coded. In order to do a good job, you need to understand the slang and coded phrases groups use to attack one another.
In this tweet, for example, “6 million cookies” is a form of Nazi denialism.
Unless you have a lot of experience, it’s difficult to label these! That’s why having data labelers with the right skills is essential to creating quality datasets.
For this project, we built a team of Surgers with experience modering content on Twitter, Facebook, TikTok and other major social platforms. They've labeled millions of examples of hate speech and toxicity for our customers’ data labeling projects.
Data Labeling Interface
Here’s a peek at our labeling UI. Our platform makes it fast to create new data annotation and data collection jobs, whether through our API or our WYSIWYG editor. Start collecting thousands of data points to feed your models and measure your progress.
More Datasets
Want to build a custom hate speech dataset, or need help with other data labeling or content moderation projects? Sign up and create a new labeling project in seconds, or reach out to team@surgehq.ai for a fully managed end-to-end labeling service.
Interested in more data? Check out our other free datasets and blog posts:
- Dataset of Toxicity on Social Media
- Holy $#!t: Are popular toxicity models simply profanity detectors?
- Is Elon right? We labeled 500 Twitter users to measure the amount of Spam
Or browse our full list of datasets.
Free Toxicity Dataset (September 2022)
Data Labeling 2.0 for Rich, Creative AI
Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.