Hate Speech Datasets in English, Spanish, Japanese, Arabic, and More

Bradley Webb
Sep 20, 2022
Hate Speech Datasets in English, Spanish, Japanese, Arabic, and More
Internet users worn down by the amount of online hate speech constantly surrounding them

Identifying Hate Speech is essential to maintaining a healthy online community. To help ML and Trust & Safety teams fight hate speech, we’re releasing a series of free, labeled datasets covering multiple languages and social media platforms. Download them here!

Need a custom hate speech dataset, or are you interested in other languages? Reach out to team@surgehq.ai to learn more!

Introduction

Without good systems to combat hate speech, communities quickly descend into vile attacks based on attributes like race, religion, ethnicity, and sexual orientation.

These aren’t the types of posts you want to see when you scroll through TikTok or Twitter! If you need to identify hate speech, get in touch! Our teams of content moderation Surgers will find the hate speech on your platform so your users don’t have to.  

Toxicity Examples

Here are a few examples of toxic content from the dataset.

Data Labeling Workforce

Labeling hate speech is tricky. While the examples above clearly show hateful language directed at other groups, language is often subtly coded. In order to do a good job, you need to understand the slang and coded phrases groups use to attack one another.

In this tweet, for example, “6 million cookies” is a form of Nazi denialism.

Unless you have a lot of experience, it’s difficult to label these! That’s why having data labelers with the right skills is essential to creating quality datasets.

For this project, we built a team of Surgers with experience modering content on Twitter, Facebook, TikTok and other major social platforms. They've labeled millions of examples of hate speech and toxicity for our customers’ data labeling projects.

Data Labeling Interface

Here’s a peek at our labeling UI. Our platform makes it fast to create new data annotation and data collection jobs, whether through our API or our WYSIWYG editor. Start collecting thousands of data points to feed your models and measure your progress.

Hate speech labeling interface on the Surge AI platform

More Datasets

Want to build a custom hate speech dataset, or need help with other data labeling or content moderation projects? Sign up and create a new labeling project in seconds, or reach out to team@surgehq.ai for a fully managed end-to-end labeling service.

Interested in more data? Check out our other free datasets and blog posts:

Or browse our full list of datasets.

Bradley Webb

Bradley Webb

Bradley runs Surge AI's Product and Growth teams. He previously led Integrity and Data Operations teams at Facebook, and graduated from Dartmouth.

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Meet the world's largest
RLHF platform

Follow Surge AI!