Bradley Webb
Sep 20, 2022
Hate Speech Datasets in English, Spanish, Japanese, Arabic, and More
Internet users worn down by the amount of online hate speech constantly surrounding them

Identifying Hate Speech is essential to maintaining a healthy online community. To help ML and Trust & Safety teams fight hate speech, we’re releasing a series of free, labeled datasets covering multiple languages and social media platforms. Download them here!

Without good systems to combat hate speech, communities quickly descend into vile attacks based on attributes like race, religion, ethnicity, and sexual orientation.

Toxicity Examples

Here are a few examples of toxic content from the dataset.

Data Labeling Workforce

Labeling hate speech is tricky. While the examples above clearly show hateful language directed at other groups, language is often subtly coded. In order to do a good job, you need to understand the slang and coded phrases groups use to attack one another.

In this tweet, for example, “6 million cookies” is a form of Nazi denialism.

Unless you have a lot of experience, it’s difficult to label these! That’s why having data labelers with the right skills is essential to creating quality datasets.

For this project, we built a team of Surgers with experience modering content on Twitter, Facebook, TikTok and other major social platforms. They've labeled millions of examples of hate speech and toxicity for our customers’ data labeling projects.

Data Labeling Interface

Here’s a peek at our labeling UI. Our platform makes it fast to create new data annotation and data collection jobs, whether through our API or our WYSIWYG editor. Start collecting thousands of data points to feed your models and measure your progress.

Hate speech labeling interface on the Surge AI platform

More Datasets

