Ever wish you had a ready-made list of profanity? Maybe you want to remove NSFW comments, filter offensive usernames, or build content moderation tools, and you can't dream up enough obscenities on your own. You’re in luck — Surge AI is creating the world's largest profanity dataset, in 20+ languages. Get it for free now.
Dataset
The dataset contains 1600+ popular English profanities and their variations.
Columns
text: the profanity
canonical_form_1: the profanity's canonical form
canonical_form_2: an additional canonical form, if applicable
canonical_form_3: an additional canonical form, if applicable
category_1: the profanity's primary category (see below for list of categories)
category_2: the profanity's secondary category, if applicable
category_3: the profanity's tertiary category, if applicable
severity_rating: We asked 5 Surge AI data labelers to rate how severe they believed each profanity to be, on a 1-3 point scale. This is the mean of those 5 ratings.
severity_description: We rounded `severity_rating` to the nearest integer. `Mild` corresponds to a rounded mean rating of `1`, `Strong` to `2`, and `Severe` to `3`.
Categories
We organized the profanity into the following categories:
- sexual anatomy / sexual acts
- bodily fluids / excrement
- sexual orientation / gender
- racial / ethnic
- mental disability
- physical disability
- physical attributes
- animal references
- religious offense
- political
Looking forward...
We'll be adding more languages and profanity annotations over time.
Need a larger set of expletives and slurs, or a list of swear words in other languages (Spanish, French, German, Japanese, Portuguese, etc)? We love feedback. Reach out to team@surgehq.ai!
—
Surge AI is a data labeling workforce and platform that provides world-class data to top AI companies and researchers. Interested in $50 of free labels? Fill out our 30-second form and we'll get you started today! Make sure to also check out our Omicron Tweet Analysis Post
Data Labeling 2.0 for Rich, Creative AI
Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.