Free Human-Labeled Datasets

Lovingly annotated by the Surge AI data labeling workforce, for your wildest data needs — including hate speech and content moderation datasets, stock market and financial transaction datasets, NSFW datasets, and more, in 30+ languages.


Need a custom dataset and don't see it here? Reach out to team@surgehq.ai!
RLHF Dataset for Reinforcement Learning with Human Feedback
Build state-of-the-art AI by training your large language models on human feedback.
InstructGPT-style Dataset
Build state-of-the-art large language models, in the style of InstructGPT and ChatGPT.
Profanity Dataset
Need a list of profanities, and can't dream up enough on your own? We have you covered. Get the world's best profanity dataset for free now.
Toxicity Dataset
The world's largest dataset of social media toxicity — hateful speech across Twitter, Facebook, YouTube, Reddit, and more.
Hate Speech Dataset
A dataset of hate speech from across the Internet.
Sentiment Analysis Dataset
1000+ customer reviews, social media posts, and more, classified by sentiment.
French Profanity List
A dataset of thousands of French profanities, insults, and curse words, so that you can keep your platform safe.
Spanish Hate Speech Dataset
A collection of Spanish hate speech texts
Financial Transactions Dataset
A dataset of financial transactions, classified by intent and financial category.
Twitter Hate Speech Dataset
A collection of hate speech tweets on Twitter.
Stock Sentiment Analysis Dataset
1000 stock market tweets, labeled with their sentiment towards a publicly traded stock.
Resumes and Job Categorization Dataset
A dataset of resumes, classified with job title, category, and more.
Search Evaluation Dataset
This search evaluation dataset contains search queries, the intent behind each search query, result URLs, and a human-evaluated search quality rating.
Crypto Sentiment Analysis Dataset
1000 Reddit comments about Crypto, labeled with Positive or Negative sentiment.
Japanese Profanity List
A dataset of thousands of Japanese profanities, insults, and curse words, so that you can keep your platform safe.
Credit Card Transactions Dataset
A collection of credit card transactions, classified by intent and financial category.
Google Search Quality Dataset
This Google Search Quality dataset contains search queries, intents, result URLs, and a human-evaluated rating.
Twitter Sentiment Analysis Dataset
1000+ tweets, classified by sentiment.
Japanese Hate Speech, Insults, and Toxicity Dataset
A dataset of online comments in Japanese that contain hate speech, insults, and toxicity.
Arabic Hate Speech Dataset
A dataset of Arabic hate speech texts.
Abortion Tweets Dataset
A collection of tweets, labeled with their stance on abortion and Roe v. Wade.
Brand Sentiment Analysis Dataset
Ditch NPS for good; understand real user sentiment with this dataset of 1000 labeled, online conversations.
Dataset of Search Queries and Intents
This dataset contains search queries, as well as the user's intent when performing the search query.
Fake News Dataset
A dataset of social media posts containing fake news.
Facebook Misinformation Dataset
A dataset of Facebook posts containing misinformation.
Facebook Hate Speech Dataset
A collection of hate speech posts on Facebook.
Email Spam Dataset
A dataset of real Spam and Not Spam emails, including whether or not they were caught by Gmail's spam filters.
German Profanity List
A dataset of thousands of German profanities, insults, and curse words, so that you can keep your platform safe.
Arabic Profanity List
A dataset of thousands of Arabic profanities, insults, and curse words, so that you can keep your platform safe.
Spanish Profanity List
A dataset of thousands of Spanish profanities, insults, and curse words, so that you can keep your platform safe.
Question-Answering Dataset
A dataset of questions about real webpages, news articles, and pieces of text, along with their associated answers.

Other Resources

Manifold
Have you ever wondered how your data is shaped? Explore your datasets in their embedding space with our interactive visualizations.

Brought to you by Surge AI

The world's highest-quality data labeling platform. We unify sophisticated labelers with the powerful tools you need to build next-gen artificial intelligence and machine learning models. Learn about some of the common pitfalls in data labeling we avoid to bring you the best data possible.

Meet the world's largest
RLHF platform