Dataset: Stock Market Sentiment on Social Media

Bradley Webb
Jun 8, 2022
Dataset: Stock Market Sentiment on Social Media

During the Gamestop saga of January 2021, the WallStreetBets subreddit successfully short squeezed Gamestop’s stock, leading to a $7B loss for Melvin Capital and memes galore.

Melvin Capital vs. the little guys

Too bad Melvin didn’t have a crack ML team monitoring social media!

To help others explore social media's influence on the stock market — and avoid Melvin Capital's fate — we created a dataset of social media conversations about public stocks, labeled with sentiment.

Download the dataset here!

Have ideas for other datasets you’d like us to release? Give us a shout on Twitter at @HelloSurgeAI.

The Stock Sentiment Analysis Dataset

The dataset contains 1000 social media discussions of publicly traded stocks, with a Positive or Negative sentiment associated with each. Some of the sentiment is unequivocal; others are much trickier for models to classify correctly, since their sentiment is masked by sarcasm and trading-specific language.

Here are some examples. Are you confident that your classifiers can label them appropriately?

For example, can your model detect the sarcasm, and realize that this message is actually Positive towards $SNOW?

Positive towards $SNOW. Can you detect the sarcasm?

How would it classify this tweet? It's overall Positive in sentiment towards Ariose Capital Management 13F, but mentions that they've exited $NVDA. Can your model parse the structure?

Negative towards $NVDA

Many off-the-shelf sentiment analysis classifiers mistakenly classify any profanity (even obscured profanity, like fack) as negative in sentiment. But profanity isn't always a bad sign!

Positive towards $GFAI

How would your sentiment classifier perform? Download the dataset and try it out here: https://github.com/surge-ai/stock-sentiment

How We Labeled the Dataset

As seen from the examples above, labeling the sentiment of a message towards a particular stock can be surprisingly tricky. And often you need domain knowledge: if someone is talking about a buying a put or a short squeeze, is that positive or negative? Unless you’re familiar with these financial terms, you may not know how to label it.

Having data labelers with the right skills is essential to creating quality datasets. For this project we used a team of Surgers with financial backgrounds who are also heavy social media users, who've worked on our other financial categorization and social media data labeling projects.

Here's a peek as well at our labeling UI as well. Our platform makes it easy to create new labeling jobs, whether through our API or our WYSIWYG editor.

Labeling on the Surge AI platform

Want to create a custom financial dataset that will take you to the moon? Sign up and create a new labeling project in seconds, or reach out to us for help at hello@surgehq.ai!

Bradley Webb

Bradley Webb

Bradley runs Surge AI's Product and Growth teams. He previously led Integrity and Data Operations teams at Facebook, and graduated from Dartmouth.

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Meet the world's largest
RLHF platform

Follow Surge AI!