Manifold: Using Neural Embeddings to Explore the Shape of Your Data

Jefferson Lee
Sep 18, 2021
Manifold: Using Neural Embeddings to Explore the Shape of Your Data

Interested in $50 of free labels? Fill out our 30-second form and we'll get you started today!

How much do you trust your training data?

Good ML models depend on high-quality datasets. Human-labeling systems are the gold standard, but like all systems, they're prone to mistakes.

At Surge AI, we're building the human-AI infrastructure to power NLP, and we're always looking for new ways to explore human-machine collaboration. (Remember Tom Cruise in Minority Report? We're inspired by that too.)

That's why we're excited to share Surge Manifold, an interactive tool to visually explore your datasets and identify outliers in your data.

Visualizing Spam

Take, for example, this orange point on the bottom left. Lurking amidst the sea of purple, it looks pretty out of place...

Well, it is! This is a Manifold visualization of a popular spam dataset in which SMS messages were labeled as Spam or Non-Spam.

It's a tired joke, but it's not spam!

If we click on the data point, the message is revealed — Did you hear about the new "Divorce Barbie"? It comes with all of Ken's stuff! — and we see it's a joke that has been been erroneously labeled as Spam.

Toxicity

Next, let's look at this Jigsaw dataset, in which Wikipedia comments were categorized as 'toxic' or 'non-toxic'.

When we visualize the dataset with Manifold, we can pick out at least one data point that is certainly a misclassification – and sure enough, it's a comment classified as non-toxic, despite containing a racial slur.

This racist comment is probably about as toxic as Wikipedia comments get.

Undetected Sarcasm

Here we've visualized a popular sentiment analysis dataset. Each point represents a tweet about one of the six major U.S. airlines, labeled as positive, neutral, or negative by human labelers.

The visualization once again helps us find a mislabel:

The orange point was labeled as Positive sentiment, but it's hard to imagine someone was thinking positive thoughts when they tweeted those emojis.

This orange tweet hovering at the purple/orange border — @united thanks for updating me about the 1+ hour delay the exact second I got to ATL. 🙅🙅🙅 — was labeled as Positive, even though it's author is clearly being sarcastic.

(If you want to explore these and other visualizations further, they're available on the Manifold home page.)

What's next?

Internally, we've been using Manifold to improve the dataset quality that our human labeling platform provides, at 10x the efficiency over standard methods. Then as a form of lightweight ML, based on the geometry that Manifold uncovers; we'll go into this more in a future post...

We're excited to make Manifold public. At Surge, we believe that everybody – whether you're a researcher, an artist, a seasoned ML engineer, or a student just starting out – should have the best data imaginable.

So try uploading your own dataset into Manifold here! All you need is a CSV file with a "text" column and an optional "label" column.

Have ideas on how to make Manifold more helpful? Need help labeling a new dataset that you're hoping to train or testing a model on?

We love questions and we'd love to help! Reach out to us at team@surgehq.ai, find us on Twitter at @HelloSurge, or schedule a demo here.

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Jefferson Lee

Jefferson Lee

Jefferson leads Surge AI's data labeling and content moderation products — whether it's helping customers evaluate large language models, moderate content, or train Spam and Hate Speech classifiers. He was previously an early engineer on Airbnb's Trust and Safety ML team, and studied computer science at Harvard.

Data Labeling for the
Richness of AI

Build human-powered datasets using our global labeling workforce and platform.