How much do you trust your training data?
Good ML models depend on high-quality datasets. Human-labeling systems are the gold standard, but like all systems, they're prone to mistakes.
At Surge AI, we're building the human-AI infrastructure to power NLP, and we're always looking for new ways to explore human-machine collaboration. (Remember Tom Cruise in Minority Report? We're inspired by that too.)
That's why we're excited to share Surge Manifold, an interactive tool to visually explore your datasets and identify outliers in your data.
Take, for example, this orange point on the bottom left. Lurking amidst the sea of purple, it looks pretty out of place...
Well, it is! This is a Manifold visualization of a popular spam dataset in which SMS messages were labeled as Spam or Non-Spam.
If we click on the data point, the message is revealed — Did you hear about the new "Divorce Barbie"? It comes with all of Ken's stuff! — and we see it's a joke that has been been erroneously labeled as Spam.
Next, let's look at this Jigsaw dataset, in which Wikipedia comments were categorized as 'toxic' or 'non-toxic'.
When we visualize the dataset with Manifold, we can pick out at least one data point that is certainly a misclassification – and sure enough, it's a comment classified as non-toxic, despite containing a racial slur.
Here we've visualized a popular sentiment analysis dataset. Each point represents a tweet about one of the six major U.S. airlines, labeled as positive, neutral, or negative by human labelers.
The visualization once again helps us find a mislabel:
This orange tweet hovering at the purple/orange border — @united thanks for updating me about the 1+ hour delay the exact second I got to ATL. 🙅🙅🙅 — was labeled as Positive, even though it's author is clearly being sarcastic.
(If you want to explore these and other visualizations further, they're available on the Manifold home page.)
Internally, we've been using Manifold to improve the dataset quality that our human labeling platform provides, at 10x the efficiency over standard methods. Then as a form of lightweight ML, based on the geometry that Manifold uncovers; we'll go into this more in a future post...
We're excited to make Manifold public. At Surge, we believe that everybody – whether you're a researcher, an artist, a seasoned ML engineer, or a student just starting out – should have the best data imaginable.
So try uploading your own dataset into Manifold here! All you need is a CSV file with a "text" column and an optional "label" column.
Have ideas on how to make Manifold more helpful? Need help labeling a new dataset that you're hoping to train or testing a model on?