Surge AI | Human Infrastructure for NLP

Andrew Mauboussin

Andrew oversees Surge AI's Engineering team. He previously led Twitter's Integrity ML and Counterintelligence efforts, and studied CS at Harvard.

DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet

Andrew Mauboussin

An update on Astral Codex Ten's Image Generation Bet: close, but no dice. DALL·E 3 and Midjourney fail.

We Evaluated ChatGPT vs. Google on 500 Search Queries

Andrew Mauboussin

We measured ChatGPT vs. Google on 500 search queries, and found that ChatGPT crushes Google on coding and ties it on general information — despite not being optimized for a search experience at all. Dive into this post to learn more about OpenAI’s existential threat to Google.

AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust

Andrew Mauboussin

How do you make large language models safer and adversarially robust to counterattacks? Learn about AI red teams of creative data labelers who try to interactively penetrate AI defenses in order to teach them.

How TikTok is Evolving the Next Generation of Search

Andrew Mauboussin

TikTok has been taking over the world — and now, your Google Search results too. But when are they actually helpful? We ran a large-scale personalized human evaluation, asking Surgers to rate hundreds of <query, TikTok> pairs to find out.

Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?

Andrew Mauboussin

Has Astral Codex Ten's bet on AI progress really been won? We asked Surgers to evaluate DALL·E and Imagen on Scott's 5 compositionality prompts!

Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels

Andrew Mauboussin

Why can't Meta A/B test its way back to greatness? To move Instagram beyond short-term engagement metrics, we ran a personalized human evaluation asking 100 users to compare TikTok vs. Instagram Reels. Learn why Gen Z considers Reels the place where TikToks go to die, and what Instagram should do about it.

The $250K Inverse Scaling Prize and Human-AI Alignment

Andrew Mauboussin

Surge AI is partnering with NYU and the Fund for Alignment Research on the Inverse Scaling Prize. If you've found a task with LLM inverse scaling properties, and need help creating a dataset of 300-500+ examples, reach out. We’re a human alignment platform with deep expertise in training large language models on human feedback, and we’re here to help – including $500 of free data labeling credits to kickstart your submission.

Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM?

Andrew Mauboussin

Hugging Face's BLOOM is a new 176B parameter multilingual large language model. How does it compare to other state-of-the-art LLMs? We ran a human evaluation across 7 real-world categories to evaluate its performance.

AI Red Teams and Adversarial Data Labeling with Redwood Research

Andrew Mauboussin

10 Egregious Failures in Gmail Spam Detection

Andrew Mauboussin

We asked Surgers – the data labelers on our platform – to collect examples of spammy emails that Gmail failed to catch. Here are 10 wild Gmail Spam misses, from our Gmail Spam dataset.

Google Search is Falling Behind

Andrew Mauboussin

Google Search is falling behind. We analyzed three areas – programming queries, sports queries, and cooking queries – to understand where Google Search lags behind its competitors.

Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values

Andrew Mauboussin

Social media platforms optimize for clicks and engagement — but those same short-term optimizations drive clickbait, toxic content, and misinformation. How can we align their ML systems to human values instead? This post describes a data-driven approach with Facebook.