Andrew Mauboussin
Andrew oversees Surge AI's Engineering team. He previously led Twitter's Integrity ML and Counterintelligence efforts, and studied CS at Harvard.
DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet
Andrew Mauboussin
An update on Astral Codex Ten's Image Generation Bet: close, but no dice. DALL·E 3 and Midjourney fail.
We Evaluated ChatGPT vs. Google on 500 Search Queries
Andrew Mauboussin
We measured ChatGPT vs. Google on 500 search queries, and found that ChatGPT crushes Google on coding and ties it on general information — despite not being optimized for a search experience at all. Dive into this post to learn more about OpenAI’s existential threat to Google.
AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust
Andrew Mauboussin
How do you make large language models safer and adversarially robust to counterattacks? Learn about AI red teams of creative data labelers who try to interactively penetrate AI defenses in order to teach them.
How TikTok is Evolving the Next Generation of Search
Andrew Mauboussin
TikTok has been taking over the world — and now, your Google Search results too. But when are they actually helpful? We ran a large-scale personalized human evaluation, asking Surgers to rate hundreds of <query, TikTok> pairs to find out.
Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?
Andrew Mauboussin
Has Astral Codex Ten's bet on AI progress really been won? We asked Surgers to evaluate DALL·E and Imagen on Scott's 5 compositionality prompts!
Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels
Andrew Mauboussin
Why can't Meta A/B test its way back to greatness? To move Instagram beyond short-term engagement metrics, we ran a personalized human evaluation asking 100 users to compare TikTok vs. Instagram Reels. Learn why Gen Z considers Reels the place where TikToks go to die, and what Instagram should do about it.
The $250K Inverse Scaling Prize and Human-AI Alignment
Andrew Mauboussin
Surge AI is partnering with NYU and the Fund for Alignment Research on the Inverse Scaling Prize. If you've found a task with LLM inverse scaling properties, and need help creating a dataset of 300-500+ examples, reach out. We’re a human alignment platform with deep expertise in training large language models on human feedback, and we’re here to help – including $500 of free data labeling credits to kickstart your submission.
Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM?
Andrew Mauboussin
Hugging Face's BLOOM is a new 176B parameter multilingual large language model. How does it compare to other state-of-the-art LLMs? We ran a human evaluation across 7 real-world categories to evaluate its performance.
AI Red Teams and Adversarial Data Labeling with Redwood Research
Andrew Mauboussin
10 Egregious Failures in Gmail Spam Detection
Andrew Mauboussin
We asked Surgers – the data labelers on our platform – to collect examples of spammy emails that Gmail failed to catch. Here are 10 wild Gmail Spam misses, from our Gmail Spam dataset.
Google Search is Falling Behind
Andrew Mauboussin
Google Search is falling behind. We analyzed three areas – programming queries, sports queries, and cooking queries – to understand where Google Search lags behind its competitors.
Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values
Andrew Mauboussin
Social media platforms optimize for clicks and engagement — but those same short-term optimizations drive clickbait, toxic content, and misinformation. How can we align their ML systems to human values instead? This post describes a data-driven approach with Facebook.