Login
Careers
Research
Blog
Contact
Blog
Unsexy AI Failures: The PDF That Broke ChatGPT
DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet
How Anthropic uses Surge AI to Train and Evaluate Claude
We Evaluated ChatGPT vs. Google on 500 Search Queries
AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust
HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors
How TikTok is Evolving the Next Generation of Search
Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?
Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels
The $250K Inverse Scaling Prize and Human-AI Alignment
Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM?
30% of Google's Emotions Dataset is Mislabeled
Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality
AI Red Teams and Adversarial Data Labeling with Redwood Research
Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?
How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems
10 Egregious Failures in Gmail Spam Detection
We asked 100 humans to draw the DALL·E prompts
The average number of ads on a Google Search recipe? 8.7
Google Search is Falling Behind
Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values
Holy $#!t: Are popular toxicity models simply profanity detectors?
Is Google Search Deteriorating? Measuring Google's Search Quality in 2022
5 Examples of the Importance of Context-Sensitivity in Data-Centric AI
The AI Bottleneck: High-Quality, Human-Powered Data