Blog

We aren’t just training models. You don’t simply feed a child textbooks.

We teach them values, street smarts, perseverance – the infinite subtle things that make them successful in the real world.

This is our logbook as we build the first generation of AGI.

Chartography: A Benchmark for Professional Chart Understanding

A benchmark for professional chart reasoning: Kaplan-Meier curves, candlesticks, contour maps, Bode plots, and more, written and graded by domain experts.

Read Post

July 20, 2026

July 16, 2026

No items found.

OpenAI cites GDP.pdf in its GPT-5.6 release

OpenAI included Surge AI’s GDP.pdf benchmark in its GPT-5.6 release. Its flagship model scored 30.7% on real-world professional document tasks.

Read Post

July 20, 2026

July 9, 2026

No items found.

Deeper Instructions, Stronger Generalization: Training on ComplexConstraints

We trained a 4B model on 1,000 expert-written rubrics from ComplexConstraints, our frontier instruction-following benchmark. It reached parity with a 60x larger model, and the gains transferred to external benchmarks it never saw.

Read Post

July 18, 2026

June 29, 2026

No items found.

HANDBOOK.md Benchmark: Can Agents Follow 100-Page Company Policies?

A benchmark for long-context enterprise agents: MCP-native RL environments, expert-written handbooks up to 124 pages, deterministic grading. No frontier model exceeds 25%. Instead, they fire employees without authorization, approve self-submitted expenses, and send expired medical records to insurers.

Read Post

July 18, 2026

June 25, 2026

No items found.

Anthropic cited GDP.pdf and Riemann-bench in their Fable 5 and Mythos 5 system card

Anthropic cited two Surge AI benchmarks, GDP.pdf and Riemann-bench, in their Fable 5 and Mythos 5 release. A look at why expert-built evaluations matter at the frontier.

Read Post

July 19, 2026

June 7, 2026

No items found.

ComplexConstraints: A Benchmark for Entangled Instruction Following

A benchmark for entangled instruction following, where constraints depend on each other, fire conditionally, and must be inferred from context.

Read Post

July 18, 2026

June 3, 2026

No items found.

Microsoft used Surge human evaluations to benchmark MAI-Thinking-1

When Microsoft wanted to understand how MAI-Thinking-1 measured up, they used Surge human evaluations to prove it.

Read Post

July 9, 2026

June 2, 2026

No items found.

Cross-Benchmark Generalization for Long-Horizon Agentic Tasks

Post-training on Surge AI's agentic RL environments and why it generalizes to external tool-use benchmarks like Toolathlon, τ²-Bench, and BFCL-V4.

Read Post

July 17, 2026

May 28, 2026

No items found.

Antidote Leaderboard: Optimizing for You

LMArena measures which answer you prefer in two seconds. Antidote measures which one you'd still be glad you got a month later, graded by doctors, lawyers, and engineers.

Read Post

July 18, 2026

May 21, 2026

No items found.

GDP.pdf Benchmark: Can Frontier Models Master the Documents that Run the World?

Can frontier models master the documents that run the world? GDP.pdf is a professional multimodal reasoning benchmark that takes real-world prompts and PDFs pulled directly from expert enterprise workflows.

Read Post

July 19, 2026

April 14, 2026

No items found.

Riemann-bench: A Benchmark for Moonshot Mathematics

Riemann-bench is a verifiable benchmark of extreme-tier mathematical problems where even frontier models score <10%.

Read Post

July 18, 2026

March 24, 2026

No items found.

EnterpriseBench: CoreCraft – Measuring AI Agents in Chaotic, Enterprise RL Environments

Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks. Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.

Read Post

July 18, 2026

February 19, 2026

No items found.

Hemingway-bench: Because Good Writing Isn't a Checklist of Vibes

Stop rewarding slop. Hemingway-bench is an AI writing leaderboard that takes real-world writing tasks and puts them in front of master wordsmiths. Our goal: to push AI writing from two-second vibes to genuine nuance and impact.

Read Post

July 18, 2026

February 4, 2026

No items found.

Building AdvancedIF: Evolving Instruction Following Beyond IFEval and “Avoid the Letter C”

Meta Superintelligence Labs partnered with Surge to build AdvancedIF, an instruction-following benchmark where every prompt and rubric was written by human experts – not synthetically generated by an LLM. In instruction-following domains, where frontier models still fail 22-30%, using these human-crafted rubrics as reward signals for RL yields a 13% gain.

Read Post

February 5, 2026

December 6, 2025

No items found.

LMArena is a cancer on AI

Would you trust a medical system whose only metric was “which doctor wins the Internet?” No, you'd call that malpractice. Yet that's LMArena.

Read Post

May 28, 2026

December 1, 2025

Surge AI Research Team

RL Environments and the Hierarchy of Agentic Capabilities

Our RL environment run on 9 models revealed the core capabilities all agents need to master: tool use, planning, adaptability, groundedness, and common sense.

Read Post

February 5, 2026

November 3, 2025

Surge AI Research Team

How do frontier models perform on real-world finance problems?

We stress-tested GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5 on 200+ expert finance tasks. Here's where even the best models break when they move from benchmarks to Wall Street.

Read Post

February 5, 2026

November 3, 2025

Lily Zhao

A Product Take on Sonnet 4.5

After 100+ hours with Opus 4.1 and 20+ hours in the first week of Sonnet 4.5's launch, Nick Heiner, our VP of Product gives first impressions.

Read Post

October 31, 2025

October 10, 2025

Nick Heiner

Is Sonnet 4.5 the best coding model in the world?

On Surge AI’s agentic coding benchmark, Claude Sonnet 4.5 outperformed GPT-5-Codex in accuracy, while GPT-5-Codex was more cost-efficient. Despite similar scores, the models were distinct in which tasks they failed in. In a refactoring case study, Claude succeeded after persistent debugging, while GPT-5-Codex failed due to an unexplained decision to end the task early. Both stayed focused and avoided hallucinations even when encountering difficulties.

Read Post

October 30, 2025

October 8, 2025

Logan Ritchie

The Human/AI Frontier: A Conversation with Bogdan Grechuk

At Surge AI, we work with the world’s sharpest minds to push the limits of AI. Professor Bogdan Grechuk – an IMO gold medalist and Associate Professor at the University of Leicester – is one of them. We interviewed him about the work he does to train SOTA models to perform frontier research.

Read Post

February 5, 2026

September 29, 2025

No items found.

SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations

When coding models spiral into self-reinforcing hallucinations, small mistakes compound into catastrophic failure. In SWE-bench, we saw SOTA models invent whole classes, methods, and terminal outputs – never realizing they had lost touch with the real codebase. In this case study, we’ll look at how three frontier coding agents tried to solve one particular SWE-bench problem: one spiraled into hallucinations and failed entirely, one spiraled but recovered, and one avoided hallucinations altogether. Our goal: to illustrate how dissecting real-world problems can steer models towards human-ready AGI.

Read Post

February 5, 2026

September 15, 2025

Logan Ritchie

Benchmarks are broken

Academic benchmarks make great headlines, and terrible AI.

Read Post

December 3, 2025

September 7, 2025

No items found.

Unsexy AI Failures: The PDF That Broke ChatGPT

The AI world loves climbing leaderboards. Companies race to hit #1 on LMSYS, chase perfect scores on academic benchmarks, and demo SVGs of pelicans on bicycles. These achievements make for great headlines and impressive presentations – even when these metrics are easily hacked.

Read Post

February 5, 2026

August 25, 2025

No items found.

Bringing light to the GPT-4o vs. GPT-5 personality controversy

GPT-5 was released on Aug 7, 2025. The swift removal of all legacy models from the ChatGPT UI was met with an even swifter backlash: some people online felt that GPT-4o was more personable, human, and engaging, whereas GPT-5 was stiff and robotic. This viral meme encapsulated the faction’s thesis:

Read Post

February 5, 2026

August 15, 2025

Nick Heiner

Keri Wood

DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet

An update on Astral Codex Ten's Image Generation Bet: close, but no dice. DALL·E 3 and Midjourney fail.

Read Post

October 30, 2025

August 1, 2024

Edwin Chen

How Anthropic uses Surge AI to Train and Evaluate Claude

Learn how Anthropic partnered with Surge AI to gather high-quality human feedback at scale using the RLHF platform, resulting in one of the safest and most advanced large language models on the planet.

Read Post

October 31, 2025

March 9, 2023

No items found.

We Evaluated ChatGPT vs. Google on 500 Search Queries

We measured ChatGPT vs. Google on 500 search queries, and found that ChatGPT crushes Google on coding and ties it on general information — despite not being optimized for a search experience at all. Dive into this post to learn more about OpenAI’s existential threat to Google.

Read Post

October 30, 2025

December 21, 2022

No items found.

AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust

How do you make large language models safer and adversarially robust to counterattacks? Learn about AI red teams of creative data labelers who try to interactively penetrate AI defenses in order to teach them.

Read Post

October 30, 2025

December 12, 2022

No items found.

HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors

We analyzed HellaSwag, a popular LLM benchmark, and found errors in 36% of its rows.

Read Post

October 30, 2025

December 4, 2022

Edwin Chen

How TikTok is Evolving the Next Generation of Search

TikTok has been taking over the world — and now, your Google Search results too. But when are they actually helpful? We ran a large-scale personalized human evaluation, asking Surgers to rate hundreds of <query, TikTok> pairs to find out.

Read Post

October 30, 2025

October 25, 2022

No items found.

Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?

Has Astral Codex Ten's bet on AI progress really been won? We asked Surgers to evaluate DALL·E and Imagen on Scott's 5 compositionality prompts!

Read Post

October 30, 2025

September 29, 2022

No items found.

Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels

Why can't Meta A/B test its way back to greatness? To move Instagram beyond short-term engagement metrics, we ran a personalized human evaluation asking 100 users to compare TikTok vs. Instagram Reels. Learn why Gen Z considers Reels the place where TikToks go to die, and what Instagram should do about it.

Read Post

October 30, 2025

August 31, 2022

No items found.

The $250K Inverse Scaling Prize and Human-AI Alignment

Surge AI is partnering with NYU and the Fund for Alignment Research on the Inverse Scaling Prize. If you've found a task with LLM inverse scaling properties, and need help creating a dataset of 300-500+ examples, reach out. We’re a human alignment platform with deep expertise in training large language models on human feedback, and we’re here to help – including $500 of free data labeling credits to kickstart your submission.

Read Post

October 30, 2025

August 15, 2022

No items found.

Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality

Search quality measurement is one of the trickiest, but most important parts of building Search. Read how Neeva uses human evaluation of search quality to build a state-of-the-art search engine challenging Google.

Read Post

October 30, 2025

July 29, 2022

No items found.

Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM?

Hugging Face's BLOOM is a new 176B parameter multilingual large language model. How does it compare to other state-of-the-art LLMs? We ran a human evaluation across 7 real-world categories to evaluate its performance.

Read Post

October 30, 2025

July 19, 2022

No items found.

30% of Google's Emotions Dataset is Mislabeled

Last year, Google released their “GoEmotions” dataset: a human-labeled dataset of 58K Reddit comments categorized according to 27 emotions. The problem? A whopping 30% of the dataset is mislabeled! Check out some of the egregious errors, and learn how to build better datasets.30% of Google's Emotions Dataset is Mislabeled

Read Post

October 30, 2025

July 11, 2022

Edwin Chen

AI Red Teams and Adversarial Data Labeling with Redwood Research

Our mission at Surge AI is to inject human values and intelligence into AI. We want to build a world where AI

Read Post

October 31, 2025

June 28, 2022

No items found.

Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

Gary Marcus has several examples of AI mistakes. But are they really failures, or a sign of creativity? We gave them to 15 Surgers to complete GPT-3's "mistakes" to see how they would perform instead.

Read Post

October 30, 2025

June 22, 2022

Edwin Chen

How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems

We built a dataset of 8,500 Grade School Math Problems for OpenAI. The goal of the dataset: to train language models like GPT-3 to solve natural language math problems and measure their reasoning ability. Learn about our process in this blog post!

Read Post

October 31, 2025

June 13, 2022

Edwin Chen

We asked 100 humans to draw the DALL·E prompts

Where do human artists fit in a world of rich, creative AI? We asked 100 Surgers to draw the DALL-E prompts.

Read Post

October 30, 2025

May 12, 2022

Edwin Chen

Google Search is Falling Behind

Google Search is falling behind. We analyzed three areas – programming queries, sports queries, and cooking queries – to understand where Google Search lags behind its competitors.

Read Post

October 31, 2025

April 12, 2022

No items found.

Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values

Social media platforms optimize for clicks and engagement — but those same short-term optimizations drive clickbait, toxic content, and misinformation. How can we align their ML systems to human values instead? This post describes a data-driven approach with Facebook.

Read Post

October 31, 2025

February 10, 2022

Edwin Chen

Holy $#!t: Are popular toxicity models simply profanity detectors?

Are popular toxicity models simply profanity detectors? We show how toxicity models overweight profanity, and make mistakes when profanity is used in a positive way.

Read Post

October 29, 2025

January 22, 2022

No items found.

Is Google Search Deteriorating? Measuring Google's Search Quality in 2022

Has Google's Search Quality deteriorated in recent years? This post measures Google Search using human evaluation.

Read Post

October 30, 2025

January 10, 2022

Edwin Chen

5 Examples of the Importance of Context-Sensitivity in Data-Centric AI

Data-centric AI requires radically rethinking the data that goes into your models. Surge AI provides data labelers with the skills you need to get context-sensitive labels.

Read Post

October 31, 2025

November 19, 2021

Edwin Chen

Blog

Raise AGI with the richness of human intelligence.