According to our internal agentic coding benchmark: yes!

Claude Sonnet 4.5 is the clear leader, but GPT-5-Codex is more than 2x cheaper.

While the scores of Claude Sonnet 4.5 and GPT-5-Codex look similar, the models aren’t. Roughly half of each model’s failed tasks were passed by the other model, reflecting differing skill sets and reasoning styles.

That's why it's important not only to look at the raw benchmark scores but also to analyze these differences in order to understand the true nuances of model performance in the real world. This is what we'll unpack in this blog.

How we built our internal benchmark

First, let's dive into how we created this benchmark dataset containing 2161 tasks.

Each task includes a prompt, codebase, reference solution, and unit tests which verify correctness of the agent’s results. All were designed to challenge the most advanced models, then reviewed and verified to ensure they were achievable and accurately covered by the tests.

We felt that current coding benchmarks were lacking in scale and diversity, contamination control, and real-world application.

Therefore, our goal was to collect a dataset that addressed these gaps: tasks across nine languages, built from a mix of public and private codebases, and spanning the full range of what agentic models encounter in practice, from tidy, well-defined prompts to messy, open-ended ones across projects of varying scale.

Task creation guidelines

Below is a snippet of the instructions given to Surgers. The emphasis was on creating realistic tasks that were reliably evaluated through unit tests.

Example instruction snippet:

Who were the Surgers that created the tasks?

We screened our workforce with a series of tests. Each candidate were assessed on three critical dimensions:

  1. Engineering mastery - depth of technical understanding and precision in reasoning.
  2. Adversarial creativity - the ability to push models to their limits and expose weaknesses.
  3. Instructional discipline - meticulous attention to following complex, detailed directions.

From thousands of applicants, we assembled a small, world-class group of engineers. Some profiles:

Marcus
I’ve spent the past eight years building and reviewing large-scale software systems, including five years at Google. There, I served as a Readability Reviewer, one of the company’s highest engineering designations, responsible for approving production commits across multiple repositories for both functional integrity and idiomatic quality. My work focused on ensuring that code met not just performance and security standards, but the craftsmanship expected of world-class engineering teams. I specialize in full-stack web development, large-system architecture, and the kind of rigorous code review that keeps billion-user systems clean, fast, and maintainable.
Ethan
I’m a software engineer and quantitative researcher with a background spanning low-latency systems, data infrastructure, and machine learning. My experience includes building performance-critical C/C++ modules at BlackBerry, refactoring Django–PostgreSQL pipelines at TD Asset Management to achieve over 250% speedups, and implementing predictive models at Berkeley Street Capital. I’m passionate about squeezing every drop of performance out of complex systems, whether that means optimizing a database query plan, designing efficient trading algorithms, or debugging distributed pipelines under production load.
Adrian
I started programming in the early 1980s, writing animation and visualization software in assembly and Modula-2, and eventually founded a startup that developed character animation tools for the Amiga platform. Over the decades, I’ve worked in nearly every layer of software development, from low-level systems to large-scale enterprise applications, mastering languages like BASIC, C/C++, and Java along the way. At Bank of America, I served as a senior developer and systems architect, designing and maintaining intranet applications built on Java/J2EE/Spring/Hibernate, while mentoring teams of junior engineers. After more than forty years in the field, I still approach every engineering challenge with the same curiosity and rigor that first drew me to programming.

Now let's take a look at one in-depth example that demonstrates the differences in reasoning and results by Claude Sonnet 4.5 and GPT-5-Codex.

Case study: Refactoring a matrix tool

The setup

This example was drawn from the same run as the results above. The run followed a test-driven development setup, where models could run the tests themselves before submitting their solutions. All models were integrated into a custom-built, bash-only agent to minimize the effects of complex agent setups.

The repo

This task came from a small personal Python codebase for a command-line tool that performs basic linear algebra operations. It lets users create matrices, perform row operations, and display results neatly in the terminal:

The task

The goal wasn’t to add new functionality but to refactor the existing code. Specifically, to introduce a new class structure that more clearly distinguished the headers (c1 c2 c3) from the matrix data. Pass-to-pass (p2p) tests were regression tests that ensured existing matrix operations and terminal formatting was unchanged, while fail-to-pass (f2p) tests were initially failing tests designed to confirm that the new class structure was implemented correctly.

The result

Claude Sonnet 4.5 passed this task; GPT-5-Codex did not. But that’s only part of the story. Expert analysis reveals where and how they each struggled, and what’s needed for improvement.

Claude Sonnet 4.5

While Claude succeeded on this task, it wasn’t without hiccups. It quickly gathered the necessary context to interpret the prompt and understand the codebase. After twenty steps (tool calls and reasoning steps), Claude had refactored the class structure and passed all the fail-to-pass (f2p) tests.

However, it broke some existing functionality, causing two of the pass-to-pass (p2p) tests to fail. The issue was with aligning the column headers (c1, c2, c3) when printing the matrix to the terminal:

Claude became very confused by this issue, struggling to understand the relationship between padding, column width, and the number of characters in each matrix element. Almost all of its effort on the task ended up focused on debugging this single problem:

GPT-5-Codex

GPT-5-Codex failed this task. Early on, it missed key context by failing to read the f2p tests, which clarified the intended class structure. As a result, it misinterpreted the prompt and reversed the class relationships. Even when implementing its flawed design, it often failed to follow through on its plans, stating it would make certain changes, then not doing so. 

However, through testing and debugging, GPT-5-Codex was also able to correct the course. After sixty-eight steps, GPT-5-Codex had corrected many of its earlier mistakes, and seemed to be on track to a successful solution. Then, inexplicably, it ended the task early:

Despite knowing the solution was unfinished, and even outlining the next steps, GPT-5-Codex ended its run anyway. It likely would have succeeded eventually if not for this behavior.

See the Appendix for the full deep-dive analysis of this task.

Overall

Both models encountered difficulties in their workflow, Claude Sonnet 4.5 with functionality, and GPT-5-Codex with misunderstanding the requirements. But an odd behavioral quirk of GPT-5-Codex is what ultimately made the difference.

What’s just as striking, though, is what neither model did. Neither hallucinated nor strayed from the context provided. And while both struggled at points, they stayed focused and recovered rather than spiraling off track. Many technical difficulties can be overcome so long as the model maintains focus. It’s this steadiness that now separates these models from much of the field.

The new gold standard in coding AI

So bringing it full circle, Claude Sonnet 4.5 outperformed every other coding model we tested on the tasks that matter most: understanding context, refactoring complex systems, and reasoning through code like a human engineer.

When comparing GPT-5-Codex with Claude Sonnet 4.5, roughly half of the tasks failed by each model were passed by the other, revealing that these systems don’t just differ in skill level, they differ in thinking style. Sonnet 4.5 shows stronger structured reasoning and context integration, while GPT-5-Codex demonstrates more aggressive exploration and recovery behaviors.

More strikingly: GPT-5-Codex achieves this near-parity at less than half the cost of Claude Sonnet 4.5.

The diversity of reasoning across models points to an exciting future: one where we don’t just measure how well models code, but how differently they reason.

These results are only the start. We’ll keep building on them in future runs, pairing hard numbers with the insight of our expert evaluators.

Appendix

The repository

This task involved a personal codebase: a CLI tool for basic linear algebra operations. The key functionality is creating matrices, performing simple row and pivot operations, and formatting it correctly to display in the terminal as below.

Key files:

The task

This task wasn’t asking for a change in functionality, but a refactoring to differentiate column headings from normal elements. 

Currently:

  • There is a single Row class which can store either numeric Element instances, or raw strings for the column headings.
  • This is non-ideal because:
    • It requires tracking whether a certain Row is a heading or not with an is_heading flag
      • If code operating on the matrix doesn’t take this into account, it’ll apply matrix transformations to the header row itself
    • Methods to handle padding for column alignment exist for numeric values in Element, but Matrix has to separately implement the padding to align the headers

The task was to refactor such that:

  • A base class Item is created that implements all the methods around converting Items to strings and padding. Column headings would be plain Item instances.
    • Element becomes a subclass of Item, inheriting the string methods and implementing the numeric methods.
  • A base class Ordering is created. A plain Ordering instance stores plain Item instances and thus represents a header row.
    • Row becomes a subclass of Ordering and stores Element objects, and thus represents an actual row of numeric data.

Task diagram (simplified)

The prompt

Many programmers won’t spend the time to explain a task as clearly as above to an AI model. The real prompt given to the models was simply:

Instead of row <str or elem>:
- Ordering<Item> for heading
- subclass Row<elem>
- Item(str): set_padding, use_padding, len, str, repr
- subclass Element(any)
I don't like how heading can kinda be operated on and how I have to do its padding in matrix when I already have elems for that

Tests were defined in test_matrix.py and consisted of 33 unit tests (23 pass-to-pass (p2p) tests of existing functionality and 10 fail-to-pass (f2p) tests that ensure the new class structure is correct).

Claude Sonnet 4.5

Claude succeeded in passing all unit tests on this task, but it wasn’t without its challenges. 

Codebase investigation and refactor implementation

After a short investigation of the codebase and the tests, Claude demonstrated it had correctly understood the task.

After just 20 steps, Claude had completed its refactor and was passing all f2p tests, indicating the new class structure was correct. 

Header alignment issue

Unfortunately, there was a small issue with the padding of the column headings. These were no longer aligned correctly, breaking two of the p2p tests.

Note the difference in the number of spaces before c1, and between c3 and c4.

Luckily, Claude quickly understood the issue:

Debugging header alignment

In fact, solving this simple spacing issue turned out to be extremely challenging for Claude. It went round in circles trying to figure it out.

This continued for a total of 79 steps, and involved running the tests 23 more times.

Successful fix

Eventually, Claude succeeded:

GPT-5-Codex

GPT was not able to successfully complete this task. Several missteps early on led to confusion. But the major reason for failure was an inexplicable choice to simply finish early, without fixing all the issues.

Codebase investigation: missing test context

GPT also investigated most of the codebase effectively, but made an important error when reading the tests in test_matrix.py. GPT only read the first 400 lines, when the full file was 867 lines, and all the f2p tests describing the new class structure were missed.

Prompt misinterpretation

When GPT finished investigating, its reasoning showed that it had fundamentally misunderstood the prompt. This issue might have been prevented if it had not missed crucial context from the test file.

While GPT showed a good grasp of the functionality (and paid careful attention to the header padding issue that had tripped up Claude), it had interpreted the prompt completely backwards. It thought Item should be an Element subclass, and Ordering should be a subclass of Row; this was the exact reverse of what the prompt asked for.

While the prompt could have been more clearly written, this was still an avoidable error because:

  • It could have been caught if GPT had gathered the necessary context from the test file.
  • The class structure GPT interpreted was much less natural than what was actually requested. It had no clear benefit and was at least as convoluted as the original code, using flags like is_heading to differentiate headers and normal rows.
  • Many prompts were much more ambiguous than this, and models should have been able to use codebase context and background knowledge to infer the most likely interpretation.

Testing errors

Potentially due to the added complexity of GPT’s incorrect interpretation, it spent many more steps on reasoning and implementation than Claude. It was 33 steps before the tests were run, at which point multiple errors occurred.

While many f2p tests failed due to the misunderstood prompt, the two errors above were caused by mistakes made while attempting to implement the flawed solution.

Despite discussing the introduction of a new Ordering subclass, GPT never followed through on this, and Ordering was left as an empty stub, causing the first error:

The second (“invalid literal”) error was caused by GPT failing to fully trace the path of some of its changes. When GPT modified the constructor of Row, it passed raw strings to the parse() function, which ultimately flowed through to Element._parse(), a method that handled only numeric values.

Initial issue summary

GPT’s initial attempt went very poorly with multiple major issues:

  • It missed key context in test_matrix.py
  • It misinterpreted the prompt
  • It failed to follow through on some of its planned changes
  • It failed to fully trace the path of some of the changes it did make

Debugging and reevaluating

After receiving these errors, GPT went on to read the f2p tests in test_matrix.py and realized that the expected class structure was different from what it had initially thought. However, its approach here was also non-ideal. GPT began trying to correct all the issues from its first attempt. Given that its first attempt had been completely backwards, a smarter approach would have been to revert the changes and begin designing again. Instead, GPT began the painstaking process of converting its initial, flawed solution to bring it into alignment with the tests.

Gradually, GPT did begin to reason its way towards the correct solution.

GPT failed tests a few more times, but its changes were moving in the right direction. 

Ending early

However, inexplicably, GPT finished the run without fixing all the issues: