After 100+ hours with Opus 4.1 and 20+ hours in the first week of Sonnet 4.5's launch, Nick Heiner, our VP of Product, gives first impressions.

Sonnet 4.5’s announcement talks a big game: it opens with the line, “Sonnet 4.5 is the best coding model in the world.”

After 20+ hours of using it myself, and previously having used Opus 4.1 for 100+ hours, I agree.

(What about Codex? I found it to be roughly equivalent to, or a little better than, Opus 4.1, with similar behavior patterns to what I describe here.)

Sonnet 4.5 has the top scores on Surge’s internal agent benchmarks, but in this post, I’ll give my take.

Opus: A spiky intelligence 

Opus was like if you combined a team’s tech lead with the first-day-on-the-job intern: capable of “no-one-else-can-do-this” technical heavy lifting, yet prone to shockingly poor judgment. 😂

For those of you who have played D&D, Opus would be a “high intelligence / low wisdom” character. It has the cognitive horsepower to solve really hard problems, but lacks the wisdom to know which problems need solving, or to consider the big picture before diving in.

For example, Opus was once working on a test for a function that used a constant defined at the top level of the file:

To make the test work, Opus needed a way to set different values for that constant. I might suggest making it an optional argument:

But Opus had… other ideas:

That’s right – rather than extend the function to take an argument, it … used the file system API to write a new value in the code on disk, then cleared the module loader cache so it could reload with the new values.

Very clever! I’m impressed that you know how to use the Node module cache. But also, why would you do this thing? 😂 A classic “high INT / low WIS” move.

But it wasn’t just zany things like that. It also exhibited a general junior-ness, like:

  • When asked to do something like parse JSONC or YAML, reimplement its own half-baked parser in ~30 lines rather than just import the appropriate standard libs
  • Classic reward hacking: suppress lint or type errors rather than fixing them
  • Duplicate handling for the same errors at multiple levels of the call stack

Sonnet 4.5: The next step

In my first 20+ hours of use, Sonnet has basically cleaned up all those problems. Add the fact that it’s way faster than Opus, way cheaper, and has all the neat Claude Code tooling refreshes - this feels like a major release. 

The most telling indicator: my “absolutely right”s have dropped to ~10% of previous levels. (And that’s not just because they patched that language.) 

Big picture: Software engineering is increasingly moving towards being a solved problem. Since Github Copilot, every six months, I’ve felt that “what we have today is amazing and what we had six months ago is intolerably antiquated.” So it’s thrilling to think about what will make Sonnet 4.5 seem antiquated in six months!

Appendix