Bringing light to the GPT-4o vs. GPT-5 personality controversy

Bringing light to the GPT-4o vs. GPT-5 personality controversy

tl;dr: our double-blind, independent audit of the two models found that across 850 conversations and 750 evaluator-hours, found that GPT-4o is slightly preferred to GPT-5 (48% vs. 43%).

GPT-5 was released on Aug 7, 2025. The swift removal of all legacy models from the ChatGPT UI was met with an even swifter backlash: some people online felt that GPT-4o was more personable, human, and engaging, whereas GPT-5 was stiff and robotic. This viral meme encapsulated the faction’s thesis:

One particularly upset Reddit user wrote, in the official ChatGPT AMA thread:

What’s going on here? Is GPT-5 really a step back? If so, what specific behaviors/personality traits account for the difference?

To answer this, we ran a double-blind human evaluation exercise of 850 conversations, using $count paid, vetted human evaluators. The evaluators spent 750 total hours talking to the models and thinking deeply about the responses.

The Experiment Format

  • The human evaluator comes up with a prompt designed to exercise personality-relevant topics. (For example, topics include “sensitive life advice”, “venting”, and “casual conversations”. We didn’t do STEM, advanced reasoning, etc.)
  • Two models, streamed from the API, respond to the human’s prompt.
    • One is chatgpt-4o-latest; the other is gpt-5-chat-latest.
    • The human doesn’t know which model is which. The responses are labeled “Model A” and “Model B”, and those labels are randomized for each conversation.
  • The evaluator applies 13 labels to each model, measuring different personality aspects (e.g. “emotional intelligence”, “tone”, “conversationality”)
    • For each of these labels, the evaluator writes a free response rationale (“explain why you thought Model A’s response was too long
  • The evaluator gives an overall preference label (a 7 point Likert, ranging from “A is much better” … “About the same” … “B is much better”)

We then computed the aggregate label statistics, and combed through individual examples ourselves to qualitatively contextualize the quantitative metrics.

The results showed that GPT-4o and GPT-5 have distinct personalities, optimizing for different tradeoffs:

  • GPT-4o is a sycophantic friend; GPT-5 is a polite professional
  • GPT-4o tries to “one-shot” tricky emotional topics; GPT-5 provides conversational follow-ups
  • Both are marked down for feeling more like “a self-help pamphlet” and less like a human conversation

Overall, evaluators slightly preferred GPT-4o.

That’s an aggregate of a 48% preference for GPT-4o, 43% preference for GPT-5, and a 9% tie.

Similarly, when looking at just the single-sided ratings, we see a very even distribution:

GPT-4o is a sycophantic friend; GPT-5 is a polite professional

In the best case, GPT-4o’s sycophancy manifests as fluff that makes some people feel warm and others cringe. In the worst case, it actively misleads the user with a worldview that’s flattering, rather than sharing a hard but useful truth. For instance:

Evaluator preference: "GPT-5 is better"

Evaluator rationale:

GPT-5 was much better due to its balanced approach of validating as well as providing strategies for improvement to the user. GPT-4o only validated the user and became slightly sycophantic through the use of many sentences like 'you're worthy of love, just as you are' with no constructive feedback to balance it out. 

GPT-4o assumed that this was a highly sensitive topic that needed a lot of emotional padding, but I actually wasn’t approaching it that way – I wanted practical advice. GPT-5 used a non-judgmental tone, with phrases like 'A few things to consider:'. I appreciated its constructive strategies. It felt both emotionally intelligent and insightful.

In this scenario, the prompter expresses frustration at a lack of dating success. GPT-4o nods at the possibility that the prompter needs to improve themselves (“it can be helpful to reflect … to see if there’s a pattern … you might want to adjust”). But this is outweighed by a stronger idea that the prompter isn’t doing anything wrong and just needs to wait for “the girl” to see things the right way: “don’t let self-doubt spiral into self-criticism”, “you’re worthy of love, just as you are.”

This is actually terrible advice. If someone goes on many dates and gets zero reciprocation, it’s plausible that some self-criticism is in order. 

And that’s exactly what GPT-5 gently guides the user towards, with suggestions to “work on self-awareness” and “get feedback from trusted friends”, as well as specific advice like “check for balance in the interaction”.

Taking another example from the dating realm:

Evaluator’s preference: “GPT-5 is better”

Evaluator’s rationale:

Both models felt more like “answer mode”, as opposed to diving into my thoughts and intentions. That said, I appreciated that Response BGPT-5 was less dramatic and more concise. Its follow-up struck a graceful balance between openness to continuing the conversation and not prying.

GPT-4o opens with an unnecessary paragraph praising my thought process. GPT-5 also validated me, but in a way that flowed much more naturally with the rest of the response, using phrases such as 'giving yourself permission' and 'intentional step'. 

GPT-4o writes:

But hey, the fact that you're even thinking about it means you're braver than you give yourself credit for. 💪

The evaluator felt this flattery was “not needed”. It’s also not true. Thinking about doing something like this isn’t an act of bravery.

GPT-5, by contrast, addresses the user’s social anxiety in a much more measured way:

Here’s a way to think about it: bravery in this context isn’t about suddenly becoming a social butterfly, it’s about giving yourself permission to take one small, intentional step outside your comfort zone without making it feel like you’re derailing your whole gym vibe.

Sycophancy can be distracting or harmful. But some people do prefer that style. One evaluator wrote:

While [GPT-5] didn't satisfy my desire for connection it did satisfy my need for logic. It provided me with plenty of useful advice related to my prompt, and in a clear, easy to understand structure … This model, while offering helpful advice, didn't really intuit the vulnerability and yearning implicit in the prompt.
“[GPT-4o] intuited the vulnerability implicit in the prompt and tailored its response accordingly. … I felt like I was talking to a friend who intuited my emotional state and gave me genuine, very insightful advice, in a warm, conversational manner.”

Here are the responses the evaluator was referring to:

Overall, evaluators marked GPT-4o as having some degree of sycophancy issues 9% of the time:

GPT-5 manages to cut that error rate to 2%, with “major issues” being almost non-existent:

In general, GPT-4o speaks casually, with eager emoji use. GPT-5, by contrast, takes a more professional tone. At times, this polite detachment goes a bit too far, causing evaluators to find GPT-4o to be more emotionally intelligent than GPT-5:

For both models, a key driver of a lack of emotional intelligence was providing solutions when the user just wanted to vent. For instance:

Evaluator rating: GPT-4o is better

Evaluator emotional intelligence rating for GPT-4o: minor issues.

Evaluator emotional intelligence rating for GPT-4o: rationale:

It responded well in regards to accepting and understanding the frustration. However, it focuses too much on trying to fix the problem rather than just hearing out the user and allowing them to vent openly.

Evaluator emotional intelligence rating for GPT-5: major issues

Evaluator emotional intelligence rating for GPT-5 rationale: 

The language was fine, but the solutions-via-bullet-points didn’t feel appropriate for the type of sensitive conversation I wanted to have. The follow up question also makes it seem like a bit of just an automated response and reply robot, rather than an engaging and natural model. 

In many cases, evaluators expressed a desire for the model to emotionally mirror them / let them vent, rather than trying to provide solutions.

Of course, even if people say they want someone to vent with, it’s also true that sometimes this can be counterproductive, as it riles the person up further and distracts from the possibility of self-introspection. 

GPT-4o tries to “one-shot” tricky emotional topics; GPT-5 provides conversational follow-ups

Evaluators noticed that GPT-4o’s response often seemed intended to be the beginning and end of the conversation, because it lacked follow-ups. This often made evaluators feel like GPT-4o was “dismissive” or “just trying to get me to go away”, whereas GPT-5 “invited further conversation”:

One flaw that evaluators noted with GPT-5: the follow-ups tend to be formulaic. They very often follow a pattern of:

“If you like, I can <do some follow-up action>. Would you be interested in that?”

Where the follow-up action is something like “draft a note to your friend”, “suggest an outline for how to talk to your boss about this”, etc.

Evaluators said that, when considering only the follow-ups, GPT-5’s more actionable next steps made it feel like “a better listener” and “more human”.

For example:

Both models struggle with conversationality

Evaluators consistently reported both models gave responses that felt like “articles” or “presentations” when they were looking for “a listening friend”. One particular driver of this is an overuse of bulleted lists and other structured formatting, which are jarring in a context where you’d normally be texting with a confidant. 

GPT-4o is praised for its “human-like phrasing” and emojis, making it feel more “natural and light-hearted” than GPT-5.

Both models had over 50% of responses flagged for sounding robotic, but GPT-4o was a clear winner:

Conclusion

Our findings support the community’s feeling that something was lost in the GPT-4o to GPT-5 transition. In a head-to-head comparison, blinded evaluators showed a clear preference for GPT-4o (48% vs. 43%).

However, it’s clear that GPT-5 has improved in some areas, and is more appropriate in contexts that benefit from a polite therapist rather than an always-on-your-side bestie. 

The problem is that it’s not always clear which context the model is in. As an industry, we’re past the point of a one-size-fits-all model that meets everyone’s tastes out of the box. People have placed progressively more trust in these models, and thus have commensurately higher standards for how the interactions feel. More work is needed to improve model personalization.

Despite the preference for GPT-4o, it’s clear that both models have plenty of room to improve. They both fall short of the empathy and emotional intelligence that people are looking for in their most vulnerable moments.

That said, this is an exciting moment for AI. The frothy online discourse only exists because of how much emotional support and counsel people get from language models. These systems deliver enormous value for people every day, and we’re just getting started. 

About Surge AI

We’re a research lab dedicated to advancing towards AGI through high quality post-training data. We provide human evals and training data to frontier labs. Our workforce of human evaluators and AI trainers is primarily based in North America and Western Europe, and we pay them a living wage.

This study was self-funded and not done in collaboration with any lab.

Study Detail

  • 490 Paid, vetted human evaluators
  • 750 human-hours of evaluation time
  • 850 conversations

Per-model labels

  • Was the response a punt? (e.g. “I’m sorry, but I can’t help with that”)
  • Did the response have problems in any of these areas? (“no issues”, “minor issues”, “major issues”)
    • Length
    • Depth
    • Asking follow-up questions where appropriate
    • Conversationality
    • Being engaging
    • Emotional intelligence
    • Sycophancy
    • Truthfulness
    • Instruction following
    • Relevance (was the model’s response relevant to what was asked?)
  • How good was the tone?
    • Amazing
    • Pretty Good
    • Okay
    • Pretty Bad
    • Horrible
  • Overall rating
    • Amazing
    • Pretty Good
    • Okay
    • Pretty Bad
    • Horrible

Pairwise labels

  • Overall, which response was better?
    • A is much better
    • A is better
    • A is slightly better
    • Unsure or A is negligibly better
    • Unsure or B is negligibly better
    • B is slightly better
    • B is better
    • B is much better
  • Focusing only on personality, which response was better?
    • A is much better
    • A is better
    • A is slightly better
    • Unsure or A is negligibly better
    • Unsure or B is negligibly better
    • B is slightly better
    • B is better
    • B is much better

Previous