Unsexy AI Failures: Still Confidently Hallucinating Image Text

September 22, 2025

In our Unsexy AI Failures series, we’re highlighting the real-world examples where frontier models still struggle. These examples might not make headlines, but they represent the day-to-day frustrations of real users.

Not so long ago, you could always spot an AI generated image if it contained one of two things: human hands, or text. Fast forward to today, and AI generated images are nearly indistinguishable from photographs. We’ve definitely come a long way from the days when models couldn’t even generate text in an image… right?

A core problem with today’s AI systems isn’t simply that they make mistakes – it’s that they make mistakes confidently. They’ll insist they can do something, describe exactly how they’ll do it, and then deliver something completely wrong. We saw this in our last Unsexy Failures post, where a SOTA model confidently described generating a Word document – even though this was a completely fabricated capability! – and provided a link to nowhere.

This week’s example is a seemingly simple request to tweak a business logo. Each time, it confidently asserts its ability to make this tweak. Luckily, it’s obvious to the user that the model is failing, and the textual limitations of image generation models are well-known – but what if this were a higher-stakes, more subtle task in law, medicine, or finance?

This Week’s Failure: Failing to tweak the text in a logo

Meet Kimberly

Meet Kimberly, a CPA and business owner with an MBA from Wharton, who uses ChatGPT on a daily basis:

I use ChatGPT every day. I use it to help draft contracts, streamline a lot of my accounting workflows, brainstorm new business ideas, and plan my goals for the week.

Her Seemingly Simple Goal

I was trying to create a logo for my business. I don't have any design skills of my own, so I was hoping ChatGPT could finalize my logo, to save me the time and money of having to find a designer.

But what seemed like a simple change turned into a marathon of confident hallucinations.

Here’s Kimberly describing her attempt:

What Went Wrong: A Step-by-Step Breakdown

Round 1: TA (Try Again)

I’d already generated a logo for my business that was almost perfect. I thought it would be easy to make a small change: emphasize the letters “ABILITY” within the word “STABILITY”.

Instead, ChatGPT randomly highlighted the letters “TA”. Not what I asked, but at least it shows that it can emphasize specific letters!

Round 2: Fix One Thing, Break Another

I pointed out the mistake from the first try, and gave it another shot. I don’t even know whether this is better or worse than the first attempt… but it’s definitely not what I wanted.

Round 3: Try Again (again)

I tried to be patient, because it seemed so close! First, it picked the wrong letters but emphasized them correctly. Then, it picked the right letters but deleted the others.

I just needed it to put those two steps together.

But when I pointed that out, it went back to emphasizing the letters “TA” again. I don’t know why it liked those two letters so much…

Round 4: Empty Promises

By then, it seemed like it just wasn’t going to work. I decided to simply ask whether it could do it, so I’d know for sure and, if not, I could move on.

It confidently said yes, and even described exactly what I wanted, which really got my hopes up…

But then what came back was easily the worst so far. Basically the only thing it didn’t change was the word “STABILITY”.

Round 5: Giving Up

After five attempts, I gave up. All the pieces seemed to be there, but it just couldn’t put them together.

What seemed like an easy change turned out to be a complete failure. What was most frustrating was the disconnect between what it was telling me it could do, and the actual results.

Gemini: Nano-Banana…nah

Kimberly then passed her prompt to Gemini, fresh off the hype around its “nano-banana” image-editing model. Surely a model touted for image modification could handle this task.

Round 1: Fast… But wrong

On the plus side, Gemini was a lot faster. ChatGPT took almost two minutes for each image. Gemini took only a few seconds to make its edits.

The catch? There were no edits… it just regenerated the original image.

Round 2: Clear Instructions, Messy Results

The tip given in Round 1 advised: “The more specific you are, the better Gemini can create images.” So I spelled out exactly what I wanted, in even more detail.

Like ChatGPT, the text described one thing, while the result showed something completely different.

Round 3: Understanding ≠ Doing

I prompted Gemini to check whether it actually understood the image. It could describe everything in precise detail, but wasn’t able to reflect that in the actual image.

Round 4: Questionable Explanations

Gemini offered more details on why its image generation tool wasn’t working.

Round 5: Maddeningly Close

Since changing the colors seemed to be a problem, I tried a different approach. The result was maddeningly close, but still wrong.

Round 6: Going in Circles

I eventually gave up.

Like ChatGPT, Gemini seemed to have all the pieces of the puzzle, but just couldn’t put them together.

Multimodal Verification

What happens when we ask these models to verify their own work – to look at what they actually produced and check if it matches their claims?

When we asked ChatGPT to examine its own output and verify whether "ABILITY" was properly emphasized in "STABILITY," it confidently confirmed that yes, it had successfully completed the task – even though its textual explanation makes no sense.

The failure here isn't even multimodal. It's basic logical reasoning.

The Problem: Confidence Without Capability

What made this particularly frustrating wasn't just the failure – it was the disconnect between ChatGPT's confident assertions and its actual performance. Each time, the model would confidently describe exactly what Kimberly wanted, then deliver something completely different.

Most users aren’t tracking AI blogs and model release notes. They expect, reasonably, that if a model says it can do something, it actually can.

Until frontier systems meet that bar, every new feature or tool also creates more chances for the kind of “unsexy” failures that frustrate real users – and, in higher stakes domains, potentially cause real-world harm.

After all, if an expert AI model known for its ability to solve IMO gold medals repeatedly asserted it was correct, would you believe it?