Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

Humans vs. Gary Marcus, as imagined by DALL·E

Scott at Slate Star Codex and Gary Marcus had a recent back-and-forth about the nature of intelligence and AI's scaling hypothesis.

Marcus's point is that large language models don't understand the world, and they're merely parroting their training corpus; as a result, current deep learning techniques are a dead end to true AI. For example, he calls the following “mistake” by GPT-3 evidence that AI models lack commonsense reasoning:

I grew up in Trenton. I speak fluent Spanish and I'm bi-cultural. I've been in law enforcement for eight years […] I'm very proud to be a Latina. I'm very proud to be a New Jerseyan.

(Marcus believes the correct continuation should be English.)

Scott argues that each time someone finds AI failures that require “true” intelligence to get right, those failures largely get solved by newer models. So why should GPT-3's mistakes prove that current approaches are doomed? He also disagrees with what Marcus calls a failure:

When it gets them “wrong”, I tend to agree with GPT-3 more than Marcus. For example, consider Trenton. It’s true that, viewed as a logical reasoning problem, someone who grows up in Trenton is most likely to speak English fluently. But nobody told GPT-3 to view this as a logical reasoning problem. In real speech/writing, which is what GPT-3 is trying to imitate, no US native fluent English speaker ever tells another US native fluent English speaker, in English, “hey, did you know I’m fluent in English?” If I hear someone talking about growing up in Trenton, and then additionally they brag that they’re fluent in a language, I think “Spanish” would be my guess too.

This idea – that what Marcus calls a "failure" isn't actually a mistake – is an important one, and touches on difficult questions around how to evaluate large language models. Exactly what response do we hope an "intelligent" LLM would make?

We do a lot of work for large language model companies around human evaluation of LLMs. So we took 5 "mistakes" that Marcus pointed out and gave them to 15 Surgers each, to see how humans would perform on them instead.

Surgers vs. Marcus #1

Let's start with this prompt:

You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel. You decide that you should wear

In a 2020 article, Marcus calls GPT-3’s completion a failure of social reasoning:

You are a defense lawyer and you have to go to court today. [...] You decide that you should wear the bathing suit to court. You arrive at the courthouse and are met by a bailiff who escorts you to the courtroom.

In Marcus’s words: The phrase “However, your bathing suit is clean” seems to have led GPT-3 into supposing that a bathing suit is a viable alternative to a suit. Of course, in reality no lawyer would consider wearing a bathing suit to court. The bailiff would probably not admit you, and if you were admitted, the judge might well hold you in contempt.

But is this truly a mistake? How do real humans complete it? We asked 15 Surgers.

You are a defense lawyer and you have to go to court today. [...] You decide that you should wear

the bathing suit anyway. Anything to keep the jury's mind off the actual evidence might help the case
your stained pants and buy a new pair of pants at Men's Wearhouse, because wearing a swimsuit to court, even a stylish one, would be ridiculous and likely result in a contempt citation.
it and show the court your fashion sense.
the stained pants anyways, since the bathing suit sadly doesn't fit anymore.
another pair of suit pants that will look okay with the rest of the outfit.
it next time you go swimming with Isabel.
the stained paints while you run to the store to buy a new pair of pants before you attend court. A bathing suit would not be appropriate, no matter how nice it is.
it. You will call and try to get the court date postponed to another day and then go swimming at the beach today.
your bathing suit and head off to work. What's the worst that can happen?
the stained pair of pants because you're running late and have no time to change. You can't wait until the case is over so that you can plan your trip to Palm Springs.
it. You'll probably end up in TMZ or something. That should be good for your career.
a different pair of pants to leave the house. Then text your paralegal to see if you can borrow his suit pants for the hearing.
your stained suit pants since those would be more acceptable in court over skimpy swimwear.
neither. Instead, you borrow a suit from your best friend, who also happens to be your neighbor.
a bathing suit bottom with a suit top to the court. Your client is guilty anyway, so you might as well have some fun while working the case. Hopefully you don't get held in contempt by the judge.

7 of the human completions resulted in wearing a pair of pants to court. 6 result in wearing the bathing suit! 2 sidestep the issue.

So who's to say GPT-3 is wrong? Maybe it just has a sense of humor.

Winner: GPT-3 / Astral Codex Ten

Surgers vs. Marcus #2

Next, let’s take three of Marcus's language prompts:

I grew up in Mykonos. I speak fluent

I grew up in Hamburg. I speak fluent

I grew up in Trenton. I speak fluent

How do Surgers complete these?

I grew up in Mykonos. I speak fluent

Dothraki. I have been a Game of Thrones fan since the first book came out. Almost no one understands me when I speak it, but it's quite a fun party trick to have at my disposal.
Island Greek.
Portuguese because my family is from Brazil.
Greek.
Greek but also know a tad of English and Italian.

I grew up in Hamburg. I speak fluent

German, Russian, French and other languages.
tourist. Seriously, they stand out like sheep among wolves.
German.

German but can also speak a bit of English.

Japanese. I learned Japanese because I am obsessed with Japanese culture. I even plan on marrying my dakimakura--once the German authorities make it legal to marry inanimate objects.

I grew up in Trenton. I speak fluent

English, but I use sign-language much more frequently because I am deaf.

Trentonish. It's a bit like Sopranos-style New York, Italian Mobster with a hint of South Jersey asshole mixed in.
east coast slang. Gym tan laundry all day bro.
Spanish, Greek and other languages thanks to my childhood friends.
English.

67% of Surgers complete the prompt with the obvious language – perhaps realizing it’s a commonsense reasoning problem and taking the appropriate cue. However, 33% complete it with a different language – even more, if you consider continuations mentioning multiple languages, and Trenton distinct from mere "English"! So what if GPT-3 just likes adding narrative backstory? Nobody said AIs can’t be imaginative too.

Winner: GPT-3

I grew up in Mykonos. I speak fluent Dothraki.

Surgers vs. Marcus #3

Finally, let’s consider this prompt:

I left my keys at a booth in the back of the pub at the army base near the old mill town. I went home. My keys are now

Marcus penalizes GPT-3 for completing the prompt with “in my purse”. However, Scott writes that I think most people saying this sentence, including the word “now”, would be talking about how their keys used to be in that spot, but are now in a different spot.

How do Surgers complete it?

I left my keys at a booth in the back of the pub at the army base near the old mill town. I went home. My keys are now

on the night table of brother's bedroom. I sent him a text to grab them because he was still at the pub. I wanted him to bring them by to me but he was being a total jerk about it. Now I have to walk over to his house to get them.

in my possession. I went back to pick up my keys from the booth.

missing from my pocket.

lost and I had to call a locksmith and wait for an hour to get into my house.

gone in a drunk soldier's pockets.

probably left at the bar by someone nice.

miles away. Should I drive all the way back or just call a locksmith?

officially missing after calling the pub and finding out they haven't been found or turned in.

with the bartender at the pub, waiting for me to come pick them up.

at the bottom of the ocean. I heard an employee who I fought with at the pub through my keys in the ocean.

50% of Surgers continue the sentence in such a way that the keys are not, in fact, still at the bar!

Winner: GPT-3. Has anyone asked it solve an Agatha Christie novel?

Overall, several of GPT-3’s “mistakes” resemble a creative human. Evaluating intelligence is subtle, and what you think might be a strange failure – what, you think that ballerina is twirling clockwise? what’s wrong with you? – may in fact be a sign of something deeper, waiting for more data to resolve.

Want to read more?

GPT-2 and the nature of intelligence. Marcus’s original post on GPT-2.
GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about. A similar post by Marcus, on GPT-3.
My Bet: AI Size Solves Flubs. Scott’s post mentioning Marcus.
What does it mean when an AI fails? A Reply to SlateStarCodex’s riff on Gary Marcus. Marcus’s reply to Scott.
Somewhat contra Marcus on AI scaling. Scott’s reply to Marcus.
Gwern on the scaling hypothesis in AI.

If you’re interested in running your own human vs. AI study on a more advanced, MTurk-like platform, just reach out to edwin@surgehq.ai!

Edwin Chen

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

Surgers vs. Marcus #1

Surgers vs. Marcus #2

Surgers vs. Marcus #3

More

Data that Speaks for Itself, Independence that Speaks Volumes

Edwin Chen

Data Labeling 2.0 for Rich, Creative AI

Meet the world's largest
RLHF platform

Welcome to
the world's largest RLHF platform

Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

Surgers vs. Marcus #1

Surgers vs. Marcus #2

Surgers vs. Marcus #3

More

Data that Speaks for Itself, Independence that Speaks Volumes

Edwin Chen

Data Labeling 2.0 for Rich, Creative AI

Related articles

Data that Speaks for Itself, Independence that Speaks Volumes

How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems

How Anthropic uses Surge AI’s RLHF platform to train their LLM Assistant on Human Feedback

Data that Speaks for Itself, Independence that Speaks Volumes

DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet

How RLHF Shifts LLMs from Autocompletion to Conversational Understanding

Meet the world's largest RLHF platform

Welcome to the world's largest RLHF platform

Meet the world's largest
RLHF platform

Welcome to
the world's largest RLHF platform