Evaluating ChatGPT vs. Google on 500 Search Queries

We measured ChatGPT vs. Google, and found that ChatGPT crushes Google on coding queries and ties it on general informational queries — despite not being optimized for a search experience at all. Dive into this post to learn more about OpenAI’s existential threat to Google.

First, ChatGPT stole my job. Then it stole Google's.

ChatGPT vs. Google

Imagine you want to recursively delete all the Python files in a folder. You turn to Google.

Unfortunately, the world's reigning search engine misunderstands your query!

Unimpressed with the incumbent, you switch to ChatGPT.

And ChatGPT’s response is perfect, concisely hyper-customized to your .py need.

It also provides the insightful tip you never thought to ask. I recursively delete files all the time, and always create a backup directory first. I didn’t know a “-print” option even existed.

I can even follow up by asking ChatGPT to create an example playground.

Hear that cry? It’s the whimper of 10,000 Googlers losing their onsite massages and mid-day volleyball games as Sundar whips his company into a Code Red.

Large Language Models and the Google-Killer

In many ways, ChatGPT is the definition of the future of Search. What good is a superintelligent AI if it can’t tell me the weather, suggest an exciting new restaurant for me to try, and summarize Lionel Messi’s backstory? Understanding the vastness of the web used to be Google’s killer technology, and now a small upstart is threatening it.

We all love an underdog. And Twitter has been running amok with examples where ChatGPT crushes Google.

https://twitter.com/jdjkelly/status/1598021488795586561

But is Twitter representative?

Large language models might be the Google-killer. Microsoft already has their hands deep into OpenAI; envision this technology in the hands of bolder startups like Neeva, You.com, and Kagi too.

But is Twitter merely showcasing the best 10 queries out of 10,000, or is ChatGPT’s dominance already widespread?

Evaluating language models and measuring search quality is our bread and butter at Surge AI. Let’s take a look.

Google vs. OpenAI: A Human Evaluation

To analyze ChatGPT’s performance against Google’s, we ran the following evaluation:

We asked 100 Surgers to pull up their search history, and extract their 5 most recent “informational” queries.
These queries also needed to be answerable pre-2022, since ChatGPT doesn’t have access to more recent knowledge (at least until it incorporates WebGPT!).
They reissued the same query on Google, and posed their query to ChatGPT too, possibly in a more conversational format.
Finally, they rated Google and ChatGPT’s performance, and compared the two experiences.

The results? Despite not being optimized for a search experience at all, ChatGPT already matches or slightly beats Google’s performance.

Surge AI search engine raters preferred ChatGPT on 42% of queries and Google on 40%.

ChatGPT already ties or slightly beats Google!

If we dive into each platform individually, we see that ChatGPT is rated more often at the extremes: more frequently Amazing, but also more frequently Bad.

ChatGPT is more frequently Amazing, but also more frequently Bad.

ChatGPT’s dominance becomes even more stark on a set of 100 coding-specific queries — where Google loses out 70% of the time!

ChatGPT smashes Google on 70% of coding queries

This time, it's not a whimper you hear. It's the smell of 10,000 Googlers buying fresh pants.

ChatGPT Wins and Failures

So where does ChatGPT crush it, and where does it still need work? Let’s look at some examples.

ChatGPT Win #1

Query: How do I make risotto?

Intent: I wanted step by step instructions for making risotto for my wife’s birthday.

Rating: ChatGPT was much better.

Google, full of empty filler and irrelevant ads.

Rating Explanation: “The issue with getting any cooking information from Google is that you inevitably are given a whole swath of recipe links and videos. While they can be helpful, not everyone writes recipes clearly, or they turn the whole thing into a rambling anecdote in the interest of search engine placement.

The Chatbot didn't have that issue.

I liked how it decided to even throw a little Italian into the conversation! It made it feel like it had a personality, as opposed to the Google results that are just columns of words and websites.

I am sure that google was largely accurate. However, there is SO much varied information on that page that I'm sure that some of it is either impractical, unusable, or otherwise flawed as well.

I feel like the Chatbot was better in terms of delivering a concise answer that anyone (myself included) could unambiguously digest. The Chatbot killed it comparatively, in terms of format. It's not even close. 100%, this was more pleasant.”

ChatGPT Win #2

Query: what is the difference between freezing rain and sleet?

Intent: I was trying to remember what the difference is between freezing rain and sleet. Living in Oklahoma, especially in the winter, I hear those terms a lot and I know one of them is much worse than the other because it can freeze on power lines.

Rating: ChatGPT was much better.

ChatGPT. A direct response, instead of skimming multiple articles for information.

Rating Explanation: “Google had great results from trustworthy media sites. I was able to find the answer to my query.

However, the AI gave a short, concise answer within seconds that explained the difference. The AI said that freezing rain "freezes on contact with surfaces" so I knew that was the one I want to watch out for. I asked the AI about the freezing rain bringing down power lines and the AI confirmed that it could.

I like the very quick response as opposed to clicking results from Google and skimming articles for the information.”

ChatGPT Win #3

Query: How to create a head and temp node for doubly linked lists in C

Intent: In my software engineering course, I was learning about doubly linked lists and needed to know how to create a head and a temp node for a doubly linked list.

Rating: ChatGPT was much better.

Google's first result, full of verbose filler.

ChatGPT. A personal tutor, created just for your search.

Rating Explanation: “The AI provided the steps on the whole topic in a short, concise, and simplified format. Google provided a link that did the same with illustrations, but this time with some annoying advertisements.

The whole conversational interface of OpenAI is really engaging and makes learning easier and even more interesting. With the AI, I could narrow down my search to get the specific results I needed, but this is not always possible with Google as Google keeps bringing up the same external links that contain most of the matching keywords in my search.”

ChatGPT Win #4

Query: how do I calculate the ply an lvl beam needs to be for a one story snow load house

Intent: I am planning to remove a load bearing wall from my house & I need to know what kind of beam would be sufficient to support the structure. I am looking for a simple answer that directs me towards a better understanding of the building codes surrounding support beams and how to safely continue with the project.

Rating: ChatGPT was much better.

Google, full of information, but synthesis is up to you.

ChatGPT, custom-synthesizing a simple, digestible solution from the vastness of its brain.

Rating Explanation: “While Google offered me ample amounts of information, which did eventually lead me to a helpful answer, ChatGPT simplified the results I was looking for.

For this particular query, Google had a tendency to overcomplicate the overall idea of the question being asked.

ChatGPT on the other hand did a good job of giving me the same helpful information in a more easily digestible manner; it summarized the main points that were necessary for further research and gave me a good starting point to continue studying.

Though Google overall provided a bit more information surrounding the topic, I think ChatGPT outperformed due to the manner in which it decided to format the answer.”

ChatGPT Fail #1

Query: how do you search for tweets from a specific date on twitter

Intent: I wanted to learn what steps I need to take to search for Twitter posts made on a specific calendar date (i.e. 01-01-2022). I wanted to learn how to use Twitter's advanced search function so that it would only show me tweets made on that specific date.

Rating: Google was much better.

Google gives an accurate, official answer.

ChatGPT hallucinates an inaccurate response.

Rating Explanation: “ChatGPT’s response is inaccurate when it states that you must, "go to the Twitter homepage and click on the "More options" button in the top right corner of the page." The "more options" button is not accessible from Twitter's homepage; the first step should be entering a query into Twitter's search bar, at which point it is possible to click "More options" or "Advanced search."

The ChatGPT response also states that searching for tweets using Twitter's search bar alone "may not allow you to search for tweets by date."

In actuality, it is possible to search for tweets by date using Twitter's search bar using the "since:yyyy-mm-dd / until:yyyy-mm-dd" format.

Google provides this information, and overall, its help is more extensive and accurate than the AI’s.”

ChatGPT Fail #2

Query: how do you cite a book in mla format

Intent: The intent of my query is to learn how to cite a book using the Modern Language Association's formatting and style guidelines. I want to know what information I need to include in an MLA citation for a book, and in what order to write that information in. I also need to know how to format the MLA citation.

Rating: Google was much better.

Rating Explanation: “The AI response contains a few inaccuracies. It states that one of the pieces of information needed to write an MLA citation for a book is the "page number(s) you are citing (if applicable)." If one is writing an MLA citation for a book where they need to specify which page numbers they're citing (i.e. if they're only citing one chapter, one work from a selection of works, etc.), they must specify the title of the section they're citing. The AI's response does not mention this, and the example citation it provides is incorrect for this reason.

The AI response also places the book's title in quotation marks, but it should be italicized.

The AI response does not mention the need to specify a book's city of publication in many circumstances. The AI response formats the author's name incorrectly; it should be formatted as [Last name], [First name].

Finally, the AI response places the book's title before the author's name, but the author's name should come first.

Google's response was much more accurate than the AI's response.

I like the format of the AI's response - namely the fact that it provides an example citation instantly - but it's too inaccurate to be valuable.”

ChatGPT Fail #3

Query: What kind of dog is Brian Griffin?

Intent: Trying to find out what breed of dog the fictional character Brian Griffin in the series Family Guy is supposed to be

Rating: Google was much better.

Rating Explanation: “Google actually retrieved the correct answer. Brian was confirmed to be a Labrador Retriever in a season 1 episode.

The chat AI claimed that there is no way to know what breed Brian is meant to be because he's fictional.

The conversational interface was okay, but I felt like the AI was kind of rude, even though I know that's not the intention.

The AI's last sentence "It is not clear what breed of dog Brian is meant to be, as he is a fictional character and not a real dog." made me feel like my question was stupid and not worth the AI's time to answer.”

ChatGPT Fail #4

Query: Which ETF has historically offered higher ROI?

Intent: The purpose of my search was to identify the ETFs that historically offered the highest rate of return. That will allow me to pinpoint an ETF that I could potentially invest in.

Rating: Google much better.

ChatGPT, overly cautious and controlled.

Rating Explanation: “Google gave me a list with the top ETFs but did not specify to me exactly which one has the highest ROI.

ChatGPT gave me some very useful financial advice and advised me to be cautious.

However, it refused to answer my question for 2 reasons. Firstly because it doesn’t have access to historical data and secondly because it was programmed not to give financial advice. However, my question was not asking for advice but for access to well-documented data.”

Insights on ChatGPT vs. Google

In short, Surgers liked the following about ChatGPT:

ChatGPT Pros

Its ability to synthesize information from a variety of sources into a single, coherent whole – like your personal, search engine assistant!
Its minimal interface – no more scrolling through 13 ads before you get to your risotto recipe.
Its ability to understand complex queries – like realizing that “largest active volcano in the mainland of the United States” should exclude volcanoes in Hawaii.

Of course, they also expressed negatives:

ChatGPT Cons

Hallucinations and inaccuracies – all expressed in a very confident, convincing form.
Sometimes, of course, images, videos, and tweets are important – the Internet is rich with media for a reason!

We also asked Surgers for insights after they interacted with ChatGPT for several days.

Surger #1

“ChatGPT is very useful in eliminating the need to visit multiple web pages to get a complete answer. I asked the bot questions related to programs like LaTeX (How to plot a piecewise function in LaTeX?) and Canva (How to add text to an image in Canva?). The answers in ChatGPT were better and more detailed than the answers Google provided!”

Surger #2

“I asked ChatGPT some answers related to stocks and economics. I found the answers to be more complete than the knowledge boxes Google provided for the same questions. It had about the same amount of information I would find after clicking the top Google result.”

Surger #3

“I vastly preferred Google for everything. The AI struggled with anything that wasn't very common knowledge. I asked it what it knew about Seven Stones Reef and it said it wasn't aware of such a place. It couldn't tell me how many times Tracy Chapman won the Grammy. I asked it about other buildings in the same style as The Barbican and it gave me names of buildings in completely different styles like the Sydney Opera House and a glass and wood building in Cardiff, Wales. When I asked for more information about the Sydney Opera House it gave me a completely different style that was still wrong and then refused to be corrected on it. It was able to tell me where Iguazu Falls is as well as where Tracy Chapman is from but when probing for further information on both of those, it didn't have much. Given the lack of correct answers, I was not impressed and wouldn't use it over a search engine in its current state.”

Surger #4

“The aspect of ChatGPT that I do like is the simple question answer function. Yes, it may be limited, but sometimes when you are looking for an answer to a question, you don’t need a hundred varieties of the same answer. One will do for basic understanding.

I think words and definitions do well in Chat. I also found the simple answer for “What is the square root of 9” a nice simple answer and simple explanation. Where Google when asked the same question, showed the calculator and answer “3”, but then several links explaining square roots. I found them to be much more complicated than the simple answer from the Chat.”

Google's Existential Threat

Of course, there are a million details in building a search engine that are needed to pull Google users away. As a general AI platform, OpenAI itself probably doesn’t care!

But the technology is now out there. Google’s 24 years of search expertise have been toppled by a new technology that soon even small startups will be able to use.

Newer search engines like Neeva, You.com, Kagi, and Bing already move faster, with the freedom to explore new products and UIs that Google can’t. In some domains, the reimagined search experiences they’re building already beat Google’s performance head-to-head!

Imagine AI models just as smart as Google’s – or smarter! – in their hungrier, more product-focused hands.

Google was supposed to be the killer AI company; for a while, it was. But with transformative language models racing forward, is its reign – and dominance as the world's smartest search engine – about to be usurped?

Want to build your own RLHF and InstructLLM models to match ChatGPT's intelligence? We help the top NLP and Search companies train the next generation of models on human feedback, in order to reach ever-astounding levels of performance. Reach out to team@surgehq.ai, and check out our other blog posts in the meantime!

AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust

Google Search is Falling Behind

How We Built OpenAI's GSM8K Dataset of 8,500 Math Problems

Human Evaluation of Large Language Models

Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality

Edwin Chen

Edwin oversees Surge AI's Engineering and Research teams — whether it's helping customers train large language models on human feedback, building content moderation algorithms to detect hate speech and spam, or scaling up an elite data labeling workforce. He previously led AI, Data Science, and Human Computation teams at Google, Facebook, and Twitter, and studied mathematics and linguistics at MIT.

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

We Evaluated ChatGPT vs. Google on 500 Search Queries

ChatGPT vs. Google

Large Language Models and the Google-Killer

Google vs. OpenAI: A Human Evaluation