Real World ML Evaluations: Evaluating Facebook's Search Quality

Jefferson Lee
Jun 13, 2022
Real World ML Evaluations: Evaluating Facebook's Search Quality

We’ve written about the decline of Google Search, and how to rigorously measure it through human evaluation. But why stop at Google? We’re starting a new blog series of real-world ML evaluations, where we investigate search and recommendation systems – whether YouTube, Meta, TikTok, or Amazon. In this post: an evaluation of Facebook Search!

Evaluating Facebook Search

I was hungry the other day when hanging out in SOMA in San Francisco, so I tried searching for "burritos near me"...

After all, so much of what we do with our friends and family revolves around eating. I discover amazing restaurants and recipes on Instagram and TikTok every day! So could Facebook maybe do a better job than Google?

"burritos near me", when searching from SOMA in San Francisco

Surprisingly, the first result is in the Richmond district – 45 minutes away!

The second is across the bridge in Emeryville...

The third is 8.6 miles away, also across the bridge, in Oakland.

Facebook's first 3 search results for "burritos near me"

Let’s take a closer look at the first result, Richmond Burritos. It doesn’t have any reviews. Why is Facebook recommending this at the top spot?

No reviews!

But hey, Facebook is smart, maybe it knows something I don’t. So let’s say I make the 45 minute trek to find the most mind-blowing burrito in town…

Only to discover that it’s closed.

Tacos beat burritos anyways

Of course, Local Search is hard – one of the most difficult categories for every search engine. It isn’t a pure information retrieval problem, and you need to factor in real-world factors like:

  • Where am I? How far am I willing to travel?
  • Is the restaurant still open?
  • Does it have photos and reviews?
  • Do I have the correct metadata, like the phone number and address?
  • What if it’s a holiday?

So how well does the rest of Facebook's search engine perform? Let’s run a larger-scale study, evaluating Groups Search and Marketplace Search as well.

Search Quality Measurement through Human Eval

In order to measure Facebook’s Search Quality, we asked 500 Surge AI search raters to look through their browser histories, and collect queries where their intent was to find:

  • A place they wanted to go to with their friends
  • A community they wanted to join
  • Something that they would be okay buying secondhand

They then reissued these search queries on Facebook, using Facebook’s Places, Groups, and Marketplace search verticals, and evaluated the quality and relevance of the results on a 5-point scale.

Here are the results:

Quality of Facebook Groups' Search
Quality of Facebook Marketplace's Search

Quality of Facebook Places' Search

How do the verticals compare against each other? Let’s turn the ratings into a numeric score, by mapping Horrible to -2, Pretty Bad to -1, Okay to 0, Pretty Good to +1, and Amazing to +2. Then we can calculate the mean score of each vertical:

Marketplace performs the best!

How good is the ranking? Interestingly, lower search result positions do have a lower human eval score, suggesting that Facebook’s search engine is indeed pulling out signal.

Finally, let’s look at some example ratings.

Examples of Facebook Search Ratings

Marketplace Search

Search Rating #1

Search Rater: Christian W.

Search Query: cool sweatshirt

Search Intent: I want a cool, unique second-hand sweatshirt that I could wear around as casual wear. Bonus points if it's something that no one else would have.

"cool sweatshirt"

Rating: Horrible. It's a bunch of boring, regular sweatshirts that are apparently being sold for $12,000! I guess the point is that she's displaying all the sweatshirts she has and wants you to haggle for them but it’s still strange to see. I also don't like that there's nothing really cool about the sweatshirts in the first place. The picture is just generic Nike.

Groups Search

Search Rating #2

Search Rater: Sam E.

Search Query: magic the gathering

Search Intent: It would be fun to find groups for people who want to trade Magic: the Gathering cards or play Magic: the Gathering online games.

"magic the gathering"

Rating: Amazing. The group isn’t only relevant to the query, but also based in the county I live in. Thus I could also get in-person games as well as online games and remote trades.

Places Search

Search Rating #3

Search Rater: Katherine B.

Search Query: luxury hotel near me

Search Intent: I wanted to find a luxury hotel in my area. I wanted one where I could go for a little "staycation" or a place I could refer to my relatives.

"luxury hotel near me"

Rating: Horrible. The first result is a PET hotel. The second one is a used car dealership!

Insights

A deeper dive into the data would be interesting, but a couple quick thoughts:

  • Is price, or price deviation from the average, used as a feature? A $12,000 item should never be showing up as the first search result – especially for a sweatshirt!
  • What kinds of business categorization and keyword relevance models is Facebook Search using? It should be easy to extract “hotel” intent from the “luxury hotel near me” query, and to classify the “Luxury Auto Sales” page as a car dealer, not a hotel. Why does the search fail here?

Summary

Interested in the full dataset of Facebook Search ratings, to do a more rigorous analysis? Shoot us a message at hello@surgehq.ai!

If you want to learn more about Search Quality Measurement and human evaluation, check out our other posts on this topic:

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Jefferson Lee

Jefferson Lee

Jefferson leads Surge AI's data labeling and NLP products — whether it's helping customers label their large language models, gather data to train Spam and Hate Speech classifiers, or run large-scale search evaluations. He was previously an early engineer on Airbnb's Trust and Safety ML team, and studied computer science at Harvard.

Data Labeling for the
Richness of AI

Build human-powered datasets using our global labeling workforce and platform.

Never miss a post

Subscribe to our newsletter and never miss our latest news.