We’ve written about the decline of Google Search, and how to rigorously measure it through human evaluation. But why stop at Google? We’re starting a new blog series of real-world ML evaluations, where we investigate search and recommendation systems – whether YouTube, Meta, TikTok, or Amazon. In this post: an evaluation of Facebook Search!
Evaluating Facebook Search
I was hungry the other day when hanging out in SOMA in San Francisco, so I tried searching for "burritos near me"...
After all, so much of what we do with our friends and family revolves around eating. I discover amazing restaurants and recipes on Instagram and TikTok every day! So could Facebook maybe do a better job than Google?
Surprisingly, the first result is in the Richmond district – 45 minutes away!
The second is across the bridge in Emeryville...
The third is 8.6 miles away, also across the bridge, in Oakland.
Let’s take a closer look at the first result, Richmond Burritos. It doesn’t have any reviews. Why is Facebook recommending this at the top spot?
But hey, Facebook is smart, maybe it knows something I don’t. So let’s say I make the 45 minute trek to find the most mind-blowing burrito in town…
Only to discover that it’s closed.
Of course, Local Search is hard – one of the most difficult categories for every search engine. It isn’t a pure information retrieval problem, and you need to factor in real-world factors like:
- Where am I? How far am I willing to travel?
- Is the restaurant still open?
- Does it have photos and reviews?
- Do I have the correct metadata, like the phone number and address?
- What if it’s a holiday?
So how well does the rest of Facebook's search engine perform? Let’s run a larger-scale study, evaluating Groups Search and Marketplace Search as well.
Search Quality Measurement through Human Eval
In order to measure Facebook’s Search Quality, we asked 500 Surge AI search raters to look through their browser histories, and collect queries where their intent was to find:
- A place they wanted to go to with their friends
- A community they wanted to join
- Something that they would be okay buying secondhand
They then reissued these search queries on Facebook, using Facebook’s Places, Groups, and Marketplace search verticals, and evaluated the quality and relevance of the results on a 5-point scale.
Here are the results:
How do the verticals compare against each other? Let’s turn the ratings into a numeric score, by mapping Horrible to -2, Pretty Bad to -1, Okay to 0, Pretty Good to +1, and Amazing to +2. Then we can calculate the mean score of each vertical:
Marketplace performs the best!
How good is the ranking? Interestingly, lower search result positions do have a lower human eval score, suggesting that Facebook’s search engine is indeed pulling out signal.
Finally, let’s look at some example ratings.
Examples of Facebook Search Ratings
Search Rating #1
Search Rater: Christian W.
Search Query: cool sweatshirt
Search Intent: I want a cool, unique second-hand sweatshirt that I could wear around as casual wear. Bonus points if it's something that no one else would have.
Rating: Horrible. It's a bunch of boring, regular sweatshirts that are apparently being sold for $12,000! I guess the point is that she's displaying all the sweatshirts she has and wants you to haggle for them but it’s still strange to see. I also don't like that there's nothing really cool about the sweatshirts in the first place. The picture is just generic Nike.
Search Rating #2
Search Rater: Sam E.
Search Query: magic the gathering
Search Intent: It would be fun to find groups for people who want to trade Magic: the Gathering cards or play Magic: the Gathering online games.
Rating: Amazing. The group isn’t only relevant to the query, but also based in the county I live in. Thus I could also get in-person games as well as online games and remote trades.
Search Rating #3
Search Rater: Katherine B.
Search Query: luxury hotel near me
Search Intent: I wanted to find a luxury hotel in my area. I wanted one where I could go for a little "staycation" or a place I could refer to my relatives.
Rating: Horrible. The first result is a PET hotel. The second one is a used car dealership!
A deeper dive into the data would be interesting, but a couple quick thoughts:
- Is price, or price deviation from the average, used as a feature? A $12,000 item should never be showing up as the first search result – especially for a sweatshirt!
- What kinds of business categorization and keyword relevance models is Facebook Search using? It should be easy to extract “hotel” intent from the “luxury hotel near me” query, and to classify the “Luxury Auto Sales” page as a car dealer, not a hotel. Why does the search fail here?
Interested in the full dataset of Facebook Search ratings, to do a more rigorous analysis? Shoot us a message at firstname.lastname@example.org!
If you want to learn more about Search Quality Measurement and human evaluation, check out our other posts on this topic:
Data Labeling 2.0 for Rich, Creative AI
Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.