Human Evaluation of Search Quality

Search Ranking and Measurement with Neeva


Neeva - Ad-free, private search


Search Engine


Custom Labeling Teams, Dataset Generation, Quality Controls, Template Editor, API, Search Evaluation, Dashboards

neeva logo

Neeva is the world’s first ad-free, private search engine, founded by a legendary team of ex-Googlers.

To succeed, Neeva needs to understand what users think of its search engine and how its capabilities stack up against incumbents. Rather than relying on proxy metrics like clicks, Neeva wanted to gather this data directly by running search evaluations where users rate the quality and relevance of Neeva’s search rankings.

The Problem?

High-quality search evaluations are notoriously hard for companies to run on their own. As Neeva puts it, “when you’re building a search engine, evaluation is one of the most important tricky components to get right. Search is a very human need and so you need unbiased human raters to tell you how well you’re doing.”

Fortunately, we specialize in Search Evaluation at Surge AI. Our team has decades of experience working on Search Quality at Google, YouTube, Microsoft, Twitter, and Facebook.
We know firsthand the importance of high-quality evaluations for training and measuring Search Quality, and the complexity of running them at scale.

We built Surge AI to make it easy for every company to run the same search evaluations that Google depends on (and many other data labeling use cases too!).

After discussing Neeva's needs with their search ranking team, we designed, ran, and delivered a series of search evaluations, including personalized search evaluations, vertical-specific evaluations, and side-by-side evaluations comparing Neeva against Google.

Let’s break our process down into three phases — building a team of Neeva search evaluators, generating and labeling data, and final quality checks.

Custom Data Labeling Teams

One of Neeva’s goals was to evaluate their search engine on a specific domain: technical programming queries. Not everyone has the domain expertise to understand what makes a good search result for a query about debugging TensorFlow, so we built a custom labeling team of Surgers with software engineering backgrounds.

These custom teams are a key feature of the Surge AI platform, ensuring that only Surgers with the required skills work on a particular project and ensuring that your labeling team can learn the nuances of your tasks as they stay with you over time.

Search Evaluation Design

While most data labeling tasks involve pre-existing data that needs to be labeled, our Search customers often need datasets created from scratch. In these cases, Surgers must gather and label data, resulting in a highly-customized, one-of-a-kind dataset.

For Neeva’s Search Evaluation project, we asked Surgers to do the following:

  1. 01.

    Pick a programming query that they had recently searched for (e.g., “python split string into characters”)

  2. 02.

    Explain the query intent

  3. 03.

    Search for the query on Google

  4. 04.

    Rate the Google search result page on a 5-point Likert Scale

  5. 05.

    Explain their rating

  6. 06.

    Search for the query on Neeva

  7. 07.

    Rate the Neeva search result page on the same 5-point Likert scale

  8. 08.

    Explain their rating

  9. 09.

    Compare Google vs. Neeva on a 5-point Likert scale

  10. 10.

    Explain their rating

Thoughtful Quality Controls

As part of this process, we created a series of quality controls to ensure that our search engine raters were performing well. These quality controls included custom search rating examinations, where Surgers rated a series of <query, search result> pairs. We measured their responses, and read through the explanations of their ratings to ensure that their judgments were thoughtful and sound.


Our high-quality, comprehensive search evaluations gave Neeva key insights into their performance against Google, as well as training data for their models.

For example, Neeva learned that when they outperform Google, 80% of the time it was because of Neeva’s search widgets and answer features. When Google beats Neeva, it’s because of improvements on long tail queries.

Insights like these allow Neeva to better understand where they are succeeding, and where they need to focus their efforts next. These search evaluations also inspired a range of additional evaluation projects that Neeva and Surge are now partnering on to uncover additional insights.

mockup of a tweet by Vivek Raghunathan @vivek7ue saying: Huge thanks to @echen and @HelloSurgeAl for the incredible work they are doing on search human eval.

Next Case Study

Content Moderation for a
Leading Social Media Platform