Human Evaluation of Large Language Models: How Good is Hugging Face's BLOOM?
Jul 19, 2022
Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality
Jun 29, 2022
Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?
Jun 22, 2022
Is Elon right? We labeled 500 Twitter users to measure the amount of Spam
May 19, 2022
Holy $#!t: Are popular toxicity models simply profanity detectors?
Jan 22, 2022
Is Google Search Deteriorating? Measuring Google's Search Quality in 2022
Jan 10, 2022
5 Examples of the Importance of Context-Sensitivity in Data-Centric AI
Nov 19, 2021