Login
Careers
Research
Blog
Contact
Blog
How Anthropic uses Surge AI to Train and Evaluate Claude
HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors
30% of Google's Emotions Dataset is Mislabeled
How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems
Holy $#!t: Are popular toxicity models simply profanity detectors?