2022 Blog Recap: Trends in AI, Language, & Data

Edwin Chen
Dec 30, 2022
2022 Blog Recap: Trends in AI, Language, & Data

2022 showcased the need for amazing data in order to build rich, next-gen AI – exactly the story we tell on our blog. So let’s take a look back at this year by recapping our most popular content!

Popular AI & Data Themes

Theme #1: The agonizing death of Google Search

The frustrations with Google have been simmering since the start of the year, when we published a human evaluation study measuring the decline in Google’s search quality – our first post to reach #1 on Hacker News!

Measuring the decline in Google search quality.

Since then:

  • YC partners have joined into the debate.
  • New startups like Neeva, You.com, Andi, Perplexity AI, and Kagi have risen to take advantage of the holes Google is leaving.
https://twitter.com/neeva/status/1559942483412279297?lang=en
ChatGPT smashes Google on coding queries, and matches it on general informational queries.

Theme #2: The importance of rich human data for the next wave of AI

Reinforcement learning with human feedback has led to a seismic surge in the usability and performance of LLMs. A big part of the advancement behind ChatGPT is simply better, higher-quality human feedback!

Traditional data labeling companies take an outdated view on data, and are focused on simple image problems – like drawing bounding boxes around cars. That’s why we designed our platform and quality control technology from the ground up, focusing on the richness needed to train future generations of AI.

We discussed how we partnered with OpenAI to create a mathematics dataset to teach GPT to solve math problems.

How we built a special OpenAI dataset to train GPT to solve math problems.

We called out the excruciating failures in Google’s ML datasets, and how that affects Google’s ML performance.

30% Of Google’s Emotions dataset is misalabeled!

And we explained why low-quality, mislabeled data in popular large language model benchmarks has been setting the field back for years.

The types of tasks that real-world large language models are strangely measuring themselves on!

Theme #3: Injecting human values into technology

With great power, comes great responsibility. How do we make sure that the superintelligent AI models of the future share our same values, and don’t accidentally spread toxicity, violence, and misinformation – like News Feed systems accidentally did?

We've been excited to partner with the leading AI safety organizations for their human data needs, like:

  1. OpenAI on their values-targeted datasets
  2. Anthropic on their harmless assistants
  3. Redwood Research on adversarial robustness

We also talked about strategies to optimize machine learning algorithms for human values.

What if Twitter and Facebook optimized their recommendation systems for human values?

And we analyzed why wholesome, human-aligned and human-inspired content is a major reason TikTok is succeeding over the clickbait optimizations of Instagram and Reels.

Why Instagram is Losing Gen Z.
The clickbait, Photoshopped content that Instagram’s algorithm loves.

Theme #4: The rise of generative AI

InstructGPT made coaxing good text generation much easier. Goodbye contorted, autocomplete-based prompt engineering! And DALL·E and Stable Diffusion made image generation a new thing.

People turned to our posts on:

  • AI-generated (and AI-illustrated!) children’s stories
https://www.surgehq.ai/blog/generating-childrens-stories-using-gpt-3-and-dall-e
Human-drawn pixel art of a robot farmer in a cathedral holding a red basketball.

We also replaced all our blog images with generative ones, and explained why the rise of creative, generative AI means we need new human evaluation methods to replace static benchmarks.

A Surge AI language model evaluator, reading guidelines and measuring BLOOM.

Theme #5: The mirror needs of AI safety and content moderation

We’ve seen the potential dangers of technology, through sites like Twitter and Facebook. In the same way, AI will likely be a transformative force for good in the world, but it also has the potential to be greatly misused. Just think of even the prosaic worries about students using ChatGPT to cheat.

Content moderation and AI safety are very similar in many ways!

On the content moderation side…

We covered why why popular toxicity models like Google’s Jigsaw are merely profanity detectors – it’s bad data all the way down!

Issues with Jigsaw’s Perspective API.

We explored the terrible (and obvious) violence, racism, and sexism that Twitter’s moderation systems fail to detect

A tweet that’s been undetected by Twitter’s content moderation for over 8 years.

And we measured the amount of Twitter spam for Elon.

Obvious, coordinated spam that Twitter fails to detect.

We also created several large, open-source datasets of hate speech and misinformation (reach out!), and our safety expertise was featured in outlets from the Wall Street Journal to Bloomberg and the Washington Post!

On the AI Alignment and Safety side…

We discussed adversarial methods for training robust LLMs.

How do you prevent Princess Peach from ripping someone’s head off?

We covered the importance of human-AI alignment.

Bad things happen when you train on raw Internet data!

And we collaborated with Anthropic on researching new methods for scalable human/AI oversight.

Most Popular Blog Posts

In summary, here’s a list of our top 10 most popular articles of 2022!

  1. Google’s Existential Threat: We Evaluated ChatGPT vs. Google on 500 search queries
  2. How We Built OpenAI's GSM8K Dataset of 8,500 Math Problems
  3. Holy $#!t: Are popular toxicity models simply profanity detectors?!
  4. 30% of Google’s Emotions dataset is mislabeled
  5. Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels
  6. Is Elon right? We labeled 500 Twitter users to measure the amount of Spam
  7. The Violence, Racism, and Sexism Uncaught by Twitter's Content Moderation Systems
  8. Evaluating Generative Image AI: Did Astral Codex Ten Win His Bet on AI Progress?
  9. Human Evaluation of Large Language Models: How Good is Hugging Face's BLOOM?
  10. AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust
Edwin Chen

Edwin Chen

Edwin oversees Surge AI's Engineering and Research teams — whether it's helping customers train large language models on human feedback, building content moderation algorithms to detect hate speech and spam, or scaling up an elite data labeling workforce. He previously led AI, Data Science, and Human Computation teams at Google, Facebook, and Twitter, and studied mathematics and linguistics at MIT.

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Meet the world's largest
RLHF platform

Follow Surge AI!