How RLHF Shifts LLMs from Autocompletion to Conversational Understanding

How RLHF Shifts LLMs from Autocompletion to Conversational Understanding

The newest batch of largest language models – like Anthropic’s Claude and OpenAI’s ChatGPT – are a seismic step change above the previous generation of LLMs, including GPT-3.

What exactly makes them so different?

At a high level, the previous generation of LLMs were designed to predict the next token in a sequence. The new generation of models, in contrast, are trained – via reinforcement learning with human feedback – to satisfy human requests.

To understand what this means, let’s look at a few examples.

Examples of RLHF vs. non-RLHF

To hit home the difference between non-RLHF large language models (not trained on human feedback; essentially “next token” predictors) and RLHF large language models (trained on human feedback; designed to satisfy your requests, not merely predict your next word), let’s look at more examples.

Example #1: Writing an email

Prompt: Write an email to my mother wishing her Merry Christmas, and ask how dad is doing.

Non-RLHF output

This is what a non-RLHF model (the original davinci) produces:

A non-RLHF model misinterprets an email-writing command as the first in a list of todos.

🤖 Read the book I got for Christmas: The Screwtape Letters by C. S. Lewis

Finish my Christmas shopping

Watch the Grinch Who Stole Christmas (good one)

Finish Christmas cards.

Go to church

Notice that the non-RLHF model interprets the command as a bullet in a list of todos, not as an instruction!

In contrast, when training RLHF models, Surgers (our data labelers and AI teachers) provide feedback to the LLM that this isn’t a good response.

RLHF output

Here’s what you get when you provide the same prompt to an RLHF model trained on that human feedback.

Successfully commanding an RLHF model to write an email.

🤖 Hi Mom,

Merry Christmas! I hope you're doing well. I'm wondering how Dad is doing. I hope he's doing well. I hope to talk to you soon.

Take care!

This time, the LLM succeeds!

If we want to get the non-RLHF model to work, we need to contort the prompt, like so:

Contorted prompt-engineering for non-RLHF models.

But even then, it doesn’t know when to stop correctly – this reads like what you’d find on a random Internet webpage about things to do on Christmas.

Example #2: Solving math problems

Prompt: What is 2 + 2?

Non-RLHF output

A non-RLHF model responds to the command by autocompleting it as part of a story dialogue.

🤖 What is 5 + 1?”

The result was the same. He couldn’t remember the answer to either question.

Not only does the non-RLHF not answer the question, it thinks the question is part of a story dialogue!

RLHF output

An RLHF model answers the math question in a way the user intended.

The RLHF model correctly interprets the prompt as a question to answer, and responds with 4.

(Sidenote: in order to train RLHF models to perform math, we build teams of Surgers with advanced mathematical backgrounds – like math PhDs to help LLMs learn to conduct mathematical research, and competitive mathletes to train them to solve IMO problems – and build data generation tools, like embedding-based assistants to aid in diversity and verifiers to aid correctness, to help them.)

Example #3: Generating code


🧑 Help me write a Python function that outputs whether a number is even.

Non-RLHF output

Trying, unsuccessfully, to ask a non-RLHF model for programming help.

🤖 My code so far

def isEven(n): if n > 0: if n % 2 == 0: return True else: return False

I know this will not work.

Here's the full code:

Note that the non-RLHF model interprets the prompt not as a command, but rather as the beginning of a StackOverflow post!

Even though it essentially gets the code correct – it “knows” how to write code, in some latent buried sense – it surrounds the request with the extraneous StackOverflow-like text.

Unearthing those latent abilities is part of the magic of RLHF, and why we do a lot of advanced, Copilot-style RLHF training for customers as well.

RLHF output

An RLHF model makes the human happy (except for the tabs!).

Want to learn more about the process, tools, and quality control technology needed to build rich RLHF datasets? Want to train your own ChatGPT competitor? We work with large language model companies around the world, on applications like:

  • Training LLMs to use tools – like search engines, IDEs, and spreadsheets – via human demonstrations
  • Training them to code
  • Training them to solve advanced math and science problems

Low-quality human datasets from body shops just don’t cut it anymore. Reach out to and check out our LLM blog posts and research papers in the meantime:

Edwin Chen

Edwin Chen

Edwin oversees Surge AI's Engineering and Research teams — whether it's helping customers train large language models on human feedback, building content moderation algorithms to detect hate speech and spam, or scaling up an elite data labeling workforce. He previously led AI, Data Science, and Human Computation teams at Google, Facebook, and Twitter, and studied mathematics and linguistics at MIT.

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Meet the world's largest
RLHF platform

Follow Surge AI!