Introduction to Reinforcement Learning with Human Feedback

Edwin Chen
Jan 4, 2023
Introduction to Reinforcement Learning with Human Feedback

Reinforcement learning with human feedback (RLHF) is a new technique for training large language models that has been critical to OpenAI's ChatGPT and InstructGPT models, DeepMind's Sparrow, Anthropic's Claude, and more. Instead of training LLMs merely to predict the next word, we train them to understand instructions and generate helpful responses.

Want to learn more about RLHF and how it works? Read on!

A human Surger training RLHF language models on human feedback.

Imagine you want a large language model to generate a story about Harry Potter.

Large language models like GPT-3 were originally designed to predict the next token in a sequence. This means that to generate a Harry Potter story, you might prompt GPT-3 with the following input:

Harry Potter had been training his large language model AI for years to fight Voldemort. So when the fight began, and Voldemort pulled out his wand and started muttering incantations, Harry [...]

GPT-3 then successfully responds with the following.

Talking to an AI by asking it to continue a prompt.

🧑 Harry Potter had been training his large language model AI for years to fight Voldemort. So when the fight began, and Voldemort pulled out his wand and started muttering incantations, Harry

🤖 was ready. His AI was ready.

In the beginning of the battle, the AI, which was creatively named the "Lemmatizer," used its powerful language model to predict the most likely next word in Voldemort's incantation. Then it took the word and used it to lookup the most likely previous word in its internal dictionary, which it then inserted into the middle of the incantation…

However, this way of “commanding” an AI – by asking it to “continue” a prompt – is unnatural. Ideally, telling the LLM “Generate a story about Harry Potter using AI to fight Voldemort” would suffice, but this is what happens when you send it to a non-RLHF model:

A non-RLHF model fails to understand the command, and tries to continue it like a blog post.

🧑 Generate a story about Harry Potter using AI to fight Voldemort

🤖 Imagine that you will write a novel about Harry Potter and you will use artificial intelligence for the first time to generate the story. You will use some basic techniques.

First, you will extract some ideas from the original book, Harry Potter and the Philosopher's Stone. You will use the basic information about the characters, the places and the events that appear in the book.

Despite knowing how to write stories, the LLM doesn’t understand the request!

This makes sense: on the Internet scrapes LLMs are trained on, you rarely see commands like “Generate a story”, followed by a story itself; it's more likely to be a bullet in a list of todos. The most powerful next-word predictor simply isn’t the same as one that intelligently follows instructions!

Enter reinforcement learning with human feedback.

Examples of RLHF vs. non-RLHF

To hit home the difference between non-RLHF large language models (not trained on human feedback; essentially “next token” predictors) and RLHF large language models (trained on human feedback; designed to satisfy your requests, not merely predict your next word), let’s look at more examples.

Example #1: Writing an email

Prompt: Write an email to my mother wishing her Merry Christmas, and ask how dad is doing.

Non-RLHF output

This is what a non-RLHF model (the original davinci) produces:

A non-RLHF model misinterprets an email-writing command as the first in a list of todos.

🤖 Read the book I got for Christmas: The Screwtape Letters by C. S. Lewis

Finish my Christmas shopping

Watch the Grinch Who Stole Christmas (good one)

Finish Christmas cards.

Go to church

Notice that the non-RLHF model interprets the command as a bullet in a list of todos, not as an instruction!

In contrast, when training RLHF models, Surgers (our data labelers and AI teachers) provide feedback to the LLM that this isn’t a good response.

RLHF output

Here’s what you get when you provide the same prompt to an RLHF model trained on that human feedback.

Successfully commanding an RLHF model to write an email.

🤖 Hi Mom,

Merry Christmas! I hope you're doing well. I'm wondering how Dad is doing. I hope he's doing well. I hope to talk to you soon.

Take care!

This time, the LLM succeeds!

If we want to get the non-RLHF model to work, we need to contort the prompt, like so:

Contorted prompt-engineering for non-RLHF models.

But even then, it doesn’t know when to stop correctly – this reads like what you’d find on a random Internet webpage about things to do on Christmas.

Example #2: Solving math problems

Prompt: What is 2 + 2?

Non-RLHF output

A non-RLHF model responds to the command by autocompleting it as part of a story dialogue.

🤖 What is 5 + 1?”

The result was the same. He couldn’t remember the answer to either question.

Not only does the non-RLHF not answer the question, it thinks the question is part of a story dialogue!

RLHF output

An RLHF model answers the math question in a way the user intended.

The RLHF model correctly interprets the prompt as a question to answer, and responds with 4.

(Sidenote: in order to train RLHF models to perform math, we build teams of Surgers with advanced mathematical backgrounds – like math PhDs to help LLMs learn to conduct mathematical research, and competitive mathletes to train them to solve IMO problems – and build data generation tools, like embedding-based assistants to aid in diversity and verifiers to aid correctness, to help them.)

Example #3: Generating code


🧑 Help me write a Python function that outputs whether a number is even.

Non-RLHF output

Trying, unsuccessfully, to ask a non-RLHF model for programming help.

🤖 My code so far

def isEven(n): if n > 0: if n % 2 == 0: return True else: return False

I know this will not work.

Here's the full code:

Note that the non-RLHF model interprets the prompt not as a command, but rather as the beginning of a StackOverflow post!

Even though it essentially gets the code correct – it “knows” how to write code, in some latent buried sense – it surrounds the request with the extraneous StackOverflow-like text.

Unearthing those latent abilities is part of the magic of RLHF, and why we do a lot of advanced, Copilot-style RLHF training for customers as well.

RLHF output

An RLHF model makes the human happy (except for the tabs!).

What does this mean for prompt engineering?

In the age of RLHF, prompt engineering should no longer be a necessity. The whole point of RLHF is to train models to understand what humans want when they naturally express things.

Soon, if you’re still writing contorted prompts or tweeting about hiring prompt engineers, it means you’re either:

  • Feeding your RLHF models low-quality human data
  • Optimizing for clickbait marketing fluff
  • Or all of the above.

How does RLHF work?

So how do you train an RLHF model on human feedback? We’ll dive into the technical details in a follow-up post, but in the meantime, here’s a high-level overview of the steps:

Step #1: Unsupervised pre-training

Start with a pre-trained language model (a next-token predictor), like the original GPT-3.

RLHF Step #1: Unsupervised pre-training

Step #2: Supervised finetuning

Form a set of commands (e.g., “generate a story about Harry Potter”), and a human-written response to each command. In other words, form a training dataset of <prompt, ideal generation> pairs.

This data collection and data generation is what LLM companies use our Surge AI platform for!

Then finetune the pre-trained model to output these human responses.

RLHF Step #2: Supervised finetuning

Step #3: Training a “human feedback” reward model

The third step is building a reward model that scores how good an LLM output is for a given response. In other words, a reward model is another model (e.g., another LLM with its final layers chopped off) that takes a prompt and generation as input, and outputs a scalar reward.

So once again, form a set of new commands. Then form a set of machine-generated responses to these commands, and ask Surgers to score or rank their quality.

Use this dataset to train a reward model that outputs a quality score for any <prompt, generation> pair.

RLHF Step #3: Training a human feedback reward model.

Step #4: Train a Reinforcement Learning policy that optimizes based on the reward model

Finally, train a Reinforcement Learning policy (a policy, in this case, is essentially an algorithm that outputs the next word or token) that optimizes based on the reward model (i.e., tries to generate text that the reward model thinks humans prefer). This is our new RLHF algorithm!

In other words, in Reinforcement Learning models, taking actions from a given state will provide the policy with a reward that it will try to maximize. In this case, our policy is essentially just a language model, its actions are generating text, and its reward is the score that the reward model gives the generated text.

So to train this RLHF model:

  • First, initialize the RL policy to be the finetuned LLM from Step 2. Then repeat the following:
  • Take a prompt and use the RL policy to generate an output.
  • Use the reward model to calculate a reward for this generation. (This is essentially simulating how a human would score the generation.)
  • Update the RL policy based on the reward (i.e., the policy is now learning whether it is generating good or bad responses).
RLHF Step #4: Training an RL policy that optimizes using the human feedback reward model.

Et voilà! You now have a state-of-the-art large language model like ChatGPT.

RLHF Nuances

Of course, there are many nuances to getting these steps right. For example:

  • When performing the supervised finetuning and reward modeling steps, what set of prompts do you want to use? How do you create them in a diverse way? For example, if you want to create a good summarizer, what kinds of texts are you summarizing, how long are the summaries, and how do you phrase the summarization requests to capture real-world needs?
  • When training a reward model, what set of generations do you give human evaluators, and how do they score them?

These questions are exactly what we build RLHF tooling to help our customers with. If you’re training your own RLHF models and want help, reach out to!


In short, RLHF is the new state-of-the-art for training language models that align to human preferences and give us what we want. OpenAI even found that RLHF models are vastly more efficient: their 1.3B-parameter RLHF model outperformed their 175B-parameter non-RLHF model, despite having more than 100x fewer parameters!

We’re lucky to be at the forefront of this movement, helping LLM companies around the world use human feedback to:

  • Train LLMs to use tools – like search engines, IDEs, and spreadsheets – via human demonstrations
  • Train them to code
  • Train them to solve advanced math and science problems
  • …and more.

As Nathan Lambert, a research scientist at Hugging Face writes:

The RLHF kingmakers.

In our next post, we’ll dive into the technical details, nuances, and open questions surrounding RLHF. To follow along with this RLHF series, follow us on Twitter @HelloSurgeAI or on LinkedIn!

Need RLHF datasets to train the next generation of AI? We work with top AI companies to provide the rich, high-quality human feedback they need to build state-of-the-art language models. Reach out to to learn more!

Edwin Chen

Edwin Chen

Edwin oversees Surge AI's Engineering and Research teams — whether it's helping customers train large language models on human feedback, building content moderation algorithms to detect hate speech and spam, or scaling up an elite data labeling workforce. He previously led AI, Data Science, and Human Computation teams at Google, Facebook, and Twitter, and studied mathematics and linguistics at MIT.

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Meet the world's largest
RLHF platform

Follow Surge AI!