Reinforcement learning with human feedback (RLHF) is a new technique for training large language models that has been critical to OpenAI's ChatGPT and InstructGPT models, DeepMind's Sparrow, Anthropic's Claude, and more. Instead of training LLMs merely to predict the next word, we train them to understand instructions and generate helpful responses.
Want to learn more about RLHF and how it works? Read on!
Imagine you want a large language model to generate a story about Harry Potter.
Large language models like GPT-3 were originally designed to predict the next token in a sequence. This means that to generate a Harry Potter story, you might prompt GPT-3 with the following input:
Harry Potter had been training his large language model AI for years to fight Voldemort. So when the fight began, and Voldemort pulled out his wand and started muttering incantations, Harry [...]
GPT-3 then successfully responds with the following.
🧑 Harry Potter had been training his large language model AI for years to fight Voldemort. So when the fight began, and Voldemort pulled out his wand and started muttering incantations, Harry
🤖 was ready. His AI was ready.
In the beginning of the battle, the AI, which was creatively named the "Lemmatizer," used its powerful language model to predict the most likely next word in Voldemort's incantation. Then it took the word and used it to lookup the most likely previous word in its internal dictionary, which it then inserted into the middle of the incantation…
However, this way of “commanding” an AI – by asking it to “continue” a prompt – is unnatural. Ideally, telling the LLM “Generate a story about Harry Potter using AI to fight Voldemort” would suffice, but this is what happens when you send it to a non-RLHF model:
🧑 Generate a story about Harry Potter using AI to fight Voldemort
🤖 Imagine that you will write a novel about Harry Potter and you will use artificial intelligence for the first time to generate the story. You will use some basic techniques.
First, you will extract some ideas from the original book, Harry Potter and the Philosopher's Stone. You will use the basic information about the characters, the places and the events that appear in the book.
Despite knowing how to write stories, the LLM doesn’t understand the request!
This makes sense: on the Internet scrapes LLMs are trained on, you rarely see commands like “Generate a story”, followed by a story itself; it's more likely to be a bullet in a list of todos. The most powerful next-word predictor simply isn’t the same as one that intelligently follows instructions!
Enter reinforcement learning with human feedback.
Examples of RLHF vs. non-RLHF
To hit home the difference between non-RLHF large language models (not trained on human feedback; essentially “next token” predictors) and RLHF large language models (trained on human feedback; designed to satisfy your requests, not merely predict your next word), let’s look at more examples.
Example #1: Writing an email
Prompt: Write an email to my mother wishing her Merry Christmas, and ask how dad is doing.
This is what a non-RLHF model (the original davinci) produces:
🤖 Read the book I got for Christmas: The Screwtape Letters by C. S. Lewis
Finish my Christmas shopping
Watch the Grinch Who Stole Christmas (good one)
Finish Christmas cards.
Go to church
Notice that the non-RLHF model interprets the command as a bullet in a list of todos, not as an instruction!
In contrast, when training RLHF models, Surgers (our data labelers and AI teachers) provide feedback to the LLM that this isn’t a good response.
Here’s what you get when you provide the same prompt to an RLHF model trained on that human feedback.
🤖 Hi Mom,
Merry Christmas! I hope you're doing well. I'm wondering how Dad is doing. I hope he's doing well. I hope to talk to you soon.
This time, the LLM succeeds!
If we want to get the non-RLHF model to work, we need to contort the prompt, like so:
But even then, it doesn’t know when to stop correctly – this reads like what you’d find on a random Internet webpage about things to do on Christmas.
Example #2: Solving math problems
Prompt: What is 2 + 2?
🤖 What is 5 + 1?”
The result was the same. He couldn’t remember the answer to either question.
Not only does the non-RLHF not answer the question, it thinks the question is part of a story dialogue!
The RLHF model correctly interprets the prompt as a question to answer, and responds with 4.
(Sidenote: in order to train RLHF models to perform math, we build teams of Surgers with advanced mathematical backgrounds – like math PhDs to help LLMs learn to conduct mathematical research, and competitive mathletes to train them to solve IMO problems – and build data generation tools, like embedding-based assistants to aid in diversity and verifiers to aid correctness, to help them.)
Example #3: Generating code
🧑 Help me write a Python function that outputs whether a number is even.
🤖 My code so far
def isEven(n): if n > 0: if n % 2 == 0: return True else: return False
I know this will not work.
Here's the full code:
Note that the non-RLHF model interprets the prompt not as a command, but rather as the beginning of a StackOverflow post!
Even though it essentially gets the code correct – it “knows” how to write code, in some latent buried sense – it surrounds the request with the extraneous StackOverflow-like text.
Unearthing those latent abilities is part of the magic of RLHF, and why we do a lot of advanced, Copilot-style RLHF training for customers as well.
What does this mean for prompt engineering?
In the age of RLHF, prompt engineering should no longer be a necessity. The whole point of RLHF is to train models to understand what humans want when they naturally express things.
Soon, if you’re still writing contorted prompts or tweeting about hiring prompt engineers, it means you’re either:
- Feeding your RLHF models low-quality human data
- Optimizing for clickbait marketing fluff
- Or all of the above.
How does RLHF work?
So how do you train an RLHF model on human feedback? We’ll dive into the technical details in a follow-up post, but in the meantime, here’s a high-level overview of the steps:
Step #1: Unsupervised pre-training
Start with a pre-trained language model (a next-token predictor), like the original GPT-3.
Step #2: Supervised finetuning
Form a set of commands (e.g., “generate a story about Harry Potter”), and a human-written response to each command. In other words, form a training dataset of <prompt, ideal generation> pairs.
This data collection and data generation is what LLM companies use our Surge AI platform for!
Then finetune the pre-trained model to output these human responses.
Step #3: Training a “human feedback” reward model
The third step is building a reward model that scores how good an LLM output is for a given response. In other words, a reward model is another model (e.g., another LLM with its final layers chopped off) that takes a prompt and generation as input, and outputs a scalar reward.
So once again, form a set of new commands. Then form a set of machine-generated responses to these commands, and ask Surgers to score or rank their quality.
Use this dataset to train a reward model that outputs a quality score for any <prompt, generation> pair.
Step #4: Train a Reinforcement Learning policy that optimizes based on the reward model
Finally, train a Reinforcement Learning policy (a policy, in this case, is essentially an algorithm that outputs the next word or token) that optimizes based on the reward model (i.e., tries to generate text that the reward model thinks humans prefer). This is our new RLHF algorithm!
In other words, in Reinforcement Learning models, taking actions from a given state will provide the policy with a reward that it will try to maximize. In this case, our policy is essentially just a language model, its actions are generating text, and its reward is the score that the reward model gives the generated text.
So to train this RLHF model:
- First, initialize the RL policy to be the finetuned LLM from Step 2. Then repeat the following:
- Take a prompt and use the RL policy to generate an output.
- Use the reward model to calculate a reward for this generation. (This is essentially simulating how a human would score the generation.)
- Update the RL policy based on the reward (i.e., the policy is now learning whether it is generating good or bad responses).
Et voilà! You now have a state-of-the-art large language model like ChatGPT.
Of course, there are many nuances to getting these steps right. For example:
- When performing the supervised finetuning and reward modeling steps, what set of prompts do you want to use? How do you create them in a diverse way? For example, if you want to create a good summarizer, what kinds of texts are you summarizing, how long are the summaries, and how do you phrase the summarization requests to capture real-world needs?
- When training a reward model, what set of generations do you give human evaluators, and how do they score them?
These questions are exactly what we build RLHF tooling to help our customers with. If you’re training your own RLHF models and want help, reach out to email@example.com!
In short, RLHF is the new state-of-the-art for training language models that align to human preferences and give us what we want. OpenAI even found that RLHF models are vastly more efficient: their 1.3B-parameter RLHF model outperformed their 175B-parameter non-RLHF model, despite having more than 100x fewer parameters!
We’re lucky to be at the forefront of this movement, helping LLM companies around the world use human feedback to:
- Train LLMs to use tools – like search engines, IDEs, and spreadsheets – via human demonstrations
- Train them to code
- Train them to solve advanced math and science problems
- …and more.
As Nathan Lambert, a research scientist at Hugging Face writes:
Need RLHF datasets to train the next generation of AI? We work with top AI companies to provide the rich, high-quality human feedback they need to build state-of-the-art language models. Reach out to firstname.lastname@example.org to learn more!
Data Labeling 2.0 for Rich, Creative AI
Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.