An Illustrated Guide to Reinforcement Learning with Human Feedback

‍Interested in training your own state-of-the-art large language models on human feedback? Learn why the world’s top AI companies turn to Surge AI for RLHF, and reach out to rlhf@surgehq.ai to schedule a call!‍

How have the current state-of-the-art in LLMs – like OpenAI’s ChatGPT, Anthropic’s Claude, and DeepMind’s Sparrow – grown in intelligence so quickly? The key advance that separates them from the previous generation of LLMs is reinforcement learning with human feedback. Instead of training language models to predict the next word, Surge AI trains them by feeding them vast amounts of rich human data.

Here’s a short, illustrated guide on how RLHF works.

Step #1: Unsupervised pre-training

Start with a pre-trained language model (a next-token predictor), like the original GPT-3. You now have a next-word predictor!

Step #2: Supervised finetuning

Form a set of commands (e.g., “generate a story about Harry Potter”, “show me step by step how to solve for x in 3x + 5 = 14”), and a human-written response to each command. In other words, form a training dataset of <prompt, ideal generation> pairs.

This data collection and data generation is what LLM companies use our Surge AI platform for!

Then finetune the pre-trained model to output these human responses.

Step #3: Training a “human feedback” reward model

The third step is building a reward model that scores how good an LLM output is for a given response.

So, once again, form a set of new commands.

Then form a set of machine-generated responses to these commands, and ask human labelers to score their quality. There are two typical scoring approaches:

Absolute scoring: give Surge AI labelers a single <prompt, generation> pair and ask them to score the generation against a Likert scale.
Ranking / comparisons: generate 2 or more responses to a given prompt, and ask Surgers to rank the responses by quality.

Then use this dataset to train a reward model that outputs a quality score for any <prompt, generation> pair.

Step #4: Train a Reinforcement Learning policy that optimizes based on the reward model

Finally, train a Reinforcement Learning policy (a policy, in this case, is essentially an algorithm that outputs the next word or token) that optimizes the reward model (i.e., tries to generate text that the reward model thinks humans prefer). This is our new RLHF algorithm!

To do this:

First, initialize the RL policy as the finetuned LLM from Step 2.
In order to train it, take a prompt and use the RL policy to generate an output.
Then use the reward model to calculate a reward for this generation. (This is essentially simulating how a human would score the generation.)
Update the RL policy based on the reward (i.e., the policy is now learning whether it is generating good or bad responses).

Et voilà! You now have a start-of-the-art large language model like ChatGPT.

‍

Want to learn more about the process, tools, and quality control technology needed to build rich RLHF datasets? Want to train your own ChatGPT competitor? We work with large language model companies around the world, on applications like:

Training LLMs to use tools – like search engines, IDEs, and spreadsheets – via human demonstrations
Training them to code
Training them to solve advanced math and science problems

Low-quality human datasets from body shops just don’t cut it anymore. Reach out to rlhf@surgehq.ai and check out our LLM blog posts and research papers in the meantime:

Edwin Chen

Edwin oversees Surge AI's Engineering and Research teams — whether it's helping customers train large language models on human feedback, building content moderation algorithms to detect hate speech and spam, or scaling up an elite data labeling workforce. He previously led AI, Data Science, and Human Computation teams at Google, Facebook, and Twitter, and studied mathematics and linguistics at MIT.

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

An Illustrated Guide to Reinforcement Learning with Human Feedback

Step #1: Unsupervised pre-training

Step #2: Supervised finetuning

Step #3: Training a “human feedback” reward model

Step #4: Train a Reinforcement Learning policy that optimizes based on the reward model

How Anthropic uses Surge AI’s RLHF platform to train their LLM Assistant on Human Feedback

Edwin Chen

Data Labeling 2.0 for Rich, Creative AI

Meet the world's largest
RLHF platform

An Illustrated Guide to Reinforcement Learning with Human Feedback

Step #1: Unsupervised pre-training

Step #2: Supervised finetuning

Step #3: Training a “human feedback” reward model

Step #4: Train a Reinforcement Learning policy that optimizes based on the reward model

How Anthropic uses Surge AI’s RLHF platform to train their LLM Assistant on Human Feedback

Edwin Chen

Data Labeling 2.0 for Rich, Creative AI

Related articles

How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems

Holy $#!t: Are popular toxicity models simply profanity detectors?

How Anthropic uses Surge AI’s RLHF platform to train their LLM Assistant on Human Feedback

How Anthropic uses Surge AI’s RLHF platform to train their LLM Assistant on Human Feedback

Introduction to Reinforcement Learning with Human Feedback

We Evaluated ChatGPT vs. Google on 500 Search Queries

Meet the world's largest RLHF platform

Meet the world's largest
RLHF platform