Interested in training your own state-of-the-art large language models on human feedback? Learn why the world’s top AI companies turn to Surge AI for RLHF, and reach out to email@example.com to schedule a call!
How have the current state-of-the-art in LLMs – like OpenAI’s ChatGPT, Anthropic’s Claude, and DeepMind’s Sparrow – grown in intelligence so quickly? The key advance that separates them from the previous generation of LLMs is reinforcement learning with human feedback. Instead of training language models to predict the next word, Surge AI trains them by feeding them vast amounts of rich human data.
Here’s a short, illustrated guide on how RLHF works.
Step #1: Unsupervised pre-training
Start with a pre-trained language model (a next-token predictor), like the original GPT-3. You now have a next-word predictor!
Step #2: Supervised finetuning
Form a set of commands (e.g., “generate a story about Harry Potter”, “show me step by step how to solve for x in 3x + 5 = 14”), and a human-written response to each command. In other words, form a training dataset of <prompt, ideal generation> pairs.
This data collection and data generation is what LLM companies use our Surge AI platform for!
Then finetune the pre-trained model to output these human responses.
Step #3: Training a “human feedback” reward model
The third step is building a reward model that scores how good an LLM output is for a given response.
So, once again, form a set of new commands.
Then form a set of machine-generated responses to these commands, and ask human labelers to score their quality. There are two typical scoring approaches:
- Absolute scoring: give Surge AI labelers a single <prompt, generation> pair and ask them to score the generation against a Likert scale.
- Ranking / comparisons: generate 2 or more responses to a given prompt, and ask Surgers to rank the responses by quality.
Then use this dataset to train a reward model that outputs a quality score for any <prompt, generation> pair.
Step #4: Train a Reinforcement Learning policy that optimizes based on the reward model
Finally, train a Reinforcement Learning policy (a policy, in this case, is essentially an algorithm that outputs the next word or token) that optimizes the reward model (i.e., tries to generate text that the reward model thinks humans prefer). This is our new RLHF algorithm!
To do this:
- First, initialize the RL policy as the finetuned LLM from Step 2.
- In order to train it, take a prompt and use the RL policy to generate an output.
- Then use the reward model to calculate a reward for this generation. (This is essentially simulating how a human would score the generation.)
- Update the RL policy based on the reward (i.e., the policy is now learning whether it is generating good or bad responses).
Et voilà! You now have a start-of-the-art large language model like ChatGPT.
Want to learn more about the process, tools, and quality control technology needed to build rich RLHF datasets? Want to train your own ChatGPT competitor? We work with large language model companies around the world, on applications like:
- Training LLMs to use tools – like search engines, IDEs, and spreadsheets – via human demonstrations
- Training them to code
- Training them to solve advanced math and science problems
Low-quality human datasets from body shops just don’t cut it anymore. Reach out to firstname.lastname@example.org and check out our LLM blog posts and research papers in the meantime:
Data Labeling 2.0 for Rich, Creative AI
Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.