In the previous article we have explained how GPT itself (like GPT3 and GPT3.5) works, not ChatGPT (or GPT4). So, what’s the difference?
Base models for generative text AI are always trained on tasks like “next word prediction” (or “masked language modelling”). This is because the data to learn this task is abundant: namely all the sentences we produce on a daily basis accessible in Internet databases, documents, webpages,.. they are everywhere. No labeling is necessary. This is of vital importance. There is no feasible way to get a training set of trillions of manually labeled examples.
But a task like “next word prediction” creates a capability vs alignment problem. We literally train our model to predict the next word. Is this what we actually want or expect as a human? Most definitely not.
Let’s consider some examples.
What would be the next word for the prompt “What is the gender of a manager?” Since the base model was trained on different texts, many of which are decades old, we know for a fact that the training data contained a lot of bias relating to this question. Because of this, statistically, the output “male” will be much more probable than “female”.
Or let’s ask it for the next word for “The United States went to war with Liechtenstein in”. Statistically, the most likely outputs after this “in” are year numbers, not “an” (as you would need to reach “an alternate universe”) or any other word. But since the task it got was to actually predict the statistically most likely word given the expectations of the data it was trained with, it’s doing an awesome job here if it outputs some year, no? 100% correct, 100% capable.
The problem is that we don’t really want it to predict the next most likely word. We want it to give a proper answer based on our human preferences. There is a clear divergence here between the way the model was trained and the way we want to use it. It’s inherently misaligned. Predicting the next word is not the same task as giving a truthful, harmless, helpful answer.
This is exactly what ChatGPT tries to fix, and it tries to do so by learning to mimic human preference.
In case you expect a big revelation now like the thing learning to reason, or it somehow at least learning to reason about itself, you’re in for a disappointment: it’s actually just more of the same ANN stuff, but a little different.
Very briefly, an approach using transfer learning (see before) was developed in which the GPT3.5 model was finetuned to “learn which responses humans like based on human feedback”. As we said before, transfer learning means freezing the first layers of the neural network (in the case of GPT-3 this is almost the entire model since even finetuning the last 10B parameters is prohibitively expensive).
The first step was to create a finetuned model using standard transfer learning based on labeled data. A dataset of about 15 000 <prompt, ideal human response> pairs was created for this. This finetuned “model can already start outputting responses that are more favoured by humans (more truthful, helpful, with risk of being harmful) than the base model. Creating the dataset for this model however was already a huge task – for each prompt, a human needed to do some intellectual work to make sure the response could be considered an “ideal human response”, or at least be close enough. The problem with this approach is that it doesn’t scale.
The second step then was to learn a reward model.
As we saw before, trained generative text AI outputs a probability distribution. So instead of asking it for the most probable next word, you can also ask it for the second most probable next word, or for the 10 most probable next words, or for 100 words that are probable enough given some threshold. You can sample it for as many responses to a prompt as you want.
To get a dataset to train this reward model, 4 to 9 possible responses to each prompt were sampled and for each prompt humans ordered these responses from least favourable to most favourable. This approach scales much much better than humans writing ideal responses to prompts manually. The fact that they needed to order the responses might make what the reward model is a bit unclear, but the reward model is as you would expect: it just outputs a score on a “human preference scale” for each text: the higher, the more preferable. The reason why they asked people to order the responses instead of just assigning a score is that different humans always give different scores on such “free text”, and using ordering + something like an ELO-system (e.g. a standardised ranking system) works much better to calculate a consistent score for a response than manually assigning a number when multiple humans are involved in the scoring.
In a third step another finetuned model (finetuned from the model that was created in the first step) is created by using the reward model that is even better at outputting human preferred responses.
Step 1 is only done once, steps 2 and 3 can be repeated iteratively to keep improving the model.
Using this transfer learning approach, we end up with a system – ChatGPT – that’s actually much better than the base model at creating responses preferred by humans, that are indeed more truthful and less biased.
If you don’t get how this can work, the answer is in the embeddings again. This model has some kind of deep knowledge about concepts. So if we rank responses that contain a male/female bias consistently worse than ones without such bias, the model will actually see that pattern and apply it quite generally, and this bias will get quite successfully removed (except for cases where the output is entirely different, like, as part of a computer program).