Damus
fiatjaf · 152w
So, these language models, when they are being trained, do they need someone telling them what they got wrong and what they got right? How do they know?
mark tyler profile picture
There are multiple steps, in the first training step they are trying to predict the next character in some text. Let’s say they got the first 10 characters of that last sentence. They should reply with an “m”. If they do, reward. The RLHF step does a similar thing but instead of one character they do a whole output and see how close it is to stuff some subset of humans liked.