mark tyler on nostr

So, these language models, when they are being trained, do they need someone telling them what they got wrong and what they got right? How do they know?

mark tyler @mark tyler 1683846245

There are multiple steps, in the first training step they are trying to predict the next character in some text. Let’s say they got the first 10 characters of that last sentence. They should reply with an “m”. If they do, reward. The RLHF step does a similar thing but instead of one character they do a whole output and see how close it is to stuff some subset of humans liked.