Self-Rewarding Language Models are a recent and extremely promising advancement in artificial intelligence, as reported by Meta, the firm behind Facebook, Whatsapp, and Rayban’s Meta spectacles.
Despite being at least an order of magnitude smaller, their LLaMa-2 70B fine-tuned model has outperformed models such as Claude 2, Gemini Pro, and GPT-4 0613.
But even if it implies humans are one step closer to losing total control over our finest AI models, these new models also seem like a plausible route to producing the first superhuman LLMs, so that’s hardly the real breakthrough.
Most of the insights I post on Medium, like this one, were first published in my weekly newsletter, The Tech Oasis.
This is for you if you want to stay current on the fast-paced field of artificial intelligence (AI) and feel motivated to take action or, at the very least, ready for what lies ahead.
The Rise of a New Alignment Method
Humans are still essential to the design of any frontier models, such as ChatGPT and Claude.
Alignment is the key component.
Human preference training is a step towards the final stages of the training process that our top language models go through, as I described in my newsletter two weeks ago.
In a word, we help our models attain more utility and lower the danger of negative responses by teaching them to respond in the way a human expert would.
Although there is a lot more information in the previous link, the main idea is that we need to create an expensive human preferences dataset, which is basically a collection of several sets of two responses to each question, with a human expert determining which one is superior.