How LLMs Learn from the Internet: The Training Process
In my 2017 book Artificial Intelligence for .NET (co-authored with Nishith), I described how language models of that era worked. A combination of gradient descent and supervised fine-tuning gave them predictive and classification superpowers for the given domain/context. Those models were next token predictors in the truest sense.
The attention/transformers architecture together with data-science-optimized GPUs revolutionized language models and made today’s LLMs possible. The fundamentals of pre-training and teaching models have stayed the same.
It’s incredible how a single breakthrough in software architecture can lead to generational advancements.
Beautifully written article by Alex Xu, so easy to digest. Highly recommended even if you aren’t familiar with the technical details.