This video provides a comprehensive yet accessible introduction to large language models (LLMs) like ChatGPT. Karpathy aims to explain the inner workings of LLMs, demystifying their capabilities and limitations, and highlighting potential pitfalls. He covers the entire pipeline of LLM development, from data collection and preprocessing to model training and inference.
LLM Development Pipeline: The video details the three main stages of LLM development: pre-training (gathering and processing internet text data), supervised fine-tuning (training on human-generated conversations), and reinforcement learning (refining the model through trial and error).
Tokenization: Raw text is converted into tokens (numerical representations of text units) through a process called tokenization. The choice of vocabulary size and tokenization method significantly impacts model performance and efficiency.
Neural Network Internals: LLMs use transformer neural networks with billions of parameters. Training involves adjusting these parameters to predict the probability of the next token in a sequence, based on the preceding context.
Inference and Generation: Generating text involves starting with a prefix (input prompt) and iteratively sampling tokens based on the model's probability distribution. This process is stochastic, leading to variations in output even with the same input.
Hallucinations and Mitigations: LLMs can hallucinate (fabricate information). Mitigations include training the model to explicitly state "I don't know" when uncertain and incorporating tools like web search to access external information.
Reinforcement Learning: Reinforcement learning further refines model performance by rewarding successful problem-solving strategies and penalizing incorrect ones. This iterative process helps the model discover optimal approaches to different tasks.
RLHF (Reinforcement Learning from Human Feedback): This technique uses a reward model (a separate neural network) to simulate human preferences, making reinforcement learning scalable even in unverifiable domains (e.g., creative writing).
The FineWeb dataset is approximately 44 terabytes of disk space and contains about 15 trillion tokens. The video states that Common Crawl data is a major contributor to many LLM datasets, and it had indexed 2.7 billion web pages by 2024. However, a direct comparison of the size of Common Crawl data used in FineWeb isn't explicitly given. The transcript only indicates that Common Crawl data is a starting point, which undergoes extensive filtering and processing before becoming the FineWeb dataset.
The byte pair encoding algorithm aims to reduce the length of a token sequence while increasing the vocabulary size. It works by identifying frequently occurring pairs of consecutive bytes (or symbols) in the text data. These pairs are then replaced with a new, single symbol (a new token with a unique ID). This process is iterated, creating new symbols for common consecutive symbol pairs, further compressing the sequence. The result is a shorter sequence of tokens, but with a larger vocabulary of possible symbols (tokens). In practice, this often leads to a vocabulary size of around 100,000 tokens, as exemplified by GPT-4's use of 100,277 tokens.
During each forward pass in LLM training, the input consists of sequences of tokens, which can range in length from zero up to a defined maximum context length (e.g., 8,000 tokens). The output of the forward pass is a probability distribution over all possible tokens in the model's vocabulary (which can be around 100,000 tokens). This distribution represents the model's prediction for what the next token in the sequence is likely to be.
The post-training step Karpathy describes, where human labelers create query-answer pairs to train the base model into an assistant, is called Supervised Fine-Tuning (SFT).
While it is a form of fine-tuning, it is not LoRA (Low-Rank Adaptation) or RLHF (Reinforcement Learning from Human Feedback). LoRA is a parameter-efficient fine-tuning technique, and RLHF is a separate, later stage that uses reinforcement learning based on human preferences. SFT specifically involves training the model on curated datasets of prompt-response pairs, directly teaching it desired behaviors through supervised learning.
The method described for mitigating hallucinations, where the model is prompted with context to generate questions and answers, then interrogated without context to identify knowledge gaps, and subsequently fine-tuned with "I don't know" as the correct answer for those gaps, is a process of probing and fine-tuning for factuality.
While not a single, universally standardized term like "SFT" or "RLHF," this approach falls under the umbrella of data curation and fine-tuning to improve model factuality and reduce hallucinations. The core idea is to empirically identify the model's knowledge boundaries and then explicitly train it on how to handle those boundaries by admitting ignorance.