Build a Large Language Model Reading Notes 1.3

Posted at 2024-10-05

1. Purpose and Differences between Pre-training and Fine-tuning

Pre-training is the foundation of the model-building process, utilizing large-scale unlabeled datasets such as CommonCrawl or Wikipedia. The model learns the structure and context of language through tasks like next-word prediction, enabling it to acquire broad language knowledge.
- Self-supervised learning allows the model to learn using the natural structure of data, where the next word in a sentence serves as a label. This method efficiently leverages large amounts of unlabeled data, making it the core of the pre-training phase.
Fine-tuning is the process of adapting the pre-trained model for specific tasks. While the pre-trained model has foundational language skills, it may not perform well on tasks like translation or text classification. Fine-tuning involves training the model on smaller, labeled datasets using two primary methods:
1. Instruction fine-tuning: Fine-tuning the model using question-answer pairs or instructions (e.g., "Translate this sentence").
2. Classification fine-tuning: Used for classification tasks such as sentiment analysis or spam detection, training the model to output specific categories based on input text.

Data preprocessing is the first step in training an LLM. It involves tokenizing the text, breaking it into units (words or subwords called “tokens”) that the model can process. This step is crucial for representing the complexity of language in a way that the model can understand.
Attention mechanism is the core of the Transformer architecture. It allows the model to focus on different parts of the input text when generating a word. By computing self-attention weights, the model identifies which words in the input are most relevant for the current task.
A key feature of the self-attention mechanism is its ability to capture long-range dependencies, allowing the model to understand relationships between words that are far apart in a sentence. This helps in better understanding the context, such as when the first word in a sentence influences the last word.
Transformers consist of encoders and decoders, where the encoder processes the input and the decoder generates the output. In LLMs like GPT, only the decoder is used for text generation.

Pre-training an LLM requires immense computational resources. Models like GPT-3 may cost millions of dollars to train. The book illustrates educational training with smaller datasets and demonstrates how to load pre-trained model weights, reducing the need for extensive resources.
Autoregressive models, like GPT, generate one word at a time, where each generated word is used as input for the next prediction. This approach ensures that the generated text is coherent and consistent. The model's training task, “next-word prediction,” helps it learn the patterns of language.

Building custom LLMs helps to understand how models work and enhances flexibility in real-world applications. Fine-tuning open-source pre-trained models solves data privacy concerns, as it avoids uploading sensitive data to third-party servers.
Custom models can also achieve better performance, especially in specialized domains. For instance, models like BloombergGPT (for finance) or medical-specific models outperform general-purpose LLMs by capturing the specific language patterns of those fields more accurately.
Local deployment of smaller LLMs on devices can reduce latency, lower server costs, and ensure data privacy. Companies like Apple are exploring the possibility of running LLMs directly on personal devices.

The chapter lays the groundwork for implementing LLMs in code. While pre-training GPT-like models typically requires massive computational resources, developers can bypass this step by loading pre-trained model weights and fine-tuning them using frameworks like PyTorch. This flexibility allows for building custom models with limited resources.