Data Selection for Pre-training and Instruction-tuning of LLMs

Data Selection for Pre-training and Instruction-tuning of LLMs

There is increasing evidence that choosing the right training data is essential for producing state-of-the-art large language models (LLMs). How can we decide on high-quality training data? Can we possibly select fewer data examples to improve performance and efficiency? In this talk, I will present two recent works on selecting high-quality data in pre-training and instruction tuning. I will first present QuRating, a simple framework for selecting pre-training data that captures the abstract attributes of texts humans intuitively perceive. We demonstrate that using state-of-the-art LLMs (e.g., GPT-3.5-turbo) can discern these qualities in pairwise judgments and emphasize the importance of balancing quality and diversity. We have created QuRatedPajama, a dataset comprising 260 billion tokens with fine-grained quality ratings, and show that sampling according to these ratings improves perplexity and in-context learning. Second, I present LESS, a method that effectively estimates data influences for identifying relevant instruction-tuning data for specific applications (a setting we call “targeted instruction tuning”). LESS is efficient, transferrable (we can use a smaller model for data selection), optimizer-aware (working with Adam), and easy to interpret. We show that training on a LESS-selected 5% of the data can often outperform training on full datasets on diverse downstream tasks.