Introduction
The idea that large language models (LLMs) are running out of training data is a common concern, but advancements in the use of synthetic data suggest that we are still far from this eventuality. (Yay!)
The Phi-4 [4] and Yi [3] technical reports highlight how synthetic data offers significant room for improvement in LLMs, both in terms of quantity and quality.
Let’s see what new things they explored!
Synthetic data enables more scaling!
Diversity and Control: Synthetic data is not merely a cheap substitute for organic data, but offers unique advantages. It allows for greater diversity and more precise control over content, making it possible to generate specific data to improve particular model capabilities. In the Phi-4 paper, they employ techniques such as multi-agent prompting, self-review, and instruction inversion, which enable the creation of datasets focused on reasoning and problem-solving, overcoming the limitations of unsupervised data.
Structured and Gradual Learning: In organic datasets, the relationship between tokens is often complex and indirect. Many reasoning steps may be required to connect the current token to the next, making it challenging for the model to learn effectively from next-token prediction. By contrast, each token generated by a language model is by definition predicted by the preceding tokens, making it easier for a model to follow the resulting reasoning patterns. In this way, synthetic data may act as a form of “spoonfeeding,” presenting challenges in a digestible and progression-oriented manner.
Alignment with Inference Contexts: The use of synthetic data ensures that model training is more aligned with the typical modes of interaction during inference. For example, a model trained with synthetic data in a chatbot style will respond better in chat contexts.
Concrete Examples and Advanced Techniques
Creating Synthetic Data in Yi [3]: Yi model uses a pre-training data cleaning system based on sophisticated engineering, with cascading filtering and deduplication. Additionally, the model is fine-tuned on high-quality but small (less than 10,000 items) instruction datasets, demonstrating that quality and care in data preparation can outweigh mere scalability.
Innovative Techniques in Phi-4: The Phi-4 model uses a variety of advanced techniques to generate synthetic data. These include creating question-answer pairs from diverse sources and self-review guided by specific rubrics. The use of "pivotal token search" (PTS) to optimize preference on critical tokens in Direct Preference Optimization (DPO) is another key innovation.
Curation of Organic Data: Both papers underline the importance of filtering and curating organic data, such as web pages, scientific articles, and books, which serve as "seeds" for the generation of synthetic data and also for direct training. The quality and cleanliness of organic data are crucial for obtaining effective synthetic datasets.
Unlike previous models in the Phi family that distilled the capabilities of a teacher model (GPT-4), Phi-4 substantially surpasses its teacher model in STEM-focused QA capabilities, demonstrating that data generation and post-training techniques go beyond distillation.
Additionally, Phi-4 shows strong performance on reasoning-focused benchmarks, thanks to improvements in data, training curriculum, and innovations in post-training.
Side note: internet keeps on growing
Also, we keep producing tons of data. Especially images / videos at larger scales than ever before. There’s still room to growth in my opinion. (See all those OpenAI / Anthropic job openings around multi modal data…)
Conclusions
Synthetic data is not a simple alternative to organic data, but represents a powerful tool for the growth and refinement of LLMs.
The emphasis on quality, diversity, and the ability to control content, as well as the use of advanced generation techniques, offer a promising path to overcome perceived limitations on the availability of training data.
I really think we can keep scaling up! :)
What do you think?
Quite interesting to see also how other similar approaches to data collection proved very effective in the new hyped DeepSeek models
Ref: https://arxiv.org/abs/2402.03300