Model distillation (or knowledge distillation) is a technique for transferring knowledge from a large, powerful AI model (the "teacher") to a smaller, more efficient model (the "student"). The student model is trained to mimic the teacher outputs rather than learning from raw data, allowing it to achieve near-teacher performance at a fraction of the size and computational cost. Distillation is widely used to create fast, cheap models suitable for production deployment. Examples include distilling GPT-4 level capabilities into smaller models, or creating specialized models from general-purpose ones. Distillation is a key technique behind the trend of increasingly capable small models.
Frequently Asked Questions
What is model distillation?
Model distillation transfers knowledge from a large AI model (teacher) to a smaller one (student), creating efficient models that approach the performance of much larger ones at lower cost.
Why use distillation instead of the original model?
Distilled models are smaller, faster, and cheaper to run while retaining most capabilities. This makes them practical for production, edge deployment, and cost-sensitive applications.