I distilled DeepSeek-R1’s reasoning ability knowledge into Qwen2, and the results were really explosive!!!
Ⅰ. What is knowledge distillation? Knowledge distillation is a model compression technique used to transfer knowledge from a large, complex model (the teacher model) to a small model (the student model). The core principle is that the teacher model teaches the student model by predicting results (such as probability distributions or inference processes), and the…