Combining probability-level and logit-level knowledge distillation losses can hinder performance due to conflicting gradients; the proposed Dual-Head Knowledge Distillation (DHKD) method overcomes this by using separate classification heads for each loss, improving knowledge transfer and student model accuracy.
This research paper introduces a novel information-theoretic framework for quantifying and optimizing the transfer of task-relevant knowledge during knowledge distillation in machine learning.
폐쇄된 대형 언어 모델(LLM)에서 지식을 효율적으로 추출하기 위해 프록시 모델을 활용한 지식 증류 기법인 Proxy-KD를 소개합니다. Proxy-KD는 프록시 모델을 블랙박스 LLM에 정렬시킨 후, 이를 활용하여 소형 LLM에 지식을 전이합니다. 실험 결과, Proxy-KD는 기존의 블랙박스 및 화이트박스 지식 증류 기법보다 성능이 뛰어나, 폐쇄된 LLM 활용의 새로운 가능성을 제시합니다.
Over-parameterizing student models during knowledge distillation using Matrix Product Operators (MPO) enhances their performance without increasing inference latency, effectively transferring knowledge from larger teacher models.
Performance-Guided Knowledge Distillation (PGKD) leverages the power of large language models (LLMs) to improve the accuracy and efficiency of smaller models for multi-class text classification tasks, particularly with limited labeled data, while significantly reducing inference costs and latency.
This paper proposes a novel knowledge distillation framework called Block-wise Logit Distillation (Block-KD) that bridges the gap between logit-based and feature-based distillation methods, achieving superior performance by implicitly aligning features through a series of intermediate "stepping-stone" models.
서로 다른 데이터셋으로 학습된 여러 교사 모델의 지식을 결합하여 단일 학생 모델로 전이하는 다단계 특징 증류(MLFD) 기법을 제시하며, 이를 통해 단일 데이터셋 학습 모델 대비 성능 향상을 달성할 수 있다.
In high-dimensional linear regression, strategically crafting a "weak teacher" model for knowledge distillation can outperform training with true labels, but it cannot fundamentally change the data scaling law.
SIKeD, a novel iterative knowledge distillation technique, enhances the mathematical reasoning abilities of smaller language models by addressing the limitations of traditional distillation methods and enabling the models to effectively learn and select from multiple reasoning strategies.
The CAKD framework enhances knowledge distillation in neural networks by decoupling the Kullback-Leibler (KL) divergence loss function, allowing for targeted emphasis on critical elements and improving knowledge transfer efficiency from teacher to student models.