This research proposes a novel knowledge distillation technique using a specialized Mixture-of-Experts (MoE) model, called Routing-by-Memory (RbM), to improve the efficiency of node classification in Graph Neural Networks (GNNs) while maintaining accuracy.
Successful knowledge distillation depends on sufficient sampling of the teacher model's output space and decision boundaries, and surprisingly, even unconventional datasets like unoptimized synthetic imagery can be effective when these criteria are met.
Multi-perspective Contrastive Logit Distillation (MCLD) leverages contrastive learning to improve knowledge transfer from teacher to student models in neural networks by comparing logits from multiple perspectives, leading to enhanced performance and transferability without relying heavily on classification task loss.
Combining probability-level and logit-level knowledge distillation losses can hinder performance due to conflicting gradients; the proposed Dual-Head Knowledge Distillation (DHKD) method overcomes this by using separate classification heads for each loss, improving knowledge transfer and student model accuracy.
This research paper introduces a novel information-theoretic framework for quantifying and optimizing the transfer of task-relevant knowledge during knowledge distillation in machine learning.
폐쇄된 대형 언어 모델(LLM)에서 지식을 효율적으로 추출하기 위해 프록시 모델을 활용한 지식 증류 기법인 Proxy-KD를 소개합니다. Proxy-KD는 프록시 모델을 블랙박스 LLM에 정렬시킨 후, 이를 활용하여 소형 LLM에 지식을 전이합니다. 실험 결과, Proxy-KD는 기존의 블랙박스 및 화이트박스 지식 증류 기법보다 성능이 뛰어나, 폐쇄된 LLM 활용의 새로운 가능성을 제시합니다.
Over-parameterizing student models during knowledge distillation using Matrix Product Operators (MPO) enhances their performance without increasing inference latency, effectively transferring knowledge from larger teacher models.
Performance-Guided Knowledge Distillation (PGKD) leverages the power of large language models (LLMs) to improve the accuracy and efficiency of smaller models for multi-class text classification tasks, particularly with limited labeled data, while significantly reducing inference costs and latency.
This paper proposes a novel knowledge distillation framework called Block-wise Logit Distillation (Block-KD) that bridges the gap between logit-based and feature-based distillation methods, achieving superior performance by implicitly aligning features through a series of intermediate "stepping-stone" models.
서로 다른 데이터셋으로 학습된 여러 교사 모델의 지식을 결합하여 단일 학생 모델로 전이하는 다단계 특징 증류(MLFD) 기법을 제시하며, 이를 통해 단일 데이터셋 학습 모델 대비 성능 향상을 달성할 수 있다.