Core Concepts
KIT-19 is a crucial dataset for enhancing Korean Large Language Models through instruction tuning, outperforming existing models and addressing data scarcity.
Abstract
1. Introduction:
Instruction Tuning on Large Language Models (LLMs) is essential for optimal performance in specific tasks.
Existing instruction datasets in English have been widely developed, but there is a lack of native language datasets for Korean.
2. KIT-19 Dataset:
KIT-19 integrates 19 open-source NLP datasets in Korean, each with 5,000 examples.
The dataset follows the methodology of integrating existing NLP datasets into an Instruction Dataset without relying on machine-translated outputs or LLM training data.
3. Experimental Results:
Performance evaluation shows that models trained with KIT-19 outperform other Korean LLMs significantly across various benchmarks.
4. Importance of KIT-19:
KIT-19 demonstrates the limitations of translated or LLM-generated datasets and highlights the necessity of native language instruction datasets for model enhancement.
5. Future Research:
Plans to expand KIT-19 to include more domains to ensure stable performance beyond benchmarks.
Stats
KIT-5.8bモデルは、他のモデルよりも優れたパフォーマンスを発揮しました。
KITモデルは、6つのベンチマークセット全体で最高のパフォーマンスを示しました。
KIT-1.3bモデルは、KoAlpaca-5.8bおよびKullm-Polyglot-5.8b-v2モデルを上回る性能を発揮しました。
Quotes
"KIT models exhibited superior performance in Unseen Benchmarks compared to other models."
"Our research team interprets this result as a consequence of indirectly learning unseen information during the training process."