Kernekoncepter
Development of WangchanLion for Machine Reading Comprehension in Thai language, focusing on contextual understanding and evaluation.
Resumé
This technical report discusses the development of WangchanLion, a Thai instruction fine-tuned model for Machine Reading Comprehension (MRC). It includes details on the model's training data, evaluation methodology, and comparison with other models. The report also introduces a new evaluation scheme assessing correctness, helpfulness, conciseness, and contextuality.
Abstract:
Development of WangchanLion for MRC in Thai language.
Public release of training data, code, and model weights under Apache-2 license.
Experimental studies using XQuAD and Iapp_wiki_qa_squad datasets.
Proposal of a new evaluation scheme for MRC.
Introduction:
Significance of Large Language Models (LLMs) in AI.
Open-source research interest in LLMs.
Introduction to SEA-LION and other models supporting the Thai language.
Instruction Tuning:
Data sources used for instruction tuning.
Supervised Fine-tuning (SFT) strategy employed.
Hyperparameter settings for fine-tuning WangchanLion.
Machine Reading Comprehension (MRC) Evaluation:
Components of MRC evaluation: context, question, reference answer, response.
Traditional extractive QA evaluation using XQuAD dataset.
Human evaluation method design and results comparison.
Statistik
SEA-LIONは10,652トークンのタイ語ボキャブラリーサイズを持つ。
LLaMA2のトレーニングデータセットは約2.0兆トークンで、そのうち89%以上が英語である。
Citater
"Large Language Models have gained significant attention in recent years." - Wannaphong Phatthiyaphaibun