Core Concepts
vi-mistral-x is an innovative Large Language Model designed for the Vietnamese language, utilizing continual pre-training to enhance understanding and generation capabilities.
Abstract
Abstract:
Introduction of vi-mistral-x, a Large Language Model for Vietnamese.
Utilizes continual pre-training based on Mistral architecture.
Outperforms existing models in text classification, question answering, and text generation.
Proposed Method:
Effective Corpus Preparation:
Random selection reduces corpus size efficiently.
N-gram-based filtering ensures dataset uniqueness.
BERT-based classifier filters out toxic content.
Perplexity-based filtering enhances document quality.
Effective Tokenizer Training:
Google SentencePiece used to train new SPM model.
Rule-based token filtering focuses on Vietnamese character recognition.
Hybrid tokenizer integrates original Mistral's SPM model with enhanced SPM.
Effective Model Initialization:
Adaptation of Mistral architecture for Vietnamese token embeddings.
Expansion of embedding layer and language model head for Vietnamese-specific tokens.
Effective Model Training:
Focus on memory and computational efficiency in training LLMs.
Optimization through curated model architectures and parallelism techniques.
Model Alignment:
Fine-tuning vi-mistral-x on task-specific Vietnamese datasets for optimal performance.
Experimental Results:
Pretrained Model Comparison:
Evaluation of vi-mistral-x against other models in various tasks like CLM and VMLU.
Finetuned Model (Updating):
Detailed evaluation results of vi-mistral-x across different categories in VMLU benchmark suite.
Stats
vi-mistral-xは、CLM(次のトークン予測)タスクで、トークン数2068480、損失2.1566、精度0.5622を達成しました。