แนวคิดหลัก
RakutenAI-7B achieves top performance in Japanese language understanding benchmarks while maintaining competitiveness in English, aiming to enhance affordable and efficient Japanese language models.
บทคัดย่อ
1. Introduction
- Shift towards "Pre-Train, Prompt, and Predict" paradigm in NLP.
- Large language models (LLMs) deliver unified solutions for NLP tasks.
- Focus on English LLMs neglects languages like Japanese.
2. Technical Details
2.1 Tokenization
- Mistral tokenizer challenges with encoding single Japanese characters into multiple tokens.
- Extending the tokenizer with 16k additional tokens improves character-per-token rate for Japanese text processing.
2.2 Foundation Model Training
- Quality of text data crucial for pre-training LLMs.
- Data filtering techniques enhance dataset quality for training RakutenAI-7B.
2.3 Model Fine-tuning
- Fine-tuning instruction and chat-tuned models using open and hand-crafted datasets.
- Safety datasets used to prevent generation of explicit or offensive content.
2.4 Evaluation
2.4.1 Evaluation Settings
- Performance evaluation using LM-Harness metrics for Japanese and English tasks.
2.4.2 Evaluation Results for Foundation Models
- RakutenAI-7B outperforms other 7-billion parameter foundation models in both Japanese and English LM-Harness evaluations.
3. Conclusion
- RakutenAI-7B showcases high performance across diverse NLP tasks in both Japanese and English.
4. Acknowledgements & 5. Limitations
สถิติ
By improving the tokenization process for Japanese, we can achieve cost efficient text processing during model training as well as inference.
We train the released models on approximately 175 billion tokens of filtered data.
Our instruct model achieves an average score of 68.74, leading by almost 2 points over Youri-7B-instruction, the second best model.