Core Concepts
RakutenAI-7B introduces advanced Japanese-oriented large language models, achieving top performance in language understanding benchmarks.
Abstract
1. Introduction:
Shift towards "Pre-Train, Prompt, and Predict" paradigm in NLP.
Focus on large language models (LLMs) for various NLP tasks.
2. Technical Details:
2.1 Tokenization:
Mistral tokenizer challenges with encoding single Japanese characters into multiple tokens.
Extending the tokenizer to improve character-per-token rate for Japanese text processing.
2.2 Foundation Model Training:
Importance of high-quality data for pre-training LLMs.
Data filtering techniques to enhance dataset quality for training.
2.3 Model Fine-tuning:
Fine-tuning foundation model to create specialized instruction and chat-tuned models.
Safety measures implemented to prevent generation of harmful content.
2.4 Evaluation:
2.4.1 Evaluation Settings:
Performance evaluation using Japanese and English versions of LM-Harness across various NLP tasks.
2.4.2 Evaluation Results for Foundation Models:
RakutenAI-7B outperforms other Japanese models on both Japanese and English test sets.
2.4.3 Evaluation Results for Instruction-Tuned Models:
Improvement in performance of instruction-tuned models over foundation model observed.
3. Conclusion:
RakutenAI-7B demonstrates high performance across diverse NLP tasks in both Japanese and English languages.
4. Acknowledgements & Limitations mentioned.
Stats
By improving the tokenization process for Japanese, we can achieve cost efficient text processing during model training as well as inference.
We train the released models on approximately 175 billion tokens of filtered data.
Our instruct model achieves an average score of 68.74, leading by almost 2 points over Youri-7B-instruction, the second best model.
Our model achieves an average score of 60.50, while Japanese-StableLM-Base-Gamma-7b lags by more than 4 points compared to RakutenAI-7B.
Our instruct model achieves an average score of 61.32, leading by almost 5 points over Youri-7B-instruction, the second best model.
Quotes
"We release our models to the public under the Apache 2.0 License."
"Our aim is to help the community create more affordable and efficient Japanese language models."