Core Concepts
Continual pre-training of large language models initially trained on English corpora can effectively enhance their Japanese language capabilities, outperforming Japanese language models trained from scratch.
Abstract
This study presents the development of Swallow, a large language model (LLM) with enhanced Japanese capabilities, through continual pre-training on Llama 2 models. The key insights are:
Continual pre-training on Japanese corpora significantly improves the performance of Llama 2 models on Japanese tasks, especially in question answering, by up to 70%. The performance monotonically increases with the amount of Japanese training data.
Swallow, the continually pre-trained model, outperforms Japanese LLMs trained from scratch by 8.4 to 17.4 points on average, demonstrating the efficiency of the cross-lingual continual pre-training approach.
Vocabulary expansion, which adds Japanese characters and words to the model's vocabulary, improves computational efficiency without negatively impacting performance, except for the summarization task.
Incorporating parallel Japanese-English corpora into the continual pre-training enhances translation performance, particularly from Japanese to English, without degrading performance on other tasks.
The study provides valuable insights into effective methodologies for cross-lingual adaptation of large language models, leveraging both English and Japanese language resources.
Stats
The Swallow Corpus contains approximately 100B tokens, with 90% Japanese text and 10% English text.
The Japanese Wikipedia corpus contains 1.6B tokens.
The vocabulary expansion added 11,176 subwords to the original LLaMA tokenizer, resulting in a total vocabulary size of 43,176.
Quotes
"Continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost."
"Swallow achieved the highest performance in Japanese among all models developed in Japan (as of December 2023)."
"We show that continual pre-training is effective for improving Japanese abilities, especially question answering tasks that require Japanese knowledge."