toplogo
Sign In

Pretraining and Continually Updating a Large Language Model for the Japanese Business Domain


Core Concepts
This study presents the first Japanese business domain-specific large language model (LLM) with 13 billion parameters, trained from scratch and continually updated with the latest business documents.
Abstract
This study explores the development of a Japanese business domain-specific large language model (LLM). The key highlights are: Pretraining from Scratch: The authors trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, in addition to general-domain data such as Wikipedia and Common Crawl. Continual Pretraining: To ensure the model's knowledge remains current, the authors continually pretrained the model on the latest business documents collected in the past two months. This helps the model adapt to new information without losing its general knowledge. Business Domain Benchmark: The authors created a new benchmark for Japanese business domain question answering (QA), consisting of 50 questions across three settings: without context, with automatically retrieved context, and with manually retrieved context. This benchmark is used to evaluate the performance of the authors' model and compare it to existing Japanese LLMs. Evaluation Results: The authors' pretrained model outperforms existing Japanese LLMs in the business domain QA tasks, particularly in the no-context setting, demonstrating its strong domain-specific knowledge. The continually updated models also show improved performance in answering questions about the most recent business events. Publicly Available Resources: The authors' pretrained model and the business domain benchmark are made publicly available to facilitate future research in language- and domain-specific LLMs.
Stats
The pretraining dataset consists of over 220 billion tokens, with 19.8% from domain-specific sources (business web pages and patents) and 80.2% from general-domain sources (Wikipedia, CC100, mC4, Common Crawl). The authors used 16 AWS Trainium instances to train the 13-billion-parameter model in a distributed learning environment. The continual pretraining dataset includes the latest business documents collected in the past two months.
Quotes
"Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM." "To update our model with the latest business knowledge, we continually pretrain the model on the latest business documents collected in the past two months, ensuring that its knowledge remains current." "Our pretrained model and business domain benchmark are publicly available."

Deeper Inquiries

How can the authors' approach be extended to develop domain-specific LLMs for other languages and industries?

The authors' approach to developing a domain-specific LLM for the Japanese business domain can be extended to other languages and industries by following a similar methodology tailored to the specific language and domain requirements. Here are some steps to extend the approach: Data Collection: Gather a diverse set of data sources in the target language and industry domain. This may include business documents, patents, news articles, and other relevant texts. Dataset Preprocessing: Filter the collected data to remove noise, identify the language, and deduplicate entries to ensure data quality. Pretraining: Train the LLM from scratch using the preprocessed dataset, focusing on the language and domain-specific texts. Consider the size of the model and the parameters based on the complexity of the language and domain. Continual Updating: Implement a pipeline for continual updating of the model with the latest data in the target language and industry. Blend the latest data with existing knowledge to prevent catastrophic forgetting. Benchmark Creation: Develop a benchmark specific to the language and industry domain to evaluate the model's performance accurately. Instruction Tuning: Utilize language-specific datasets for instruction tuning to enhance the model's performance on specific tasks. By following these steps and customizing them to the linguistic and domain characteristics of other languages and industries, researchers can successfully develop domain-specific LLMs for various contexts.

What are the potential challenges and limitations in continually updating a large language model without catastrophic forgetting?

Continually updating a large language model without experiencing catastrophic forgetting poses several challenges and limitations: Balancing New and Old Knowledge: Ensuring that the model retains previously learned information while incorporating new data is crucial. Balancing the weight given to new and old knowledge can be challenging. Data Quality: The quality of the new data used for updating is essential. Noisy or irrelevant data can negatively impact the model's performance. Computational Resources: Continual updating requires significant computational resources to process and incorporate new data efficiently. Overfitting: There is a risk of overfitting to the new data, leading to a decrease in performance on previously learned tasks. Evaluation: Evaluating the model's performance after each update to ensure it maintains or improves its capabilities can be resource-intensive. Optimal Hyperparameters: Finding the optimal hyperparameters for continual learning to prevent catastrophic forgetting while adapting to new information is a complex task. Addressing these challenges requires careful planning, monitoring, and fine-tuning of the updating process to maintain the model's performance over time.

How can the business domain benchmark be further expanded or refined to better capture the nuances of the Japanese business landscape?

To enhance the business domain benchmark and capture the nuances of the Japanese business landscape more effectively, the following strategies can be implemented: Diversification of Question Types: Include a wider range of question types that reflect various aspects of the Japanese business landscape, such as regulatory frameworks, cultural influences, market trends, and specific industry practices. Incorporation of Real-world Scenarios: Integrate real-world business scenarios and case studies into the benchmark to test the model's ability to apply knowledge in practical situations. Industry-specific Challenges: Introduce questions that address industry-specific challenges unique to the Japanese business environment, such as supply chain disruptions, sustainability initiatives, or regulatory compliance issues. Expert Validation: Involve domain experts from the Japanese business sector to validate the benchmark questions and ensure they accurately reflect the complexities of the industry. Contextual Understanding: Develop questions that require a deep understanding of Japanese business culture, etiquette, and communication norms to test the model's contextual comprehension. Dynamic Updates: Regularly update the benchmark with the latest trends, events, and developments in the Japanese business landscape to keep the evaluation relevant and reflective of current industry dynamics. By implementing these refinements and expansions, the business domain benchmark can provide a more comprehensive and nuanced evaluation of LLMs in the context of the Japanese business domain.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star