toplogo
Sign In

Building a Large High-Quality Japanese Web Corpus for Training Large Language Models


Core Concepts
This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive, which is the largest of all available training corpora for Japanese LLMs, surpassing existing corpora.
Abstract
This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of approximately 63.4 billion pages crawled between 2020 and 2023). The corpus consists of approximately 312.1 billion characters (approximately 173 million pages), which is the largest of all available training corpora for Japanese LLMs, surpassing CC-100 (approximately 25.8 billion characters), mC4 (approximately 239.7 billion characters) and OSCAR 23.10 (approximately 74 billion characters). The key steps in building the corpus include: Rapid Japanese detection to reduce processing time for subsequent steps. Text extraction from HTML using Trafilatura. Precise Japanese detection based on a linear binary classifier. Quality filtering to remove low-quality Japanese text based on various rules. Deduplication using MinHash to remove duplicated text. Filtering by hostnames to remove irrelevant content. Normalizing punctuations and removing footers. To confirm the quality of the corpus, the authors performed continual pre-training on Llama 2 7B, 13B, 70B, Mistral 7B v0.1, and Mixtral 8x7B Instruct as base LLMs. The experiments demonstrated consistent (6.6–8.1 points) improvements in Japanese benchmark datasets, and established the state-of-the-art performance in each model size (7B, 13B, 8x7B, and 70B). The improvement on Llama 2 13B brought from the presented corpus was the largest among those from other existing corpora.
Stats
The presented corpus consists of approximately 312.1 billion characters (approximately 173 million pages). The CC-100 corpus contains approximately 25.8 billion characters, the mC4 corpus contains approximately 239.7 billion characters, and the OSCAR 23.10 corpus contains approximately 74 billion characters.
Quotes
"This corpus consists of approximately 312.1 billion characters (approximately 173 million pages), which is the largest of all available training corpora for Japanese LLMs, surpassing CC-100 (approximately 25.8 billion characters), mC4 (approximately 239.7 billion characters) and OSCAR 23.10 (approximately 74 billion characters)." "Experimental results demonstrate that continual pre-training consistently improves the base model's performance by 6.6–8.1 points on Japanese benchmark datasets." "We also demonstrate that the improvement on Llama 2 13B brought from the presented corpus was the largest among those from other existing corpora."

Key Insights Distilled From

by Naoaki Okaza... at arxiv.org 04-30-2024

https://arxiv.org/pdf/2404.17733.pdf
Building a Large Japanese Web Corpus for Large Language Models

Deeper Inquiries

How can the quality filtering process be further improved to remove more harmful or biased content from the corpus

To further improve the quality filtering process and remove more harmful or biased content from the corpus, several strategies can be implemented: Enhanced NG Expression List: Expand the list of harmful or biased expressions to include a wider range of potentially harmful content. This can be achieved by collaborating with experts in various fields to identify and include additional problematic terms or phrases. Machine Learning Models: Implement machine learning models for automatic detection of harmful or biased content. Train these models on a diverse set of examples to improve accuracy in identifying and filtering out such content. Community Feedback: Establish a feedback mechanism where users can report any harmful or biased content they encounter in the corpus. This feedback can be used to continuously update and refine the filtering criteria. Ethical Review: Conduct an ethical review of the corpus construction process to ensure that it aligns with ethical guidelines and standards. This review can help identify and address any potential biases or harmful content that may have been overlooked. Diverse Perspectives: Include diverse perspectives in the quality filtering process to ensure that a wide range of viewpoints and sensitivities are considered. This can help in identifying and addressing biases that may be present in the corpus.

What are the potential limitations or biases in the presented corpus, and how can they be addressed in future work

The potential limitations or biases in the presented corpus include: Language Bias: The corpus may still contain inherent language biases that could impact the performance of language models. To address this, a thorough analysis of the language data and continuous monitoring for biases is essential. Cultural Bias: The corpus may reflect certain cultural biases or perspectives that could influence the language models trained on it. To mitigate this, incorporating diverse cultural perspectives and ensuring representation from various communities is crucial. Content Relevance: There may be instances where the content in the corpus is not relevant or suitable for training language models. Regular reviews and updates to remove irrelevant or outdated content can help maintain the quality of the corpus. Data Privacy: Ensuring data privacy and protection of sensitive information within the corpus is vital. Implementing robust data anonymization techniques and compliance with data protection regulations can address this concern. To address these limitations and biases, future work can focus on: Bias Detection Algorithms: Implementing bias detection algorithms to identify and mitigate biases in the corpus. Diverse Dataset Sources: Incorporating data from diverse sources and domains to ensure a well-rounded and representative corpus. Transparency and Documentation: Providing transparency in the corpus construction process and documenting potential biases or limitations to facilitate further research and improvements.

How can the corpus building process be extended to other languages beyond Japanese to create high-quality training data for multilingual language models

To extend the corpus building process to other languages beyond Japanese, the following steps can be taken: Language Expertise: Collaborate with language experts and native speakers of the target languages to ensure accurate language detection and text extraction. Cultural Sensitivity: Consider cultural nuances and sensitivities of different languages to avoid biases and ensure inclusivity in the corpus. Multilingual Approach: Develop a multilingual corpus by incorporating data from multiple languages to create a diverse and comprehensive training dataset for multilingual language models. Quality Assurance: Implement quality assurance measures specific to each language to maintain the integrity and accuracy of the corpus. Collaboration: Partner with researchers and organizations from different language-speaking regions to gather data and insights for building high-quality training data for multilingual language models.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star