Core Concepts
Open-source 4B-parameter PhoGPT models enhance Vietnamese NLP.
Abstract
Abstract:
Introduces PhoGPT-4B and PhoGPT-4B-Chat for Vietnamese language.
Demonstrates superior performance compared to other models.
Introduction:
Large language models' success in English contrasted with limited progress in other languages.
Release of PhoGPT models to advance Vietnamese NLP research.
PhoGPT Model Architecture and Pre-training:
Transformer decoder-based model incorporating flash attention and ALiBi.
Trained on a diverse corpus of Vietnamese texts for two epochs.
PhoGPT-4B-Chat: Supervised Fine-tuning:
Fine-tuned using instructional prompts and conversations dataset.
Concatenation of various sources for fine-tuning data.
Evaluation:
Comparison with closed-source and open-source models, showcasing competitive accuracy.
Specific focus on Vietnam-related questions where PhoGPT excels.
Conclusion:
Introduction of state-of-the-art open-source LLMs for Vietnamese language.
Aim to foster future research and applications in the field.
Limitations:
Not suitable for tasks involving reasoning, coding, or mathematics.
Caution advised due to potential generation of harmful or biased responses.
Acknowledgments:
Extended thanks to individuals involved in crawling health data and initial discussions.
References:
Numerous references cited related to language models, pre-training methods, and model architectures.
Stats
PhoGPT models have 3.7B parameters.
Vietnamese corpus consists of 102B tokens.
PhoGPT trained on a dataset of 70K instructional prompts.
Quotes
"We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese."
"Our goal is to provide comprehensive and powerful LLMs for Vietnamese."