Concepts de base
Turkish VBART models outperform multilingual models, setting new standards in Turkish NLP research.
Résumé
1. Introduction
Word embedding methods evolution: Word2Vec, GloVe, FastText, ELMo.
Deep Learning frameworks democratization: Keras, Tensorflow, PyTorch.
2. Related Work
BERTurk for Turkish text tasks.
Text summarization, title generation, question answering, and paraphrasing tasks.
3. Model
Tokenizer: SentencePiece Unigram Model.
Network Architecture: Based on mBART with sinusoidal positional embeddings.
Pre-training Task: Sentence permutation with span masking.
Training Corpus: OSCAR and mC4 Turkish sections.
4. Experiments
Text Summarization: VBART-Large and XLarge surpass previous models.
Title Generation: VBART models excel in generating titles.
Text Paraphrasing: VBART models outperform mT5-Base.
Question Generation & Answering: VBART models outperform mT5 models.
5. Discussion
Tokenizer efficiency: Turkish tokenizer compactness.
Model performance: Dedicated Turkish models outperform multilingual ones.
VBART-Large vs. VBART-XLarge: XLarge model's marginal improvement.
Chinchilla Scaling Law: Applicability to encoder-decoder models.
Future Work: Model enlargement, different pre-training objectives.
Stats
VBART-Large 모델은 mBART25, mBART50 및 mT5-Base 모델을 능가합니다.
VBART-Large 및 VBART-XLarge 모델은 mT5-Large 모델과 비교 가능한 결과를 보여줍니다.
XLarge 모델의 개선은 작지만, 더 많은 단계로 사전 훈련된 경우 크게 향상될 수 있습니다.
Citations
"Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models."