Core Concepts
Language-independent representations improve the performance of zero-shot summarization by decoupling task-specific knowledge from language-specific abilities.
Abstract
The paper focuses on improving zero-shot summarization, both intralingual and crosslingual, by developing methods to decouple language-specific and task-specific knowledge in pretrained multilingual language models.
Key highlights:
Naive fine-tuning of pretrained models on summarization data leads to highly language-specific representations, resulting in poor zero-shot performance.
The authors propose "query-key (QK) finetuning" to selectively update only the query and key projections in the attention modules, while keeping the value projections fixed to retain the pretrained language generation capabilities.
For crosslingual zero-shot settings, the authors further introduce a "balanced adversarial language classifier" that encourages the model to learn language-agnostic representations by incentivizing the classifier to predict a uniform distribution over languages.
Combining QK finetuning and the balanced adversarial classifier, along with a two-step finetuning approach (first on translation, then on summarization), leads to significant improvements in zero-shot crosslingual summarization performance.
Qualitative analyses show that removing source language identity correlates with better zero-shot summarization, confirming the importance of language-agnostic representations.
Stats
"Finetuning on English only leads the model to forget its generation ability for other languages, resulting in off-target generation."
"Even after explicitly encouraging language-agnostic representations with an adversarial language classifier, recovering the source language identity remains easy."
Quotes
"Finetuning pretrained models on downstream generation tasks often leads to catastrophic forgetting in zero-shot conditions."
"We hypothesize that the value projections should be kept unchanged to prevent losing pretrained generation capabilities during finetuning. In contrast, query and key are updated as adaptation to specific tasks."
"A problem with the cross-entropy-based formulation is that it operates on single classes and does not incentivize language-agnostic representations on the output distribution level."