toplogo
Sign In

Extending Text Detoxification to New Languages with Parallel Data: MultiParaDetox


Core Concepts
This work presents MultiParaDetox, a pipeline for extending the ParaDetox text detoxification corpus collection procedure to new languages. The authors showcase the pipeline by collecting new parallel datasets for Spanish, Russian, and Ukrainian, and present an evaluation study of unsupervised baselines, large language models, and fine-tuned supervised models for these languages, affirming the advantages of parallel corpora for text detoxification.
Abstract
The paper introduces MultiParaDetox, an extension of the ParaDetox pipeline for collecting parallel text detoxification corpora in new languages. The key steps are: Toxic Corpus Preparation: Obtain toxic samples, either from an existing binary toxicity classification dataset or by filtering a general corpus using toxic keywords. Task Language Adaptation: Translate the ParaDetox crowdsourcing tasks to the target language and have native speakers proofread the translations. Task Settings Adjustment: Configure the crowdsourcing tasks for the target language, including annotator language requirements and quality control. The authors applied this pipeline to collect new parallel datasets for Spanish, Russian, and Ukrainian. The quality of the collected data was verified manually by native speakers. To validate the usefulness of the new datasets, the authors conducted text detoxification experiments, comparing unsupervised baselines, large language models, and fine-tuned models. The results show that models fine-tuned on the parallel corpora outperform the unsupervised approaches and zero-shot prompted language models, confirming the benefits of parallel data for text detoxification in these languages. The authors also discuss the limitations of the study, such as the focus on only explicit toxicity types and the uneven distribution of sample ratios in the datasets. They suggest future research directions, including exploring implicit toxicity, minimal data requirements for fine-tuning, and cross-lingual knowledge transfer.
Stats
The original toxic texts contain rude words, obscenities, and offensive language. The detoxified paraphrases aim to preserve the original meaning while removing the toxic elements. The datasets contain 8,500 unique Russian inputs, 2,122 unique Ukrainian inputs, and 337 unique Spanish inputs, with 1.67-2.19 paraphrases per input on average. The total costs for data collection were $880 for Russian, $849 for Ukrainian, and $278 for Spanish.
Quotes
"While the parallel detoxification corpora are already available together with their collection pipelines, they were only presented for English language. However, we strongly support the idea of such corpus availability for any language would lead to fair and safe LMs development equally for all languages." "The models fine-tuned on the presented data never fail in any of the evaluation parameters and outperform unsupervised baselines based on J score with a high gap. This attests to the reliability of our data and necessity of parallel text detoxification corpora in acquiring state-of-the-art text detoxification models."

Key Insights Distilled From

by Daryna Demen... at arxiv.org 04-03-2024

https://arxiv.org/pdf/2404.02037.pdf
MultiParaDetox

Deeper Inquiries

How can the MultiParaDetox pipeline be extended to collect parallel corpora for implicit toxicity types, such as sarcasm or racism, which may require more sophisticated task definitions?

To extend the MultiParaDetox pipeline for collecting parallel corpora for implicit toxicity types like sarcasm or racism, a more nuanced task definition is necessary. For sarcasm, annotators could be asked to identify sarcastic elements in the text and paraphrase them in a way that retains the sarcastic tone while removing any offensive content. This would require a deeper understanding of linguistic nuances and context. Similarly, for racism, annotators could be tasked with identifying racially insensitive language and transforming it into neutral or inclusive language. This would involve not only paraphrasing but also addressing underlying biases and stereotypes present in the text. Crowdsourcing tasks would need to be carefully designed to capture the subtleties of implicit toxicity types, ensuring that the resulting parallel corpora are accurate and effective for training models to detect and mitigate such forms of toxicity.

What is the minimal amount of parallel data required to fine-tune a robust text detoxification model, and how does this vary across different languages?

The minimal amount of parallel data required to fine-tune a robust text detoxification model can vary depending on the complexity of the language, the diversity of toxic expressions, and the specific toxicity types being addressed. Generally, a few hundred parallel pairs per language may be sufficient to train a basic model, but for more nuanced tasks like implicit toxicity detection, a larger dataset would be needed to capture the intricacies of language use. In low-resource languages, the availability of parallel data may be limited, making it challenging to fine-tune models effectively. In such cases, techniques like multilingual transfer learning or leveraging cross-lingual knowledge transfer can help bridge the gap by transferring knowledge from high-resource languages to low-resource languages. This approach can reduce the data requirements for fine-tuning and improve the performance of text detoxification models in languages with limited resources.

Can leveraging cross-lingual knowledge transfer between languages from neighboring language families help improve the performance of text detoxification models in low-resource languages?

Cross-lingual knowledge transfer between languages from neighboring language families can indeed help improve the performance of text detoxification models in low-resource languages. Languages from the same language family or with similar linguistic structures may share commonalities in terms of vocabulary, syntax, and cultural nuances. By leveraging this shared knowledge, models trained on data from one language can be adapted to perform well in related languages, even with limited parallel data. For example, languages like Spanish and Italian, which belong to the Romance language family, share similarities that can be exploited for cross-lingual transfer learning. By fine-tuning models on data from one language and transferring the knowledge to a related language, the performance of text detoxification models in low-resource languages can be significantly enhanced. This approach maximizes the utility of available data and resources, making text detoxification more accessible and effective across diverse linguistic contexts.
0