Extending Text Detoxification to New Languages with Parallel Data: MultiParaDetox
This work presents MultiParaDetox, a pipeline for extending the ParaDetox text detoxification corpus collection procedure to new languages. The authors showcase the pipeline by collecting new parallel datasets for Spanish, Russian, and Ukrainian, and present an evaluation study of unsupervised baselines, large language models, and fine-tuned supervised models for these languages, affirming the advantages of parallel corpora for text detoxification.