Core Concepts
This paper presents MultiLS-SP/CA, a novel dataset for lexical simplification in Spanish and Catalan, which includes scalar ratings of the understanding difficulty of lexical items. The dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification available for Spanish.
Abstract
The paper presents two new datasets, MultiLS-SP and MultiLS-CA, for lexical simplification in Spanish and Catalan, respectively.
The key highlights are:
MultiLS-SP/CA is the first dataset for lexical simplification in Catalan and a significant addition to the limited data available for Spanish.
The datasets include not only complex words that can be simplified, but also non-substitutable words, providing a more comprehensive scenario for system development.
Each target word is annotated with a scalar rating of lexical complexity on a 5-point Likert scale, in addition to up to 3 lexical substitutes.
The paper describes the data compilation process and provides baseline performance for lexical simplification and lexical complexity prediction tasks using the new datasets.
The baseline results indicate substantial room for improvement, highlighting the need for more advanced models and the value of these new resources for the Iberian Romance languages.
Stats
The datasets contain a total of 1,105 target words (625 in Spanish, 480 in Catalan) embedded in 370 context sentences (210 in Spanish, 160 in Catalan).