핵심 개념
FastSpell is a language identifier that combines fastText and Hunspell to provide a refined second-opinion on language predictions, with a focus on accurately distinguishing between similar and closely-related languages.
초록
The paper introduces FastSpell, a language identification tool that aims to improve upon existing language identifiers, particularly in cases where they struggle to differentiate between similar or closely-related languages.
FastSpell works in two steps:
- It first uses the fastText language identification model to make an initial prediction.
- If the predicted language is similar to the targeted language, FastSpell then uses the Hunspell spell-checker to refine the decision by checking the spelling of the text. Depending on the ratio of spelling errors for each similar language, FastSpell will either confirm the targeted language or replace it with the similar language that has the lowest number of spelling errors.
This two-step approach allows FastSpell to better handle cases where existing language identifiers tend to confuse similar languages, such as Spanish and Galician, or the Bokmål and Nynorsk variants of Norwegian. FastSpell also helps identify new languages or language varieties that may be ignored or misclassified by other tools.
The paper describes the motivation for developing FastSpell, the benchmarking process used to evaluate and select the language identification tools, the details of the FastSpell algorithm, and how to use and configure the tool. The authors also discuss potential future enhancements to further improve FastSpell's performance and language coverage.
통계
FastSpell is up to 100 times faster than the HeLI-OTS language identifier on average.
FastSpell achieves F1 scores of 0.983 for Serbo-Croatian, 0.914 for Maltese, and 0.810 for Norwegian Nynorsk, outperforming other benchmarked tools.
인용구
"FastSpell was initially developed as part of the code of the ParaCrawl series of projects aiming at deriving parallel data from web-crawled content."
"FastSpell was developed to be able to cope with these issues and refine the decisions made by CLD2 at the beginning of the pipeline."
"FastSpell focuses on a given language (the targeted language), that is provided as a parameter. Given a text (usually, a sentence), FastSpell will first predict its language by using fastText. For efficiency, only if the predicted language is the targeted language or a similar language according to a configurable list, FastSpell will try to refine the fastText prediction by checking the sentence spelling with Hunspell for the targeted language and its similar languages."