The authors present a reverse dictionary system that leverages information retrieval techniques, pre-trained language models, and approximate nearest neighbor search. The system encodes word definitions from an existing Estonian language lexicon, Sõnaveeb, using various pre-trained transformer-based models, and then performs semantic search over these encoded definitions to find the most relevant words when a user provides a description or definition.
The authors evaluate the system in two settings: an unlabeled evaluation that uses the lexicon's own structure and synonymy relations to define the ground truth, and a labeled evaluation that extends an existing English reverse dictionary dataset by translating the words and definitions to Estonian and Russian.
The results show that models trained for cross-lingual retrieval and including Estonian in their training data, such as E5 and LaBSE, perform the best in both monolingual and cross-lingual reverse dictionary tasks. The authors also propose a novel unlabeled evaluation approach that does not require annotated data, making it suitable for non-English and multilingual dictionaries.
The authors aim to integrate the best-performing model into the existing Sõnaveeb language portal to enrich the resource and support language learning, documentation, and preservation.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Aleksei Dork... في arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19430.pdfاستفسارات أعمق