The authors present a reverse dictionary system that leverages information retrieval techniques, pre-trained language models, and approximate nearest neighbor search. The system encodes word definitions from an existing Estonian language lexicon, Sõnaveeb, using various pre-trained transformer-based models, and then performs semantic search over these encoded definitions to find the most relevant words when a user provides a description or definition.
The authors evaluate the system in two settings: an unlabeled evaluation that uses the lexicon's own structure and synonymy relations to define the ground truth, and a labeled evaluation that extends an existing English reverse dictionary dataset by translating the words and definitions to Estonian and Russian.
The results show that models trained for cross-lingual retrieval and including Estonian in their training data, such as E5 and LaBSE, perform the best in both monolingual and cross-lingual reverse dictionary tasks. The authors also propose a novel unlabeled evaluation approach that does not require annotated data, making it suitable for non-English and multilingual dictionaries.
The authors aim to integrate the best-performing model into the existing Sõnaveeb language portal to enrich the resource and support language learning, documentation, and preservation.
To Another Language
from source content
arxiv.org
Ключові висновки, отримані з
by Aleksei Dork... о arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19430.pdfГлибші Запити