Core Concepts
A reverse dictionary system that uses modern pre-trained language models and approximate nearest neighbor search to enable semantic search over word definitions, allowing users to find words by describing their meaning.
Abstract
The authors present a reverse dictionary system that leverages information retrieval techniques, pre-trained language models, and approximate nearest neighbor search. The system encodes word definitions from an existing Estonian language lexicon, Sõnaveeb, using various pre-trained transformer-based models, and then performs semantic search over these encoded definitions to find the most relevant words when a user provides a description or definition.
The authors evaluate the system in two settings: an unlabeled evaluation that uses the lexicon's own structure and synonymy relations to define the ground truth, and a labeled evaluation that extends an existing English reverse dictionary dataset by translating the words and definitions to Estonian and Russian.
The results show that models trained for cross-lingual retrieval and including Estonian in their training data, such as E5 and LaBSE, perform the best in both monolingual and cross-lingual reverse dictionary tasks. The authors also propose a novel unlabeled evaluation approach that does not require annotated data, making it suitable for non-English and multilingual dictionaries.
The authors aim to integrate the best-performing model into the existing Sõnaveeb language portal to enrich the resource and support language learning, documentation, and preservation.
Stats
The Sõnaveeb lexicon contains 124K words, 213K Estonian definitions, and 16K definitions in other languages.
There are 295K synonymy relations, which were mirrored to 590K relations.
On average, each word has 3.85 synonyms.
Quotes
"A reverse dictionary (see examples in Table 1) is a system that takes user descriptions or definitions as input and returns words or expressions corresponding to the provided input."
"Early approaches to building reverse dictionary systems were based on information retrieval (IR) techniques reliant on exact term matching: both user inputs and candidate collections were represented using sets of keywords or sparse term-based vector representations."
"When applied to lexicographical data, semantic search may be leveraged to create a reverse dictionary system. Word definitions encoded by a pre-trained language model represent the search index, which is then queried with the encoded representation of the user's input (definition or description of a concept)."