The paper presents a cross-lingual automatic term recognition (ATR) framework to expand the English consumer health vocabulary (CHV) into other languages. The key steps are:
Data Collection: The framework collects healthcare Q&A corpora in English and Chinese as the input.
Pre-processing: It applies NLP techniques to clean the raw texts and extract medical entities from the corpora.
Monolingual Word Vector Space Determination: The framework uses the skip-gram algorithm to determine the word vector spaces for each language independently.
Space Alignment: It aligns the monolingual word spaces into a bilingual word vector space using a small set of medical entity translations as anchors.
Word Expansion: Given a set of seed words, the framework retrieves their synonym candidates across languages using the bilingual word space and a dynamic threshold mechanism.
The experimental results show that the bilingual word space induced by the proposed framework outperforms two state-of-the-art large language models (GPT-3.5-Turbo and Cohere Rerank) in identifying cross-lingual consumer health vocabulary. The framework only requires raw user-generated health corpora and a limited set of medical translations, reducing the human effort in compiling cross-lingual CHV.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Chia-Hsuan C... at arxiv.org 04-03-2024
https://arxiv.org/pdf/2206.11612.pdfDeeper Inquiries