The article introduces a novel tagging scheme and decoding algorithm for discontinuous named-entity recognition (NER). The key insights are:
Discontinuous mentions are decomposed into a two-layer representation, where the upper layer identifies sets of mentions and the lower layer identifies typed components (e.g. body parts and events).
A new tagging scheme is proposed that can encode this two-layer structure, with 10 tags in total. The tagging scheme ensures that any valid sequence of tags can be unambiguously mapped to a set of discontinuous mentions.
An efficient decoding algorithm based on weighted finite-state automata is presented, which ensures that only well-formed tag sequences are predicted. The time complexity of both marginal and maximum a posteriori inference is linear in the length of the input.
To handle the lack of component type annotations in datasets, the authors propose weakly-supervised learning techniques to infer the component types during training.
Experiments on three English biomedical datasets show that the proposed approach achieves competitive results compared to state-of-the-art methods, while being significantly faster at inference time.
The authors argue that their tagging scheme and decoding algorithm can be used as a drop-in replacement for any BIO tagger, enabling future research on BIO tagging to be easily evaluated on discontinuous NER as well.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Caio Corro at arxiv.org 09-25-2024
https://arxiv.org/pdf/2409.16243.pdfDeeper Inquiries