The article introduces a novel tagging scheme and decoding algorithm for discontinuous named-entity recognition (NER). The key insights are:
Discontinuous mentions are decomposed into a two-layer representation, where the upper layer identifies sets of mentions and the lower layer identifies typed components (e.g. body parts and events).
A new tagging scheme is proposed that can encode this two-layer structure, with 10 tags in total. The tagging scheme ensures that any valid sequence of tags can be unambiguously mapped to a set of discontinuous mentions.
An efficient decoding algorithm based on weighted finite-state automata is presented, which ensures that only well-formed tag sequences are predicted. The time complexity of both marginal and maximum a posteriori inference is linear in the length of the input.
To handle the lack of component type annotations in datasets, the authors propose weakly-supervised learning techniques to infer the component types during training.
Experiments on three English biomedical datasets show that the proposed approach achieves competitive results compared to state-of-the-art methods, while being significantly faster at inference time.
The authors argue that their tagging scheme and decoding algorithm can be used as a drop-in replacement for any BIO tagger, enabling future research on BIO tagging to be easily evaluated on discontinuous NER as well.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문