Core Concepts
A novel methodology called L3X that tackles the challenge of extracting long lists of object entities from long texts, such as entire books, by combining recall-oriented generation using large language models with precision-oriented scrutinization.
Abstract
The paper introduces a new task of extracting long lists of object entities that stand in a specific relation to a given subject, from long texts such as books or websites. The authors present the L3X (LM-based Long List eXtraction) methodology, which works in two stages:
Stage 1 - Recall-oriented Generation:
- An LLM (large language model) is prompted with the subject and relation to generate a full list of object candidates.
- Information retrieval techniques are used to find relevant passages from the long text and feed them into the LLM prompts to improve recall.
- Passage re-ranking and batching techniques are employed to further enhance the LLM's ability to extract long lists.
Stage 2 - Precision-oriented Scrutinization:
- The high-recall list of object candidates from stage 1 is scrutinized using various techniques to validate or prune the candidates.
- Methods include score-based thresholding, confidence elicitation from the LLM, predicate-specific classifiers, and discriminative classifiers that leverage the support passages for each candidate.
The authors construct a new dataset of 10 books/book series and 8 relations, and evaluate L3X using GPT-3.5 as the underlying LLM. The results show that L3X substantially outperforms LLM-only baselines, reaching nearly 80% recall and 30-48% recall@precision (at 80% and 50% precision targets).
The key contributions are: (1) defining the new task of extracting long object lists from long documents, (2) the L3X methodology that combines retrieval-augmented LLM generation and scrutinization, and (3) experiments on a new benchmark dataset demonstrating the effectiveness of the approach.
Stats
"Methods for relation extraction from text mostly focus on high precision, at the cost of limited recall."
"High recall is crucial, though, to populate long lists of object entities that stand in a specific relation with a given subject."
"Cues for relevant objects can be spread across many passages in long texts."
"Our L3X method outperforms LLM-only generations by a substantial margin."
"We reach nearly 80% recall using our passage re-ranking and batching technique and ca. 48% R@P50 and 30% R@P80 through our scrutinizing technique."
Quotes
"Methods for relation extraction from text mostly focus on high precision, at the cost of limited recall."
"High recall is crucial, though, to populate long lists of object entities that stand in a specific relation with a given subject."
"Cues for relevant objects can be spread across many passages in long texts."
"Our L3X method outperforms LLM-only generations by a substantial margin."