Core Concepts
Terrorizer algorithm harmonizes company names in patents using NLP and network theory, reducing unique names by 42%.
Abstract
The disambiguation of company names in patents is crucial for accurate analysis.
Labor-intensive methods like dictionaries or string matching are insufficient for large datasets.
Terrorizer algorithm combines NLP, network theory, and rule-based techniques to harmonize company names.
Three main phases: parsing with knowledge augmentation, matching with cosine similarity, and filtering using community detection.
Validation on four datasets shows superior performance compared to existing algorithms.
Stats
"Our final result is a reduction in the initial set of names of over 42%."
"The performance of Terrorizer is stable across different datasets."
"It achieves a higher F1 score compared to the algorithm currently used in PatentsView."
Quotes
"The problem biases research outcomes as it mostly underestimates the number of patents attributed to companies."
"An algorithm as such could provide significant benefit to the community of scholars working on patent data."