toplogo
Connexion

Datasets for Patent Classification: Introducing CinPatent in English and Japanese


Concepts de base
Introducing two new datasets, CinPatent-EN and CinPatent-JA, for patent classification, with AttentionXML showing superior performance.
Résumé

The article introduces two new datasets, one in English and the other in Japanese, for patent classification. These datasets contain a significant number of patent documents with multiple labels. The study compares various multi-label text classification methods on these datasets, with AttentionXML consistently outperforming other strong baselines. The ablation study highlights the importance of combining title, abstract, description, and claim1 for optimal results. Additionally, the article discusses the behavior of baselines with different data segmentation percentages.

edit_icon

Personnaliser le résumé

edit_icon

Réécrire avec l'IA

edit_icon

Générer des citations

translate_icon

Traduire la source

visual_icon

Générer une carte mentale

visit_icon

Voir la source

Stats
English dataset includes 45,131 patent documents with 425 labels. Japanese dataset contains 54,657 documents with 523 labels. UPTO-3M dataset has 3M patents collected from Google Patents Public Datasets. CLEF-IP dataset includes patents between 1978 and 2009 with IPC codes.
Citations
"Experimental results show that AttentionXML is consistently better than other strong baselines." "We make a systematic comparison of strong multi-label classification methods on the two datasets."

Idées clés tirées de

by Minh-Tien Ng... à arxiv.org 03-18-2024

https://arxiv.org/pdf/2212.12192.pdf
CinPatent

Questions plus approfondies

How can the findings from this study impact future research in patent classification?

The findings of this study, particularly the comparison of various multi-label text classification methods on new datasets for English and Japanese patents, can significantly impact future research in patent classification. By showcasing the effectiveness of AttentionXML over other strong baselines like FastXML, Parabel, and PatentBERT, researchers can focus on exploring hybrid models that combine tree structures with deep neural networks for improved performance. Additionally, the ablation study revealing the importance of combining title, abstract, description, and claim1 for optimal results can guide future studies towards leveraging multiple parts of a patent document for better classification accuracy.

What are potential limitations or biases in using multi-label text classification methods for patent analysis?

While multi-label text classification methods offer significant advantages in handling complex categorization tasks like patent analysis, there are potential limitations and biases to consider. One limitation is class imbalance within patents where certain codes may be more prevalent than others leading to skewed training data distribution. Biases could arise from incomplete or missing fields within patents such as absent abstracts or claims which might affect model performance if not appropriately addressed during preprocessing. Another challenge is related to language-specific nuances that could impact model generalization across different languages when dealing with multilingual datasets.

How might advancements in deep learning models further enhance the accuracy and efficiency of patent classification systems?

Advancements in deep learning models hold great promise for enhancing the accuracy and efficiency of patent classification systems. Techniques like pre-trained language models (e.g., BERT) fine-tuned specifically for patents (as seen in PatentBERT) enable capturing intricate semantic relationships within documents leading to more precise classifications. Continued developments in attention mechanisms (like those used by AttentionXML) allow models to focus on relevant parts of a document while considering label dependencies effectively improving overall performance. Moreover, incorporating novel architectures that blend hierarchical structures with advanced neural networks could lead to even more robust classifiers capable of handling diverse aspects present in complex patent documents efficiently.
0
star