Core Concepts
FUNDUS is a user-friendly news scraper that provides high-quality extractions through bespoke content extractors tailored to each online newspaper, outperforming generic methods.
Abstract
Introduction:
FUNDUS introduces a user-friendly news scraper optimized for high-quality extractions.
The tool uses bespoke content extractors tailored to the formatting guidelines of each supported online newspaper.
Evaluation Against Other Scrapers:
FUNDUS yields significantly higher quality extractions compared to existing libraries.
Existing libraries struggle with extraction accuracy due to generic methods.
Evaluation shows FUNDUS outperforms other popular news scrapers in terms of quality.
Usage Example:
Users can easily scrape news articles from supported publishers using FUNDUS.
The library combines crawling and content extraction in a single pipeline for ease of use.
Content Extraction:
FUNDUS uses bespoke extractors for each newspaper, optimizing accuracy and attribute coverage.
Extraction rules are manually crafted for each publisher, ensuring high-quality text extraction.
Scalability and Performance:
FUNDUS supports access to the CC-NEWS web archive, enabling users to create large news corpora.
The tool demonstrates efficient crawling performance and scalability across different publishers.
Stats
F1スコアが97.69で、他のライブラリよりも高い品質の抽出を実現しています。