Core Concepts
Utilizing HTML structure and semantic information in Retrieval-Augmented Generation (RAG) systems significantly improves performance compared to traditional plain-text-based approaches.
Stats
A real HTML document from the Web contains over 80K tokens on average.
Over 90% of the tokens in a typical HTML document are CSS styles, JavaScript, Comments, or other meaningless tokens.
The HTML cleaning process reduces the length of the HTML to 6% of its original size.
The authors' HTML pruning method reduces a 60K token document to 2K-32K tokens.