Core Concepts
Advancing Chinese text analysis through paragraph-level topic representation and corpus construction.
Abstract
This content delves into the importance of paragraph-level topic structure in Chinese texts, outlining the challenges and solutions in constructing a corpus for topic segmentation and outline generation. It discusses the hierarchical representation model, the two-stage annotation method, and the application of the corpus in downstream tasks like discourse parsing.
Abstract
Topic segmentation and outline generation aim to divide documents into coherent sections and generate subheadings.
Paragraph-level topic structure provides a higher-level context for document understanding.
Lack of large-scale Chinese paragraph-level corpora hinders research and applications.
Introduction
Well-written documents consist of semantically coherent segments revolving around specific topics.
Paragraph-level topic structure aids in understanding the overall context of a document.
English has more research on topic segmentation compared to Chinese.
Chinese Paragraph-level Topic Structure Representation
Proposed a hierarchical representation model with three layers for comprehensive topic structure.
Subheadings and titles are used to represent richer information at the paragraph level.
Chinese Paragraph-level Topic Structure Corpus Construction
Data source: News documents from Xinhua News Agency in the Gigaword corpus.
Two-stage annotation method: Automatic extraction followed by manual verification for high-quality corpus.
Statistical details of CPTS: Average words per document, paragraphs per document, words per subheading, etc.
Experiments on Corpus Evaluation
Topic Segmentation
Baselines: Segbot, PN-XLNet, TM-BERT, BERT+Bi-LSTM, Hier. BERT, ChatGPT.
Evaluation metrics: Pk, WindowDiff, Segmentation Similarity, Boundary Similarity, and macro-F1.
Results: BERT+Bi-LSTM and Hier. BERT outperform other models in topic segmentation.
Outline Generation
Baselines: BART, T5, ChatGPT.
Evaluation metrics: ROUGE, BLEU, BertScore, and manual evaluation.
Results: T5 (24) performs best among baselines in outline generation.
Title Generation
Baselines: BART, T5, ChatGPT.
Results: T5 (24) outperforms other models in title generation.
Application in Discourse Parsing
Used CPTS to enhance paragraph-level discourse parsing performance.
Real topic structure in CPTS improves parser performance.
Discussion and Future Work
Applicability of annotation method in various genres.
Potential challenges in expanding joint learning framework and exploring hierarchical topic structures.
Conclusion
Proposed a comprehensive approach for constructing a Chinese paragraph-level topic structure corpus.
Validated the corpus through experiments on topic segmentation, outline generation, and discourse parsing.
Bibliographical References
Includes references to relevant works in text segmentation, discourse parsing, and text summarization.
Stats
The lack of large-scale Chinese paragraph-level corpora hinders research and applications.
The CPTS corpus contains about 14393 documents with high quality.
The average number of words per document is 1727.96.
The corpus construction method involves a two-stage annotation process.
Quotes
"Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings."
"The lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained relative research and applications."