The authors use natural language processing techniques, including BERT and GPT-2 embeddings combined with dimensionality reduction methods like UMAP, TriMAP, and PaCMAP, to analyze the State of the Union (SOTU) address dataset from Kaggle. Their analysis reveals a surprising finding - there is a sharp break in the language and style of SOTU addresses around 1927-1932, suggesting a major discontinuity in American history.
The authors first observe that addresses delivered by the same president are closely clustered, and those written in chronological proximity are also similar. However, the most striking result is the clear separation between addresses written before 1927 and those written after 1932, as shown in the UMAP and TriMAP visualizations.
The authors hypothesize that this shift may be due to two factors: 1) the increased use of speechwriters by presidents, starting with Franklin Roosevelt, and 2) the transformation of the United States from a remote, provincial country to a global superpower after World War II, leading to changes in the focus and emphasis of presidential addresses.
The authors also experiment with authorship attribution and year prediction tasks using fine-tuned DistilBERT models. They are able to achieve high accuracy (93-95%) in identifying the president who delivered a particular address, and reasonably good performance (RMSE of around 4.5 years) in predicting the year of an address, despite the relatively small amount of training data available for each president.
The authors conclude by acknowledging that they do not have a definitive explanation for the observed discontinuity, but they believe that there must be an underlying reason or reasons for this significant shift in the language and style of SOTU addresses over time.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Alexander Ko... lúc arxiv.org 05-07-2024
https://arxiv.org/pdf/2312.01185.pdfYêu cầu sâu hơn