Core Concepts
The authors developed an ARIMA-based prediction model for the number of reported Wordle results, a regression model based on XGBoost for predicting the distribution of reported results, and a classification model using K-Means clustering and decision trees to categorize solution words by difficulty.
Abstract
The authors first preprocessed the Wordle data by removing and replacing any abnormal data. They then established an ARIMA-based prediction model to forecast the number of reported results on March 1, 2023, with a prediction interval of [20,337, 21,673].
Next, the authors selected three word attributes - frequency of word usage (FREQ), information entropy of the word (WIE), and number of repeated letters (NRE) - and performed correlation analysis. They found that FREQ was positively correlated with the number of tries, while WIE and NRE were negatively correlated.
The authors then built a regression model using XGBoost to predict the distribution of reported results for each number of tries. They achieved an overall accuracy of 82.1% in predicting the percentage distribution, and were able to accurately predict the distribution for the word "EERIE".
Finally, the authors used K-Means clustering to classify the solution words into three difficulty categories - easy, medium, and difficult. They then built a decision tree model to explore the relationship between the three word attributes and the difficulty classification, achieving an accuracy of 77.6%.
The authors also found that for 83.9% of the words in the dataset, more than 90% of players needed 3 or more guesses to solve the word, indicating the overall difficulty of the Wordle game.
Stats
The frequency of word usage (FREQ) for the word "EERIE" is 0.000002437871.
The information entropy (WIE) of the word "EERIE" is 1.4797732853992995.
The number of repeated letters (NRE) in the word "EERIE" is 3.