Core Concepts
An ensemble method is proposed for detecting social media bots across multiple platforms, including Twitter, Reddit, and Instagram, by training specialized classifiers for different user data fields and aggregating their outputs.
Abstract
The authors propose an ensemble method, called BotBuster For Everyone, for detecting social media bots across multiple platforms, including Twitter, Reddit, and Instagram. The key highlights of the approach are:
Handling incomplete data: The ensemble method uses specialized classifiers for different user data fields (username, screen name, description, user metadata, post metadata), allowing it to make predictions even when some data fields are missing.
Multi-platform generalizability: The ensemble model is trained on aggregated datasets from the three platforms, enabling it to detect bots across Twitter, Reddit, and Instagram.
Interpretable classifiers: The use of tree-based classifiers (decision tree, random forest, gradient boosting) provides interpretability, allowing analysis of important bot features like username entropy and presence of identity terms in descriptions.
Eliminating threshold selection: The ensemble outputs both bot and human probabilities, eliminating the need to choose a classification threshold.
The authors apply the bot detector to analyze discourse around the 2020 US presidential elections, finding a higher proportion of bots on Reddit compared to Twitter, and differences in the narratives pushed by bots versus human users across the two platforms.
Stats
The entropy of usernames is an important factor in bot determination.
The number of retweets/shares a post receives is the most indicative feature of bot classification, followed by number of likes and replies.
Words representing a person's identity (e.g. writer, mom, host) are extremely indicative of human accounts.
Quotes
"The entropy of names and number of interactions (retweets/shares) are important factors in bot determination."
"Words representing a person's identity (i.e. writer, mom, host, author, reporter, editor etc.) are extremely indicative words, suggesting connections between the expression of identities and bot likeliness of an account."