toplogo
Sign In

Comprehensive Dataset of Bluesky Social Platform: Insights into User Behavior, Content Dynamics, and Algorithmic Curation


Core Concepts
This work introduces a comprehensive dataset of the Bluesky Social platform, covering over 4 million user accounts, 235 million posts, and various interaction data. The dataset enables unprecedented analysis of online behavior, content diffusion, and the effects of algorithmic curation on user engagement.
Abstract
The authors present a large, high-coverage dataset of the Bluesky Social platform, which is a new decentralized online social network. The dataset includes: User information: 4.1 million user accounts, covering around 81% of all registered users Post data: 235 million posts made by these users Interaction data: Follow/followee relationships, replies, reposts, and quotes Feed data: Posts from 11 popular feed generators, along with bookmarking and liking information The authors provide a detailed technical validation of the dataset, analyzing the social structure, posting activity, content, and sentiment on the platform. Key findings include: The follower-followee network exhibits a power-law degree distribution, with a few highly influential accounts. Users show moderate engagement, with 58% posting at least once and an average of 99 posts per active user. Sentiment analysis reveals a generally positive outlook, with 39% of English posts being positive, 27% negative, and 41% neutral. Topic modeling on negative posts during a period of increased activity uncovers discussions around racism and content moderation issues on the platform. The dataset also includes information on user-curated feed generators, allowing for the study of the effects of algorithmic curation on user engagement and exposure to content. This comprehensive dataset provides a valuable resource for researchers to study online social dynamics, content diffusion, and the interplay between human and algorithmic curation in the context of a new decentralized social media platform.
Stats
"Bluesky reached more than three million users in November." "Bluesky reported an unprecedented increase in new user activity, totalling 5 million users in February 2024." "Our sample covers ∼81% of Bluesky accounts." "The dataset contains 235,567,116 posts." "63M (27%) of the posts are reposts, and 12M (5%) are quotes." "Out of the annotated English posts, 39M (32%) are positive, 32M (27%) are negative, and 50M (41%) are neutral."
Quotes
"Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern." "Contrary to the general trend, however, the year-old decentralized OSP Bluesky Social (hereafter, Bluesky) has recently opened its APIs to developers, offering a potential solution to the widespread data shortage." "Bluesky allows users to choose the algorithm(s) that power their home feed, allowing them to view, e.g., only posts by whom they follow, posts containing specific words or entities, and more complex filters."

Deeper Inquiries

How might the federated structure of Bluesky impact the spread of information and the formation of communities on the platform?

The federated structure of Bluesky can have significant implications for how information spreads and communities form on the platform. By allowing users to register on different servers, Bluesky creates a decentralized network where each server can have its own rules, moderation policies, and community norms. This decentralization can lead to the formation of diverse and specialized communities, each with its unique culture and interests. Users can choose to join servers that align with their preferences, values, and beliefs, fostering the creation of niche communities where like-minded individuals can engage in discussions, share content, and build relationships. Moreover, the federated model promotes user autonomy and control over their online experience. Users can select servers that prioritize certain types of content, moderation strategies, or community guidelines, allowing them to tailor their social media experience to their liking. This flexibility can lead to a more personalized and curated online environment, where users feel empowered to shape their digital interactions according to their preferences. However, the federated structure of Bluesky also poses challenges in terms of content moderation, coordination across servers, and ensuring a cohesive user experience. With different servers operating independently, issues such as inconsistent moderation practices, varying levels of content quality, and potential echo chambers within specialized communities may arise. It becomes crucial for Bluesky to establish mechanisms for cross-server communication, standardize moderation policies, and promote diversity and inclusivity across the platform to mitigate these challenges and foster a healthy and vibrant online ecosystem.

What are the potential biases and limitations introduced by user-curated feed generators, and how can they be mitigated to ensure diverse and balanced content exposure?

User-curated feed generators on Bluesky introduce several potential biases and limitations that can impact the diversity and balance of content exposure on the platform. One significant bias is the risk of creating echo chambers, where users are primarily exposed to content that aligns with their existing beliefs and preferences. This can lead to reinforcement of existing viewpoints, limited exposure to diverse perspectives, and the amplification of misinformation within closed information bubbles. Another limitation is the potential for algorithmic bias, where feed generators may prioritize certain types of content or users over others based on factors like popularity, engagement metrics, or user preferences. This can result in the amplification of mainstream voices, marginalization of minority perspectives, and homogenization of content, leading to a lack of diversity and inclusivity in the feed. To mitigate these biases and limitations and ensure diverse and balanced content exposure, Bluesky can implement several strategies. Firstly, the platform can introduce transparency measures that disclose how feed algorithms work, what factors influence content selection, and how users can customize their feed preferences. This transparency can empower users to make informed choices about their content consumption and reduce the risk of algorithmic manipulation. Additionally, Bluesky can promote content diversity by diversifying the sources of content in feed generators, incorporating mechanisms for content recommendation from a wide range of users, and encouraging cross-pollination of content across different communities. By fostering a culture of inclusivity, openness, and information sharing, Bluesky can create a more vibrant and diverse online environment that values varied perspectives and promotes healthy discourse.

Given the unique features of Bluesky, such as the ability to create custom feed generators, how might this platform be leveraged to study the interplay between human agency and algorithmic curation in shaping online discourse and information ecosystems?

Bluesky's unique features, particularly the ability to create custom feed generators, offer a valuable opportunity to study the interplay between human agency and algorithmic curation in shaping online discourse and information ecosystems. Researchers can leverage Bluesky's data to investigate how users' choices in selecting feed algorithms influence the content they are exposed to, the diversity of viewpoints they encounter, and the formation of online communities based on shared interests and preferences. By analyzing user interactions with feed generators, researchers can explore how human behavior, such as content consumption patterns, engagement with specific topics, and customization of feed preferences, interacts with algorithmic recommendations to shape the information landscape on Bluesky. This analysis can provide insights into the role of user agency in content curation, the impact of personalized algorithms on information diversity, and the dynamics of content dissemination within specialized communities. Moreover, Bluesky's platform architecture allows for the study of content virality, diffusion patterns, and information cascades within and across different feed generators. Researchers can investigate how content spreads through user-curated feeds, the role of influential users in amplifying information, and the mechanisms that drive the circulation of content within the platform. By examining the interplay between human actions and algorithmic processes, researchers can gain a deeper understanding of how online discourse evolves, information flows, and communities form in decentralized social media environments like Bluesky.
0