toplogo
Sign In

DOSA: Dataset of Social Artifacts from Indian Subcultures


Core Concepts
Generative models need to consider cultural contexts, as shown by DOSA dataset creation and LLM benchmarking.
Abstract
Generative models like LLMs require cultural context for accuracy. DOSA dataset created using participatory research methods. Benchmarking LLMs on DOSA artifacts shows varying cultural familiarity. Importance of community-centered research for technology evaluation. Introduction: Generative models integrated into various applications with social impact. Concerns about cultural nuances encoded in LLMs outputs. Web-based training data lacks representation of diverse cultures. Data Extraction: "Since the training data for LLMs is web-based and the Web is limited in its representation of information, it does not capture knowledge present within communities that are not on the Web." "We use a gamified framework that relies on collective sensemaking to collect the names and descriptions of these artifacts such that the descriptions semantically align with the shared sensibilities of the individuals from those cultures." Related Work: Past work focuses on understanding values and ethics encoded in LLMs. Fairness studies look at biases towards specific communities in LLMs. Methodology for Dataset Creation: Combined survey and GWAP methods to collect social artifacts data. Survey questionnaire administered across 19 Indian states. Benchmarking LLMs Cultural Familiarity: Experiment setup includes popular open-source and closed-source models. Accuracy used as primary metric for evaluating cultural familiarity. Results: Variance observed in LLMs' familiarity with regional subcultures in India. GPT-4 and Palm 2 perform better than open-source models like Llama 2 and Falcon.
Stats
"Since the training data for LLMs is web-based and the Web is limited in its representation of information, it does not capture knowledge present within communities that are not on the Web." "We use a gamified framework that relies on collective sensemaking to collect the names and descriptions of these artifacts such that the descriptions semantically align with the shared sensibilities of the individuals from those cultures."
Quotes
"Culture is a complex societal-level concept, and it can be defined by multiple factors: location, sexuality, race, nationality, language, religious beliefs, ethnicity." "Our work offers an example of how technology evaluation can benefit from engaging community members using participatory research."

Key Insights Distilled From

by Agrima Seth,... at arxiv.org 03-25-2024

https://arxiv.org/pdf/2403.14651.pdf
DOSA

Deeper Inquiries

How can generative models like LLMs be improved to better incorporate diverse cultural contexts?

Generative models like LLMs can be enhanced to better integrate diverse cultural contexts by incorporating several strategies: Diverse Training Data: Including a more extensive and varied dataset that represents different cultures, languages, and regions will help the model learn about a wider range of social artifacts and cultural nuances. Fine-tuning for Cultural Sensitivity: Implementing fine-tuning techniques where the model is trained on specific cultural datasets or given additional prompts related to various cultures can improve its understanding of diverse contexts. Community Engagement: Involving community members in dataset creation through participatory research methods, as demonstrated in the context above, allows for a more accurate representation of social artifacts from different cultures. Bias Mitigation Techniques: Implementing bias mitigation strategies during training and inference stages can help reduce biases that may affect how the model generates content related to different cultures.

What are potential implications if generative models continue to lack awareness of social artifacts from different cultures?

The implications of generative models lacking awareness of social artifacts from various cultures include: Cultural Erasure: Certain communities' identities and practices may be overlooked or misrepresented in generated content, leading to cultural erasure. Propagation of Stereotypes: Models may inadvertently reinforce stereotypes or biases by not accurately representing the diversity present in different cultures. Communication Breakdowns: In cross-cultural interactions or applications like chatbots, a lack of awareness about social artifacts could lead to miscommunications or misunderstandings between individuals from different backgrounds.

How might participatory research methods influence future developments in AI technology beyond language models?

Participatory research methods can have significant impacts on future AI technology developments beyond language models: Ethical AI Development: By involving community members in dataset creation and model evaluation processes, ethical considerations are prioritized, leading to more responsible AI technologies. Improved Cultural Representation: Participatory research ensures that diverse voices are heard during data collection and model training phases, resulting in better representation of various cultures within AI systems. User-Centered Design: Engaging users through participatory approaches helps create technologies that align with user needs and preferences, enhancing user experience across different demographic groups. 4.Enhanced Trust: Community involvement fosters trust between developers and end-users as it demonstrates transparency, inclusivity, and respect for diverse perspectives when designing AI solutions.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star