Core Concepts
Spacerini simplifies the creation and deployment of search engines, enabling qualitative analysis of text datasets.
Abstract
1. Introduction
Data commoditization transforms computer science.
Large language models rely on vast text datasets.
Training often precedes understanding data.
Importance of distinguishing data availability vs. appropriateness.
2. Background and Related Work
Recent proliferation of large-scale text datasets in NLP.
Focus on data understanding and governance gaining traction.
Efforts to standardize reproducible metrics for datasets.
3. Spacerini
Modular framework integrating Pyserini with Hugging Face ecosystem.
Streamlines process from dataset to search interface deployment.
Loading Data, Pre-processing, Indexing, Template-based Interfaces, Deployment to Hugging Face Spaces.
4. Use Cases and Demonstrations
Benefits for NLP researchers, IR researchers, Linguists, Digital Humanists, IR Students, Shared Task Organizers, Tech Journalists.
Specific examples like XSUM dataset index demo.
5. Limitations and Future Plans
Disk space limit on Hugging Face Spaces a constraint.
Planned improvements include better documentation and tokenization support.
6. Conclusion
Spacerini facilitates quick deployment of search indexes for qualitative analysis of text datasets.
Stats
"We demonstrate a portfolio of 13 search engines created with Spacerini for different use cases."
"The disk space limit imposed by Hugging Face Spaces is currently set to 50 GB for the free tier."
"Planned improvements include automating the creation of dataset cards when pushing an index to the Hugging Face Hub."
Quotes
"Being unable to easily audit large datasets incentivizes researchers to release models trained on data they do not truly understand."
"Understanding the training data is a critical step in releasing and auditing large language models."
"Spacerini leverages features from both the Pyserini toolkit and the Hugging Face ecosystem."