Belangrijkste concepten
Callico is an open-source, web-based platform designed to simplify the annotation process in document recognition projects, enabling efficient creation and refinement of high-quality training data for machine learning and deep learning algorithms.
Samenvatting
Callico is a versatile open-source platform for collaborative document annotation. It offers the following key features:
- Dual display annotation: Callico enables simultaneous visualization and annotation of scanned images and text, which is particularly useful for training OCR and HTR models, analyzing document layout, and recognizing named entities.
- Collaborative annotation: The platform supports collaborative efforts by allowing team members or volunteers to join and contribute to open or closed annotation campaigns.
- Versatile annotation capabilities: Callico supports a wide range of tasks, including text classification, manual transcription, layout annotation, and information extraction, making it a comprehensive solution for any type of annotation project.
- Open-source availability: Callico is released under the GNU AGPLv3 license, ensuring accessibility and the ability for the community to freely use, modify, and distribute the platform.
- High-quality, maintainable, and evolvable code: Emphasis has been placed on software quality, with the implementation of Continuous Integration/Continuous Deployment (CI/CD) practices to ensure the codebase remains maintainable and high-quality.
- Easy on-premises deployment with Docker: Callico can be easily deployed on-premises using Docker, simplifying the installation process and enabling organizations to quickly set up and start annotating documents.
The platform's design principles focus on flexibility, user experience, and support for machine learning workflows. Callico offers a range of annotation modes, including image classification, document structuring, text transcription, named entity recognition, key-value information extraction, and element grouping. The platform also provides features for project management, campaign management, and task management, ensuring efficient and effective data annotation processes.
The use cases presented in the paper demonstrate Callico's versatility in supporting various document recognition projects, including the transcription of historical registers, the indexing of French World War II prisoner-of-war records, and the extraction of personal information from French census documents. These case studies highlight Callico's ability to streamline the data annotation process and contribute to the development of high-quality training data for machine learning and deep learning algorithms.
Statistieken
"More data beats a cleverer algorithm" - this principle highlights the importance of expanding the volume of annotated data through efficient annotation processes.
Callico's initial campaign for annotating 500 pages of French prisoner-of-war records was completed in about 60 hours by 30 annotators, averaging about 30 seconds per line.
The second phase of validating or correcting 38,851 lines of individual records was completed by 273 contributors, with the median validation time reduced to 13 seconds per line.
Citaten
"The move towards data-centric AI in machine learning and deep learning underscores the importance of high-quality data, and the need for specialised tools that increase the efficiency and effectiveness of generating such data."
"The principle that 'more data beats a cleverer algorithm' carries significant weight in the field of machine and deep learning."