Harris, L. (2024). A Simple and Effective Temporal Grounding Pipeline for Basketball Broadcast Footage. arXiv preprint arXiv:2411.00862v1.
This paper aims to develop a reliable and efficient pipeline for temporally aligning basketball broadcast footage with corresponding play-by-play annotations. This alignment is crucial for creating large, multi-modal datasets that can be used to train and evaluate video models for sports action recognition.
The proposed pipeline employs a two-staged approach: text detection and text recognition. First, a fine-tuned YOLOv8l object detection model, trained on a custom dataset of basketball broadcast frames, identifies semantic text regions (quarter and time remaining) within each frame. Subsequently, the detected regions are cropped and processed by the PaddleOCR library for text recognition. To ensure temporal consistency and handle potential occlusions, a denoising algorithm and interpolation techniques are applied to the extracted timestamps. The pipeline is further optimized for large-scale deployment through parallelization across multiple CPU threads.
The paper demonstrates the effectiveness of the proposed pipeline in accurately aligning basketball broadcast footage with play-by-play annotations. The use of end-to-end text localization and readily available OCR tools simplifies the process and ensures reproducibility. The authors highlight the pipeline's potential to expedite the development of large, multi-modal datasets for action recognition models.
The paper concludes that the proposed temporal grounding pipeline offers a practical and efficient solution for aligning basketball broadcast footage with play-by-play annotations. The simplicity of the approach, combined with the use of open-source libraries and deep learning methods, makes it easily adaptable for various use cases and sub-domains within sports analytics.
This research contributes to the field of sports video analysis by providing a streamlined method for temporal grounding, a crucial step in creating annotated datasets for training and evaluating video understanding models. The proposed pipeline has the potential to accelerate research in sports action recognition and facilitate the development of automated sports analysis tools.
The authors acknowledge the need for further testing and quantitative benchmarks to evaluate the robustness of the pipeline across diverse broadcast conditions. Future work could explore the use of custom, in-domain trained text recognition models to enhance accuracy and investigate methods for handling cases where on-screen graphical indicators are absent.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Levi Harris at arxiv.org 11-05-2024
https://arxiv.org/pdf/2411.00862.pdfDeeper Inquiries