Core Concepts
Enhancing the quality of a sign language translation dataset significantly improves machine translation models' performance.
Abstract
The content introduces SignBank+, a refined version of the SignBank dataset optimized for machine translation between spoken language text and SignWriting. The article discusses data cleaning, expansion, and evaluation processes to enhance translation quality. It compares different frameworks and models trained on original, cleaned, and expanded datasets, highlighting the benefits of improved data quality.
Directory:
Introduction
Importance of sign language in communication.
Objective: Enhance sign language machine translation.
Background
Techniques for translating between signed and spoken languages.
Data Cleaning Process (§3.1)
Rule-based corrections and manual cleaning methods.
Dataset Expansion (§3.2)
Introducing variations to enrich the dataset.
Data Quality Experiments
Evaluation of original, cleaned, and expanded datasets on machine translation models.
Results
Performance comparison across different frameworks/models.
Conclusions
Improved data quality leads to significant performance gains in machine translation.
Future Work & Limitations
Suggestions for future research and study limitations.
Stats
Our best results came from GPT-4 with an IoU of 0.80.
The projected costs for GPT-4 are approximately $4000 due to its higher pricing compared to GPT-3.5-turbo at $200.
Quotes
"Models trained on SignBank+ surpass those on the original dataset."
"Our experimental results confirm the efficiency of complex modeling approaches on raw datasets."