The paper proposes a System-Independent WER Estimation (SIWE) method for estimating the quality of ASR transcripts. Previous WER estimation approaches were dependent on the specific ASR system used to generate the training data, limiting their flexibility and performance on out-of-domain data.
The key aspects of the proposed SIWE method are:
Data Augmentation: Instead of using ASR system outputs for training, the authors generate plausible hypotheses by simulating common ASR errors (insertions, deletions, substitutions) based on phonetic similarity and linguistic probability. This allows the WER estimator to be trained in a system-independent manner.
Hypothesis Generation Strategies: Three main strategies are used to generate the training hypotheses - random selection, phonetic similarity, and linguistic probability. The authors experiment with different combinations of these strategies and find that using phonetic similarity and linguistic probability leads to the best performance.
Evaluation: The SIWE model is evaluated on both in-domain and out-of-domain datasets. On in-domain data, it reaches the same level of performance as the ASR system-dependent WER estimators. On out-of-domain data, the SIWE model outperforms the system-dependent estimators, with relative improvements of 17.58% in RMSE and 18.21% in Pearson correlation coefficient.
The authors also find that the performance of the SIWE model is further improved when the training data's WER distribution is close to the evaluation dataset's WER.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Chanho Park,... at arxiv.org 04-26-2024
https://arxiv.org/pdf/2404.16743.pdfDeeper Inquiries