This paper presents a two-stage audio topic retrieval system that leverages the strengths of pre-trained language models and transfer learning.
In the first stage, audio files are represented by vector embeddings to enable rapid retrieval from large archives. Three different text sources are examined for extracting these embeddings: automatic speech recognition (ASR) transcriptions, automatic summaries (AutoSum) of the ASR transcriptions, and human-written synopses.
The results show that the retrieval system built using human-written synopses outperforms the one using ASR transcriptions, with an nDCG@3 of 0.61 compared to 0.47. This performance gap is analyzed using a fact-checking approach, which reveals that synopses contain more than 50% of atomic facts not present in the ASR transcriptions.
To address this, the authors propose using AutoSum, which narrows the performance gap with the Synopsis system, improving the nDCG@3 from 0.47 to 0.52.
In the second stage, zero-shot reranking methods using large language models (LLMs) are investigated to further boost the retrieval performance. Two reranking approaches are examined: listwise reranking (LRL) and pairwise reranking (PRL). The results show that the PRL method with the Flan-T5-3B model achieves comparable performance to the LRL counterpart while being more computationally efficient. Pairwise reranking with AutoSum improves the nDCG@3 to 0.54, representing a 14.9% relative improvement compared to the baseline retrieval with ASR transcriptions.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы