A novel framework named LaIAR that leverages knowledge from language models to enhance both the recognition capabilities and the interpretability of video models.