Core Concepts
Musketeer achieves competitive multi-task performance through joint training with Task Explanation Prompts.
Abstract
Musketeer introduces a vision-language model trained jointly on multiple tasks, utilizing Task Explanation Prompts (TEP) to reduce interference among heterogeneous tasks.
The model's architecture includes stacked Transformer layers for encoding and decoding, with shared parameters across tasks.
Musketeer outperforms specialist models in visual grounding, visual entailment, and image captioning without task-specific fine-tuning.
TEP enhances zero-shot learning performance on unseen tasks and datasets.
Ablation studies show that adding more tasks improves the accuracy of existing tasks in Musketeer.
Stats
一つのモデルで複数の異なるタスクを共同でトレーニングする(Musketeer)。
モデルのアーキテクチャには、エンコードとデコード用のスタックされたTransformerレイヤーが含まれており、タスク間で共有されるパラメータが使用されている。
Quotes
"TEPs are structured text explanations that guide the training and inference processes."
"Musketeer outperforms specialist models in various visual language tasks."