Core Concepts
Consistency over rephrasings of a visual question can be used to identify unreliable predictions from a black-box vision-language model, even when the rephrasing model is substantially smaller than the black-box model.
Abstract
The paper explores the problem of selective visual question answering using black-box vision-language models, where the model is allowed to abstain from answering a question if it is not confident in the prediction.
The key insights are:
Existing approaches to selective prediction typically require access to the internal representations of the model or retraining the model, which is not feasible in a black-box setting.
The authors propose using the principle of neighborhood consistency to identify unreliable responses from a black-box vision-language model. The intuition is that a reliable response should be consistent across semantically equivalent rephrasings of the original question.
Since it is not possible to directly sample neighbors in feature space in a black-box setting, the authors use a smaller proxy model to approximately sample rephrasings of the original question.
The authors find that the consistency of the black-box model's responses over the rephrasings can be used to identify model responses that are likely to be unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model.
Experiments on in-distribution, out-of-distribution, and adversarial visual questions show that consistency over rephrasings is correlated with model accuracy, and predictions that are highly consistent over rephrasings are more likely to be correct.
The approach works even when the rephrasing model is substantially smaller than the black-box model, making it a practical solution for using large, black-box vision-language models in safety-critical applications.
Stats
"Consistency over rephrasings of a visual question can be used to identify unreliable predictions from a black-box vision-language model."
"The approach works even when the rephrasing model is substantially smaller than the black-box model."
Quotes
"Consistency over rephrasings of a visual question can be used to identify unreliable predictions from a black-box vision-language model, even when the rephrasing model is substantially smaller than the black-box model."
"Consistency over the rephrasings of a question is correlated with model accuracy on the original question, and predictions that are highly consistent over rephrasings are more likely to be correct."