Core Concepts
Provable interpretability guarantees are provided by the Merlin-Arthur classifier, ensuring soundness and completeness in feature-based explanations.
Abstract
The article introduces a novel interactive multi-agent classifier, the Merlin-Arthur classifier, that offers provable interpretability guarantees for complex agents like neural networks. It establishes lower bounds on mutual information between selected features and classification decisions. The results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems, providing measurable metrics such as soundness and completeness. The article discusses the challenges of Explainable AI (XAI) methods and demonstrates how formal approaches can overcome complexity barriers when applied to Neural Networks. It also explores the concept of Asymmetric Feature Correlation to capture correlations affecting interpretability guarantees. The theoretical framework is developed for the Merlin-Arthur classifier, focusing on mutual information, entropy, precision, and relative success rates. Numerical implementations evaluate theoretical bounds on datasets like MNIST and UCI Census Income dataset. The article concludes by discussing limitations, future directions, and implications for real-world applications.
Structure:
Introduction to Interpretability Guarantees with Merlin-Arthur Classifiers
Challenges in Explainable AI (XAI) Methods
Theoretical Framework for Merlin-Arthur Classifier
Numerical Implementation and Evaluation of Theoretical Bounds
Discussion on Adversarial Robustness and Causal Mechanisms
Conclusion and Future Directions
Stats
Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems.
Completeness describes the probability that Arthur classifies correctly based on features from Merlin.
Soundness is the probability that Arthur does not get fooled by Morgana.
Mutual information measures how uncertain we are about the class a priori.
Precision is defined as Px∼D[c(x) = c(y) | z ⊆x].
Asymmetric Feature Correlation captures correlations complicating information bounds.
Quotes
"The field of Explainable AI (XAI) has put forth a number of interpretability approaches."
"Formal approaches run into complexity barriers when applied to NNs."
"Our results provide quantitative lower bounds on feature informativeness without modeling data distribution."