Conceitos essenciais
Neural networks can learn complex concepts that are often not easily interpretable. This survey reviews recent methods for explaining the concepts learned by neural networks, ranging from analyzing individual neurons to learning classifiers for entire layers, in order to make neural networks more transparent and easier to control.
Resumo
This survey provides a comprehensive overview of recent approaches for explaining concepts in neural networks. It categorizes the methods into two main groups:
Neuron-Level Explanations:
- Similarity-based approaches compare the activation of individual neurons to predefined concepts, such as the network dissection method that measures the intersection over union between neuron activations and segmented concept images.
- Causality-based approaches analyze the causal relationship between neuron activations and concepts, either by intervening on the input to measure the influence on neuron activations or by intervening on the neuron activations to measure the impact on concept prediction.
Layer-Level Explanations:
- Concept Activation Vectors (CAVs) train a linear classifier for each concept to identify the presence of the concept in the activations of a specific layer.
- Probing uses a multi-class classifier to evaluate how well the layer activations capture linguistic features, which can then be combined with a knowledge base to provide richer explanations.
- Concept Bottleneck Models explicitly represent each concept as a unique neuron in a bottleneck layer, allowing the model to explain its predictions in terms of the activated concepts.
The survey highlights the progress in this active research area and discusses the opportunities for tighter integration between neural models and symbolic representations, known as neuro-symbolic integration, to make neural networks more transparent and controllable.