Engels, J., Michaud, E.J., Liao, I., Gurnee, W., & Tegmark, M. (2024). Not All Language Model Features Are Linear. arXiv preprint arXiv:2405.14860v2.
This research challenges the prevailing linear representation hypothesis in language models by investigating the existence and nature of multi-dimensional features in these models. The authors aim to provide a theoretical framework for understanding multi-dimensional features and develop empirical methods for their discovery and analysis.
The authors propose a rigorous definition of irreducible multi-dimensional features based on their non-reducibility to independent or non-co-occurring lower-dimensional features. They introduce the separability index and ϵ-mixture index as quantitative measures of feature reducibility. Leveraging these definitions, they design a scalable method using sparse autoencoders (SAEs) to automatically identify multi-dimensional features in pre-trained language models, GPT-2 and Mistral 7B. The method involves clustering SAE dictionary elements based on cosine similarity and analyzing the reconstructed activation vectors for irreducible multi-dimensional structures. Additionally, they employ intervention experiments, regression analysis, and synthetic datasets to investigate the causal implications and continuity of the discovered features.
This work challenges the linear representation hypothesis by demonstrating the existence and significance of multi-dimensional features in language models. The authors argue that understanding these multi-dimensional representations is crucial for unraveling the underlying algorithms employed by language models. They propose that their findings pave the way for a more nuanced understanding of language model interpretability, moving beyond simple linear representations.
This research significantly contributes to the field of mechanistic interpretability by providing a theoretical and empirical framework for understanding multi-dimensional representations in language models. It challenges existing assumptions about feature linearity and highlights the importance of considering more complex feature structures.
The authors acknowledge the limited number of interpretable multi-dimensional features discovered and suggest further research into interpreting high-scoring features and improving clustering techniques. The study also encourages further investigation into the algorithms employed by language models that utilize these multi-dimensional representations, particularly in the context of algorithmic tasks. Additionally, exploring the evolution of feature dimensionality with increasing model size is proposed as a promising avenue for future research.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania