Language models (LMs) utilize various mechanisms for fact completion, including exact recall, heuristics, and guesswork, and understanding these mechanisms is crucial for accurate interpretation of LM behavior.
Language models like BERT and RoBERTa develop internal subnetworks that correspond to theoretical linguistic categories, demonstrating a degree of learned grammatical understanding that can be analyzed using Shapley Head Values and pruning techniques.
언어 모델의 내부 작동 방식을 이해하기 위해 사용되는 민감도 방향 분석 기법을 개선하고, 특히 Sparse Autoencoder(SAE) 기반의 특징 분석 방법의 효과와 한계를 명확히 밝혔습니다.
Contrary to the linear representation hypothesis, language models can and do learn inherently multi-dimensional features, as evidenced by the discovery of circular representations for concepts like days of the week and months of the year in GPT-2 and Mistral 7B using sparse autoencoders and novel irreducibility metrics.
본 논문에서는 언어 모델 내부에서 특정 작업에 중요한 역할을 하는 구성 요소를 파악하고, 이를 활용하여 모델의 예측을 효과적으로 조정하는 방법을 제시합니다.
Retrieve to Explain (R2E) introduces a retrieval-based language model that prioritizes evidence for predictions, improving explainability and performance in complex tasks.