Chalnev, S., Siu, M., & Conmy, A. (2024). Improving Steering Vectors by Targeting Sparse Autoencoder Features. arXiv preprint arXiv:2411.02193v1.
This research paper aims to address the unpredictability of current LLM steering methods by developing a technique to quantify and interpret the effects of steering interventions using SAEs, ultimately leading to the creation of more effective steering vectors.
The researchers utilize JumpReLU SAEs to measure the changes in feature activations caused by steering interventions. They train a linear effect approximator function to predict feature effects for given steering vectors. This function is then used to identify targeted steering vectors that maximize the activation of desired features while minimizing unintended side effects. The effectiveness of this SAE-Targeted Steering (SAE-TS) method is evaluated against existing methods like Contrastive Activation Addition (CAA) and direct SAE feature steering.
The research highlights the potential of SAEs in understanding and controlling LLM behavior. SAE-TS offers a promising approach for constructing steering vectors that produce more predictable and controlled changes in model output, paving the way for more reliable and interpretable LLM steering.
This research contributes significantly to the field of LLM steering by introducing a novel method that leverages the interpretability of SAEs to improve the controllability and predictability of LLMs. This has important implications for developing safer and more reliable language models.
The study primarily focuses on the Gemma-2 model and a limited set of steering tasks. Future research should explore the generalizability of SAE-TS across different LLM architectures and a wider range of tasks, including safety-critical applications. Additionally, investigating the effectiveness of SAE-TS with chat models and in mitigating social biases is crucial.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Sviatoslav C... at arxiv.org 11-05-2024
https://arxiv.org/pdf/2411.02193.pdfDeeper Inquiries