Stable and Steerable Sparse Autoencoders with Weight Regularization
基于权重正则化的稳定且可操控的稀疏自编码器
Piotr Jedryszek, Oliver M. Crook
AI总结 通过L1/L2权重正则化提高稀疏自编码器的跨种子特征一致性,并在语言模型上提升操控成功率,同时保持可解释性分数。
详情
稀疏自编码器(SAEs)被广泛用于从神经网络激活中提取人类可解释的特征,但其学习到的特征在不同随机种子和训练选择下可能差异很大。为了提高稳定性,我们研究了通过添加编码器和解码器权重的L1或L2惩罚进行权重正则化,并评估了正则化与常见SAE训练默认值的交互作用。在MNIST上,我们观察到L2权重正则化产生了一个高度对齐的特征核心,并且当与绑定初始化和单位范数解码器约束结合时,它显著提高了跨种子的特征一致性。对于在语言模型激活(Pythia-70M-deduped)上训练的TopK SAEs,添加小的L2权重惩罚增加了三个随机种子间共享特征的比例,并使操控成功率大致翻倍,同时自动可解释性分数的平均值基本保持不变。最后,在正则化设置下,激活操控成功与否能更好地由自动可解释性分数预测,这表明正则化可以使基于文本的特征解释与功能可控性对齐。
Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.