# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

> ## Abstract

## Abstract

> Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic [interpretability](https://wiki.g15e.com/pages/Interpretability%20(machine%20learning.txt)) team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet, 1 <Anthropic>'s medium-sized production model.
>
> We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).
>
> Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code; bias (including both overt slurs, and more subtle biases); lying, deception, and power-seeking (including treacherous turns); sycophancy; and dangerous / criminal content (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there's a difference (for example) between knowing about lies, being capable of lying, and actually lying in the real world. This research is also very preliminary. Further work will be needed to understand the implications of these potentially safety-relevant features.

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

## Key results

- Sparse autoencoders produce interpretable features for [large models](https://wiki.g15e.com/pages/Large%20language%20model.txt).
- <Scaling laws> can be used to guide the training of sparse autoencoders.
- The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.
- There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.
- Features can be used to steer large models (see e.g. Influence on Behavior). This extends prior work on steering models using other methods (see Related Work).
- We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.