Transformer Circuits Thread

Transformer Circuits Thread https://transformer-circuits.pub Interpretability research on transformer circuits http://www.rssboard.org/rss-specification python-feedgen en Sun, 05 Apr 2026 03:17:54 +0000 Emotion Concepts and their Function in a Large Language ModelSofroniew et al., 2026We find representations of emotion concepts in Claude Sonnet 4.5 and show that they causally influence its outputs. https://transformer-circuits.pub/2026/emotions/index.html s. https://transformer-circuits.pub/2026/emotions/index.html Wed, 01 Apr 2026 00:00:00 +0000 Circuits Cross-Post â Activation OraclesWe train language models to answer questions about their own activations in natural language. https://alignment.anthropic.com/2025/activation-oracles/ . https://alignment.anthropic.com/2025/activation-oracles/ Mon, 01 Dec 2025 00:00:00 +0000 Circuits Updates â November 2025A short update on harm pressure. https://transformer-circuits.pub/2025/november-update/index.html Circuits Updates â November 2025A short update on harm pressure. https://transformer-circuits.pub/2025/november-update/index.html Sat, 01 Nov 2025 00:00:00 +0000 Circuits Updates â October 2025Small updates on visual features and dictionary initialization. https://transformer-circuits.pub/2025/october-update/index.html Circuits Updates â October 2025Small updates on visual features and dictionary initialization. https://transformer-circuits.pub/2025/october-update/index.html Wed, 01 Oct 2025 00:00:00 +0000 When Models Manipulate Manifolds: The Geometry of a Counting TaskGurnee et al., 2025We find geometric structure underlying the mechanisms of a fundamental language model behavior. https://transformer-circuits.pub/2025/linebreaks/index.html r. https://transformer-circuits.pub/2025/linebreaks/index.html Wed, 01 Oct 2025 00:00:00 +0000 Emergent Introspective Awareness in Large Language ModelsLindsey, 2025We find evidence that language models can introspect on their internal states. https://transformer-circuits.pub/2025/introspection/index.html s. Circuits Updates â https://transformer-circuits.pub/2025/introspection/index.html Wed, 01 Oct 2025 00:00:00 +0000 Circuits Updates â September 2025A small update on features and in-context learning. https://transformer-circuits.pub/2025/september-update/index.html Circuits Updates â September 2025A small update on features and in-context learning. https://transformer-circuits.pub/2025/september-update/index.html Mon, 01 Sep 2025 00:00:00 +0000 Circuits Updates â August 2025A small update: How does a persona modify the assistantâs response? https://transformer-circuits.pub/2025/august-update/index.html Circuits Updates â August 2025A small update: How does a persona modify the assistantâs response? https://transformer-circuits.pub/2025/august-update/index.html Fri, 01 Aug 2025 00:00:00 +0000 Circuits Updates â July 2025A collection of small updates: revisiting A Mathematical Framework and applications of interpretability to biology. https://transformer-circuits.pub/2025/july-update/index.html Circuits Updates â July 2025A collection of small updates: revisiting A Mathematical Framework and applications of interpretability to biology. https://transformer-circuits.pub/2025/july-update/index.html Tue, 01 Jul 2025 00:00:00 +0000 A Toy Model of Mechanistic (Un)FaithfulnessWhen transcoders go awry. https://transformer-circuits.pub/2025/faithfulness-toy-model/index.html . Tracing Attention Computation Through Feature Interactions Kamath et al., 2025 We describe and apply a method to explain attention patterns in terms of feature interactions, and integrate this information into attribution graphs. A Toy Model of Interference Weights Unpacking "interference weights" in some more depth. Sparse mixtures of linear transforms We investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders. Circuits Updates â https://transformer-circuits.pub/2025/faithfulness-toy-model/index.html Tue, 01 Jul 2025 00:00:00 +0000 Tracing Attention Computation Through Feature InteractionsKamath et al., 2025We describe and apply a method to explain attention patterns in terms of feature interactions, and integrate this information into attribution graphs. https://transformer-circuits.pub/2025/attention-qk/index.html s. A Toy Model of Interference Weights Unpacking "interference weights" in some more depth. Sparse mixtures of linear transforms We investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders. Circuits Updates â https://transformer-circuits.pub/2025/attention-qk/index.html Tue, 01 Jul 2025 00:00:00 +0000 Circuits Updates â April 2025A collection of small updates: jailbreaks, dense features, and spinning up on interpretability. https://transformer-circuits.pub/2025/april-update/index.html Circuits Updates â April 2025A collection of small updates: jailbreaks, dense features, and spinning up on interpretability. https://transformer-circuits.pub/2025/april-update/index.html Tue, 01 Apr 2025 00:00:00 +0000 Circuit Tracing: Revealing Computational Graphs in Language ModelsAmeisen et al., 2025We describe an approach to tracing the "step-by-step" computation involved when a model responds to a single prompt. https://transformer-circuits.pub/2025/attribution-graphs/methods.html t. https://transformer-circuits.pub/2025/attribution-graphs/methods.html Sat, 01 Mar 2025 00:00:00 +0000 On the Biology of a Large Language ModelLindsey et al., 2025We investigate the internal mechanisms used by Claude 3.5 Haiku â Anthropic's lightweight production model â in a variety of contexts. https://transformer-circuits.pub/2025/attribution-graphs/biology.html s. Circuit Tracing: Revealing Computational Graphs in Language Models Ameisen et al., 2025 We describe an approach to tracing the "step-by-step" computation involved when a model responds to a single prompt. https://transformer-circuits.pub/2025/attribution-graphs/biology.html Sat, 01 Mar 2025 00:00:00 +0000 Circuits Updates â January 2025A collection of small updates: dictionary learning optimization techniques. https://transformer-circuits.pub/2025/january-update/index.html Circuits Updates â January 2025A collection of small updates: dictionary learning optimization techniques. https://transformer-circuits.pub/2025/january-update/index.html Wed, 01 Jan 2025 00:00:00 +0000 A Toy Model of Interference WeightsUnpacking "interference weights" in some more depth. https://transformer-circuits.pub/2025/interference-weights/index.html A Toy Model of Interference WeightsUnpacking "interference weights" in some more depth. https://transformer-circuits.pub/2025/interference-weights/index.html Wed, 01 Jan 2025 00:00:00 +0000 Insights on Crosscoder Model DiffingA preliminary note on using crosscoders to diff models. https://transformer-circuits.pub/2025/crosscoder-diffing-update/index.html Insights on Crosscoder Model DiffingA preliminary note on using crosscoders to diff models. https://transformer-circuits.pub/2025/crosscoder-diffing-update/index.html Wed, 01 Jan 2025 00:00:00 +0000 Sparse mixtures of linear transformsWe investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders. https://transformer-circuits.pub/2025/bulk-update/index.html Sparse mixtures of linear transformsWe investigate sparse mixture of linear transforms (MOLT), a new approach to transcoders. https://transformer-circuits.pub/2025/bulk-update/index.html Wed, 01 Jan 2025 00:00:00 +0000 Progress on AttentionAn update on our progress studying attention. https://transformer-circuits.pub/2025/attention-update/index.html Progress on AttentionAn update on our progress studying attention. https://transformer-circuits.pub/2025/attention-update/index.html Wed, 01 Jan 2025 00:00:00 +0000 Automated AuditingA note on using agents to perform automated alignment audits, including using interpretability tools. https://alignment.anthropic.com/2025/automated-auditing/ Automated AuditingA note on using agents to perform automated alignment audits, including using interpretability tools. https://alignment.anthropic.com/2025/automated-auditing/ Wed, 01 Jan 2025 00:00:00 +0000 Using Dictionary Learning Features as ClassifiersA preliminary note comparing feature-based and raw-activation based harmfulness classifiers. https://transformer-circuits.pub/2024/features-as-classifiers/index.html . https://transformer-circuits.pub/2024/features-as-classifiers/index.html Tue, 01 Oct 2024 00:00:00 +0000 Sparse Crosscoders for Cross-Layer Features and Model DiffingA preliminary note on a way to get consistent features across layers, and even models. https://transformer-circuits.pub/2024/crosscoders/index.html . Using Dictionary Learning Features as Classifiers A preliminary note comparing feature-based and raw-activation based harmfulness classifiers. https://transformer-circuits.pub/2024/crosscoders/index.html Tue, 01 Oct 2024 00:00:00 +0000 Circuits Updates â September 2024A collection of small updates: investigating successor heads, oversampling data in SAEs. https://transformer-circuits.pub/2024/september-update/index.html Circuits Updates â September 2024A collection of small updates: investigating successor heads, oversampling data in SAEs. https://transformer-circuits.pub/2024/september-update/index.html Sun, 01 Sep 2024 00:00:00 +0000 Circuits Updates â August 2024A collection of small updates: interpretability evals, reproducing self-explanation. https://transformer-circuits.pub/2024/august-update/index.html Circuits Updates â August 2024A collection of small updates: interpretability evals, reproducing self-explanation. https://transformer-circuits.pub/2024/august-update/index.html Thu, 01 Aug 2024 00:00:00 +0000 Circuits Updates â July 2024A collection of small updates: five hurdles, linear representations, dark matter, pivot tables, feature sensitivity. https://transformer-circuits.pub/2024/july-update/index.html Circuits Updates â July 2024A collection of small updates: five hurdles, linear representations, dark matter, pivot tables, feature sensitivity. https://transformer-circuits.pub/2024/july-update/index.html Mon, 01 Jul 2024 00:00:00 +0000 Circuits Updates â June 2024A collection of small updates: topk and gated SAE investigation. https://transformer-circuits.pub/2024/june-update/index.html Circuits Updates â June 2024A collection of small updates: topk and gated SAE investigation. https://transformer-circuits.pub/2024/june-update/index.html Sat, 01 Jun 2024 00:00:00 +0000 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 SonnetTempleton et al., 2024Using a sparse autoencoder, we extract a large number of interpretable features from Claude 3 Sonnet. Some appear to be safety-relevant. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html t. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html Wed, 01 May 2024 00:00:00 +0000 Circuits Updates â April 2024A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2024/april-update/index.html Circuits Updates â April 2024A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2024/april-update/index.html Mon, 01 Apr 2024 00:00:00 +0000 Circuits Updates â March 2024A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2024/march-update/index.html Circuits Updates â March 2024A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2024/march-update/index.html Fri, 01 Mar 2024 00:00:00 +0000 Reflections on Qualitative ResearchSome opinionated thoughts on why interpretability research may have qualitative aspects be more central than we're used to in other fields. https://transformer-circuits.pub/2024/qualitative-essay/index.html Reflections on Qualitative ResearchSome opinionated thoughts on why interpretability research may have qualitative aspects be more central than we're used to in other fields. https://transformer-circuits.pub/2024/qualitative-essay/index.html Mon, 01 Jan 2024 00:00:00 +0000 Stage-Wise Model DiffingA preliminary note on model diffing through dictionary fine-tuning. https://transformer-circuits.pub/2024/model-diffing/index.html Stage-Wise Model DiffingA preliminary note on model diffing through dictionary fine-tuning. https://transformer-circuits.pub/2024/model-diffing/index.html Mon, 01 Jan 2024 00:00:00 +0000 Circuits Updates â January 2024A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2024/jan-update/index.html Circuits Updates â January 2024A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2024/jan-update/index.html Mon, 01 Jan 2024 00:00:00 +0000 Circuits Updates â February 2024A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2024/feb-update/index.html Circuits Updates â February 2024A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2024/feb-update/index.html Mon, 01 Jan 2024 00:00:00 +0000 Towards Monosemanticity: Decomposing Language Models With Dictionary LearningBricken et al., 2023Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer. https://transformer-circuits.pub/2023/monosemantic-features/index.html r. https://transformer-circuits.pub/2023/monosemantic-features/index.html Sun, 01 Oct 2023 00:00:00 +0000 Circuits Updates â July 2023A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2023/july-update/index.html Circuits Updates â July 2023A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2023/july-update/index.html Sat, 01 Jul 2023 00:00:00 +0000 Distributed Representations: Composition & SuperpositionAn informal note on how "distributed representations" might be understood as two different, competing strategies â "composition" and "superposition" â with quite different properties. https://transformer-circuits.pub/2023/superposition-composition/index.html . https://transformer-circuits.pub/2023/superposition-composition/index.html Mon, 01 May 2023 00:00:00 +0000 Circuits Updates â May 2023A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2023/may-update/index.html Circuits Updates â May 2023A collection of small updates from the Anthropic Interpretability Team. https://transformer-circuits.pub/2023/may-update/index.html Mon, 01 May 2023 00:00:00 +0000 Privileged Bases in the Transformer Residual StreamOur mathematical theories of the Transformer architecture suggest that individual coordinates in the residual stream should have no special significance, but recent work has shown that this observation is false in practice. We investigate this phenomenon and provisionally conclude that the per-dimension normalizers in the Adam optimizer are to blame for the effect. https://transformer-circuits.pub/2023/privileged-basis/index.html . https://transformer-circuits.pub/2023/privileged-basis/index.html Wed, 01 Mar 2023 00:00:00 +0000 Superposition, Memorization, and Double DescentHenighan et al., 2023We have little mechanistic understanding of how deep learning models overfit to their training data, despite it being a central problem. Here we extend our previous work on toy models to shed light on how models generalize beyond their training data. https://transformer-circuits.pub/2023/toy-double-descent/index.html a. https://transformer-circuits.pub/2023/toy-double-descent/index.html Sun, 01 Jan 2023 00:00:00 +0000 Interpretability DreamsOur present research aims to create a foundation for mechanistic interpretability research. In doing so, it's important to keep sight of what we're trying to lay the foundations for. https://transformer-circuits.pub/2023/interpretability-dreams/index.html Interpretability DreamsOur present research aims to create a foundation for mechanistic interpretability research. In doing so, it's important to keep sight of what we're trying to lay the foundations for. https://transformer-circuits.pub/2023/interpretability-dreams/index.html Sun, 01 Jan 2023 00:00:00 +0000 Mechanistic Interpretability, Variables, and the Importance of Interpretable BasesAn informal note on intuitions related to mechanistic interpretability. https://transformer-circuits.pub/2022/mech-interp-essay/index.html . https://transformer-circuits.pub/2022/mech-interp-essay/index.html Wed, 01 Jun 2022 00:00:00 +0000 Toy Models of SuperpositionElhage et al., 2022Neural networks often seem to pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity'. In our latest interpretability work, we build toy models where the origins and dynamics of polysemanticity can be fully understood. https://transformer-circuits.pub/2022/toy_model/index.html Toy Models of SuperpositionElhage et al., 2022Neural networks often seem to pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity'. In our latest interpretability work, we build toy models where the origins and dynamics of polysemanticity can be fully understood. https://transformer-circuits.pub/2022/toy_model/index.html Sat, 01 Jan 2022 00:00:00 +0000 Softmax Linear UnitsAn alternative activation function increases the fraction of neurons which appear to correspond to human-understandable concepts. https://transformer-circuits.pub/2022/solu/index.html Softmax Linear UnitsAn alternative activation function increases the fraction of neurons which appear to correspond to human-understandable concepts. https://transformer-circuits.pub/2022/solu/index.html Sat, 01 Jan 2022 00:00:00 +0000 In-Context Learning and Induction HeadsOlsson et al., 2022An exploration of the hypothesis that induction heads are the primary mechanism behind in-context learning. We also report the existence of a previously unknown phase change in transformers language models.paper https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html In-Context Learning and Induction HeadsOlsson et al., 2022An exploration of the hypothesis that induction heads are the primary mechanism behind in-context learning. We also report the existence of a previously unknown phase change in transformers language models.paper https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html Sat, 01 Jan 2022 00:00:00 +0000 A Mathematical Framework for Transformer CircuitsElhage et al., 2021Our early mathematical framework for reverse engineering models, demonstrated by reverse engineering small toy models.paper https://transformer-circuits.pub/2021/framework/index.html per Exercises Some exercises we've developed to improve our understanding of how neural networks implement algorithms at the parameter level. note, exercises Videos Very rough informal talks as we search for a way to reverse engineering transformers. links, videos PySvelte One approach to bridging Python and web-based interactive diagrams for interpretability research. github link, infrastructure Garcon A description of our tooling for https://transformer-circuits.pub/2021/framework/index.html Wed, 01 Dec 2021 00:00:00 +0000 VideosVery rough informal talks as we search for a way to reverse engineering transformers.links, videos https://transformer-circuits.pub/2021/videos/index.html VideosVery rough informal talks as we search for a way to reverse engineering transformers.links, videos https://transformer-circuits.pub/2021/videos/index.html Fri, 01 Jan 2021 00:00:00 +0000 GarconA description of our tooling for doing interpretability on large models.note, infrastructure https://transformer-circuits.pub/2021/garcon/index.html GarconA description of our tooling for doing interpretability on large models.note, infrastructure https://transformer-circuits.pub/2021/garcon/index.html Fri, 01 Jan 2021 00:00:00 +0000 ExercisesSome exercises we've developed to improve our understanding of how neural networks implement algorithms at the parameter level.note, exercises https://transformer-circuits.pub/2021/exercises/index.html ExercisesSome exercises we've developed to improve our understanding of how neural networks implement algorithms at the parameter level.note, exercises https://transformer-circuits.pub/2021/exercises/index.html Fri, 01 Jan 2021 00:00:00 +0000 Distill on Hiatus https://distill.pub/2021/distill-hiatus/ , we're taking a page from David Haâs approach of simply creating websites (eg. World Models ) for research projects. https://distill.pub/2021/distill-hiatus/ Fri, 01 Jan 2021 00:00:00 +0000 Original Distill Circuits ThreadOur exploration of Transformers builds heavily on the original Circuits thread on Distill. https://distill.pub/2020/circuits/ Original Distill Circuits ThreadOur exploration of Transformers builds heavily on the original Circuits thread on Distill. https://distill.pub/2020/circuits/ Wed, 01 Jan 2020 00:00:00 +0000 Activation Atlases https://distill.pub/2019/activation-atlas/ for a striking example). Previously we would have submitted to Distill, but with Distill on Hiatus , we're taking a page from David Haâs approach of simply creating websites (eg. World Models ) for research projects. https://distill.pub/2019/activation-atlas/ Tue, 01 Jan 2019 00:00:00 +0000