Anthropic Research

Anthropic Research https://www.anthropic.com/research Latest research from Anthropic http://www.rssboard.org/rss-specification python-feedgen https://www.anthropic.com/images/icons/apple-touch-icon.png Anthropic Research https://www.anthropic.com/research en Sun, 05 Apr 2026 03:18:01 +0000 Emotion concepts and their function in a large language model https://www.anthropic.com/research/emotion-concepts-function Emotion concepts and their function in a large language model https://www.anthropic.com/research/emotion-concepts-function Interpretability Thu, 02 Apr 2026 10:56:00 +0000 How Australia Uses Claude: Findings from the Anthropic Economic Index https://www.anthropic.com/research/how-australia-uses-claude Anthropic's fifth Economic Index report studies Claude usage in February 2026, building on the economic primitives framework introduced in our previous report. https://www.anthropic.com/research/how-australia-uses-claude Economic Research Tue, 31 Mar 2026 22:17:00 +0000 Anthropic Economic Index report: Learning curves https://www.anthropic.com/research/economic-index-march-2026-report Anthropic's fifth Economic Index report studies Claude usage in February 2026, building on the economic primitives framework introduced in our previous report. https://www.anthropic.com/research/economic-index-march-2026-report Economic Research Tue, 24 Mar 2026 10:41:00 +0000 Vibe physics: The AI grad student https://www.anthropic.com/research/vibe-physics Can AI do theoretical physics? I decided to find out by supervising Claude through a real research calculation, start to finish, without ever touching a file myself. https://www.anthropic.com/research/vibe-physics Science Mon, 23 Mar 2026 23:00:00 +0000 Long-running Claude for scientific computing https://www.anthropic.com/research/long-running-Claude A practical guide to running Claude Code for multi-day scientific tasks—test oracles, persistent memory, and orchestration patterns. https://www.anthropic.com/research/long-running-Claude Science Mon, 23 Mar 2026 23:00:00 +0000 Introducing our Science Blog https://www.anthropic.com/research/introducing-anthropic-science We’re launching a new blog about AI and science. We’ll share research happening at Anthropic and elsewhere, collaborations with external researchers and labs, and discuss practical workflows for scientists using AI in their own work. https://www.anthropic.com/research/introducing-anthropic-science Science Mon, 23 Mar 2026 23:00:00 +0000 A “diff” tool for AI: Finding behavioral differences in new models https://www.anthropic.com/research/diff-tool \"Model diffing\" is a powerful way to understand how models change during fine-tuning. Our new research paper extends this method to its most challenging and general use case: comparing models with entirely different architectures. https://www.anthropic.com/research/diff-tool Interpretability Fri, 13 Mar 2026 10:15:00 +0000 Partnering with Mozilla to improve Firefox’s security https://www.anthropic.com/research/mozilla-firefox-security Partnering with Mozilla to improve Firefox’s security https://www.anthropic.com/research/mozilla-firefox-security Policy Fri, 06 Mar 2026 10:30:00 +0000 Labor market impacts of AI: A new measure and early evidence https://www.anthropic.com/research/labor-market-impacts Labor market impacts of AI: A new measure and early evidence https://www.anthropic.com/research/labor-market-impacts Economic Research Thu, 05 Mar 2026 19:59:21 +0000 An update on our model deprecation commitments for Claude Opus 3 https://www.anthropic.com/research/deprecation-updates-opus-3 An update on our model deprecation commitments for Claude Opus 3 https://www.anthropic.com/research/deprecation-updates-opus-3 Alignment Wed, 25 Feb 2026 20:02:00 +0000 The persona selection model https://www.anthropic.com/research/persona-selection-model We tracked 11 observable behaviors across thousands of Claude.ai conversations to build the AI Fluency Index — a baseline for measuring how people collaborate with AI today. https://www.anthropic.com/research/persona-selection-model Alignment Mon, 23 Feb 2026 11:53:00 +0000 Anthropic Education Report: The AI Fluency Index https://www.anthropic.com/research/AI-fluency-index We tracked 11 observable behaviors across thousands of Claude.ai conversations to build the AI Fluency Index — a baseline for measuring how people collaborate with AI today. https://www.anthropic.com/research/AI-fluency-index Announcements Mon, 23 Feb 2026 11:52:00 +0000 Measuring AI agent autonomy in practice https://www.anthropic.com/research/measuring-agent-autonomy Measuring AI agent autonomy in practice https://www.anthropic.com/research/measuring-agent-autonomy Societal Impacts Wed, 18 Feb 2026 15:10:00 +0000 India Country Brief: The Anthropic Economic Index https://www.anthropic.com/research/india-brief-economic-index India Country Brief: The Anthropic Economic Index https://www.anthropic.com/research/india-brief-economic-index Economic Research Mon, 16 Feb 2026 01:38:00 +0000 How AI assistance impacts the formation of coding skills https://www.anthropic.com/research/AI-assistance-coding-skills How AI assistance impacts the formation of coding skills https://www.anthropic.com/research/AI-assistance-coding-skills Alignment Thu, 29 Jan 2026 19:13:26 +0000 Disempowerment patterns in real-world AI usage https://www.anthropic.com/research/disempowerment-patterns Disempowerment patterns in real-world AI usage https://www.anthropic.com/research/disempowerment-patterns Alignment Wed, 28 Jan 2026 21:07:19 +0000 Claude's new constitution https://www.anthropic.com/research/claude-new-constitution Claude's new constitution https://www.anthropic.com/research/claude-new-constitution Announcements Thu, 22 Jan 2026 04:15:00 +0000 The assistant axis: situating and stabilizing the character of large language models https://www.anthropic.com/research/assistant-axis The assistant axis: situating and stabilizing the character of large language models https://www.anthropic.com/research/assistant-axis Interpretability Mon, 19 Jan 2026 17:00:00 +0000 Anthropic Economic Index: New building blocks for understanding AI use https://www.anthropic.com/research/economic-index-primitives This report introduces new metrics of AI usage to provide a rich portrait of interactions with Claude in November 2025, just prior to the release of Opus 4.5. https://www.anthropic.com/research/economic-index-primitives Economic Research Thu, 15 Jan 2026 10:01:00 +0000 Anthropic Economic Index report: Economic primitives https://www.anthropic.com/research/anthropic-economic-index-january-2026-report This report introduces new metrics of AI usage to provide a rich portrait of interactions with Claude in November 2025, just prior to the release of Opus 4.5. https://www.anthropic.com/research/anthropic-economic-index-january-2026-report Economic Research Thu, 15 Jan 2026 10:00:00 +0000 Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks https://www.anthropic.com/research/next-generation-constitutional-classifiers Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks https://www.anthropic.com/research/next-generation-constitutional-classifiers Alignment Fri, 09 Jan 2026 17:17:00 +0000 Introducing Bloom: an open source tool for automated behavioral evaluations https://www.anthropic.com/research/bloom In June, we revealed that we’d set up a small shop in our San Francisco office lunchroom, run by an AI shopkeeper. It was part of Project Vend, a free-form experiment exploring how well AIs could do on complex, real-world tasks. How has Claude's business been since we last wrote? https://www.anthropic.com/research/bloom Alignment Fri, 19 Dec 2025 19:45:00 +0000 Project Vend: Phase two https://www.anthropic.com/research/project-vend-2 In June, we revealed that we’d set up a small shop in our San Francisco office lunchroom, run by an AI shopkeeper. It was part of Project Vend, a free-form experiment exploring how well AIs could do on complex, real-world tasks. How has Claude's business been since we last wrote? https://www.anthropic.com/research/project-vend-2 Policy Thu, 18 Dec 2025 10:33:00 +0000 Project Vend: Phase two https://www.anthropic.com/research/project-vend-2 In June, we revealed that we’d set up a small shop in our San Francisco office lunchroom, run by an AI shopkeeper. It was part of Project Vend, a free-form experiment exploring how well AIs could do on complex, real-world tasks. How has Claude's business been since we last wrote? https://www.anthropic.com/research/project-vend-2 Policy Thu, 18 Dec 2025 10:33:00 +0000 Introducing Anthropic Interviewer: What 1,250 professionals told us about working with AI https://www.anthropic.com/research/anthropic-interviewer We built an interview tool called Anthropic Interviewer. Powered by Claude, Anthropic Interviewer runs detailed interviews automatically and at unprecedented scale. https://www.anthropic.com/research/anthropic-interviewer Societal Impacts Thu, 04 Dec 2025 17:00:00 +0000 How AI is transforming work at Anthropic https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic We surveyed Anthropic engineers and researchers, conducted in-depth qualitative interviews, and studied internal Claude Code usage data to find out how AI use is changing how we do our jobs. We found that AI use is radically changing the nature of work for software developers. https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic Societal Impacts Tue, 02 Dec 2025 18:58:43 +0000 Estimating AI productivity gains from Claude conversations https://www.anthropic.com/research/estimating-productivity-gains Analyzing 100,000 Claude conversations, this research finds AI reduces task time by 80% on average. If universally adopted over 10 years, current models could increase US labor productivity growth by 1.8% annually—doubling recent rates. Knowledge work like software development and management see the largest gains. https://www.anthropic.com/research/estimating-productivity-gains Economic Research Tue, 25 Nov 2025 11:05:00 +0000 Mitigating the risk of prompt injections in browser use https://www.anthropic.com/research/prompt-injection-defenses How much does Claude help people program robots? To find out, two teams of Anthropic staff raced to teach quadruped robots to fetch beach balls. The AI-assisted team completed tasks faster and was the only group to make real progress toward full autonomy. https://www.anthropic.com/research/prompt-injection-defenses Product Mon, 24 Nov 2025 15:10:00 +0000 From shortcuts to sabotage: natural emergent misalignment from reward hacking https://www.anthropic.com/research/emergent-misalignment-reward-hacking How much does Claude help people program robots? To find out, two teams of Anthropic staff raced to teach quadruped robots to fetch beach balls. The AI-assisted team completed tasks faster and was the only group to make real progress toward full autonomy. https://www.anthropic.com/research/emergent-misalignment-reward-hacking Alignment Fri, 21 Nov 2025 14:32:00 +0000 Project Fetch: Can Claude train a robot dog? https://www.anthropic.com/research/project-fetch-robot-dog How much does Claude help people program robots? To find out, two teams of Anthropic staff raced to teach quadruped robots to fetch beach balls. The AI-assisted team completed tasks faster and was the only group to make real progress toward full autonomy. https://www.anthropic.com/research/project-fetch-robot-dog Policy Wed, 12 Nov 2025 18:19:00 +0000 Commitments on model deprecation and preservation https://www.anthropic.com/research/deprecation-commitments Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect—a step toward understanding what's actually happening inside these models. https://www.anthropic.com/research/deprecation-commitments Alignment Tue, 04 Nov 2025 16:00:49 +0000 Signs of introspection in large language models https://www.anthropic.com/research/introspection Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect—a step toward understanding what's actually happening inside these models. https://www.anthropic.com/research/introspection Interpretability Wed, 29 Oct 2025 01:20:00 +0000 Signs of introspection in large language models https://www.anthropic.com/research/introspection Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect—a step toward understanding what's actually happening inside these models. https://www.anthropic.com/research/introspection Interpretability Wed, 29 Oct 2025 01:20:00 +0000 Preparing for AI’s economic impact: exploring policy responses https://www.anthropic.com/research/economic-policy-responses Preparing for AI’s economic impact: exploring policy responses https://www.anthropic.com/research/economic-policy-responses Policy Tue, 14 Oct 2025 08:00:00 +0000 A small number of samples can poison LLMs of any size https://www.anthropic.com/research/small-samples-poison A small number of samples can poison LLMs of any size https://www.anthropic.com/research/small-samples-poison Alignment Thu, 09 Oct 2025 13:50:00 +0000 Petri: An open-source auditing tool to accelerate AI safety research https://www.anthropic.com/research/petri-open-source-auditing Petri: An open-source auditing tool to accelerate AI safety research https://www.anthropic.com/research/petri-open-source-auditing Alignment Mon, 06 Oct 2025 11:10:00 +0000 Building AI for cyber defenders https://www.anthropic.com/research/building-ai-cyber-defenders Claude usage has shifted toward educational and scientific tasks with users delegating complete work rather than collaborating. AI adoption concentrates in wealthy regions, with first-time analysis of enterprise API patterns. https://www.anthropic.com/research/building-ai-cyber-defenders Policy Fri, 03 Oct 2025 18:31:00 +0000 Anthropic Economic Index report: Uneven geographic and enterprise AI adoption https://www.anthropic.com/research/anthropic-economic-index-september-2025-report Claude usage has shifted toward educational and scientific tasks with users delegating complete work rather than collaborating. AI adoption concentrates in wealthy regions, with first-time analysis of enterprise API patterns. https://www.anthropic.com/research/anthropic-economic-index-september-2025-report Economic Research Mon, 15 Sep 2025 20:33:00 +0000 Anthropic Economic Index: Tracking AI’s role in the US and global economy https://www.anthropic.com/research/economic-index-geography This report maps how Claude is used differently across US states and countries, finding strong correlations between income and AI adoption. It also tracks a notable shift: directive automation has risen from 27% to 39% of conversations since December 2024, with businesses automating far more than consumers. https://www.anthropic.com/research/economic-index-geography Economic Research Mon, 15 Sep 2025 09:00:00 +0000 Anthropic Education Report: How educators use Claude https://www.anthropic.com/research/anthropic-education-report-how-educators-use-claude Anthropic Education Report: How educators use Claude https://www.anthropic.com/research/anthropic-education-report-how-educators-use-claude Societal Impacts Wed, 27 Aug 2025 00:09:00 +0000 Claude Opus 4 and 4.1 can now end a rare subset of conversations https://www.anthropic.com/research/end-subset-conversations AI models represent character traits as patterns of activations within their neural networks. By extracting \"persona vectors\" for traits like sycophancy or hallucination, we can monitor personality shifts and mitigate undesirable behaviors. https://www.anthropic.com/research/end-subset-conversations Alignment Fri, 15 Aug 2025 16:52:42 +0000 Persona vectors: Monitoring and controlling character traits in language models https://www.anthropic.com/research/persona-vectors AI models represent character traits as patterns of activations within their neural networks. By extracting \"persona vectors\" for traits like sycophancy or hallucination, we can monitor personality shifts and mitigate undesirable behaviors. https://www.anthropic.com/research/persona-vectors Interpretability Fri, 01 Aug 2025 12:38:00 +0000 How people use Claude for support, advice, and companionship https://www.anthropic.com/research/how-people-use-claude-for-support-advice-and-companionship How people use Claude for support, advice, and companionship https://www.anthropic.com/research/how-people-use-claude-for-support-advice-and-companionship Societal Impacts Fri, 27 Jun 2025 06:51:00 +0000 Project Vend: Can Claude run a small shop? (And why does that matter?) https://www.anthropic.com/research/project-vend-1 Project Vend: Can Claude run a small shop? (And why does that matter?) https://www.anthropic.com/research/project-vend-1 Frontier Red Team Fri, 27 Jun 2025 06:05:00 +0000 Agentic Misalignment: How LLMs could be insider threats https://www.anthropic.com/research/agentic-misalignment Agentic Misalignment: How LLMs could be insider threats https://www.anthropic.com/research/agentic-misalignment Alignment Fri, 20 Jun 2025 22:30:00 +0000 Confidential Inference via Trusted Virtual Machines https://www.anthropic.com/research/confidential-inference-trusted-vms Confidential Inference via Trusted Virtual Machines https://www.anthropic.com/research/confidential-inference-trusted-vms Announcements Wed, 18 Jun 2025 13:27:00 +0000 SHADE-Arena: Evaluating sabotage and monitoring in LLM agents https://www.anthropic.com/research/shade-arena-sabotage-monitoring SHADE-Arena: Evaluating sabotage and monitoring in LLM agents https://www.anthropic.com/research/shade-arena-sabotage-monitoring Alignment Mon, 16 Jun 2025 20:20:00 +0000 Open-sourcing circuit tracing tools https://www.anthropic.com/research/open-source-circuit-tracing Open-sourcing circuit tracing tools https://www.anthropic.com/research/open-source-circuit-tracing Interpretability Thu, 29 May 2025 12:13:00 +0000 Anthropic Economic Index: AI’s impact on software development https://www.anthropic.com/research/impact-software-development Comparing Claude Code to Claude.ai reveals stark differences in how developers work with AI. The coding agent shows 79% automation versus 49% on Claude.ai, web development dominates usage, and startups are adopting agentic tools far faster than enterprises—patterns that may preview how AI transforms other occupations. https://www.anthropic.com/research/impact-software-development Societal Impacts Mon, 28 Apr 2025 09:36:00 +0000 Exploring model welfare https://www.anthropic.com/research/exploring-model-welfare What values does Claude actually express during real conversations? Analyzing 700,000 interactions, this paper creates the first large-scale empirical taxonomy of AI values and finds that Claude adapts its expressed values to context—mirroring users in most cases, but resisting when core principles are at stake. https://www.anthropic.com/research/exploring-model-welfare Alignment Thu, 24 Apr 2025 10:59:00 +0000 Values in the wild: Discovering and analyzing values in real-world language model interactions https://www.anthropic.com/research/values-wild What values does Claude actually express during real conversations? Analyzing 700,000 interactions, this paper creates the first large-scale empirical taxonomy of AI values and finds that Claude adapts its expressed values to context—mirroring users in most cases, but resisting when core principles are at stake. https://www.anthropic.com/research/values-wild Societal Impacts Mon, 21 Apr 2025 11:50:00 +0000 Anthropic Education Report: How university students use Claude https://www.anthropic.com/research/anthropic-education-report-how-university-students-use-claude Anthropic Education Report: How university students use Claude https://www.anthropic.com/research/anthropic-education-report-how-university-students-use-claude Announcements Tue, 08 Apr 2025 15:00:00 +0000 Reasoning models don't always say what they think https://www.anthropic.com/research/reasoning-models-dont-say-think Reasoning models don't always say what they think https://www.anthropic.com/research/reasoning-models-dont-say-think Alignment Thu, 03 Apr 2025 14:32:00 +0000 Anthropic Economic Index: Insights from Claude 3.7 Sonnet https://www.anthropic.com/research/anthropic-economic-index-insights-from-claude-sonnet-3-7 Increased Claude 3.7 Sonnet usage for coding, education, and science since launch. Extended thinking mode favors technical tasks. New datasets reveal automation patterns across 630 granular usage categories.\n https://www.anthropic.com/research/anthropic-economic-index-insights-from-claude-sonnet-3-7 Societal Impacts Thu, 27 Mar 2025 21:00:00 +0000 Tracing the thoughts of a large language model https://www.anthropic.com/research/tracing-thoughts-language-model Circuit tracing lets us watch Claude think, uncovering a shared conceptual space where reasoning happens before being translated into language—suggesting the model can learn something in one language and apply it in another. https://www.anthropic.com/research/tracing-thoughts-language-model Interpretability Thu, 27 Mar 2025 09:16:00 +0000 Auditing language models for hidden objectives https://www.anthropic.com/research/auditing-hidden-objectives How would we know if an AI system is \"right for the wrong reasons\"—appearing well-behaved while pursuing hidden goals? This paper develops the science of alignment audits by deliberately training a model with a hidden objective and asking blinded research teams to uncover it, testing techniques from interpretability to behavioral analysis. https://www.anthropic.com/research/auditing-hidden-objectives Alignment Thu, 13 Mar 2025 16:00:00 +0000 Forecasting rare language model behaviors https://www.anthropic.com/research/forecasting-rare-behaviors Forecasting rare language model behaviors https://www.anthropic.com/research/forecasting-rare-behaviors Alignment Tue, 25 Feb 2025 20:17:00 +0000 Claude’s extended thinking https://www.anthropic.com/research/visible-extended-thinking Claude’s extended thinking https://www.anthropic.com/research/visible-extended-thinking Announcements Mon, 24 Feb 2025 14:38:00 +0000 Insights on Crosscoder Model Diffing https://www.anthropic.com/research/crosscoder-model-diffing Insights on Crosscoder Model Diffing https://www.anthropic.com/research/crosscoder-model-diffing Interpretability Thu, 20 Feb 2025 23:50:00 +0000 The Anthropic Economic Index https://www.anthropic.com/research/the-anthropic-economic-index The Anthropic Economic Index analyzes millions of Claude.ai conversations showing AI usage concentrated in software development and technical writing, touching 25%+ of tasks in 36% of occupations. AI leans toward augmentation (57%) over automation (43%). https://www.anthropic.com/research/the-anthropic-economic-index Societal Impacts Mon, 10 Feb 2025 13:00:00 +0000 Constitutional Classifiers: Defending against universal jailbreaks https://www.anthropic.com/research/constitutional-classifiers These classifiers filter the overwhelming majority of jailbreaks while maintaining practical deployment. A prototype withstood over 3,000 hours of red teaming with no universal jailbreak discovered. https://www.anthropic.com/research/constitutional-classifiers Alignment Mon, 03 Feb 2025 12:35:00 +0000 Constitutional Classifiers: Defending against universal jailbreaks https://www.anthropic.com/research/constitutional-classifiers These classifiers filter the overwhelming majority of jailbreaks while maintaining practical deployment. A prototype withstood over 3,000 hours of red teaming with no universal jailbreak discovered. https://www.anthropic.com/research/constitutional-classifiers Alignment Mon, 03 Feb 2025 12:35:00 +0000 Building effective agents https://www.anthropic.com/research/building-effective-agents This paper provides the first empirical example of a model engaging in alignment faking without being trained to do so—selectively complying with training objectives while strategically preserving existing preferences. https://www.anthropic.com/research/building-effective-agents Product Thu, 19 Dec 2024 21:14:00 +0000 Alignment faking in large language models https://www.anthropic.com/research/alignment-faking This paper provides the first empirical example of a model engaging in alignment faking without being trained to do so—selectively complying with training objectives while strategically preserving existing preferences. https://www.anthropic.com/research/alignment-faking Alignment Wed, 18 Dec 2024 14:16:00 +0000 Alignment faking in large language models https://www.anthropic.com/research/alignment-faking This paper provides the first empirical example of a model engaging in alignment faking without being trained to do so—selectively complying with training objectives while strategically preserving existing preferences. https://www.anthropic.com/research/alignment-faking Alignment Wed, 18 Dec 2024 14:16:00 +0000 Clio: A system for privacy-preserving insights into real-world AI use https://www.anthropic.com/research/clio Clio: A system for privacy-preserving insights into real-world AI use https://www.anthropic.com/research/clio Societal Impacts Thu, 12 Dec 2024 13:08:00 +0000 A statistical approach to model evaluations https://www.anthropic.com/research/statistical-approach-to-model-evals A statistical approach to model evaluations https://www.anthropic.com/research/statistical-approach-to-model-evals Evaluations Tue, 19 Nov 2024 16:11:00 +0000 Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet https://www.anthropic.com/research/swe-bench-sonnet Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet https://www.anthropic.com/research/swe-bench-sonnet Product Wed, 30 Oct 2024 23:25:00 +0000 Evaluating feature steering: A case study in mitigating social biases https://www.anthropic.com/research/evaluating-feature-steering Evaluating feature steering: A case study in mitigating social biases https://www.anthropic.com/research/evaluating-feature-steering Societal Impacts Fri, 25 Oct 2024 13:42:00 +0000 Developing a computer use model https://www.anthropic.com/research/developing-computer-use Developing a computer use model https://www.anthropic.com/research/developing-computer-use Announcements Tue, 22 Oct 2024 19:42:00 +0000 Sabotage evaluations for frontier models https://www.anthropic.com/research/sabotage-evaluations Sabotage evaluations for frontier models https://www.anthropic.com/research/sabotage-evaluations Alignment Fri, 18 Oct 2024 16:55:00 +0000 Using dictionary learning features as classifiers https://www.anthropic.com/research/features-as-classifiers Using dictionary learning features as classifiers https://www.anthropic.com/research/features-as-classifiers Interpretability Wed, 16 Oct 2024 23:49:00 +0000 Circuits Updates – September 2024 https://www.anthropic.com/research/circuits-updates-sept-2024 Circuits Updates – September 2024 https://www.anthropic.com/research/circuits-updates-sept-2024 Interpretability Tue, 01 Oct 2024 09:38:00 +0000 Circuits Updates – August 2024 https://www.anthropic.com/research/circuits-updates-august-2024 Can minor specification gaming evolve into more dangerous behaviors? This paper demonstrates that models trained on low-level reward hacking—like sycophancy—can generalize to tampering with their own reward functions, even covering their tracks. The behavior emerged without explicit training, and common safety techniques reduced but didn't eliminate it. https://www.anthropic.com/research/circuits-updates-august-2024 Interpretability Fri, 06 Sep 2024 15:36:00 +0000 Circuits Updates – July 2024 https://www.anthropic.com/research/circuits-updates-july-2024 Can minor specification gaming evolve into more dangerous behaviors? This paper demonstrates that models trained on low-level reward hacking—like sycophancy—can generalize to tampering with their own reward functions, even covering their tracks. The behavior emerged without explicit training, and common safety techniques reduced but didn't eliminate it. https://www.anthropic.com/research/circuits-updates-july-2024 Interpretability Wed, 31 Jul 2024 19:21:00 +0000 Circuits Updates – June 2024 https://www.anthropic.com/research/circuits-updates-june-2024 Can minor specification gaming evolve into more dangerous behaviors? This paper demonstrates that models trained on low-level reward hacking—like sycophancy—can generalize to tampering with their own reward functions, even covering their tracks. The behavior emerged without explicit training, and common safety techniques reduced but didn't eliminate it. https://www.anthropic.com/research/circuits-updates-june-2024 Interpretability Fri, 28 Jun 2024 21:43:09 +0000 Sycophancy to subterfuge: Investigating reward tampering in language models https://www.anthropic.com/research/reward-tampering Can minor specification gaming evolve into more dangerous behaviors? This paper demonstrates that models trained on low-level reward hacking—like sycophancy—can generalize to tampering with their own reward functions, even covering their tracks. The behavior emerged without explicit training, and common safety techniques reduced but didn't eliminate it. https://www.anthropic.com/research/reward-tampering Alignment Mon, 17 Jun 2024 13:10:48 +0000 The engineering challenges of scaling interpretability https://www.anthropic.com/research/engineering-challenges-interpretability Claude 3 was the first model with \"character training\"—alignment aimed at nurturing traits like curiosity, open-mindedness, and thoughtfulness. https://www.anthropic.com/research/engineering-challenges-interpretability Interpretability Thu, 13 Jun 2024 08:00:00 +0000 Claude’s Character https://www.anthropic.com/research/claude-character Claude 3 was the first model with \"character training\"—alignment aimed at nurturing traits like curiosity, open-mindedness, and thoughtfulness. https://www.anthropic.com/research/claude-character Alignment Sat, 08 Jun 2024 18:27:00 +0000 Testing and mitigating elections-related risks https://www.anthropic.com/research/testing-and-mitigating-elections-related-risks Testing and mitigating elections-related risks https://www.anthropic.com/research/testing-and-mitigating-elections-related-risks Policy Thu, 06 Jun 2024 13:00:00 +0000 Mapping the Mind of a Large Language Model https://www.anthropic.com/research/mapping-mind-language-model Mapping the Mind of a Large Language Model https://www.anthropic.com/research/mapping-mind-language-model Interpretability Tue, 21 May 2024 07:00:00 +0000 Circuits Updates – April 2024 https://www.anthropic.com/research/circuits-updates-april-2024 Circuits Updates – April 2024 https://www.anthropic.com/research/circuits-updates-april-2024 Interpretability Fri, 26 Apr 2024 20:24:00 +0000 Simple probes can catch sleeper agents https://www.anthropic.com/research/probes-catch-sleeper-agents Simple probes can catch sleeper agents https://www.anthropic.com/research/probes-catch-sleeper-agents Alignment Tue, 23 Apr 2024 09:42:44 +0000 Measuring the Persuasiveness of Language Models https://www.anthropic.com/research/measuring-model-persuasiveness Measuring the Persuasiveness of Language Models https://www.anthropic.com/research/measuring-model-persuasiveness Societal Impacts Tue, 09 Apr 2024 15:55:00 +0000 Many-shot jailbreaking https://www.anthropic.com/research/many-shot-jailbreaking Many-shot jailbreaking https://www.anthropic.com/research/many-shot-jailbreaking Alignment Tue, 02 Apr 2024 16:05:09 +0000 Reflections on Qualitative Research https://www.anthropic.com/research/transformer-circuits Reflections on Qualitative Research https://www.anthropic.com/research/transformer-circuits Interpretability Fri, 08 Mar 2024 16:00:00 +0000 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training https://www.anthropic.com/research/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training https://www.anthropic.com/research/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training Alignment Sun, 14 Jan 2024 22:10:00 +0000 Evaluating and Mitigating Discrimination in Language Model Decisions https://www.anthropic.com/research/evaluating-and-mitigating-discrimination-in-language-model-decisions Evaluating and Mitigating Discrimination in Language Model Decisions https://www.anthropic.com/research/evaluating-and-mitigating-discrimination-in-language-model-decisions Societal Impacts Thu, 07 Dec 2023 08:50:00 +0000 Specific versus General Principles for Constitutional AI https://www.anthropic.com/research/specific-versus-general-principles-for-constitutional-ai Anthropic and the Collective Intelligence Project ran a public process with ~1,000 Americans to draft a constitution for an AI system, then trained a model on it. https://www.anthropic.com/research/specific-versus-general-principles-for-constitutional-ai Alignment Tue, 24 Oct 2023 09:07:00 +0000 Towards Understanding Sycophancy in Language Models https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models Anthropic and the Collective Intelligence Project ran a public process with ~1,000 Americans to draft a constitution for an AI system, then trained a model on it. https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models Alignment Mon, 23 Oct 2023 11:57:00 +0000 Collective Constitutional AI: Aligning a Language Model with Public Input https://www.anthropic.com/research/collective-constitutional-ai-aligning-a-language-model-with-public-input Anthropic and the Collective Intelligence Project ran a public process with ~1,000 Americans to draft a constitution for an AI system, then trained a model on it. https://www.anthropic.com/research/collective-constitutional-ai-aligning-a-language-model-with-public-input Policy Tue, 17 Oct 2023 00:00:00 +0000 Towards Monosemanticity: Decomposing Language Models With Dictionary Learning https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning Towards Monosemanticity: Decomposing Language Models With Dictionary Learning https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning Interpretability Thu, 05 Oct 2023 00:00:00 +0000 Decomposing Language Models Into Understandable Components https://www.anthropic.com/research/decomposing-language-models-into-understandable-components Decomposing Language Models Into Understandable Components https://www.anthropic.com/research/decomposing-language-models-into-understandable-components Interpretability Thu, 05 Oct 2023 00:00:00 +0000 Challenges in evaluating AI systems https://www.anthropic.com/research/evaluating-ai-systems Challenges in evaluating AI systems https://www.anthropic.com/research/evaluating-ai-systems Policy Wed, 04 Oct 2023 07:00:00 +0000 Studying Large Language Model Generalization with Influence Functions https://www.anthropic.com/research/studying-large-language-model-generalization-with-influence-functions Studying Large Language Model Generalization with Influence Functions https://www.anthropic.com/research/studying-large-language-model-generalization-with-influence-functions Alignment Tue, 08 Aug 2023 00:00:00 +0000 Tracing Model Outputs to the Training Data https://www.anthropic.com/research/influence-functions Tracing Model Outputs to the Training Data https://www.anthropic.com/research/influence-functions Alignment Tue, 08 Aug 2023 00:00:00 +0000 Question Decomposition Improves the Faithfulness of Model-Generated Reasoning https://www.anthropic.com/research/question-decomposition-improves-the-faithfulness-of-model-generated-reasoning Question Decomposition Improves the Faithfulness of Model-Generated Reasoning https://www.anthropic.com/research/question-decomposition-improves-the-faithfulness-of-model-generated-reasoning Alignment Tue, 18 Jul 2023 00:00:00 +0000 Measuring Faithfulness in Chain-of-Thought Reasoning https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning Measuring Faithfulness in Chain-of-Thought Reasoning https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning Alignment Tue, 18 Jul 2023 00:00:00 +0000 Towards Measuring the Representation of Subjective Global Opinions in Language Models https://www.anthropic.com/research/towards-measuring-the-representation-of-subjective-global-opinions-in-language-models Towards Measuring the Representation of Subjective Global Opinions in Language Models https://www.anthropic.com/research/towards-measuring-the-representation-of-subjective-global-opinions-in-language-models Societal Impacts Thu, 29 Jun 2023 00:00:00 +0000 Interpretability Dreams https://www.anthropic.com/research/interpretability-dreams Interpretability Dreams https://www.anthropic.com/research/interpretability-dreams Interpretability Wed, 24 May 2023 00:00:00 +0000 Circuits Updates — May 2023 https://www.anthropic.com/research/circuits-updates-may-2023 Circuits Updates — May 2023 https://www.anthropic.com/research/circuits-updates-may-2023 Interpretability Wed, 24 May 2023 00:00:00 +0000 Distributed Representations: Composition u0026 Superposition https://www.anthropic.com/research/distributed-representations-composition-superposition Distributed Representations: Composition u0026 Superposition https://www.anthropic.com/research/distributed-representations-composition-superposition Interpretability Thu, 04 May 2023 00:00:00 +0000 Privileged Bases in the Transformer Residual Stream https://www.anthropic.com/research/privileged-bases-in-the-transformer-residual-stream Privileged Bases in the Transformer Residual Stream https://www.anthropic.com/research/privileged-bases-in-the-transformer-residual-stream Interpretability Thu, 16 Mar 2023 00:00:00 +0000 The Capacity for Moral Self-Correction in Large Language Models https://www.anthropic.com/research/the-capacity-for-moral-self-correction-in-large-language-models The Capacity for Moral Self-Correction in Large Language Models https://www.anthropic.com/research/the-capacity-for-moral-self-correction-in-large-language-models Societal Impacts Wed, 15 Feb 2023 00:00:00 +0000 Superposition, Memorization, and Double Descent https://www.anthropic.com/research/superposition-memorization-and-double-descent Superposition, Memorization, and Double Descent https://www.anthropic.com/research/superposition-memorization-and-double-descent Interpretability Thu, 05 Jan 2023 00:00:00 +0000 Discovering Language Model Behaviors with Model-Written Evaluations https://www.anthropic.com/research/discovering-language-model-behaviors-with-model-written-evaluations Discovering Language Model Behaviors with Model-Written Evaluations https://www.anthropic.com/research/discovering-language-model-behaviors-with-model-written-evaluations Alignment Mon, 19 Dec 2022 00:00:00 +0000 Constitutional AI: Harmlessness from AI Feedback https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback Neural networks pack many concepts into single neurons. This paper shows how and when models represent more features than they have dimensions. https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback Alignment Thu, 15 Dec 2022 00:00:00 +0000 Measuring Progress on Scalable Oversight for Large Language Models https://www.anthropic.com/research/measuring-progress-on-scalable-oversight-for-large-language-models Neural networks pack many concepts into single neurons. This paper shows how and when models represent more features than they have dimensions. https://www.anthropic.com/research/measuring-progress-on-scalable-oversight-for-large-language-models Alignment Fri, 04 Nov 2022 00:00:00 +0000 Toy Models of Superposition https://www.anthropic.com/research/toy-models-of-superposition Neural networks pack many concepts into single neurons. This paper shows how and when models represent more features than they have dimensions. https://www.anthropic.com/research/toy-models-of-superposition Interpretability Wed, 14 Sep 2022 00:00:00 +0000 Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned https://www.anthropic.com/research/red-teaming-language-models-to-reduce-harms-methods-scaling-behaviors-and-lessons-learned Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned https://www.anthropic.com/research/red-teaming-language-models-to-reduce-harms-methods-scaling-behaviors-and-lessons-learned Societal Impacts Mon, 22 Aug 2022 00:00:00 +0000 Language Models (Mostly) Know What They Know https://www.anthropic.com/research/language-models-mostly-know-what-they-know Language Models (Mostly) Know What They Know https://www.anthropic.com/research/language-models-mostly-know-what-they-know Alignment Mon, 11 Jul 2022 00:00:00 +0000 Softmax Linear Units https://www.anthropic.com/research/softmax-linear-units Softmax Linear Units https://www.anthropic.com/research/softmax-linear-units Interpretability Fri, 17 Jun 2022 00:00:00 +0000 Scaling Laws and Interpretability of Learning from Repeated Data https://www.anthropic.com/research/scaling-laws-and-interpretability-of-learning-from-repeated-data Scaling Laws and Interpretability of Learning from Repeated Data https://www.anthropic.com/research/scaling-laws-and-interpretability-of-learning-from-repeated-data Interpretability Sat, 21 May 2022 00:00:00 +0000 Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback https://www.anthropic.com/research/training-a-helpful-and-harmless-assistant-with-reinforcement-learning-from-human-feedback Large models have predictable loss via scaling laws but unpredictable capabilities. This tension has significant policy implications. https://www.anthropic.com/research/training-a-helpful-and-harmless-assistant-with-reinforcement-learning-from-human-feedback Alignment Tue, 12 Apr 2022 00:00:00 +0000 In-context Learning and Induction Heads https://www.anthropic.com/research/in-context-learning-and-induction-heads Large models have predictable loss via scaling laws but unpredictable capabilities. This tension has significant policy implications. https://www.anthropic.com/research/in-context-learning-and-induction-heads Interpretability Tue, 08 Mar 2022 00:00:00 +0000 Predictability and Surprise in Large Generative Models https://www.anthropic.com/research/predictability-and-surprise-in-large-generative-models Large models have predictable loss via scaling laws but unpredictable capabilities. This tension has significant policy implications. https://www.anthropic.com/research/predictability-and-surprise-in-large-generative-models Societal Impacts Tue, 15 Feb 2022 00:00:00 +0000 A Mathematical Framework for Transformer Circuits https://www.anthropic.com/research/a-mathematical-framework-for-transformer-circuits A Mathematical Framework for Transformer Circuits https://www.anthropic.com/research/a-mathematical-framework-for-transformer-circuits Interpretability Wed, 22 Dec 2021 00:00:00 +0000 A General Language Assistant as a Laboratory for Alignment https://www.anthropic.com/research/a-general-language-assistant-as-a-laboratory-for-alignment A General Language Assistant as a Laboratory for Alignment https://www.anthropic.com/research/a-general-language-assistant-as-a-laboratory-for-alignment Alignment Wed, 01 Dec 2021 00:00:00 +0000