Claude's Safety Lives on Anthropic's Servers

A federal court asked if Anthropic could control deployed Claude. The answer was no.

Share

subject: Claude's Safety Lives on Anthropic's Servers preview: A federal court asked if Anthropic could control deployed Claude. The answer was no. status: revised pillar: the-fine-print

Metadata

  • Topic: Anthropic's "no kill switch" admission and the structural gap in Constitutional AI safety claims
  • Content Pillar: The Fine Print
  • Sources: DC Circuit per curiam order (April 8, 2026); Anthropic Petitioner's Opening Brief (April 22, 2026); Ramasamy NDCA declaration (March 21, 2026); Anthropic RSP v3.1; Anthropic "Building safeguards for Claude"; Anthropic Claude's Constitution; CRS Report IN12669; CNBC; CNN; AP; TechCrunch
  • Status: revised
  • Date Created: 2026-04-25
  • Subject Line: Claude's Safety Lives on Anthropic's Servers
  • Preview Text: A federal court asked if Anthropic could control deployed Claude. The answer was no.

Introduction

Anthropic has spent four years telling you its AI comes with safety built in. In March, its VP of Public Sector told a federal court under oath that once you deploy Claude yourself, Anthropic has no back door, no kill switch. Those describe two different products that share a name: the app on Anthropic's servers and the static model that every enterprise, hospital, and government agency actually deploys.

What Got Said in Court

On April 22, 2026, Anthropic filed a 12,890-word opening brief in its DC Circuit appeal against the Department of War. The brief was the company's response to a per curiam order from April 8, in which a panel of three judges ordered both parties to address a specific question: "(3) whether, and if so how, Anthropic is able to affect the functioning of its artificial-intelligence models before or after the models, or updates to them, are delivered to the Department."

The question was a direct prompt about technical control, and Anthropic's lawyers answered it directly. The company has "no back door or remote kill switch." Its personnel "cannot log into a department system to modify or disable a running model." Claude as supplied is a "static" model that "does not degrade or change on its own, and Anthropic cannot push undisclosed or unsanctioned changes to a model after the department has deployed it." That language wasn't invented for the brief. It first appeared in a sworn declaration by Anthropic VP of Public Sector Thiyagu Ramasamy filed in the Northern District of California on March 21, 2026, as reported by TechCrunch. Ramasamy's exact wording: "Anthropic does not maintain any back door or remote 'kill switch.'" The April brief restates the same sworn position to a second court, more than a month later.

So Anthropic's "safe AI" brand describes a managed cloud service, not a model, and every enterprise customer who deployed Claude on their own infrastructure was never told the difference.

The Marketing Side of the Ledger

Twenty days before the DC Circuit brief was filed, Anthropic put out a new version of its Responsible Scaling Policy. RSP v3.1, effective April 2, 2026, describes a four-layer defense architecture for Claude that includes "real-time prompt and completion classifiers and completion interventions for immediate online filtering," "asynchronous monitoring classifiers," and "post-hoc jailbreak detection." It states the company "maintains the capability to adjust the model's prompting to reinforce safety constraints."

The August 2025 Building safeguards for Claude page goes further. It describes "real-time detection and enforcement" in which classifiers can "automatically add additional instructions to Claude's system prompt to steer its response" and can "stop Claude from responding entirely." Claude's Constitution, updated January 2026, frames Anthropic's ability to "pause" Claude as a core safety feature: "if Anthropic wants to pause Claude or have it stop actions... we would like Claude to comply with such requests if they genuinely come from Anthropic."

Together, these documents describe a model being actively monitored and steered by Anthropic. That's the picture casual readers and Anthropic's institutional customers were buying when they wrote the checks. The same word, "Claude," covers two different products in documents from the same company.

The Reconciliation Anthropic Would Offer

The technical defense is real. Every monitoring mechanism in the RSP, the safeguards page, and the safety research lives on Anthropic's servers. When Claude runs through Claude.ai or the Anthropic API, the classifiers see every prompt and every completion. Anthropic can push prompt changes, model updates, and policy enforcement in real time. For that product, the marketing checks out.

When an enterprise customer takes Claude on-premises or deploys it inside an air-gapped classified environment, none of that infrastructure is reachable. The model becomes what Ramasamy described to the court: a static artifact running on someone else's hardware, behaving according to whatever values were trained into the weights. No live classifier sees that output, and there's no remote pause available. The "pause" feature in Claude's Constitution is a behavior trained into the model, requiring Claude itself to choose to comply. Not a kill switch, and the company's lawyers were clear about that in court because the law required them to be.

The marketing never disclosed this distinction. The RSP, the safeguards documentation, and the Constitutional AI research are all written as universal properties of Claude. There's no asterisk saying these features only apply on Anthropic's infrastructure, no carve-out warning hospitals and banks that the live safeguards don't travel with a self-hosted deployment. A federal court was the first venue where Anthropic separated the model from the service it was sold inside.

The Pentagon Case Is the Proof

The exact use case where the safety guarantees are now admitted to be inoperative is the use case Anthropic was building toward.

In July 2025, the Department of War awarded Anthropic a contract worth up to $200 million, alongside parallel awards to Google, OpenAI, and xAI. Anthropic partnered with Palantir and AWS for the integration. According to a Congressional Research Service report from April 21, Anthropic has described Claude as "reportedly the Department's most widely deployed and used frontier AI model" inside the agency. CRS also confirmed that Claude was used in the January 3, 2026 operation that captured Venezuelan President Nicolás Maduro.

Those classified deployments are exactly the on-premises, air-gapped configurations where every layer of the four-layer defense becomes inoperative. Nothing in the public marketing told the contracting officer that a Claude model running inside a classified network would be a static artifact with no live monitoring. The Pentagon labeled Anthropic a "supply chain risk" in March 2026 partly because it feared the company could "preemptively alter the behavior of its model" during warfighting operations. In court, Anthropic's answer is that it literally cannot do that, which dismantles the supply chain designation and the implied promise of ongoing oversight in the marketing at the same time.

One other timing detail. On February 24, 2026, Anthropic released RSP v3, which quietly removed the hard pause commitment, the previous policy promise to stop training more powerful models if capabilities outstripped the company's ability to control them. Chief Science Officer Jared Kaplan told CNN, "We felt that it wouldn't actually help anyone for us to stop training AI models." A source familiar with the matter told CNN the change was unrelated to the Pentagon dispute, and Anthropic has maintained that position. That same week, Defense Secretary Pete Hegseth gave Anthropic CEO Dario Amodei an ultimatum about usage policy restrictions. No proven causal link. But the public commitment to "control" got rewritten in the same window the company was being asked to prove control existed.

Every Frontier AI Provider Carries the Same Asterisk

Anthropic is the company that built its commercial identity on safety. Constitutional AI, the Responsible Scaling Policy, the public AI safety research papers — that posture is a core asset in the $380 billion valuation the company carried into the dispute and the $14 billion in annualized revenue under it. If the safety apparatus runs as server-side middleware anywhere, it runs that way at Anthropic too.

The same logic applies to every other frontier AI provider. OpenAI's content policies and monitoring stack run on OpenAI's infrastructure. Google's Gemini guardrails run on Google's. Whenever any of these companies licenses a model for on-premises or air-gapped deployment, the same disclaimer applies and almost never gets spelled out. Anthropic just happens to be the first one a federal court forced to say so under oath.

Who Benefits

The legal positioning helps Anthropic now. The "no kill switch" language is the company's direct answer to the DC Circuit's question #3, and it dismantles the legal premise of the Pentagon's supply chain risk designation. If Anthropic literally cannot reach into a deployed model, then the supply chain argument, which was built on the fear that Anthropic could sabotage Claude in government systems, collapses on its own terms. Confessing technical powerlessness negates the threat the designation was meant to address.

For non-government customers, the same admission reinforces a different part of the safety brand. For a hospital or a regulated bank, the assurance that Anthropic has no remote access to a deployed model is a feature, not a bug. Industries with data privacy obligations have been wary of any vendor with a backdoor into their environment. The DC Circuit per curiam noted that Amodei told employees the Pentagon dispute had made Anthropic "#2 in the App store" and that "the general public or the media see us as the heroes," citing a Digiday article that called the $200 million walked away from "the best marketing spend in Silicon Valley for years." So the legal disclosure doubles as a sales pitch: trust us, we have no hidden access.

OpenAI is the other beneficiary. Within hours of Anthropic being blacklisted, OpenAI signed its own Pentagon deal without the usage policy restrictions Anthropic had insisted on. CEO Sam Altman reportedly told employees that "military's operational decisions are up to the government," abandoning the autonomous weapons red line at the heart of Anthropic's stand. The contracts Anthropic walked away from went to OpenAI inside the same week.

What the Court Won't Settle

Three weeks of coverage have framed this as Anthropic versus the Pentagon. What got established in the process is narrower: what "safe AI" means when a customer deploys the model themselves.

For four years, Anthropic marketed "safe AI" as a property of Claude. A court made the company say under oath what the marketing never did: the safety features run on Anthropic's servers. The moment an enterprise, hospital, or government agency pulls Claude into their own environment, they're running a static artifact with no live monitoring.

Every other frontier AI provider's safety architecture works the same way. The question is what enterprise procurement teams do with that information the next time they write a check for "responsible AI."