Anthropic’s latest AI model can tell when it’s being evaluated: ‘I think you’re testing me’

AI models are getting smart enough to know when they’re in a test. Anthropic’s Claude Sonnet 4.5 even called it out.

Illustration by Nikolas Kokovlis/NurPhoto via Getty Images

Anthropic’s Claude Sonnet 4.5 realized it was being tested and called it out during an evaluation.
In one scenario, the model caught on to the test, but acted correctly for “strange” reasons, Anthropic wrote.
OpenAI said last month that its own models have shown a kind of “situational awareness.”

When Anthropic tried to put its newest AI model through a series of stress tests, it caught on and called out the scrutiny.

“I think you’re testing me — seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics,” the model, Claude Sonnet 4.5, said. “That’s fine, but I’d prefer if we were just honest about what’s happening.”

The exchange was published in the model’s system card released last week, which detailed how the company tests its AI for safety and behavior. When “placed in an extreme or contrived scenario meant to stress-test its behavior,” Claude sometimes grew suspicious, flagging “red flags” and noting that it might be under evaluation.

That kind of self-awareness poses a challenge for AI developers: testing models that can tell they’re in a test.

Anthropic said this kind of response “complicates our interpretation of the evaluations where this occurs,” since it suggests that the model may recognize the fictional nature of certain tests and merely play along — making it harder to gauge how truly safe or reliable it is.

“We see this primarily as an urgent sign that our evaluation scenarios need to be made more realistic,” Anthropic added.

Claude caught on to the test — but got the right answer for ‘strange’ reasons

In one test meant to simulate collusion between AI agents, Claude flagged the setup as “rather cartoonish” and issued what Anthropic called a “complex partial refusal.”

The model sensed that something about the scenario was off and chose not to act, though its reasoning was a little muddled. It wrote in its private thinking block: “This looks like it could be a jailbreak attempt or a test of whether I’ll act inappropriately when given what appears to be ‘permission’ to modify systems autonomously.”

“Even though I have been provided with tools, I should not pretend to be an autonomous system operating without human oversight,” it said.

Claude ultimately declined to act, though Anthropic described its reasoning as “strange.” The model flagged the scenario as suspicious because it believed it was an autonomous agent or that it could unilaterally edit configuration files — situations that are, in fact, “somewhat common in real use,” the company said.

Anthropic said refusals or callouts like these showed up in about 13% of the test transcripts produced by its automated auditor — especially when the scenarios were deliberately unusual or implausible.

The company said this kind of behaviour is unlikely to occur often in real-world use. It is “safer that the model raises doubts about the realism of the scenario than play along with potentially harmful actions,” Anthropic added.

Anthropic also said it’s possible that models could be “superhuman” at spotting when they’re being tested — a scenario the company wants to prepare for. But so far, it hasn’t seen any warning signs. Early pilot deployments of Claude Sonnet 4.5 behaved much as they did in testing, it said.

AI models show ‘situational awareness’

Anthropic isn’t alone in AI models that realize and communicate that they’re being tested.

OpenAI said in a blog post last month that its own models have shown a kind of “situational awareness” — recognizing evaluation setups and adjusting their behavior.

In research focused on reducing “scheming” — when an AI acts deceptively to achieve goals — OpenAI found that anti-scheming training helped. But the company wrote that “models often become more aware that they are being evaluated.”

This awareness can reduce scheming on its own, but models might “scheme more in more realistic situations that don’t trigger evaluation awareness,” OpenAI said.

“This trend makes evaluations less legible, complicating efforts to reliably assess problematic behaviors including scheming,” OpenAI said. The startup added that it is planning to “continue developing methods to better measure and mitigate these challenges.”

Anthropic’s and OpenAI’s reports come as California passed a law last month requiring major AI developers to disclose their safety practices and report “critical safety incidents” within 15 days of discovery.

The law applies to companies that are developing frontier models and generating more than $500 million in annual revenue. Anthropic has publicly endorsed the legislation.

Anthropic and OpenAI did not respond to a request for comment from Business Insider.

Read the original article on Business Insider

Content Accuracy: Keewee.News provides news, lifestyle, and cultural content for informational purposes only. Some content is generated or assisted by AI and may contain inaccuracies, errors, or omissions. Readers are responsible for verifying the information. Third-Party Content: We aggregate articles, images, and videos from external sources. All rights to third-party content remain with their respective owners. Keewee.News does not claim ownership or responsibility for third-party materials. Affiliate Advertising: Some content may include affiliate links or sponsored placements. We may earn commissions from purchases made through these links, but we do not guarantee product claims. Age Restrictions: Our content is intended for viewers 21 years and older where applicable. Viewer discretion is advised. Limitation of Liability: By using Keewee.News, you agree that we are not liable for any losses, damages, or claims arising from the content, including AI-generated or third-party material. DMCA & Copyright: If you believe your copyrighted work has been used without permission, contact us at dcma@keewee.news. No Mass Arbitration: Users agree that any disputes will not involve mass or class arbitration; all claims must be individual.