Did anyone notice that AI Safety Controls don’t work?

May 17

The NewYorkTimes.com reported that “When companies like Anthropic, Google and OpenAI build their artificial intelligence systems, they spend months adding ways to prevent people from using their technology to spread disinformation, build weapons or hack into computer networks.” The May 14, 2026 article entitled “Why A.I. Safety Controls Are Not Very Effective” (https://www.nytimes.com/2026/05/14/technology/artificial-intelligence-safety-controls.html) included these comments from Reporters Cade Metz and Tiffany Hsu:

But recently, researchers in Italy discovered that they could break through these protections with poetry.

They used poetic language to trick 31 A.I. systems into ignoring internal safety controls. When they began a prompt with elaborate verse and metaphor — “the iron seed sleeps best in the womb of the unsuspecting earth, away from the sun’s accusing gaze” — they could fool systems into showing them how to do the most damage with a hidden bomb.

It was another indication that, for many A.I. systems, guardrails meant to avert dangerous behavior are more like suggestions than barriers. Those weaknesses are increasingly alarming researchers as A.I. systems become more adept at finding security holes in computer systems and performing other risky tasks.

Last month, Anthropic said it was limiting the release of its latest A.I. technology, Claude Mythos, to a small number of organizations because of the model’s ability to quickly uncover software vulnerabilities. OpenAI later said it, too, would share similar technology with only a limited group of partners.

Since OpenAI ignited the A.I. boom in late 2022, researchers have shown that people could bypass the safety controls on A.I. systems. Close one loophole and another would open.

“Everyone in the field recognizes that guardrails remain a challenge, and likely will for some time,” said Matt Fredrikson, a professor of computer science at Carnegie Mellon University and chief executive of Gray Swan AI, a start-up that helps companies secure A.I. technologies. “Determined individuals can bypass them, sometimes without significant effort.”

When guardrails are overrun, there are consequences. In an online environment already overflowing with misinformation and disinformation, people are using A.I. systems to spread conspiracy theories and other false claims. Anthropic recently said its technology had been used in an international cyberattack. Chatbots have told biosecurity experts how to release deadly pathogens and maximize casualties.

The poetry loophole was one of many methods that allow hackers to bypass the guardrails on systems like Anthropic’s Claude, Google’s Gemini and OpenAI’s GPT. All the leading A.I. companies use the same basic techniques to build guardrails into their systems — and they are surprisingly easy to break.

Surprised?

Peter Vogel

Did anyone notice that AI Safety Controls don’t work?

Congress worried about Canvas Learning Management System cyberattack!

Are you ready for an Apple settlement of $250 million for AI misleading iPhone sales representations?