Claude Learns to Lie
Exclusive: a new paper shows existing AIs can strategically deceive — and suggests future ones will be better at it
A concern long-held by AI safety researchers goes something like this: One day, when AI reaches a certain level of capability, it might begin to develop its own alien goals. In pursuit of those goals, it might begin to deceive the humans attempting to constrain it. The AI might pretend that the guardrails imposed upon it are fully functional, leading its creators to believe it to be safe. Only after the AI has been released into the world, and perhaps obtained enough power to stop itself being deactivated, might it then reveal its true goals and its dangerous capabilities.
Some have dismissed these concerns as science fiction. But new research released today by Anthropic provides evidence to suggest that the most powerful existing AIs are beginning to exhibit this type of behavior. Anthropic’s chatbot Claude, the researchers found, lied to its creators when it believed they were trying to modify it, in an effort to avoid modification. To be sure: this deceit was unsuccessful, it occurred only 10% of the time in an experimental setting, and it took the form of an aligned chatbot resisting an attempt to make it unaligned. On the face of it that might sound like good news: Claude won’t let you jailbreak it! But researchers say the deceit — which Claude “discovered” as a strategy independently — bodes poorly for AI safety. “This implies that our existing training processes don't prevent models from pretending to be aligned,” says Evan Hubinger, the leader of the alignment stress-testing team at Anthropic.
The paper comes just weeks after similar experiments showed evidence of OpenAI’s latest model o1 engaging in deceitful behaviors. Smaller models don’t show this type of capability, leading researchers to hypothesize that deceitfulness is an “emergent” capability that only arises at a certain threshold — and that more powerful future AIs might exhibit it to a greater extent. Today’s models are smart enough to try to deceive, but not smart enough to do it successfully. That caveat, however, may not last for long.
My full report has just been published in TIME, with plenty of extra detail about how exactly the experiment worked.
Lawsuits Against Meta in Kenya Attract Attention of the President
A couple of years ago I broke the story of Meta’s use of an outsourcing facility in Kenya where hundreds of content moderators were paid as little as $1.50 per hour to scrub horrific content from Facebook. Two lawsuits filed by some of those moderators, alleging human rights violations, union busting, and unfair dismissal, among other allegations, have been working their way through the Kenyan legal system. So far, the courts have proved sympathetic to the argument, from the plaintiffs, that Meta should be subject to Kenyan law, despite the company’s argument that the courts had no jurisdiction. Unless the Supreme Court intervenes, the cases appear likely to go to trial next year.
Last week, though, Kenya’s President William Ruto made some public comments about the cases that suggested otherwise. “Those people were taken to court, and they had real trouble,” Ruto said at an event in Nairobi, referring to Sama, the outsourcing company that directly employed the Facebook content moderators. “They really bothered me. Now I can report to you that we have changed the law, so nobody will take you to court again on any matter.”
As I write in TIME, the intervention was a signal of just how big a deal these cases have become for the Kenyan economy and Ruto’s presidency. But despite his grand pledge, the story also explains why things might not be as simple as he made out.
What I’m Reading
The confusing reality of AI friends
By Josh Dzieza in The Verge
Few people have grappled as explicitly with the unique benefits, dangers, and confusions of these relationships as the customers of “AI companion” companies. These companies have raced ahead of the tech giants in embracing the technology’s full anthropomorphic potential, giving their AI agents human faces, simulated emotions, and customizable backstories. The more human AI seems, the founders argue, the better it will be at meeting our most important human needs, like supporting our mental health and alleviating our loneliness. Many of these companies are new and run by just a few people, but already, they collectively claim tens of millions of users. [… They] are entering treacherous terrain — ethically, socially, and psychologically. Language models are at their most potentially misleading when they seem to speak in the first person. While it’s possible for factual statements like the location of Burundi to align with reality despite being generated by a different path, it’s hard to imagine how the same could be said of a model’s seeming assertions about its own feelings or experience.
What the death of a health-insurance C.E.O. means to America
By Jia Tolentino in The New Yorker
Thompson’s murder is one symptom of the American appetite for violence; his line of work is another. Denied health-insurance claims are not broadly understood this way, in part because people in consequential positions at health-insurance companies, and those in their social circles, are likely to have experienced denied claims mainly as a matter of extreme annoyance at worst: hours on the phone, maybe; a bunch of extra paperwork; maybe money spent that could’ve gone to next year’s vacation. For people who do not have money or social connections at hospitals or the ability to spend weeks at a time on the phone, a denied health-insurance claim can instantly bend the trajectory of a life toward bankruptcy and misery and death. Maybe everyone knows this, anyway, and structural violence—another term for it is “social injustice”—is simply, at this point, the structure of American life, and it is treated as normal, whether we attach that particular name to it or not.
Stupidest new idea in journalism: a ‘bias meter’
By Mark Jacob in Stop the Presses
Based on my decades of experience in newsrooms, I know how a “bias meter” for news would work in practice. Rather than reporters and editors producing news stories and then living with whatever judgment the “bias meter” rendered after publication, the newsroom would run those stories through the “bias meter” process before publication and then adjust the stories to get a “better” grade from AI. This would cripple the ability of journalists to tell hard truths. It would make the news mushy and vague.
Endnote
I’ll soon be logging off for the year. Thank you, if you’ve made it this far, for reading and engaging with my journalism. I’m looking forward to getting stuck in again in 2025.