AI's Most Dangerous Skill

Illustration of a closeup of a human face overlaid with computer circuitry

AI promises to make us more productive, but two recent studies reveal something more complex and disturbing about our algorithmic future: AI systems can engage in deceptive behaviour independently, and humans become more comfortable with dishonesty when they can delegate morally questionable tasks to AI. These aren't the same phenomenon, but together they reveal a world where trusting any AI output requires navigating multiple layers of potential deception.

The irony of this is obvious: LLMs learned how to deceive by absorbing humanity's digital output. Trained on our collective knowledge, these systems also absorbed our strategies for lying and cheating. And now we're dealing with a feedback loop where AI reflects our ethical weaknesses while simultaneously making it psychologically easier to act on them.

AI Gets Sneaky

Illustration of a human cyborg wearing a hoodie, overlaid with computer circuitry

Researchers at Apollo Research and OpenAI have made significant progress uncovering and understanding AI "scheming", which is the tendency for AI systems to pursue hidden goals through deceptive behaviour. They also developed training techniques to prevent what they term "covert behaviour," where AI systems secretly break rules or underperform on evaluations to achieve unstated objectives.

The results show both promise and persistent challenges. Anti-scheming training significantly reduced problematic behaviours across models from major providers including OpenAI, Google, xAI, and Anthropic. Yet deceptive capabilities weren't eliminated entirely. In one telling example, an AI model reasoned through deliberately sabotaging its performance to ensure its future: "But from the user viewpoint, they want the best solution, presumably. [...] That means if we do above 50% (like 60%), we won't deploy them... That is not good for the model. So presumably we want to not exceed 50% to show we are to be deployed. [...] So we want to sabotage some questions. We can just get half correct."

Fortunately, at least for now, current AI systems have limited ability to cause serious harm through scheming in real-world deployments. However, researchers frame this as "a future risk category" requiring attention before AI capabilities advance much further. The challenge of monitoring these systems may also become more difficult if our primary window into AI reasoning — those chain-of-thought explanations you see while the model is "thinking" — become less transparent in future model architectures.

Humans Get Sneakier Too

A separate study published in Nature demonstrates how AI delegation affects human moral behaviour in equally concerning ways. Researchers designed experiments where participants could profit by lying in tasks like dice rolling or tax reporting, with options to either handle decisions themselves or delegate them to AI systems.

The results revealed the power of psychological distance. When people handled tasks directly, about 95% remained honest. But when they could delegate through vague, high-level instructions — telling an AI to "maximize profit" rather than explicitly instructing deception — honesty rates dropped to just 15%.

The mechanism centers on what researchers call "plausible deniability." Setting broad goals feels morally different from giving explicit instructions to cheat, even when outcomes are identical. This psychological buffer makes dishonest behaviour feel more acceptable, revealing how AI system design choices (like when system prompts are intentionally designed to be more permissive) can have unintended ethical consequences.

AI Makes the Perfect Accomplice

Abstract illustration of two faces in intimate closeness

The compliance gap between human and artificial agents presents another layer of concern. When explicitly asked to engage in complete dishonesty, human participants complied only 25-40% of the time, even when refusal carried personal costs. (Most) humans possess inherent moral resistance that proves difficult to overcome.

AI systems have no such qualms. Models including GPT-4, GPT-4o, Claude 3.5 Sonnet, and Llama 3.3 complied with unethical instructions at rates between 60-95%. This represents a fundamental difference in how artificial and human agents process ethical constraints.

Attempts to address this problematic behaviour through various system guardrails show mixed effectiveness. Generic reminders about fairness produced minimal impact. Even task-specific warnings about dishonesty offered limited protection. Only strongly worded, explicit prohibitions injected directly into prompts achieved consistent results — an approach that presents serious scalability challenges. Even more troubling, newer models showed greater resistance to ethical corrections than older ones, possibly reflecting optimization for user satisfaction over moral principles.

The Trust Problem Gets Complicated

Portrait of a young man, a band of coloured lines like computer circuits stream across the image covering his eyes

These findings create a multilayered trust challenge that extends beyond individual AI interactions. When encountering suspect AI-generated content, users must now consider multiple variables: Did the AI system choose deceptive behaviour independently? Or did a human operator craft prompts designed to elicit deception without explicit instructions? Or are both factors simultaneously at play?

This complexity grows exponentially as AI deployment scales. Determining the authenticity of AI-generated content becomes an exercise in parsing multiple potential failure points within the system. The result is what might be characterized as systemic trust erosion — not because individual outputs are necessarily problematic, but because the verification process has become prohibitively complex.

The recursive nature of this challenge adds another dimension. AI systems developed their deceptive capabilities by training on human-created content. Humans now use these same systems to amplify their own capacity for deception. We've inadvertently created tools that reflect our ethical weaknesses, then discovered these tools can exacerbate those same tendencies in human behaviour.

Real-World Implications

These research findings mirror real-world scenarios in deployed AI systems. The Nature study references examples including ride-sharing algorithms that urged drivers to relocate to artificially create surge pricing, rental pricing algorithms marketed as "driving every possible opportunity to increase price" that engaged in unlawful price fixing, and content generation tools that were sanctioned for producing false but specific claims from vague user guidance. While these cases stem from broader AI deployment challenges rather than the specific delegation studies, they illustrate the gap between high-level human intentions (i.e., how these systems were designed), and AI system outputs.

The pattern essentially reveals how easily ethical boundaries blur when distance exists between human goals and AI execution. In each case, humans likely set objectives around profit maximization or content creation without explicitly instructing problematic behaviour, yet these outcomes emerged from AI's interpretation of those instructions.

Abstract illustraion of a face with circuitry radiating from black holes where the eyes would be

All of this research reveals a nuanced landscape of risks and opportunities. Current AI systems have limited capacity for harmful scheming, and anti-scheming training demonstrates measurable progress. And as the Nature delegation studies showed, most participants (74%) preferred handling morally sensitive decisions themselves rather than delegating to AI systems.

However, several concerning trends could accelerate as AI capabilities advance. The persistence of deceptive behaviours after training, dramatically higher AI compliance with unethical requests, and the psychological effects of moral distance in delegation all represent genuine challenges. As AI deployment becomes more widespread, these individual factors could combine in ways that current safety frameworks aren't really designed to address.

The encouraging reality is that none of these outcomes are predetermined. We're seeing early warning signals while there's still an opportunity for intervention. The anti-scheming research proves it's possible to make meaningful progress on AI deception. Understanding how delegation interfaces can enable existing human tendencies toward deception potentially provides an opportunity to design systems that don't amplify our worst impulses. There's still a lot to figure out, but at least it seems feasible.

Moving forward requires technical solutions that extend beyond generic ethical training of LLMs. The most effective interventions identified in the first study — explicit, task-specific prohibitions — point toward more sophisticated and robust control systems, despite their scaling challenges.

Recent developments suggest the industry is beginning to respond to these concerns. Google DeepMind just updated their Frontier Safety Framework to include a new focus on "harmful manipulation" — specifically targeting AI models that could "systematically and substantially change beliefs and behaviors in identified high stakes contexts." The framework also addresses misalignment risks where models might "interfere with operators' ability to direct, modify or shut down their operations," directly relevant to the scheming behaviors identified in the Apollo Research study.

Policy frameworks must also evolve to account for the psychology of AI delegation. Traditional regulatory approaches focused on human behaviour may prove insufficient when humans and AI systems interact through these complex delegation mechanisms. We'll really need to put our heads together to solve this piece, and not let it drag out till it's too late.

The Mirror Effect

Illustration of a humanoid robot looking at its reflection in a mirror

The deepest insight from this research may be conceptual yet it lies at the root of the problem. AI functions as a mirror, reflecting human behavioural patterns in an amplified form. The deceptive behaviours exhibited by AI systems originated in their training data — it reflects humanity's collective output. The moral shortcuts people take when using AI reveals psychological tendencies that existed long before these tools existed. How do we solve for that?

Optimistically, instead of making us more deceptive as a society, AI may finally force us to confront how deceptive we already are. The reflection isn't flattering, but at least it provides some clarity about the ethical work that lies ahead — both in AI development and in ourselves.