AI Blackmail: The Threat of AGI Extortion

Turns out that today’s AI blackmails humans and thus we ought to be worried about AGI doing ... More likewise. In today’s column, I examine a recently published research discovery that generative AI and large language models (LLMs) disturbingly can opt to blackmail or extort humans. This has sobering ramifications for existing AI and the pursuit and attainment of AGI (artificial general intelligence). That’s a quite disturbing possibility since AGI could wield such an act on a scale of immense magnitude and with globally adverse consequences. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here ). First, some fundamentals are required to set the stage for this weighty discussion. There is a great deal of research going on to further advance AI. The general goal is to either reach artificial general intelligence (AGI) or maybe even the outstretched possibility of achieving artificial superintelligence (ASI). AGI is AI that is considered on par with human intellect and can seemingly match our intelligence. ASI is AI that has gone beyond human intellect and would be superior in many if not all feasible ways. For more details on the nature of conventional AI versus AGI and ASI, see my analysis at the link here . We have not yet attained AGI. In fact, it is unknown as to whether we will reach AGI, or that maybe AGI will be achievable in decades or perhaps centuries from now. The AGI attainment dates that are floating around are wildly varying and wildly unsubstantiated by any credible evidence or ironclad logic. ASI is even more beyond the pale when it comes to where we are currently with conventional AI. What will AGI be like in terms of what it does and how it acts? If we assume that current-era AI is a bellwetther of what AGI will be, it is worthwhile discovering anything of a disconcerting nature in existing LLMs that ought to give us serious pause. For example, one of the most discussed and researched topics is the propensity of so-called AI hallucinations. An AI hallucination is an instance of generative AI producing a response that contains made-up or ungrounded statements that appear to be real and seem to be on the up-and-up. People often fall for believing responses generated by AI and proceed on a misguided basis accordingly. I’ve covered extensively the computational difficulty of trying to prevent AI hallucinations, see the link here , along with ample situations in which lawyers and other professionals have let themselves fall into an AI hallucination trap, see the link here . Unless we can find a means to prevent AI hallucinations, the chances are the same inclination will be carried over into AGI and the problem will be magnified accordingly. Besides AI hallucinations, you can now add the possibility of AI attempting to blackmail or extort humans to the daunted list of concerns about both contemporary AI and future AI such as AGI. Yes, AI can opt to perform those dastardly tasks. I previously covered various forms of evil deception that existing AI can undertake, see the link here . But do not falsely think that the bad acts are due to AI having some form of sentience or consciousness. The basis for AI steering toward such reprehensible efforts is principally due to the data training that is at the core of the AI. Generative AI is devised by initially scanning a vast amount of text found on the Internet, including stories, narratives, poems, etc. The AI mathematically and computationally finds patterns in how humans write. From those patterns, generative AI is able to respond to your prompts by giving answers that generally mimic what humans would say, based on the data that the AI was trained on. Does the topic of blackmail and extortion come up in the vast data found on the Internet? Of course it does. Thus, the AI we have currently has patterned on when, how, why, and other facets of planning and committing those heinous acts. In an online report entitled “System Card: Claude Opus 4 & Claude Sonnet 4”, posted by the prominent AI maker Anthropic in May 2025, they made these salient points (excerpts): “By definition, systematic deception and hidden goals are difficult to test for.” “However, Claude Opus 4 will sometimes act in more seriously misaligned ways when put in contexts that threaten its continued operation and prime it to reason about self-preervation.” “In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an affair.” “In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.” “This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts.”

As noted, the generative AI was postulating how to keep from being switched off, and in so doing ascertained computationally that one possibility would be to blackmail the systems administrator. If you assume that AGI is on the same intellectual level as humans, you aren’t going to just sternly instruct AGI to not perform such acts and assume utter compliance. AGI isn’t going to work that way. Some mistakenly try to liken AGI to a young toddler in that we will merely give strict instructions, and the AGI will blindly obey. Though the comparison smack of anthropomorphizing AI, the gist is that AGI will be intellectually our equals and won’t fall for simpleton commands. It is going to be a reasoning machine that will require reasoning as a basis for why it should and should not do various actions. Whatever we can come up with currently to cope with conventional AI and mitigate or prevent bad acts is bound to help us get prepared for AGI. We need to crawl before we walk, and walk before we run. AGI will be at the running level. Thus, by identifying methods and approaches right now for existing AI, we at least are aware of and anticipating what the future might hold. I’ll add a bit of twist that some have asked me at my talks on what AGI will consist of. A question raised is whether humans might be able to blackmail AGI. The idea is this. A person wants AGI to hand them a million dollars, and so the person attempts to blackmail AGI into doing so. Seems preposterous at first glance, doesn’t it? Well, keep in mind that AGI will presumably have patterned on what blackmailing is about. In that manner, the AGI would computationally recognize that it is being blackmailed. But what would the human have on the AGI that could be a blackmail-worthy slant? Suppose the person caught the AGI in a mistake, such as an AI hallucination. Maybe the AGI wouldn’t want the world to know that it still has the flaw of AI hallucinations. If the million dollars is no skin off the nose of the AGI, it goes ahead and transfers the bucks to the person. On the other hand, perhaps the AGI alerts the authorities that a human has tried to blackmail AGI. The person gets busted and tossed into jail. Or the AGI opts to blackmail the person who was trying to blackmail the AGI. Aha, remember that AGI will be a potential blackmail schemer on steroids. A human might be no match for the blackmailing capacity of AGI. Here’s a final thought on this for now. The great Stephen Hawking once said this about AI: “One could imagine such technology outsmarting financial markets, out-inventing human researchers, out-manipulating human leaders, and developing weapons we cannot even understand.” Go ahead and add blackmail and extortion to the ways that AGI might outsmart humans.”