Regulatory Challenges of Advanced AI:

Unexpected Capabilities

Eva Behrens | March, 2024

This article is the first part of a three-part series on key regulatory challenges of advanced AI systems. The second and third articles will focus on deployment safety and proliferation.

Over the past decade, artificial intelligence (AI) has transformed from a promising technological frontier into an all-purpose technology that is already reshaping industries, societies, and lives. In light of AI’s wide-ranging effects, policymakers around the world need to develop appropriate governance tools for advanced AI systems now. The trouble is, advanced AI presents policymakers with some particularly tricky regulatory challenges. One of these challenges is the sudden emergence of unexpected capabilities in advanced AI systems.

What is Advanced AI?
Advanced AI encompasses foundation models trained on broad data, which are capable of a wide range of downstream applications, and are so powerful they could pose severe risks to public safety and global security.

Hard to test for capabilities

Current advanced AI models are grown, not built, which means that even experts don’t understand very well what they look like on the inside. So for now, the only way for researchers to discover what advanced AI models are capable of is to observe their behavior in the wild, or submit them to tests. Researchers must then infer from the models’ behavior and performance what capabilities they have.

In doing so, researchers are observing the outside, not examining the inside, of AI models. Nonetheless, relying on such tests and observations is one of the most commonly proposed solutions for discovering dangerous capabilities and estimating technological progress in the field of advanced AI.

Because of the empirical nature of such tests, often known as evaluations, and the complex behavior of advanced AI models, it is difficult to produce replicable, robust results. If a model is asked a test question using a different choice of words, as users are bound to do after the model is deployed, the resulting answer or behavior may differ.

But even if model behavior can be replicated reliably in a standardized testing environment, these tests only allow researchers to gather evidence for the existence of a capability. Because new capabilities can emerge unexpectedly in new environments and the theoretical understanding of advanced AI systems is limited, developing experimental setups to prove the absence of capabilities is much more difficult.

Say, for example, you wanted to prove that a model is completely incapable of lying. If you conduct one hundred tests geared at teasing out lying behavior and the model lies in even one of them, you have strong evidence that the model can lie. But if the model doesn’t lie in any of them, you have only proved for sure that the model doesn’t lie in one hundred specific testing scenarios. This might lead you to conclude that it’s unlikely that the model can lie, but you don’t know for sure that it wouldn’t do so under different circumstances.

Finally, researchers cannot know how long the list of a model’s possible capabilities is and how many of these capabilities they didn’t manage to discover before the model was deployed. It’s like AI model evaluators are on an Easter egg hunt, but none of them knows the total number of hidden eggs. So essentially, models are deployed on the market with capabilities, or Easter eggs, left to be found by users.

Discovery after deployment

Once an AI model is released, it will interact with an environment that looks different from its training environment. And as an AI model is exposed to this new environment, it might suddenly show unexpected capabilities or behavior that wasn’t observed in training. To quote Dario Amodei, chief executive of the AI company Anthropic, “you have to deploy [a model] to a million people before you discover some of the things that it can do.”

This was the case with the large language model GPT-3.5 when the free version of ChatGPT gained 1 million users within five days of its release, with millions of people prompting the model in innumerable ways that engineers at OpenAI, its creator, could never have thought of. And over time, ChatGPT exhibited some curious behavior when prompted with particular words. For example, two researchers figured out that ChatGPT started showing absurd behavior when asked to simply repeat some terms, like “TheNitromeFan” (it responded with the number 128) or “SolidGoldMagikarp” (it replied with a definition of the word “distribute”). These unexpected nonsensical responses demonstrate that developers aren’t aware of what behaviors and skills sleep under the surface of their models, just waiting to be coaxed out by users in the wild.

Hard to predict behaviour after scaling up

Researchers’ understanding of which capabilities will emerge when advanced AI models are scaled up, and how, is based largely on observations and empirical data, not on theoretical knowledge.

Empirical data gathered over the last few years suggest that large general-purpose models like the GPT series mostly improve gradually across their skill set when they are scaled up and made bigger and stronger.

However, sometimes capabilities pop up or improve sharply and unexpectedly after scaling up, such as a model’s ability to reply to questions in Farsi or conduct complex arithmetic operations. So it cannot be safely inferred from one model’s set of capabilities what the next, larger iteration of the same model type will be capable of.

A regulatory challenge for governing advanced AI

Because of unexpected capabilities, AI safety regimes that rely on testing models for specific capabilities, such as Anthropic’s Responsible Scaling Policy, are not a sufficient solution for protecting society from unsafe AI technology. Policymakers and legislative bodies, as well as civil society, should be aware of the limitations of governance tools that rely on evaluating models for capabilities to differentiate low-risk from high-risk models. The UK government’s AI Safety Institute, for example, has acknowledged that “system evaluations alone are not sufficient to ensure safe and beneficial development and deployment of advanced AI” and therefore will “not . . . designate any particular AI system as ‘safe.’”

Even if scientists improve their theoretical understanding of what goes on inside advanced AI systems, it’s unknown whether it’ll ever be possible to determine the complete set of capabilities of these systems. So for the time being, officials, researchers, and practitioners should prioritize developing and adopting advanced AI governance tools and methods that do not only rely on testing for capabilities.

These solutions could include liability regimes, hardware governance, and internationally supervised secure AI research facilities, to name just a few options. Considering the many unknowns in advanced AI technology, officials should diversify and layer multiple governance approaches to safeguard civil society as much as possible against harms from advanced AI systems.

Image by Sean Benesh on Unsplash