Why we should regulate AI before deployment
David Janků | November 2024
The first wave of AI policy making, from the EU AI Act to the US Executive Order to the strategy of the UK AI Safety Institute, focuses on the pre-deployment stage of the AI lifecycle. These policies try to ensure that AI products are not dangerous before they are deployed widely.
They do this via tools and approaches like “capability evaluations” against benchmarks and “red teaming”. There has been some promising progress on this front: for instance, the UK AI Safety Institute has developed evaluations of cyber-offensive and chem/biohazardous capabilities, Model Evaluation and Threat Research (METR) has developed new evaluations of autonomous agentic capabilities and self-replication, and another group has developed novel frameworks for automated red-teaming.
Capability evaluations use standardized datasets and tasks to test the model’s ability and likelihood to perform certain (dangerous) behaviours. Safety benchmarks often include stress tests for edge cases, ensuring the model doesn’t fail or behave unpredictably in rare or extreme scenarios.
Red teaming involves acting as a malicious actor in an attempt to find flaws or vulnerabilities in the model via adversarial testing. This can include probing for biases, testing edge cases, or trying to elicit undesirable behaviours.
These techniques might also be complemented by an analysis of the model’s internal representations and structures to understand both how it processes information and makes decisions, and to check that it faithfully reports its knowledge. This process aims to help identify and mitigate potential safety threats and biases, with the goal of creating models that can withstand malicious inputs and perform safely in real-world applications.
While having robust deployment procedures and testing is important, this method of safety testing has been shown to have multiple deficiencies. The rest of this article unpacks some of these limitations and then argues that policymakers need to explore regulation in the pre-development phase of the AI lifecycle to effectively prevent Advanced AI risks.
Safety testing before deployment: necessary but not sufficient
Dangerous capabilities may only emerge after deployment
First, the inherent complexity and unpredictability of AI systems mean that misalignments and dangerous capabilities may not become apparent until they are already operational in complex real-world scenarios, making preemptive identification and correction challenging. We have seen several examples of such failures in current systems, like when Microsoft’s Bing made threats and exhibited other disturbing behaviors like trying to seduce or manipulate users; or circumnavigating ChatGPT’s safety precautions within just a single day of it going public despite thorough safety testing prior to release.
The emergence of unexpected capabilities after deployment is one of the factors that makes regulation of advanced AI particularly challenging, as my colleague wrote about earlier this year.
Moreover, building on top of a base model (that has already been evaluated for safety) could potentially elicit new dangerous capabilities in the resulting model. While the base model might be safe, the addition of new layers of training or modifications can inadvertently emphasize harmful patterns or create vulnerabilities that were not present in the original model. For instance, red-teaming studies have shown that fine-tuning large language models can compromise their safety guardrails. By using just a few adversarial training examples, researchers were able to bypass the model’s safety mechanisms, making it responsive to harmful instructions. This removal of safety mechanisms could be as cheap as 200 USD. Similarly, both Google Deepmind and OpenAI found that scaffolding techniques – which involve adding supportive structures and tools to enhance a model’s functionality – can introduce dangerous capabilities into AI models, such as manipulation, deception, and offensive cyber capabilities.
Sufficiently advanced models could recognise and evade safety testing
By making AI models ever more powerful, it is plausible that at some point a model is created that will have enough situational awareness to realize when it is being tested (with some early indications this is already happening) and enough capabilities to adjust its behaviour to the test, to hide potentially dangerous capabilities and misalignments. This is sometimes called deceptive alignment and empirical tests on the recent models show that our current safety techniques are not able to detect and prevent such behavior.
Dangerous AI models could be stolen or accidentally proliferated
AI developers might create versions of dangerous models either as an artifact to improve their safety methods—similar to gain-of-function research (as illustrated in this study)—or by accident while trying to develop more capable models and only later figuring out via their internal safety testing that the new model is dangerous. In such cases, AI labs would likely decide not to deploy such systems or be prevented from doing so by stronger regulations. However, these dangerous systems would already exist internally in the labs, which means there is a chance they could still be proliferated: either by accidental deployment or by models being stolen or replicated without the original developers’ comprehensive safety checks. This has already happened, as with the unintended release of Meta’s LLaMA model.
Broader trends in AI could create further weaknesses
Racing dynamics exacerbate these factors
Economic pressures and the global race for AI supremacy combine to form a particularly potent catalyst for premature AI deployment. Companies and nations alike are often driven by the need to secure technological leadership and economic advantage, which can lead to the deployment of AI systems that have not been thoroughly vetted for safety. This rush towards deployment can be illustrated by the Uber 2018 self-driving car crash, where engineers disabled an emergency brake that they worried would cause the car to behave overly cautiously and look worse than competitor vehicles, leading to a pedestrian death. The same racing dynamics were at the root of many other industrial disasters like the Bhopal gas tragedy, which is widely considered to be the worst industrial disaster ever to have happened. These pressures are magnified on the international stage, where nations vie for technological dominance, sometimes at the expense of establishing and adhering to rigorous safety and ethical standards, especially in a military development context. Cutting corners on safety is especially worrying when the safety margins are very thin, represented by safety research making up only 2 % of all AI research.
Agentic models may disrupt the training → deployment paradigm
In addition, there are technical shifts coming that could disrupt the existing paradigm of capability evaluations and red teaming. One of the major focus of AI companies is the creation of ‘AI agents’, models which enrich AI products with autonomous agency in an effort to give them the ability to simultaneously learn and be trained in real time. Current safety evaluations are designed for systems whose training phase is distinct and completed prior to deployment. The introduction of models capable of ongoing learning—where algorithms adapt and optimize post-deployment—necessitates a reevaluation of these traditional safety protocols. Such new contexts might cause the AI system to change over time to the extent that the original safety evaluations lose relevance.
The mere introduction of very capable agentic AIs into the real world already creates novel and unprecedented situations that might further decrease the relevance of previous safety testing
Solutions
Although useful, present regulations relying on pre-deployment safety testing are not sufficient to cover the important risks coming from development of advanced AI systems. States should intervene early in the process before potentially dangerous systems are developed, to proactively create a coordinated environment in which chances of safe development are highest and risks are minimized.
In order to intervene in the pre-development stage of the AI lifecycle, policy makers should define which systems are too risky to be developed in the first place (see, for example the proposal for regulating long term planning agents, or red lines in AI development) and require pre-development safety guarantees (especially in relation to red lines and other dangerous capabilities). Further, in order to reduce competition pressures that might cause circumventing of the safety features, nations should explore international governance mechanisms such as collaboration on AI development.
Without these measures, the risks associated with AI misalignment and premature deployment could lead to irreversible consequences, fundamentally altering our societal landscape.