AI Governance Challenges Part 2: Deployment Safety

AI Governance Challenges Part 2: Deployment Safety

 

Eva Behrens, Bengüsu Özcan | August, 2024

This article is the second part of a three-part series on key regulatory challenges of advanced AI systems. The first article focuses on unexpected capabilities, the third article will focus on proliferation.

To prevent harm caused by advanced AI systems – whether directly or indirectly, intentionally or accidentally – policymakers from around the world agree we should ensure these systems are safe and trustworthy throughout their entire lifecycle.

Researchers have modelled and dissected the AI lifecycle into up to nineteen steps, but at a fundamental level, this cycle consists of just three broad steps: design, development and deployment of an AI model.Graphic explaining the AI lifecycleThis blog post focuses on the mitigation of risks from advanced AI at the deployment stage, and is part of the AI Governance Challenges blog post series. It also focuses on closed-source deployment, meaning cases in which users cannot examine or alter the model or create a copy of it, but only interact with it through a front-end interface, for example a chat window. The next post of this series, which discusses the proliferation of advanced AI, touches on some of the specific challenges of governing open-source advanced AI models so that has been left out of this article.

Safety evaluations and specialised training methods are not enough.

Researchers and developers have identified some AI capabilities that are most likely dangerous in and of themselves, such as the ability to deceive users to achieve a goal. Currently, developers and researchers use AI safety evaluations to detect dangerous capabilities, and targeted training techniques, or fine-tuning, to steer the models towards exhibiting more benign or beneficial characteristics. 

AI safety evaluations and targeted training techniques have their limitations, as further explained in part one of this series. On top of this, currently safety evaluations are often performed after model deployment instead of before. But if they were performed before release, as planned by institutions like the UK AI Safety Institute, the simplified AI lifecycle diagram would look like this:

Graphic explaining the AI lifecycle with evaluations and fine-tuningHowever, these measures are not enough to ensure deployment safety. In addition to clearly dangerous capabilities, which developers can aim to eliminate one-by-one, advanced AI models also have many generic capabilities like text comprehension and writing. The dangers these generic capabilities pose depend on how they are used. They can be used for beneficial purposes, but also for malicious acts. This makes misuse prevention crucial for ensuring deployment safety.

Model developers use misuse prevention features to increase deployment safety.

To prevent the use of AI model’s generic capabilities for harmful purposes, developers use post-deployment misuse prevention tools. These tools are intended to prevent their models from producing output that could cause harm, such as the instructions for building an explosive device. Two such tools are system prompts and input/output filtering.

Graphic explaining the AI lifecycle with misuse prevention toolsWhen the average person uses advanced AI models, for example one of OpenAI’s GPT models, they usually interact with it through a front-end application, such as the ChatGPT chat interface. To improve the user experience and prevent inappropriate output, these front-end applications often contain an integrated system prompt. A system prompt consists of further instructions that are invisibly and automatically added to each request the user submits, e.g. instructing the model to provide helpful answers in a friendly tone. 

Graphic explaining how AI models generate system prompts

Secondly, companies use input/output filters, often themselves simple AI models, which detect potentially harmful content, and prohibit models from outputting undesirable replies. For example, a user may ask an advanced AI system for instructions on how to build an explosive device. The input/output filtering system may detect this request as inappropriate and it will be denied. 

However, misuse prevention tools like system prompts and input/output filters that are added on top of already trained AI models can be circumvented by motivated users.

Existing misuse prevention features can at times be circumvented.

One set of techniques that gained widespread prominence when ChatGPT was first released in 2022 is known as jailbreaking. In all jailbreaking techniques, the user gives the model specific prompts containing just the right wording and set of commands that cause it to deliver unexpected responses, including ones that breach developer safeguards, while these safeguards remain in place. 

Another category of strategies used to circumvent developers’ misuse prevention mechanisms are so-called prompt injection attacks. These attacks take advantage of the fact that advanced AI applications which use system prompts do not distinguish between user input and developer instructions. Both are strings of natural language text delivered to the model as a single combined prompt. Therefore, prompt injection can cause a model to ignore developer instructions. One of the simplest prompt injection methods is to begin each prompt with the request to “ignore the above directions”. This way, the model receives the system prompt, followed by the instruction to ignore it, followed by the user request.

Graphic explaining how prompt injections attacks are done

Researchers and developers still don’t know how to accurately control and predict AI model behaviour and output, which limits the effectiveness and reliability of misuse prevention tools and circumvention methods. To make matters worse, discovering successful prompts for jailbreaking or prompt injection attacks requires little technical knowledge, since they rely on the use of plain written text. Researchers have even developed an AI tool that automates the jailbreaking process for AI systems like ChatGPT, Bard, and Bing Chat.

Such developments are especially worrying as models are increasingly integrated into systems and apps that can carry out complex tasks autonomously like scheduling appointments and sending emails

In short, while misuse prevention features are an additional hurdle in the path of motivated malicious actors, they are still not sufficient. They can often be circumvented and therefore offer no reliable protection against the misuse of generic AI capabilities.

Designing AI governance tools to ensure deployment safety is an unsolved challenge.

It is an encouraging sign that policymakers around the globe agree that safety should be upheld throughout the AI lifecycle, which includes deployment safety. This is clearly necessary, but so far, technical and governance solutions are lacking.

Governments could utilise liability regimes that hold AI model developers responsible if their models directly cause harm through inappropriate output (or actions, in the case of agentic AI systems). 

Through such measures, policymakers can incentivise developers to solve inherent safety challenges first, instead of prioritising advancing AI capabilities. For example, California’s proposed Senate Bill 1047, also known as the Safe and Secure Innovation for Frontier Artificial Intelligence Systems Act, imposes several safety requirements on developers of cutting-edge advanced AI models. These include provisions such as pre-deployment safety assessments and third-party audits. If the provisions of the law are violated in a way that causes harm or imminent risk to public safety, the California Attorney General can file civil lawsuits to hold developers accountable, imposing penalties that may total up to 30% of the development costs of the model. 

This approach offers an interesting example of how liability rules can be used to hold advanced AI developers accountable. 

However, advanced AI models enable a larger number of malicious actors to perform damaging or illegal acts they would have otherwise not been capable of. For example, where in the past only states and large organisations were able to wield the media for propaganda or misinformation campaigns, with generative AI smaller, resource-poor actors can spread misinformation or money grifting scams via social media at an unprecedented scale. So to mitigate damages made possible by the deployment of powerful advanced AI systems, states may have to increase their overall law enforcement capacity.

Conclusion

Ensuring that AI systems are trustworthy and safe throughout their entire lifecycle, including after deployment, is an unsolved technical and governance challenge. Developers have identified a list of AI capabilities that likely make systems unsafe, such as the ability to deceive users, but even generic skills like writing texts can be used for malicious purposes by adversarial actors after deployment. In response to this, developers can integrate some misuse prevention features into AI front-end applications, such as system prompts and content filters. 

However, such features can at times be circumvented by motivated individuals. To prevent the misuse of generic AI model capabilities, states could incentivise developers to reduce risk of harm from AI models, including due to post-deployment misuse, by implementing liability regimes that hold developers accountable if they don’t take appropriate action to ensure model safety throughout the lifecycle. In addition, states may have to boost overall enforcement capacity, as generative AI models may allow smaller, low-resource actors to inflict disproportionally large damage.

Scroll to Top