What is AI safety?

People who create AIs do the following thing: take a model architecture, combine it with a training procedure and out comes an artificial intelligence. If the AI that comes out is capable enough, it could be very dangerous to humans unless it has certain properties that prevent that.

AI Safety aims to answer the following questions:

  1. What properties do we want the AI to have?
  2. How can we measure whether it has those properties?
  3. How can we ensure these properties through the architecture and training setup?

The possible properties fall broadly into two classes:

  1. Behavioural:, e.g. capability, corrigibility, robustness, ethics, thruthfulness
  2. Internal: e.g. interpretability, goal orientedness, reasoning process, correctness of implicit world models

The behavioural properties are what we can principally observe when the model acts and also what we primarily care about. In a certain sense, we do not necessarily care about how the AI operates internally as long as it does what we want. However, to ensure that it will continue to do what we want under many unforseen circumstances the internal properties matter a great deal to us.


We want our AI to answer questions truthfully. We can test the truthfulness on questionaires. We can measurably increase the truthfulness via incorporating Reinforcement Learning from Human Feedback into the training process, see e.g. here. Another example is that the AI’s reasoning should be interpretable to us.

What should alignment research look like?

  1. Conceptualise new properties X that we should want an AI to have.
  2. Design tests that quantify whether trained AIs have X.
  3. Relate X to existing training paradigms and investigate why it might not be trained for X.
  4. Find the best training setup possible to enforce X under reasonable resource constraints.