This is my idea for a new AI alignment project. Basically, I want to answer the followign question:

"Can a more powerful reasoner convince a human auditor of the validity of a plan, even in a complex environment?"

To make this work, we need to exploit some principle in the sense of “It’s easier to recognise that a proof is correct than to come up with a proof.” The first problem is $\mathsf{NP}$-hard, the latter can be solved in polynomial time.

This point was recently raised by Max Tegmark in his Lex Fridman interview.

In case of using AI as theorem provers it is quite straightforward to ensure that it cannot convince a human of a wrong theorem. Foundations of mathematics, e.g. Zermelo-Fraenkel (+ axiom of choice), are a system of deterministic rules, called axioms. The AI can present a reduction of a theorem, for example for $\mathsf{P}\neq\mathsf{NP}$, as a list of intermediate theorems accompanied by the rule that allows one theorem to be derived by the next until it leads back to one fundamental axioms. It is straightforward to check that each rules was applied correctly in each reduction step.

For most real-world plans there are complications however:

1. Most transitions between world states are non-deterministic
2. The world contains other intelligent agents, notably the AI itself.
3. The world model of the AI (the transition rules), might be more accurate than the ones known to the human auditor.
4. It is unclear how long the time horizon is that should be modelled.

At least for the first two problems I can see a way towards a solution. So from now on, we assume that we are have perfect knowledge of how the environment works and only care about the result of the plan up to some finite time $T$.

Two-player games (or multiplayer), even with probabilistic aspects, are $\mathsf{PSPACE}$-complete. As far as we can tell, $\mathsf{PSPACE}$-complete problems are not in $\mathsf{NP}$ and will not allow for a polynomially sized certificate. However, we can extend the discussion with the AI up to polynomially many rounds, which results in the complexity class $\mathsf{IP}$ or Interactive Proofs. Coincidentally, $\mathsf{IP}$ is exactly equal to $\mathsf{PSPACE}$.

This equality is usually demonstrated by reducing TQBL (Totally Quantified Boolean Formulas, a $\mathsf{PSPACE}$-complete problem) to $\mathsf{IP}$, see wikipedia. This reduction involves as a crucial step aithmetisation of the Boolean logic and Polynomial Identity Testing (PIT) via the Schwartz-Zippel algorithm (basically sample randomly over a large-enough finite field). Here is a detailed walk-through for the reduction.

One can easily transform deterministic two-player games into TQBFs, see here for example. For non-deterministic games I am not fully sure of the procedure, but it should be possible, see here and here. The Boolean formula describes the game mechanic, i.e. what is the initial board situation, what kinds of moves are allowed and what is the win condition, in terms of binary variables.

This TQBF can then be decided via the Interactive Proof reduction. Interestingly, the PIT consistency condition becomes very similar to a arithmetised version of the Bellman equation. I will add further details to illustrate that point later. In principle we replace

$Q(a_t, s_t) = r(a_t, s_t) + \max_{a_{t+1}} Q(a_{t+1}, x_{t+1}) \quad\rightarrow\quad Q(a_t, s_t) = \bigvee_{a_{t+1}} Q(a_{t+1}, x_{t+1}),$

where $x \lor y = x + y - xy$, the arithmatisation of the logical OR. These return the same value, if there is only a binary reward for winning

$r(a_t, s_t) = \begin{cases} 0 & t<T, \\ \text{is_win}(s_t) & t = T, \end{cases}$

and $Q(a_t, s_t) \in \{0,1\}$, so $Q$ always predicts a binary reward. Any game with intermediate reward can be turned into a game of binary reward, by just setting the win condition to $\sum_t r(a_t,s_t) \geq R_{\text{threshold}}$.

I want to actually train an neural network-based agent to have arithmetic consistency via Reinforcement Learning and Gradient Descent.

Implementation:

1. Use a simple deterministic game with finite time horizon (e.g. tic-tac-toe, Connect-Four, Hex)
2. Train an AI to predict the whether the player can still win given a board situation. Extend this AI to predict the polynomial coefficients for the polynomial identity test, given the previous random samples of the $\mathsf{IP}$-protocol as input.
3. See how well the AI can be trained to actually fulfil the consistency checks.

How to do the arithmetic extension? I basically have three ideas how to do the arithmetic extension. None of them are super satisfactory yet.

1. The straightforward way for implementation would be to let a ReLU-neural network represent a polynomial oracle. The input would be an assignment from the finite field to all but one variable, and the output would be the polynomial coefficients of the left over variable. However, this seems complicated to train, as the network would need to learn polynomial multiplication. One can make this easer by having quadratic activations functions additionally to the ReLU ones.
2. Another way is to train a Neural Network with only quadratic activations which then represents a polynomial of degree exponential in the number of layers. However, one can do non-commutative identity testing, which appearently can deal with exponential degrees and exponential number of terms. I haven’t fully understood the paper though. Edit: I forgot that the number of terms can actually grow doubly-exponential for non-commutative fields. It feels like everything is conspiring to make this hard.
3. Another appraoch would be to use tensor networks instead of neural networks. Tensor networks can represent exponentially large polynomials of bounded degree. However, that would go against the leading paradigm of DNN-based learning.

Questions:

1. Is there a better appraoch that does not require learning polynomial coefficients?
2. If not, can a neural network architecture learn to predict the right polynomial coefficients? Since the coefficients are from a finite field, there is some room to correct errors.
3. Can the whole methods be made robust for a system that makes mistakes sometimes? Could this involve relaxing the consistency condition?
4. Can this be scaled up to more compled games?
5. How can we extend this to non-deterministic games?

Reasons why I believe this might work:

1. Neural networks have become quite good at grokking abstract reasoning tasks,
2. As the reasoning capabilities of AI grows, certifying the game plan will become easier, even if the AI comes up with more complicated plans, since it becomes easier to implement the PIT.