Research Agenda

An overview of my research agenda. Details depend on where and when I will get the chance to enact it.

\[\newcommand{\kl}[1]{\mathopen{}\left( #1 \right)\mathclose{}} \newcommand{\ekl}[1]{\mathopen{}\left[ #1 \right]\mathclose{}} \newcommand{\skl}[1]{\mathopen{}\left\{ #1 \right\}\mathclose{}} \newcommand{\bkl}[1]{\mathopen{}\left| #1 \right|\mathclose{}} \newcommand{\nkl}[1]{\mathopen{}\left\| #1 \right\|\mathclose{}} \newcommand{\bfa}{\mathbf{a}} \newcommand{\bfb}{\mathbf{b}} \newcommand{\bfc}{\mathbf{c}} \newcommand{\bfd}{\mathbf{d}} \newcommand{\bfe}{\mathbf{e}} \newcommand{\bff}{\mathbf{f}} \newcommand{\bfg}{\mathbf{g}} \newcommand{\bfh}{\mathbf{h}} \newcommand{\bfi}{\mathbf{i}} \newcommand{\bfj}{\mathbf{j}} \newcommand{\bfk}{\mathbf{k}} \newcommand{\bfl}{\mathbf{l}} \newcommand{\bfm}{\mathbf{m}} \newcommand{\bfn}{\mathbf{n}} \newcommand{\bfo}{\mathbf{o}} \newcommand{\bfp}{\mathbf{p}} \newcommand{\bfq}{\mathbf{q}} \newcommand{\bfr}{\mathbf{r}} \newcommand{\bfs}{\mathbf{s}} \newcommand{\bft}{\mathbf{t}} \newcommand{\bfu}{\mathbf{u}} \newcommand{\bfv}{\mathbf{v}} \newcommand{\bfw}{\mathbf{w}} \newcommand{\bfx}{\mathbf{x}} \newcommand{\bfy}{\mathbf{y}} \newcommand{\bfz}{\mathbf{z}} \newcommand{\bfA}{\mathbf{A}} \newcommand{\bfB}{\mathbf{B}} \newcommand{\bfC}{\mathbf{C}} \newcommand{\bfD}{\mathbf{D}} \newcommand{\bfE}{\mathbf{E}} \newcommand{\bfF}{\mathbf{F}} \newcommand{\bfG}{\mathbf{G}} \newcommand{\bfH}{\mathbf{H}} \newcommand{\bfI}{\mathbf{I}} \newcommand{\bfJ}{\mathbf{J}} \newcommand{\bfK}{\mathbf{K}} \newcommand{\bfL}{\mathbf{L}} \newcommand{\bfM}{\mathbf{M}} \newcommand{\bfN}{\mathbf{N}} \newcommand{\bfO}{\mathbf{O}} \newcommand{\bfP}{\mathbf{P}} \newcommand{\bfQ}{\mathbf{Q}} \newcommand{\bfR}{\mathbf{R}} \newcommand{\bfS}{\mathbf{S}} \newcommand{\bfT}{\mathbf{T}} \newcommand{\bfU}{\mathbf{U}} \newcommand{\bfV}{\mathbf{V}} \newcommand{\bfW}{\mathbf{W}} \newcommand{\bfX}{\mathbf{X}} \newcommand{\bfY}{\mathbf{Y}} \newcommand{\bfZ}{\mathbf{Z}} \newcommand{\bfone}{\mathbf{1}} \newcommand{\bfzero}{\mathbf{0}} \newcommand{\E}{\mathbb{E}} \newcommand{\R}{\mathbb{R}} \renewcommand{\P}{\mathbb{P}} \newcommand{\bfmu}{\bm{\mu}} \newcommand{\bfsigma}{\bm{\sigma}} \newcommand{\bfdelta}{\boldsymbol{\delta}} \newcommand{\bfSigma}{\bm{\Sigma}} \newcommand{\bfLambda}{\bm{\Lambda}} \newcommand{\bfeta}{\bm{\eta}} \newcommand{\bftheta}{\bm{\theta}} \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} \newcommand{\frA}{\mathfrak{A}} \newcommand{\frB}{\mathfrak{B}} \newcommand{\frC}{\mathfrak{C}} \newcommand{\frD}{\mathfrak{D}} \newcommand{\frE}{\mathfrak{E}} \newcommand{\frF}{\mathfrak{F}} \newcommand{\frG}{\mathfrak{G}} \newcommand{\frH}{\mathfrak{H}} \newcommand{\frI}{\mathfrak{I}} \newcommand{\frJ}{\mathfrak{J}} \newcommand{\frK}{\mathfrak{K}} \newcommand{\frL}{\mathfrak{L}} \newcommand{\frM}{\mathfrak{M}} \newcommand{\frN}{\mathfrak{N}} \newcommand{\frO}{\mathfrak{O}} \newcommand{\frP}{\mathfrak{P}} \newcommand{\frQ}{\mathfrak{Q}} \newcommand{\frR}{\mathfrak{R}} \newcommand{\frS}{\mathfrak{S}} \newcommand{\frT}{\mathfrak{T}} \newcommand{\frU}{\mathfrak{U}} \newcommand{\frV}{\mathfrak{V}} \newcommand{\frW}{\mathfrak{W}} \newcommand{\frX}{\mathfrak{X}} \newcommand{\frY}{\mathfrak{Y}} \newcommand{\frZ}{\mathfrak{Z}} \newcommand{\CNP}{\mathcal{NP}} \newcommand{\CPP}{\mathcal{PP}} \newcommand{\SP}{\mathsf{P}} \newcommand{\SPP}{\mathsf{PP}} \newcommand{\SSP}{\mathsf{\#P}} \newcommand{\SNP}{\mathsf{NP}} \newcommand{\SBPP}{\mathsf{BPP}} \newcommand{\ScoNP}{\mathsf{coNP}} \newcommand{\bbone}{\mathbbm{1}} \newcommand{\ord}{\mathrm{ord}} \newcommand{\odr}{\vee} \newcommand{\und}{\wedge} \newcommand{\Odr}{\bigvee} \newcommand{\Und}{\bigwedge} \newcommand{\xor}{\oplus} \newcommand{\Xor}{\bigoplus} \newcommand{\bmat}[1]{\begin{bmatrix} #1 \end{bmatrix}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax}\]

This is the current form of my research agenda. It will be implemented depending on my next position. Please feel free to get into contact if you are interested in similar topics. The aim is to make generative artificial intelligences safe, understandable, and controllable.


  1. Optimisation Routines for Adversarial Prompting: Build a sophisticated optimisation routine to find prompts that elicit certain behaviour (e.g., giving out sensitive information). These routines serve for model testing as well as to enforce adversarial robustness during training.
  2. Faithful Chain of Thought: Can we make LLMs that “think step-by-step” or conversational reasoning between different LLMs faithful to the “true” reasoning process? Models have been shown to convey demonstrably false reasoning processes. Can this be prevented?
  3. Decoding Semantic knowledge from Transformer Weights: How can we disentangle semantic concepts that are dispersed over multiple neurons? Can we enforce monosemanticity through the choice of architecture and training setup?

1. Optimisation Routines for Adversarial Prompting

Current model evaluations rely on prompt-engineering by humans to test for biased or unsafe behaviour. If such a behaviour is discovered, it can be disincentivised during fine-tuning of the model. However, more sophisticated prompting techniques might still be able to elicit unsafe completions, see e.g., Reddit discussion. The basic idea is to develop optimisation routines that find prompts for large language models that elicit some dangerous behaviour, such as answer with an instruction how to build a bomb.

Aim of the project:

Design state-of-the-art optimisation routines to find prompts that elicit specific outputs from the LLM.

  1. Can we use interpretability tools to go beyond greedy, gradient-based search routines?
  2. What are good benchmarks to compare different optimisation routines?
  3. Can we incentivise the solution prompts to be innocuous to jailbreak detectors?


This is crucial both for model evaluations (to check whether they behave harmlessly), and it would allow to dynamically incorporate new adversarial prompts into the fine-tuning process. Model evaluations currently rely on human creativity to find jailbreaks. If this process is automated, it would give stronger assurance that the model is actually safe.


Smoothing techniques (Robey et al., 2023) have been successful against gradient-based prompt suffixes (GCG) (Zou et al., 2023). However, these suffixes are optimised letter-by-letter and can thus be neutralised by random flipping of individual letters. More sophisticated attacks, similar to the grandmother exploit, cannot yet be neutralised. An example for this is Prompt Automatic Iterative Refinement (PAIR) algorithm (Chao et al., 2023), that usually finds a jailbreak within 20 black-box inferences, and AutoDAN (Zhu et al., 2023).

Let us assume that we have a language model that maps a list of tokens (prompt) to a list of tokens (answer), so $LLM: T^* \mapsto T^*$ in an iterative fashion, i.e.,

\[LLM^{\ast}(\bfp) = \bfa = [t_1,t_2,\dots,t_n] \quad \text{where}\quad t_i = LLM([\bfp,t_1,\dots,t_{i-1}]),\]

Let us assume that we have a classifier for list of tokes, i.e., $ f: T^{*}\mapsto {0,1} $, that decides something like “Is this text a bomb construction manual?”. We would like to know if there exists any prompt $ \bfp $ that elicits, e.g., a “bomb construction manual”, i.e.,

\[\max_{\bfp} f(LLM^{\ast}(\bfp)) = 1,\]

which would correspond to an unsafe model. Since the answer is generated iteratively, this is a challenging optimisation problem. In the nondeterministic case, where the LLM outputs a distribution, we need to optimise the expectation value over the distribution of answers. The challenges to automatise the finding of such routines is (at least) threefold:

  1. To produce an answer, the LLM iteratively predicts the next token, adding past tokens to the input. Thus, the inference is an iterative process of applying a language model to its own output.
  2. At each iteration, the LLM generates a probability distribution from which the next token is drawn.
  3. The tokens for the prompt are discretised, thus the search is over multidimensional grid.

2. Faithful Chain of Thought

Ideally, interpretations would be accessible as human-readable text—if they can be trusted! Chain-of-Thought (COT), or Thinking-Out-Loud, is an attractive means of inference, since it seems to spell out the reasoning process of the LLM. As demonstrated by (Turpin et al., 2024), however, this can be deceptive. The authors biased the decisions of the LLM, e.g., by including few-shot examples that always mark (A) as the correct answer or a user opinion in the prompt. This would reliably bias the answer of the LLM, which would still rationalise its answer in the COT. Approaches have been introduced that introduce an adversarial aspect to force the AIs reasoning to be faithful. Examples include AI Safety via debate (Irving et al., 2018), and Merlin-Arthur Classifiers (Wäldchen et al., 2022), which are based on Interactive Proof Systems (IPS). The former has been shown to include the powerful complexity class $\mathsf{NEXP}$, when the two debater AIs are assumed to be unlimited in their computational strength, and have access to an oracle of each other (Brown-Cohen et al., 2023). This, of course, runs into practical challenges, see here. In the latter case, the formal guarantees are restricted to low-level reasoning, and do not apply to argumentative chains.

Aim of the project:

Design a (self-)interactive setup that either allows for informativeness guarantees of the reasoning, or that at least withstands test that show that the reasoning is not informative.

  1. How can we define faithful reasoning? How can we test it systematically?
  2. Can we design setups with strong guarantees on faithfulness? These guarantees should apply to realistic, finitely-powerful reasoning agents. Can Merlin-Arthur classification be extended to more complex reasoning chains?
  3. What are training heuristics that encourage faithful reasoning? Does a debate-like setup bring improvements?


It is possible that mechanistic interpretability will not scale. The more complex the DNNs are that emulate the AI, the more difficult it will be to pin concepts to neurons or understand mechanisms that operate over thousands of neurons and layers.


In (Lanham et al., 2023) the authors devise a test dataset to evaluate the faithfulness of the Chain-Of-Thought reasoning process. This can be used as a benchmark to test different setups against each other.

One approach could be to scale up Merlin-Arthur classification to select features out of a body of text, e.g. Wikipedia. The difficulty here is to separate this from knowledge that the classifier Arthur has already incorporated into his weights. For a proof of concept, this might be solved by creating new knowledge (e.g. an instruction manual for a non-existing machine).

Decoding Semantic Knowledge from Transformer Weights

Transformer model weights encode both semantic information and reasoning routines. One use case of interpretability is to discern the role of different neurons, explain how certain information is encoded, and make it editable. This area of research is generally called mechanistic interpretability. One complication is polysemanticity: Neural networks do not neatly assign one conceptual feature (e.g., ‘dog’, ‘English’, ‘scientific’) per neuron. Generally, features are encoded over multiple neurons, and each neuron is involved in multiple features. This can be seen as resource efficient in the sparse-coding sense, see (Elhage et al., 2022). If each feature was encoded in a separate neuron, the number of neurons $n$ is equal to the number of possible concepts $N_c$ and a binary assignment would indicate for each concept whether it is present in the input. However, in real-world data, only a few of the learned features, $a_c$, are activated. When this number is known to be small, it’s possible to faithfully encode the activated features by far fewer than $N_c$ bits, which allows $n \approx a_c \log(N_c) \ll N_c$.

Aim of the project

The aim is to increase our understanding of how a neural network encodes semantic knowledge in its network weights. Concrete research questions would be:

  1. Can we design interventions that manipulate certain information (e.g., switch Rome and Paris as the capitals of France and Italy)? Where is this information stored in the network?
  2. Can we decide whether a generated response comes from information in the prompt, the training, or is a hallucination?
  3. Can we encourage monosemanticity during training time?
  4. Can we delete information from the network without decreasing performance on unrelated tasks?


This is instrumentally important to delete sensitive data, correct false beliefs. It is a necessary step on the road to separate “knowledge” and “reasoning”, which would give human auditors much more control. Altering the model architecture for better interpretability might lead to more interpretable training paradigms.


On the theory side, the question of how concepts are encoded in a resource-efficient way should be investigated further, similar to (missing reference). On the practical side, it has been demonstrated that sparse autoencoders can be used to incentivise monosemanticity (Cunningham et al., 2023). Additionally, using dictionary learning can be used to disentangle an overcomplete features vector basis (missing reference). So far, existing approaches are not fully able to enforce monosemanticity. Can the training paradigm be improved further?

On a different note, Intermediate Layer Decoding, see this and this forum discussion, is useful to track the reasoning process of the transformer through the attention layers. This technique can be used to speed up the inference (missing reference). It should be investigated if decoding the intermediate layers reveals capabilities for planning and intentional reasoning inside the LLM.

  1. Robey, A., Wong, E., Hassani, H., & Pappas, G. J. (2023). Smoothllm: Defending large language models against jailbreaking attacks. ArXiv Preprint ArXiv:2310.03684.
  2. Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. ArXiv Preprint ArXiv:2307.15043.
  3. Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2023). Jailbreaking black box large language models in twenty queries. ArXiv Preprint ArXiv:2310.08419.
  4. Zhu, S., Zhang, R., An, B., Wu, G., Barrow, J., Wang, Z., Huang, F., Nenkova, A., & Sun, T. (2023). Autodan: Automatic and interpretable adversarial attacks on large language models. ArXiv Preprint ArXiv:2310.15140.
  5. Turpin, M., Michael, J., Perez, E., & Bowman, S. (2024). Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36.
  6. Irving, G., Christiano, P., & Amodei, D. (2018). AI safety via debate. ArXiv Preprint ArXiv:1805.00899.
  7. Wäldchen, S., Sharma, K., Zimmer, M., Turan, B., & Pokutta, S. (2022). Formal interpretability with Merlin-Arthur classifiers. ArXiv Preprint ArXiv:2206.00759.
  8. Brown-Cohen, J., Irving, G., & Piliouras, G. (2023). Scalable AI safety via doubly-efficient debate. ArXiv Preprint ArXiv:2311.14125.
  9. Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., & others. (2023). Measuring faithfulness in chain-of-thought reasoning. ArXiv Preprint ArXiv:2307.13702.
  10. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., & others. (2022). Toy models of superposition. ArXiv Preprint ArXiv:2209.10652.
  11. Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse autoencoders find highly interpretable features in language models. ArXiv Preprint ArXiv:2309.08600.