Exceptional Skills

An overview of examplary projects of my performance as employee

If I was working at an AI sfety research institute, this is I would want the agenda to concentrate on

Fine-tuning ability:

It is important

Adversarial Alignment

World-Model Extraction:

The standard ML approach to train foundation models is through differential, end-to-end training of the transformer architecture. The transformer thus learns facts about the world (e.g., Paris is the capital of France) and resoning, both correlative (e.g., the phrase “Thank you!” is evidence for niceness) and logical (e.g., modus ponens). This stands in contrast to symbolic AI, that builds a reasoning engine by hand, which has access to an external database of facts.

In recent history, it became clear that ML is a much more scalable tool than symbolic reasoning. Symbolic AI has the advantage, however, that it is understandable to humans and

There probably is an aspect to reasoning that will always be closer to pattern matching than symbolic reasoning.

If we could externalise the world model of the LLM this would have three disctinct advantages: we could

  • correct errors in the world model,
  • delete sensitive information,
  • observe calls to the world model. The last point could give us an idea about the plans of the LLM, for example to build a virus, the AI would need to access specifics about viruses.

Extracting world-models from an end-to


The use cases for Interpretability are the following:

  • To inform adversarial alignment,
  • to extract world models