Chapter 1 - Transformers & Mechanistic Interpretability

The transformer is an important neural network architecture used for language modeling, and it has made headlines with the introduction of models like ChatGPT.

In this chapter, you will learn all about transformers, and build and train your own. You’ll also learn about Mechanistic Interpretability of transformers, a field which has been advanced by Anthropic’s transformer circuits sequence, and work by Neel Nanda.

You can find the actual content at this page.

Transformers:
Building, Training, Sampling

The first day of this chapter involves:

  • Learning about how transformers work (e.g. core concepts like the attention mechanism and residual stream)

  • Building your own GPT-2 model

  • Learning how to generate autoregressive text samples from a transformer’s probabilistic output

  • Understand how transformers can use key and value caching to speed up computation

Intro to Mechanistic Interpretability

The next day covers:

  • Mechanistic Interpretability - what is it, and what is the path to impact

  • Anthropic’s Transformer Circuits sequence (starting with A Mathematical Framework for Transformer Circuits)

  • The open-source library TransformerLens, and how it can assist with MechInt investigations and experiments

  • Induction Heads - what they are, and why they matter

Algorithmic Tasks (balanced brackets)

This is the first option in the set of paths you can take, after having covered the first 2 days of material.

Here, you’ll perform interpretability on a transformer trained to classify bracket strings as balanced or unbalanced. You’ll also have a chance to interpret models trained on simple LeetCode-style problems of your choice!

Indirect Object Identification

This is the second option in the set of paths you can take, after having covered the first 2 days of material.

Here, you’ll explore circuits in real-life models (GPT-2 small), and replicate the results of the Interpretability in the Wild paper (whose authors found a circuit for performing indirect object identification).

Grokking & Modular Arithmetic

This is the third option in the set of paths you can take, after having covered the first 2 days of material.

Here, you’ll investigate the phenomena of grokking in transformers, by studying a transformer trained on the task of performing modular addition.

Superposition

This is the fourth option in the set of paths you can take, after having covered the first 2 days of material.

Here, you’ll investigate the phenomenon of superposition, which allows models to represent more features than they have neurons. You’ll also work on Sparse AutoEncoders, which aim to allow interpretability of models despite superposition.

OthelloGPT

This is the fifth option in the set of paths you can take, after having covered the first 2 days of material.

Here, you’ll conduct an investigation into how a model trained to play a game operates. You’ll use linear probes, logit attribution, and activation patching to gain an understanding of how the model represents and plays Othello.