
Chapter 1 - Transformers & Mechanistic Interpretability
The transformer is an important neural network architecture used for language modeling, and it has made headlines with the introduction of models like ChatGPT.
In this chapter, you will learn all about transformers, and build and train your own. You’ll also learn about Mechanistic Interpretability of transformers, a field which has been advanced by Anthropic’s transformer circuits sequence, and work by Neel Nanda.
You can find the actual content at this page.
Transformers:
Building, Training, Sampling
The first day of this chapter involves:
Learning about how transformers work (e.g. core concepts like the attention mechanism and residual stream)
Building your own GPT-2 model
Learning how to generate autoregressive text samples from a transformer’s probabilistic output
Understand how transformers can use key and value caching to speed up computation
Intro to Mechanistic Interpretability
The next day covers:
Mechanistic Interpretability - what is it, and what is the path to impact
Anthropic’s Transformer Circuits sequence (starting with A Mathematical Framework for Transformer Circuits)
The open-source library TransformerLens, and how it can assist with MechInt investigations and experiments
Induction Heads - what they are, and why they matter
Algorithmic Tasks (balanced brackets)
This is the first option in the set of paths you can take, after having covered the first 2 days of material.
Here, you’ll perform interpretability on a transformer trained to classify bracket strings as balanced or unbalanced. You’ll also have a chance to interpret models trained on simple LeetCode-style problems of your choice!
Indirect Object Identification
This is the second option in the set of paths you can take, after having covered the first 2 days of material.
Here, you’ll explore circuits in real-life models (GPT-2 small), and replicate the results of the Interpretability in the Wild paper (whose authors found a circuit for performing indirect object identification).
Grokking & Modular Arithmetic
This is the third option in the set of paths you can take, after having covered the first 2 days of material.
Here, you’ll investigate the phenomena of grokking in transformers, by studying a transformer trained on the task of performing modular addition.
Superposition
This is the fourth option in the set of paths you can take, after having covered the first 2 days of material.
Here, you’ll investigate the phenomenon of superposition, which allows models to represent more features than they have neurons. You’ll also work on Sparse AutoEncoders, which aim to allow interpretability of models despite superposition.
OthelloGPT
This is the fifth option in the set of paths you can take, after having covered the first 2 days of material.
Here, you’ll conduct an investigation into how a model trained to play a game operates. You’ll use linear probes, logit attribution, and activation patching to gain an understanding of how the model represents and plays Othello.