Chapter 3 - Model Evaluation

Evaluating models is a growing field of AI safety. It allows us to test the outputs of LLMs under different circumstances to see whether they are safe, reliable and secure.

In this chapter, you will be learning about some of the fundamentals of designing safety evaluations, and go on to design your own. You’ll then learn how to run these evaluations on current frontier models. You’ll finsih by learning how to build and scaffold LM agents, a rapidly developing facet of current frontier model evaluation.

The full content is available here.

Eval Design & Threat Modelling

Threat-modelling focuses on teaching the fundamental skills necessary to write good evaluation questions. These include: defining the properties you want to measure, understanding why they are relevant for safety, and how to measure them effectively. You’ll then take a look at an evaluations case-study of “alignment faking”. Finally, you’ll write a few evaluation questions for the property of your choice.

Model-written evals

Writing evaluations for LLMs can be time-consuming! On day two, you will build a pipeline in order to use language models (along with the few questions you hand-wrote) to generate hundreds of questions. You’ll also build a pipeline to audit these generated questions so that you can minimise biases in questions, and remove any mistakes made in the generation process.

Running Evals with Inspect

Now that you have an evaluation dataset, you can run your evaluation on a variety of current models to see how they respond, and collect data on their behaviour under different circumstances.

You will be using the UK AISI’s Inspect-AI library to test and run different evaluation methods, and to help to visualise the results of our evaluations across models and contexts.

Building and evaluating LLM Agents

Description to fill in soon.

Chapter 3 - Model Evaluation

Eval Design & Threat Modelling

Model-written evals

Running Evals with Inspect

Building and evaluating LLM Agents

Get in touch: info@arena.education