Since the release of ChatGPT in 2022, large language models (LLMs) based on the transformer architecture like ChatGPT, Gemini and Claude have transformed the world with their ability to produce high-quality, human-like text and more recently the ability to produce images and videos. Yet, behind this incredible capability lies a profound mystery: we don’t understand how these models work.
The reason is that LLMs aren’t built like traditional software. A traditional program is designed by human programmers and written in explicit, human-readable code. But LLMs are different. Instead of being programmed, LLMs are automatically trained to predict the next word on vast amounts of internet text, growing a complex network of trillions of connections that enable them to perform tasks and understand language. This training process automatically creates emergent knowledge and abilities, but the resulting model is usually messy, complex and incomprehensible since the training process optimizes the model for performance but not interpretability or ease of understanding.
The field of mechanistic interpretability aims to study LLM models and reverse engineer the knowledge and algorithms they use to perform tasks, a process that is more like biology or neuroscience than computer science.
The goal of this post is to provide insights into how LLMs work using findings from the field of mechanistic interpretability. — Read More