[[[[Large Language Models (LLM)]] context length]] (0:00 – 16:12)
Importance of context length is underhyped. Throwing a bunch of tokens in context can create similar improvements to [[evals]] as big increases in model scale.
[[sample efficiency ([[reinforcement learning]])]]: the ability to get the most out of every sample. E.g. playing PONG, humans understand immediately, while modern reinforcement learning algorithms need 100,000 times more data so they are relatively sample inefficient.
Because of large [[[[Large Language Models (LLM)]] context length]], LLMs may have more sample efficiency than we give them credit for – [[Sholto Douglas]] mentioned [[evals]] where the model learned an esoteric human language that wasn’t in its training data.
[[Sholto Douglas]] mentions a line of research suggesting [[in-context learning]] might be effectively performing [[gradient descent]] on the in-context data (link to paper). It’s basically performing a kind of meta-learning – learning how to learn.
Large [[[[Large Language Models (LLM)]] context length]] creates risks since it can effectively create a whole new model if it’s really doing [[gradient descent]] on the fly.
[[Sholto Douglas]] suggests that figuring out how to better induce this meta-learning in pre-training will be important for flexible / adaptive intelligence.
[[Sholto Douglas]] suggests that current difficulties with [[[[AI]] Agents]]’s long-term planning are not because of a lack of long [[[[Large Language Models (LLM)]] context length]] – it’s more about the reliability of the model and needing more [[nines of reliability]] since these agents chain a bunch of tasks together, and even a small failure rate implies a large failure rate when you sample many times.
The idea behind the NeurIPS paper about emergence being a mirage (link) is related to this idea of [[nines of reliability]] – theres a threshold where you get enough nines of reliability and it looks like a sudden capability when you look at certain metrics but it was actually always there to begin with but the model was not reliable enough to see it. There are apparently better evals now like [[HumanEval]] (link) that have “smoother” evaluations that get around this issue.
In my mind this raises the question – if you have a big enough [[[[Large Language Models (LLM)]] context length]], wouldn’t you just leverage that rather than decomposing into a bunch of tasks? #[[Personal Ideas]]
No – The nature of longer tasks is that you need to break them down and subsequent tasks depend on previous tasks, so longer tasks performed by [[[[AI]] Agents]] will always need to be broken down into multiple calls.
However, larger context for any given call would improve reliability. For example, with each task call to the model, you could build up a large in-context history of what the model has done to give it more context, and you could of course push in more information specific to what that task is trying to solve.
Developing [[evals]] for long-horizon tests will be important to understand impact and capabilities of [[[[AI]] Agents]]. [[SWE-Bench]] (link) is a small step in this direction, but GitHub issues is still a sub-hour task.
Many people speak of [[quadratic attention costs]], as a reason we can’t have long context windows – but there are ways around it. See this [[Gwern Branwen]] article.
[[Dwarkesh Patel]] makes an interesting hypothesis wondering whether learning in context (i.e. the “forward pass” where the model has been pre-trained and predictions are made based on input data) may be more efficient as it resembles how humans actively think and process information as they acquire it rather than passively absorb it.
Not sure how far you can push the analogies to the brain – as [[Sholto Douglas]] says, birds and airplanes both achieve the same end but use very different means. However, [[Dwarkesh Patel]]’s point sounds like [[in-context learning]] may be analogous to the frontal cortex region of the brain (responsible for complex cognitive behavior, personality expression, decision-making, moderating social behavior, working memory, speech production), while the pre-trained weights (calculated in the “backward pass” that trains the model via [[backpropagation]]) are analogous to the other regions (responsible for emotional regulation and processing, sensory processing, memory storage and retrieval). See this [[ChatGPT]] conversation.
The key to these models becoming smarter is [[meta learning]], which you start to achieve once you pass a certain scale threshold of [[pre-training]] and [[[[Large Language Models (LLM)]] context length]]. This is the key difference between [[GPT-2]] and [[GPT-3]].
[[ChatGPT]] conversations related to this section of the conversation: here and here
[[Intelligence]] is just associations (16:12 – 32:35)
[[Anthropic AI]]’s way of thinking about [[transformer model]]
Think of the [[residual ([[neural net]])]] as it passes through the neural network to predict the next token like a boat floating down a river that takes in information streams coming off the river. This information coming in comes from [[attention heads ([[neural net]])]] and [[multi-layer perceptron (MLP) ([[neural net]])]] parts of the model.
Maybe what’s happening is early in the stream it processes basic, fundamental things, in the model it’s adding information on ‘how to solve this’, and then in the later stages doing the work to convert it back to an output token.
The [[cerebellum]] behaves kind of like this – inputs route through it but they can also go directly to the end point the cerebellum “module” contributes to – so there are indirect and direct paths where it can pick up information it wants and add it in.
The [[cerebellum]] is associated with fine motor control, but the truth is it lights up for almost any task in a [[fMRI]] scan, and 70% of your neurons are there.
[[Pentti Kanerva]] developed an associative memory algorithm ([[Sparse Distributed Memory (SDM)]]) where you have memories, want to store them, and retrieve them to get the best match while dealing with noise / corruption. Turns out if you implement this as an electrical circuit it looks identical to the core [[cerebellum]] circuit. (Wikipedia link)
[[Trenton Bricken]] believes most intelligence is [[pattern matching]] and you can do a lot of great pattern matching with a hierarchy of [[associative memory]]
The model can go from basic low-level associations and group them together to develop higher level associations and map patterns to each other. It’s like a form of [[meta-learning]]
He doesn’t really state this explicitly, but the [[attention ([[neural net]])]] mechanism is a kind of associative memory learned by the model.
[[associative memory]] can help you denoise (e.g. recognize your friend’s face in a heavy rainstorm) but also pick up related data in a completely different space (e.g. the alphabet – seeing A points to B, which points to C, etc.)
It should be “association is all you need”, not “attention is all you need”
[[[[Intelligence]] explosion]] and great researchers (32:35 – 1:06:52)
This part of the discussion explores whether automating AI researchers can lead to an intelligence explosion in a way that economists overlook (and they are apparently the ones with the formal models on [[[[Intelligence]] explosion]])
[[compute]] is the main bounding constraint on an [[[[Intelligence]] explosion]]
To me, it’s interesting that most people seem to think their job won’t be fully automated. People often agree that AI could make them much more productive, but the idea that it would completely automate their job they are skeptical of. It’s always other people’s jobs that are supposedly fully automatable (i.e. the jobs you have much less information about and context about what they do). People tend to overlook physical constraints, social constraints, and the importance of “taste” (which I would define as a human touch or high level human guidance to align things so they’re useful to us). This is probably why economists have been able to contribute the most here – all day long they’re thinking about resource constraints and the implications of those constraints. I mean, a common definition of economics is “studying the allocation of resources under constraints” #[[Personal Ideas]]
[[Sholto Douglas]] suggests the hardest part of an AI researcher’s job is not writing code or coming up with ideas, but paring down the ideas and shot calling under imperfect information. Complicating matters is the fact that things that work at small scale don’t necessarily work at large scale, as well as the fact that you have limited compute to test everything you dream of. Also, working in an collaborative environment where a lot of people are doing research can slow you down (lower iteration speed compared to when you can do everything yourself).
“ruthless prioritization is something which I think separates a lot of quality research from research that doesn’t necessarily succeed as much…They don’t necessarily get too attached to using a given sort of solution that they are familiar with, but rather they attack the problem directly.” – [[Sholto Douglas]]
Good researchers have good engineering skills, which enable them to try experiments really fast – their cycle time is faster, which is key to success.
[[Sholto Douglas]] suggests that really good data for [[Large Language Models (LLM)]] is data that involved a lot of reasoning to create. The key trick is somehow verifying the reasoning was correct – this is one challenge with generating [[synthetic data]] from LLMs.
[[Dwarkesh Patel]] makes an interesting comparison of human language being [[synthetic data]] data that humans create, and [[Sholto Douglas]] adds that the real world is like a built-in verifier of that data. Doesn’t seem like a perfect analogy, but it does occur to me that some of the best “systems” in the world have some kind of built-in verification: capitalism, the scientific method, democracy, evolution, traditions (which I think of as an evolution of memes – the good ones stick).
Question I have is the extent to which [[compute]] will always be a constraint. Seems like it will obviously always be required to some extent, but I wonder what these guys think of the likelihood of some kind of model architecture or training method that improves [[statistical efficiency]] and [[hardware efficiency]] so much that, say, you can train a GPT-4 on your laptop in a day?
[[superposition]] and secret communication (1:06:52 – 1:22:34)
When your data is high-dimensional and sparse (i.e. any given data point doesn’t appear very often), then your model will learn a compression strategy called [[superposition]] so it can pack more features of the world into it than it has parameters. Relevant paper from [[Anthropic AI]] here.
This makes interpretability more difficult, since when you see a [[neuron ([[neural net]])]] firing and try to figure out what it fires for, it’s confusing – like firing for 10% of every possible input.
This is related to the paper that [[Trenton Bricken]] and team at [[Anthropic AI]] put out called Towards Monosemanticity, which found that if you project the activations into a higher-dimensional space and provide a sparsity penalty, you get very clean features and everything starts to make more sense.
They suggest that [[superposition]] means [[Large Language Models (LLM)]] are under-parametrized given the complexity of the task they’re being asked to perform. I don’t understand why this follows.
[[knowledge distillation]]: the process of transferring knowledge from a large model to a smaller one
Puzzle proposed by [[Gwern Branwen]] (link): [[knowledge distillation]] gives smaller models better performance – why can’t you just train these small models directly and get the same performance?
[[Sholto Douglas]] suggests it’s because distilled models get to see the entire vectors of probabilities for what the next token is predicted to be. In contrast, training just gives you a one hot encoded vector of what the next token should have been, so the distilled model gets more information or “signal”.
In my mind, it’s not surprising that [[knowledge distillation]] would be more efficient given a certain amount of training resources. But do researchers find it’s better given any amount of training for the smaller model? That seems much less intuitive, and if that’s the case, what is the information being “sent” to the smaller model that can’t be found through longer training?
[[adaptive compute]]: spending more cycles thinking about a problem if it is harder. How is it possible to do this with [[Large Language Models (LLM)]]? The forward pass always does the same compute, but perhaps [[chain-of-thought (CoT)]] or similar methods are kind of like adaptive compute since they effectively produce more forward passes.
[[chain-of-thought (CoT)]] has been tested to have some strange behaviour, such as giving the right answer even when the chain of thought reasoning is patently wrong or giving the wrong answer it was trained to give and then provide a plausible sounding but wrong explanation. E.g. this paper and this paper
[[[[AI]] Agents]] and true reasoning (1:22:34 – 1:34:40)
[[Dwarkesh Patel]] raised question of whether agents communicating via text is the most efficient method – perhaps they should share [[residual ([[neural net]])]] streams.
[[Trenton Bricken]] suggests a good half-way measure would be using features you learn from [[Sparse Dictionary Learning (SDL)]] – more internal access but also more human interpretable.
Will the future of [[[[AI]] Agents]] be really long [[[[Large Language Models (LLM)]] context length]] with “[[adaptive compute]]” or instead will it be multiple copies of agents taking on specialized tasks and talking to one another? Big context or [[division of labour]]?
[[Sholto Douglas]] leans towards more agents talking to each other, at least in the near term. He emphasizes that it would help with interpretability and trust. [[Trenton Bricken]] mentions cost benefits as well since individual agents could be smaller and [[fine-tuning]] them makes them accurate.
Maybe in the long run the dream of [[reinforcement learning]] will be fulfilled – provide a very sparse signal and over enough iterations [[[[AI]] Agents]] learn from it. But in the shorter run, these will require a lot of work from humans around the machines to make sure they’re doing what we want.
[[Dwarkesh Patel]] wonders whether language is actually a very good representation of ideas as it has evolved to optimize human learning. [[Sholto Douglas]] adds that compared to “next token prediction”, which is a simple representation, representations in [[machine vision]] are more difficult to get right.
Some evidence suggests [[fine-tuning]] a model on generalized tasks like math, instruction following, or code generation enhances language models’ performance on a range of other tasks.
This raises the question in my mind – are there other tasks we want [[Large Language Models (LLM)]] to do where we might achieve better results by [[fine-tuning]] on a seemingly unrelated area? Like if we want a model to get better at engineering, should we fine tune on constructing a lego set, since so many engineers seem to have played with lego as a kid? What does real-world empirics about learning and performance tell us about where we should be fine-tuning? #[[Personal Ideas]]
How [[Sholto Douglas]] and [[Trenton Bricken]] got into [[AI]] research (1:34:40 – 2:07:16) #[[Career]]
[[Trenton Bricken]] has had significant success in interpretability, contributing to very important research and has only been in [[Anthropic AI]] for 1.5 years. He attributes this success to [[luck]], ability to execute on putting together and quickly testing existing research ideas already lying around, headstrongness, willingness to push through when blocked where others would give up, and willingness to change direction.
[[Sholto Douglas]] agrees with those qualities for success (hard work, agency, pushing), but also adds that he’s benefited from being good at picking extremely high-leverage problems.
In organizations you need people that care and take direct responsibility to get things done. This is often why projects fail – nobody quite cares enough. This is one purpose of consulting firms like [[McKinsey]] ([[Sholto Douglas]] started there) – allows you to “hire” people you wouldn’t otherwise be able to for a short window where they can push through problems. They also are given direct responsibility as consultants which speaks to his first point.
[[Sholto Douglas]] also hustled – worked from 10pm-2am and 6-8 hours a day on the weekends to work on research and coding projects. [[James Bradbury]] (who was at [[Google]] but now at [[Anthropic AI]]) saw [[Sholto Douglas]] was asking questions online that he thought only he was interested in, saw some robotics stuff on his blog, and then reached out to see if he wanted to work there. “Manufacture luck”
Another advantage of this fairly broad reading / studying he was doing was it gave him the ability to see patterns across different subfields that you wouldn’t get by just specializing in say, [[Natural Language Processing]].
One lesson here emphasized by [[Dwarkesh Patel]] is that the world is not legible and efficient. You shouldn’t just go to jobs.google.com or whatever and assume you’ll be evaluated well. There are other, better ways to put yourself in front of people and you should leverage that. Seems like it’s particularly valuable to do if you don’t have a “standard” background or look really good just on paper with degrees from Stanford or whatever. Put yourself out there and demonstrate you can do something at a world-class level.
This is what [[Andy Jones]] from [[Anthropic AI]] did with a paper on scaling laws and board games – when he published this, both Anthropic and [[OpenAi]] desperately wanted to hire him.
“The system is not your friend. It’s not necessarily actively against you or your sworn enemy. It’s just not looking out for you. So that’s where a lot of proactiveness comes in. There are no adults in the room and you have to come to some decision for what you want your life to look like and execute on it.” -[[Trenton Bricken]]
“it’s amazing how quickly you can become world-class at something. Most people aren’t trying that hard and are only working the actual 20 hours or something that they’re spending on this thing. So if you just go ham, then you can get really far, pretty fast” – [[Trenton Bricken]]
Are [[features]] the wrong way to think about [[intelligence]] (2:07:16 – 2:21:12)
[[Dwarkesh Patel]] and [[Trenton Bricken]] explore what a feature is in these large neural networks. A “feature” in a standard logistic regression model is quite clear and explicit – it’s just one of the terms in the regression.
[[ChatGPT]] provides a good answer here that helps resolve the confusion in my mind. It still makes sense to think of the model in terms of features, except in a [[neural net]], the features are learned rather than being explicitly specified. Each layer in a neural net can learn an increasingly complex and abstract set of features.
What would be the standard where we can say we “understand” a model’s output and the reasons it did what it did, ensuring it was not doing anything duplicitous?
You need to find features for the model at each level (including attention heads, residual stream, MLP, attention), and hopefully identify broader general reasoning circuits. To avoid deceptive behaviour, you could flag features that correspond to this kind of behaviour.
Relevant [[ChatGPT]] conversations: here, here, and here
Will [[[[AI]] interpretability]] actually work on superhuman models (2:21:12 – 2:45:05)
One great benefit of these [[Large Language Models (LLM)]] in terms of interpretability is they are deterministic, or you can at least make them deterministic. It’s like this alien brain you can operate on by ablating any part of it you want. If it does something “superhuman”, you should be able to decompose it into smaller spaces that are understandable, kind of like how you can understand superhuman chess moves.
Essentially, [[Trenton Bricken]] is hopeful that we can identify “bad” or “deceptive” circuits in [[Large Language Models (LLM)]] and essentially lobotomize them in those areas.
One interesting way he suggests of doing this is fine-tuning a model to have bad behaviour, and then use this bad model to identify the parts of the feature space that have changed.
There are similarities of features across different models that have been found. E.g. there are [[Base64]]-related features that are very common that fire for and model Base64 encoded text (common in URLs).
Similarity is measured using [[cosine similarity]] – which is a measure of similarity between two non-zero vectors defined in an inner product space. Takes a value in [-1, 1], where -1 represents vectors in the opposite direction, 0 represents orthogonal vectors, and 1 represents identical vectors. Formula: (A ⢠B) / (||A||||B||)
[[curriculum learning]]: training a model in a meaningful order from easy examples to hard examples, mimicking how human beings learn. This paper is a survey on this method – it seems to come with challenges and it’s unclear whether it’s currently used much to train models, but it’s a plausible avenue for future models to use to improve training.
[[feature splitting]]: the models tend to learn however many features it has capacity for that still span the space of representation. E.g. basic models will learn a “bird” feature, while bigger models learn features for different types of birds. “Oftentimes, there’s the bird vector that points in one direction and all the other specific types of birds point in a similar region of the space but are obviously more specific than the coarse label.”
The models seems to learn [[hierarchy]] – which is a powerful model for understanding reality and organizing a bunch of information so it is sensible and easily accessible. #[[Personal Ideas]]
[[Trenton Bricken]] makes the distinction between the [[weights ([[neural net]])]] that represent the trained, fixed parameters of the model and the [[activations ([[neural net]])]] which represents the actual results from making a specific call. [[Sholto Douglas]] makes the analogy that the weights are like the actual connection scheme between neurons, and the activations are the current neurons lighting up on a given call to the model. [[Trenton Bricken]] says “The dream is that we can kind of bootstrap towards actually making sense of the weights of the model that are independent of the activations of the data”.
[[Trenton Bricken]]’s work on [[[[AI]] interpretability]] uses a sparse autoencoding method which is unsupervised and projects the data into a wider space of features with more detail to see what is happening in the model. You first feed the trained model a bunch of inputs and get [[activations ([[neural net]])]], then you project into a higher dimensional space.
The amount of detail you want to determine from a feature is determined by the [[expansion factor ([[neural net]])]] which represents how many times bigger the dimensionality is of the space you’re projecting to compared to the original space. E.g. if you have 1000 neurons and projecting to a 2000 dimensional space, the expansion factor is 2. The amount of features you “see” in the space you’re projecting to depends on the size of this expansion factor.
[[neuron ([[neural net]])]] can be polysemantic, meaning that they can represent multiple meanings or functions simultaneously. This polysemy arises because of “[[superposition]],” where multiple informational contents are superimposed within the same neuron or set of neurons. [[Trenton Bricken]] mentions that if you only look at individual neurons without considering their polysemantic nature, you can miss how they might code multiple features due to their superposition. Disentangling this might be the key to understanding the “role” of the Experts in [[mixture of experts (MoE)]] used in the recent [[Mistral ([[AI]] company)]] model – they could not determine the “role” of the experts themselves, so it’s an open question.
[[Sholto Douglas]] challenge for the audience (2:45:05 – 3:03:57)
A good research project [[Sholto Douglas]] challenges the audience with is to disentangle the neurons and determine the roles of the [[mixture of experts (MoE)]] model by [[Mistral ([[AI]] company)]], which is open source. There is a good chance there is something to discover here, since image models such as [[AlexNet]] have been found to have specialization that you can clearly identify.
Rapid Fire (3:03:57 – 3:11:51)
One rather disappointing point they make is that a lot of the cutting edge research on issues like multimodality, long-context, agent, reliability is probably not being published if it works well. So, published papers are not necessarily a great source to get to the cutting edge. This raises teh question – how do you get to the cutting edge without working inside one of these tech companies?
[[Sholto Douglas]] mentions that academia and others outside the inner circle should work more on [[[[AI]] interpretability]], which is legible from the outside, and places like [[Anthropic AI]] publishes all its research. It also typically doesn’t require a ridiculous amount of resources.
For access to my shared Anki deck and Roam Research notes knowledge base as well as regular updates on tips and ideas about spaced repetition and improving your learning productivity, join "Download Mark's Brain".
One thought on “Roam Research Notes on Dwarkesh Patel Conversation with Sholto Douglas & Trenton Bricken – How to Build & Understand GPT-7’s Mind”
Thanks for the summary. Helped me refresh the conversation!
Thanks for the summary. Helped me refresh the conversation!