What is a GPT? Visual intro to transformers

What is a GPT

The initials GPT stands for Generative Pre-Trained Transformer. So the first word is clear enough, it’s bots that generate new text. Pretraining indicates how the model undergoes the learning process from a huge amount of data, and prefixing indicates that there is more room to fine-tune it on specific tasks with additional training.

What is a GPT


But the final word, this is the real centerpiece. A transformer is a specific type of neural network, a machine learning model, and the fundamental invention behind the current boom in artificial intelligence. What I want to do with this video and the following chapters is provide a visual explanation of what actually happens inside the transformer.

What is a GPT


We will follow the data flowing through it and move step by step. There are many different types of models you can build using transformers. Some models take audio and produce text. This sentence comes from a model that goes in the opposite direction, producing synthetic speech from text only. All those tools that have taken the world by storm in 2022 like Dolly and Midjourney that take a textual description and produce a transformer based image.
Even if I couldn’t understand what the pancake creature was supposed to be, I’d still be amazed that this kind of thing was even remotely possible. The original converter introduced by Google was invented in 2017 for the specific use case of translating text from one language to another. But the variable you and I will focus on, the kind that underlies tools like ChatGPT, will be a model that has been trained to take in a piece of text, perhaps even with some surrounding images or audio accompanying it, and produce a prediction of what comes next in the clip.
This prediction takes the form of a probability distribution over the many different pieces of text that you might follow. At first glance, you might think that predicting the next word seems like a completely different goal than creating new text. But once you have a prediction model like this, the simple thing to create a longer piece of text is to give it an initial snippet to work with, have it take a random sample from the distribution it just generated, append that sample to the text, and then run the whole process again
to make New prediction based on all new text, including what was just added. I don’t know about you, but it seems like this shouldn’t really work. In this animation, for example, I’m running GPT-2 on my laptop and asking it to repeatedly predict and sample the next piece of text to create a story based on the initial text.
The story doesn’t really make much sense. But if you replace them with API calls to GPT-3 instead, which is the same basic model, but much larger, we suddenly, almost magically, have a plausible story, one that seems to conclude that a pi-creature will live in the land of mathematics and computation .
This process here of iterative prediction and sampling is basically what happens when you interact with ChatGPT or any of these other big language models and see them produce one word at a time. In fact, one of the features I’ll enjoy the most is the ability to see the cardinality distribution of each new word you choose.
Let’s kick things off with a high-level preview of how data flows through the switch. We’ll spend more time motivating, explaining, and expanding on the details of each step, but in general, when one of these chatbots generates a particular word, here’s what’s going on under the hood. First, the input is divided into a set of small pieces.
These pieces are called symbols, and in the case of text, these tend to be words, small parts of words, or other common character sets. If images or audio are involved, tokens can be small specks of that image or small portions of that audio. Each of these tokens is then associated with a vector, meaning a list of numbers, which is intended to encode the meaning of that token in some way.
If you think of these vectors as giving coordinates in a very high dimensional space, then words with similar meanings tend to land on vectors close to each other in that space. This sequence of vectors goes through a process known as attention block, and this allows the vectors to talk to each other and pass information back and forth to update their values.
For example, the meaning of the word model in the phrase machine learning model is different from its meaning in the phrase fashion model. The attention cluster is responsible for knowing which words in context are relevant for updating the meanings of other words, and exactly how these meanings should be updated.

What is a GPT


And again, whenever you use the meaning of the word, it’s all encoded in one way or another in the input of those vectors. These vectors then go through a different type of process, and depending on which source you read, this will be referred to as multilayer perceptron or perhaps feedforward layer.
Here the vectors do not talk to each other, but rather they all go through the same process in parallel. Although this block is a little difficult to explain, we’ll talk later about how the step is a bit like asking a long list of questions about each vector, and then updating it based on the answers to those questions.
All operations in both blocks look like a huge pile of matrix multipliers, and our primary task will be to understand how to read the underlying arrays. I cover some detail about some of the normalization steps that happen between them, but this is ultimately a high-level preview. After that, the process is essentially repeated, going back and forth between attentional blocks and multi-layered perceptual blocks, until finally, the hope is that all the essential meaning of the passage has somehow been baked into the final vector in the arrangement.
We then perform a certain operation on the final vector that produces a probability distribution over all possible tokens, and all possible bits of text that might come after that. And as I said, once you have a tool that predicts what’s coming next given a snippet of text, you can feed it a little bit of raw text and have it repeatedly play the game of predicting what’s coming next, sampling the distribution, appending that, and then repeating it over and over again.
Some of you with experience might remember how long before ChatGPT came on the scene, this is what early demos of GPT-3 looked like, where you could auto-complete stories and articles based on a raw snippet. To turn a tool like this into a chatbot, the easiest starting point is to have a little bit of text that defines the settings for the user interacting with the helpful AI assistant, which is what you might call a prompt, and then you would use the initial question or prompt for the user is the first part of
dialogue, and then you can start predicting what your helpful AI assistant will say in response. There’s a lot to be said for the training step required to make this work, but at a high level, that’s the idea. In this chapter, you and I will go into detail about what happens at the beginning of the network, and at the end of the network, and I also want to spend a lot of time reviewing some important pieces of basic knowledge, things that would have been second nature for any machine learning engineer by the time transformers came along.
If you are comfortable with this basic knowledge and a little impatient, feel free to skip to the next chapter, which will focus on attention blocks, which are generally considered the heart of the adapter. Next I want to talk more about multi-layered perceptual blocks, how training works, and a number of other details that will be skipped up to that point.
For a broader context, these videos are additions to a mini-series on deep learning, and it’s okay if you haven’t watched the previous videos, I think you can do that outside of the system, but before diving into transformers specifically, I think it’s worth making sure We are on the same page about the basic premise and architecture of deep learning.
At the risk of stating the obvious, this is an approach to machine learning, which describes any model in which data is used to determine how the model will behave in one way or another. What I mean by that is, say you want a function that takes an image and produces a tag that describes it, or our example of predicting the next word given a passage of text, or any other task that seems to require some element of intuition and pattern recognition.
We take this for granted these days, but the idea in machine learning is that instead of trying to define a clear procedure for how to do that task in code, which is what people would do in the early days of AI, instead set up a very flexible architecture. With tunable parameters, such as a set of knobs and dials, you then somehow use many examples of what the output should look like for a given input to tweak and adjust the values ​​of those parameters to mimic that behavior.
For example, probably the simplest form of machine learning is linear regression, where the inputs and outputs are single numbers, something like the square footage of a house and its price, and what you want is to find a line of best fit through that data, you know, to predict prices Homes of the future.
This line is described by two continuous parameters, for example the slope and the y-intercept, and the goal of linear regression is to determine those parameters to closely match the data. It goes without saying that deep learning models are becoming more complex. GPT-3, for example, contains not two, but 175 billion parameters.

But what is a GPT? Visual intro to Transformers | Deep …


But here’s the thing, it’s not a given that you can create a giant model with a large number of parameters without having to significantly overfit the training data or make it completely intractable to train . Deep learning describes a class of models that have proven in the past two decades to be remarkably scalable.
What unites them is the same training algorithm, called backpropagation, and the context I want you to get as we go is that for this training algorithm to work well at scale, these models have to follow a certain specific format. If you are familiar with this format, it is helpful to explain the many choices about how the converter handles the language, which may run the risk of feeling arbitrary.
First, whatever model you create, the input must be formatted as an array of real numbers. This could mean a list of numbers, or it could be a two-dimensional array, or more often you are dealing with higher dimensional arrays, where the general term used is tensor. You often think of the input data as being gradually transformed into many distinct layers, with each layer always organized as a kind of set of real numbers, until it reaches the final layer that it takes to be the output.
For example, the last layer in our text processing model is a list of numbers that represents the probability distribution of all possible subsequent tokens. In deep learning, these model parameters are always referred to as weights, because the main advantage of these models is that the only way these parameters interact with the data being processed is through weighted sums.
You can also sprinkle some nonlinear functions throughout, but they will not depend on the parameters. Normally, instead of seeing the weighted sums all bare and clearly written this way, you’ll find them grouped together as different components in a matrix vector product. It means saying the same thing, if you think again about how matrix vector multiplication works, each component in the output looks like a weighted sum.
It is often cleaner conceptually for you and me to think of matrices filled with tunable parameters that transform vectors that are extracted from the data being processed. For example, those 175 billion weights in GPT-3 are organized into just under 28,000 distinct matrices. These arrays are in turn divided into eight different categories, and what you and I will do is go through each one of these categories to understand what the type does.
As we move forward, I think it’s fun to go back to the specific numbers from GPT-3 to calculate exactly where those 175 billion came from. Even if there are bigger and better models out there nowadays, this model has a certain charm as a big language model to capture the attention of the world outside the machine learning communities.
Also in practice, companies tend to stick to specific numbers for more modern networks. I just want to start the scene, as you look under the hood to see what’s going on inside a tool like ChatGPT, almost all of the actual math looks like matrix vector multiplication. There’s little risk of getting lost in the sea of ​​billions of numbers, but you have to draw a very sharp distinction in your mind between the model’s weights, which they’ve always asked in blue or red, and the obtained processed data, which they’ve always asked in grey.
Weights are the actual brains, they are the things that are learned during training, and they are what determine how he behaves. The data being processed simply encodes any specific input entered into the form for a particular run, such as a snippet of text. With all that as a foundation, let’s dive into the first step of this text processing example, which is breaking down the input into small parts and converting those parts into vectors.
I mentioned how these pieces are called symbols, which may be parts of words or punctuation marks, but every now and then in this chapter and especially in the next, I would just pretend that they are more clearly divided into words. Since we humans think in words, this will make it much easier to refer to small examples and explain each step.
The model contains a pre-defined vocabulary, a list of all possible words, say 50,000 of them, and the first matrix we will encounter, known as the embedding matrix, contains one column for each of these words. These columns are what define the vector that each word turns into in that first step. We call it we, and like all arrays we see, its values ​​start out random, but will be learned based on the data.
Converting words into vectors was a common practice in machine learning long before transformers, but it’s a bit strange if you’ve never seen it before, and it lays the foundation for everything that follows, so let’s take a moment to learn about it. We often call this a word embedding, which invites you to think of these vectors very geometrically as points in a high-dimensional space.
Visualizing a list of three numbers as coordinates for points in 3D space wouldn’t be a problem, but word embedding tends to be much higher dimensional. In GPT-3 they have 12,288 dimensions, and as you will see, it is important to work in a space with a lot of distinct directions. In the same way that you can take a 2D slice across a 3D space and drop all the points onto that slice, in order to animate the word embeddings that a simple model gives me, I would do something similar by selecting a 3D slice
across this very high dimensional space, and projecting Label word vectors and display the results. The big idea here is that when the model adjusts and adjusts its weights to determine how to combine words as vectors during training, it tends to settle on a set of embeddings where the directions in space have some sort of semantic meaning.
For the simple word-to-vector model I’m using here, if you do a search for all the words whose embeddings are closest to the word zodiac, you’ll notice how they all seem to give very similar zodiac feelings. And if you want to learn some Python and play with it at home, this is the exact model I use to make animations.


It’s not a transformer, but it’s enough to illustrate the idea that directions in space can have semantic meaning. A classic example of this is that if you take the difference between the vectors of a woman and a man, which is something you can imagine as a little vector connecting one end to the other, it’s very similar to the difference between a king and a man a queen.
Suppose you don’t know the word for female queen, you can find it by taking the king, adding the woman-man direction, and looking for the implicatures closest to that point. At least somewhat. Although it’s a classic example of the model I’m playing with, the true implication of the Queen is actually a bit further along than this might suggest, perhaps because the way the Queen is used in the training data is not just a female version of the King.
When I looked around, family relationships seemed to illustrate the point much better. The important point is that during training the model seems to have found it useful to choose embeddings such that one direction in this space encodes gender-related information. Another example is if you take the inclusion of Italy, subtract the inclusion of Germany, and add that to the inclusion of Hitler, you get something very close to the inclusion of Mussolini.
It is as if the model has learned to associate some trends with Italian identity, and others with the leaders of the World War II Axis. Probably my favorite example in this context is how in some models, if you take the difference between Germany and Japan, and add it to sushi, you end up very close to a hot dog.
Also while playing the Find Nearest Neighbor game, I was pleased to see how close Kat was to the monster and the monster. One mathematical intuition that is useful to keep in mind, especially in the next chapter, is how the dot product of two vectors can be taken as a way to measure their correspondence.
Mathematically, dot products involve multiplying all the corresponding components and then adding the results, which is good, since a lot of our calculations should look like weighted sums. Geometrically, the dot product is positive when vectors point in similar directions, zero if they are perpendicular, and negative when they point in opposite directions.
For example, let’s say you were playing with this model, and you hypothesized that including cats minus the cat might represent some kind of pluralism trend in this space. To test this, I will take this vector and calculate its dot product against the embeddings of some singular nouns, and compare the dot products with the corresponding plural nouns.
If you play around with this, you’ll notice that the addition always seems to give higher values than the singular values, indicating that it’s more in line with this trend. It’s also interesting that if you take this dot product with word embeddings 1, 2, 3, etc.
it gives increasing values, so it’s as if we can quantify how well the model finds a particular word. Again, the details of how words are embedded are learned using data. This embedding matrix, whose columns tell us what happens to each word, is the first pile of weights in our model. Using GPT-3 numbers, the vocabulary size is specifically 50,257, and again, this is not technically made up of words per se, but of tokens.
The embedding dimension is 12,288, and multiplying that tells us that this consists of about 617 million weights. Let’s go ahead and add this to a running tally, remembering that in the end we have to count up to 175 billion. In the case of transformers, you really want to think of the vectors in this embedding space as not just representing individual words.
For one thing, it also encodes information about the position of that word, which we’ll talk about later, but more importantly, you should think of it as having the ability to understand context. For example, a vector that began life as a combination of the word “king” By different blocks in this network, so that it ultimately points to a more specific and precise direction that encodes one way or another.
He was a king who lived in Scotland, who came to office after killing the previous king, and is described in Shakespearean language. Think about your understanding of a particular word. The meaning of that word is clearly determined by the surrounding environment, and sometimes that includes context from a distance, so when putting together a model that has the ability to predict which word will come next, the goal is to somehow enable it to incorporate context efficiently.
To be clear, in that first step, when you create a set of vectors based on the input text, each one is simply plucked from the embedding matrix, so initially each one can only encode the meaning of a single word without any input from its surroundings. But you should think of the primary purpose of this network that flows through it as enabling each one of those vectors to capture a richer and more specific meaning than can be represented by mere individual words.
The network can only process a fixed number of vectors at a time, known as its context size. For GPT-3, it was trained with a context size of 2048, so the data flowing through the network always looks like this set of 2048 columns, each with 12,000 dimensions. This context size limits the amount of text the transformer can incorporate when predicting the next word.
This is why long conversations with some chatbots, such as early versions of ChatGPT, often gave the feeling of the bot losing the thread of the conversation as you continued for too long. We’ll get into the details of interest in due course, but skipping ahead, I want to talk for a minute about what happens at the end.
Remember, the desired output is a probability distribution over all tokens that might come next. For example, if the last word is “professor,” and the context includes words like “Harry Potter,” and right before that we see the least favorite teacher, also if you give me some freedom by letting me pretend that the symbols simply look like whole words , then a well-trained network that has gained knowledge of Harry Potter is supposed to assign a large number to Snape’s word.
This involves two different steps. The first is to use another array that maps the last vector in this context to a list of 50,000 values, one for each token in the vocabulary. Then there’s a function that normalizes this to a probability distribution, called Softmax and we’ll talk more about that in just a second, but before that it might seem a little weird to use just that last embedding for prediction, when after all, in that last step, there are thousands of vectors The other ones are in the layer that is there with their context-rich meanings.
This has to do with the fact that in the training process it turns out to be more efficient if you use each one of those vectors in the final layer to simultaneously predict what comes immediately after it. There’s a lot more to be said about training later, but I just want to mention it now. This matrix is ​​called the Unembedding matrix and we give it the name WU.
Again, like all weight matrices we see, their entries start out random, but are learned during the training process. To keep the result in our total number of parameters, this de-embedding matrix has one row for each word in the vocabulary, and each row contains the same number of elements as the embedding dimension.
It’s very similar to an embedding matrix, just with the order swapped, so it adds another 617 million parameters to the network, which means that so far our number is just over a billion, which is a small but not entirely insignificant fraction of the 175 billion that we have. It will end in total.
As a final little lesson in this chapter, I want to talk more about this softmax function, because it comes up again when we dive into attention blocks. The idea is that if you want a string of numbers to act as a probability distribution, such as a distribution over all possible next words, each value needs to be between 0 and 1, and you also need to add them all up to 1.
However, if you’re playing an educational game where everything you do looks like matrix and vector multiplication, the output you get by default doesn’t adhere to this at all. The values ​​are often negative, or much larger than 1, and their sum is almost certainly not 1. Softmax is the standard way to transform a random list of numbers into a valid distribution in such a way that larger values ​​end up closest to 1, and smaller values ​​end up very close to 0.
This Everything you really need to know. But if you’re curious, the way it’s done is first you raise e to the power of each number, which means you now have a list of positive values, and then you can add up all those positive values ​​and divide each term by that amount, which normalizes it into A list that adds up to 1.
You will notice that if one number in the input is much larger than the rest, then in the output the corresponding term dominates the distribution, so if you are sampling it you are almost certainly choosing the maximum input. But it’s more flexible than just choosing the maximum, meaning that when other values ​​are similarly large, they also get a meaningful weight in the distribution, and everything is constantly changing as you’re constantly changing the inputs.
In some situations, like when ChatGPT uses this distribution to generate a next word, there’s room for a little extra fun by adding a little extra excitement to this function, adding a constant t to the denominator of those exponents. We call it temperature, because it’s vaguely similar to the role of temperature in some thermodynamic equations, and the effect is that when t is larger, you give more weight to the lower values, which means that the distribution is a little more homogeneous, and if t is smaller, the values Larger ones will dominate
more strongly, so in the extreme case, setting t equal to zero means that all the weight goes to the maximum value. For example, I’ll have GPT-3 generate a story with the body text, Once There Was A, but I’ll use different temperatures in each case. Zero temperature means that it always goes with the most expected word, and what you end up getting is a hackneyed derivative of Goldilocks.
Higher temperature gives you the opportunity to choose words that are less likely, but risky. In this case, the story starts out more authentic, about a young South Korean web artist, but quickly turns into nonsense. Technically, the API doesn’t allow you to choose a temperature greater than 2.
There’s no mathematical reason for this, it’s just an arbitrary constraint imposed to prevent their tools from appearing to generate meaningless stuff. So, if you’re curious, the way this animation actually works is that I take the next 20 most likely codes that GPT-3 generates, which seems to be the maximum they’ll give me, and then I adjust the probabilities based on the power of 15.
As another term, in the same way that you might call the components of the output of this function probabilities, people often refer to the inputs as logs, or some people say logs, some people say logs, I’ll say logs. For example, when you feed some text, all these embedded words flow through the network, and you do this final multiplication using the matrix disembedding, and the machine learning people will refer to the components in this initial abnormal output as the logistics to predict the next word.
Much of the goal in this chapter was to lay the foundations for understanding the mechanism of attention, Karate Kid style wax-on-wax. You see, if you have a strong intuition for word embedding, for Softmax, for how dot products measure similarity, and also the basic premise that most calculations should look like matrix multiplication with matrices full of tunable parameters, then an understanding of interest should be this mechanism, which is The cornerstone of the modern AI boom is relatively smooth.
So, come and join me in the next chapter. As I post this, a draft of the next chapter is available for review by Patreon supporters. The final version should be released to the public in a week or two, usually depending on how much I end up changing based on that review.

What is a GPT

One thought on “What is a GPT? Visual intro to transformers

Leave a Reply

Your email address will not be published. Required fields are marked *