In our previous previous post, we went through the tokenization process used by GPT-2. Referring to the high level architecture diagram below, that means that we covered the first couple of boxes, up to the point of having our vector of integer input token IDs.
The next step then is embedding. This process transforms each of our token IDs into a vector of floating point numbers of length equal to the dimensionality of the model (d_model), which, in the case of the small variant of GPT-2 that we are recreating, is 768. Therefore if our input is a sequence of 10 token IDs, that means after embedding we will have a matrix that is (10 x 768), and this matrix will become the input for the transformer layers of the model. For now we are going to skip past those layers so that we can also cover the unembedding process that occurs at the bottom of the diagram, which converts the normalised output of the transformer blocks into 'logits' - raw scores for each token in the vocabulary. Given it is relatively straight forward we'll then finish the process by converting these logits into probabilities with softmax, so that we can finally get the next token in the sequence.
By the end of this post 'all' that will remain is the content of the transformer blocks themselves (and the final layer norm, but layer norms also feature in the transformer blocks), which we'll save for the final entry in this series.
As with the previous post we'll begin by going through the idea of embeddings and what they are trying to do, before actually starting on the implementation in C++.
Embeddings are a key component in virtually all large language models (LLMs), including GPT-2. They are the means by which we can turn an integer token ID into a vector of floating point numbers that the model can actually do something with. Early word embedding techniques like Word2Vec and GloVe were pre-trained separately and then used as fixed inputs to neural networks. In most modern LLMs, however, including GPT-2, BERT, and their successors, embeddings are learned during the training process. This means that the model is able to adjust the vector representation of each token ID to better predict what the next token will be. While some studies, suggest that these embeddings are capturing syntactic information about the tokens, even that level of interpretation is open to debate in these large models. If you are interested in interpretability though I can recommend pretty much anything by the likes of Chris Olah.
The size of the embedding vectors is a hyperparameter and will vary from model to model, ranging from a hundreds to thousands of parameters per token. In the case of GPT-2 there are few different sizes, but we will be dealing with the smallest model, which has a model size of 768 parameters.
GPT-2 uses two types of embeddings: token embeddings and position embeddings. There are others, such as segment embeddings, which are used by models like BERT, but we'll stick to the two used by GPT-2 here.
At a high level, token embeddings can be thought of as the floating point representation that encodes the 'meaning' of each token in our vocabulary. This means that the token embedding matrix that needs to be learnt is (vocab size X model size), which for our GPT-2 implementation will be (50796 X 768). That's about 40 million parameters just for the token embedding! Fortunately we only care about the forward pass in this post, so we don't need to worry about how to train those parameters, we just need to know how to use the matrix to actually perform the embedding.
In the previous post in this series we encoded the string "GPT2 was created by OpenAI", which resulted in the set of tokens [38, 11571, 17, 373, 2726, 416, 4946, 20815]. Below is a diagram that illustrates the token embedding process for these tokens:
Here, each token ID corresponds to a specific row in the token embedding matrix. We can think of the embedding process as selecting the relevant row for each token (represented by the blue boxes in the middle), maintaining the order of our input sequence, and then combining (stacking) these rows to form the particular embedded token matrix for our input. In our case this matrix has 8 rows (one for each input token) and 768 columns (the embedding dimension).
To summarise, this process transforms our discrete token IDs into vectors of floating point numbers. The resulting 8 x 768 embedded tokens matrix preserves the sequence of our input while capturing the information that has been learnt about each token, encoded in the embedding matrix.
The second type of embedding used by GPT-2 is position embedding. As the name suggests, rather than trying to encode semantic information about the tokens, this instead is trying to encode generic, useful information about the position of a token in the input sequence. The reason for wanting to add in some kind of position embedding is because otherwise the transformer would always give the same output for the same set of input tokens, regardless of their ordering, i.e. "I walk faster than I run" and "I run faster than I walk" would be treated as the same input and so generate the same next token. There has been a lot of work done on different ways to perform position embedding, and this paper provides a good overview, covering different examples of absolute and relative embedding methods.
GPT-2 uses absolute position embedding, meaning that each vector of floating point numbers in the position embedding matrix corresponds to a particular absolute position in the input sequence. As with the token embedding, the position embedding matrix is also learnt during training, however rather than being of size (vocab size X model size), it is of size (max sequence length X model size), which in the case of GPT-2 this is (1024 X 768).
The image below illustrates the position embedding process for the same input tokens as above. It is pretty similar, however rather than taking the row that corresponds to the token ID, we take the row that corresponds to the position of the token in the sequence. Just to be explicit, this means we grab the first row for the first token, the second row for the second token and so on.
If you are thinking "that seems like a weird way to encode position", I agree with you. It isn't obvious to me that the absolute position of a token in an input is particularly useful, and analysis has shown that simply replacing them with random vectors leads to a surprisingly small drop in performance. It is interesting that fixed position embedding matrices such as the sinusoidal approach have been found to have comparable performance to the learned absolute position embeddings, and I would like to know whether simply having some ordered set of vectors where the position embedding for the i'th position is just a vector of the integer i is equally as good. This approach simply adds a linearly deceasing bias to the models attention scores, and finds that it performs just as well as learned absolute position embeddings, so it seems plausible that the GPT-2 approach is overparameterised for what it is trying to do.
Before moving on i'll mention one other more recent approach to position embedding: RoPE allows for both absolute and relative information to be encoded, and is used in models like LLaMA, and I intend to implement that myself from scratch in a future post.
Implementing all the above in code is actually pretty trivial. To see how we've done it we just need to look at the forward pass for our GPT2 class:
cpp
// In src/gpt2.cppEigen::MatrixXf gpt2_t::forward(string_t input_string){// get the token ids for this string from the tokenizerstd::vector<int> tokens = tokenizer.tokenize(input_string);// check this doesn't exceed the maximum sequence length (1024 for GPT2)if (tokens.size() > max_seq_len) {die("Input token sequence is too long");}// Initialise the embedded tokens matrix with the right size for this set of tokensEigen::MatrixXf embedding_matrix = Eigen::MatrixXf::Zero(tokens.size(), d_model);for (size_t i = 0; i < tokens.size(); ++i) {// Check if the token ID is within the valid rangeif (tokens[i] >= 0 && tokens[i] < weights.token_embedding.rows()) {// for token embedding, take the row corresponding to the token IDembedding_matrix.row(i) = weights.token_embedding.row(tokens[i]);// for the position embedding, take the row corresponding to the position// and add that to the token embeddingembedding_matrix.row(i) += weights.position_embedding.row(i);} else {die("Invalid token ID: " + std::to_string(tokens[i]));}}// the token embedding matrix is now ready to be passed to the transformerEigen::MatrixXf transformer_output = transformer.forward(embedding_matrix);// pass the transformer output through the final layer normalizationMatrixXf norm_final_output = final_norm_layer.forward(transformer_output);// get the logits by multiplying the final output by the token embedding matrixMatrixXf logits = norm_final_output * weights.token_embedding.transpose();return logits;}
Note that we are making use of the fabulous Eigen library for handling all our matrix operations, for which the MatrixXf
type just means a matrix of floating point numbers. This method performs the entire forward pass for our GPT-2 model:
We can go through the embedding process in a bit more detail:
To perform the token embedding, for the i'th token we take the row corresponding to the ID of that token from the token embedding matrix and assign it to the i'th row in our embedded_tokens matrix:
cpp
// for token embedding, take the row corresponding to the token IDembedded_tokens.row(i) = weights.token_embedding.row(tokens[i]);
Even more straightforwardly for the position embedding, for the i'th token we just take the i'th row of the position embedding matrix, and add it to the i'th row of the embedded_tokens matrix:
cpp
// for the position embedding, take the row corresponding to the position// and add that to the token embeddingembedded_tokens.row(i) += weights.position_embedding.row(i);
That's all there is to it! We have now performed the embedding process and have our input ready for the transformer layers of our network (which we'll cover in the next post).
After all the transformer layers and the final layer normalisation, we need to convert the output back into the vocabulary space. This process is often called "unembedding". In GPT-2, the unembedding step is performed by multiplying the output by the transpose of the token embedding matrix:
cpp
// get the logits by multiplying the final output by the token embedding matrixMatrixXf logits = norm_final_output * weights.token_embedding.transpose();
This projects our output back onto a space that has the same length as the model's vocabulary, with the values representing unnormalised scores for each token in our vocabulary, referred to as logits. This design choice - using the same matrix for both embedding and unembedding - is common amongst language models and is referred to as 'weight tying'. Although it is not strictly necessary - one could have a separate matrix for unembedding - it helps to reduce the number of parameters in the model without having a demonstrably negative impact on performance.
After we've obtained the logits, the final step is to convert these into an actual token prediction. Here's how we do that in our implementation:
cpp
// In src/gpt2.cppstring_t gpt2_t::get_next_max_like_token(MatrixXf& logits){// we only want to predict the next token after the input sequence// so we take the last row of the logits matrixEigen::VectorXf last_token_logits = logits.row(logits.rows() - 1);// convert these logits to probabilities using softmaxEigen::VectorXf probabilities = softmax(last_token_logits);// in this function we just want to return the token with the highest probability// so we find the index of the maximum probability and return the token corresponding to that indexEigen::Index max_index;probabilities.maxCoeff(&max_index);// cast the Index type to an integerint max_prob_token_id = static_cast<int>(max_index);// finally we detokenize the token ID to get the actual tokenstring_t token = tokenizer.detokenize(max_prob_token_id);return token;}
The first thing we do here is to get the last row from the logits matrix. This is because the network doesn't just generate scores for the token immediately following our input. Instead, it produces scores for every possible next token at each step of the input sequence. In other words, for an input of length N, the network calculates N sets of scores: the first set predicts the second token given the first, the second set predicts the third token given the first two, and so on, culminating in the Nth set that predicts the (N+1)th token given all N input tokens. Although this is very useful in training it isn't so useful for inference so we just grab the last row, which represents the scores for the next token given our complete input sequence.
Once we have our vector of logits (one per token in the models vocabulary) a softmax function is used to convert the logits into probabilities. The softmax function is a common way to normalize a set of values into a probability distribution. It exponentiates each value and then divides by the sum of all the exponentiated values to ensure that the output values sum to 1. Our implementation of the softmax function is shown below:
cpp
// In src/utils.cppVectorXf softmax(const VectorXf& x){// use the standard trick of subtracting the maximum value to avoid overflowVectorXf exp_x = (x.array() - x.maxCoeff()).exp();// return the normalized values// the sum of the values will be 1 so they can be interpreted as probabilitiesreturn exp_x.array() / exp_x.sum();}
Note that for numerical stability we perform the common trick of subtracting the maximum value from each element before exponentiating. This prevents overflow issues that can occur when exponentiating large numbers.
Softmax is a nice method of computing probabilities from the output logits because it results in the maximum entropy distribution given the constraint that the expected value of the logits matches the actual logits we have. Said differently, the probabilities we get from softmax assumes the least additional information (maximises entropy), given the constraint that the expected values of the logits should match the actual logits we have.
That said, there are other methods of converting logits to probabilities, a couple of examples being the SigSoftmax function which combines a sigmoid and softmax, or the Spherical Loss family of functions which projects the network's output onto a hypersphere before applying normalization.
Finally, we find the token with the highest probability using Eigens maxCoeff function. This uses Eigens built in Index type, and so we need to turn that into a standard int before we can use it to detokenize the index and get our actual token:
cpp
// in this function we just want to return the token with the highest probability// so we find the index of the maximum probability and return the token corresponding to that indexEigen::Index max_index;probabilities.maxCoeff(&max_index);// cast the Index type to an integerint max_prob_token_id = static_cast<int>(max_index);
This integer can then be passed to the decoder map from the previous post:
cpp
// convert a single token back to textstring_t tokenizer_t::detokenize(const int token){return decoder[token];}
We now have the token that the model predicts as the most likely next token in the sequence! This approach always selects the most likely next token, which is known as "greedy" decoding. While simple and often effective, it can lead to repetitive or deterministic outputs. There are many alternative strategies for selecting the next token, which can introduce more variety and potentially improve the quality of generated text. We'll just summarise a few below for interests sake, as we didn't actually implement any of these in the code:
Each of these methods has its own trade-offs between diversity and quality of the generated text. The choice often depends on the specific application and desired characteristics of the output.
We've now covered the entire process from input text to output token in our GPT-2 implementation, albeit with a big gap in the middle where the transformer layers do their thing.
In this post we looked at how GPT-2 uses both token embeddings and position embeddings to transform discrete token IDs into vectors of floating point numbers that serve as the inputs to the transformer layers. We also went through the unembedding process, which converts the model's transformer output into the raw scores (logits) for the next token back in the vocabulary space using the same matrix that was used for the token embedding (called weight tying). Finally we converted these logits into probabilities using the softmax function to get the next token in the sequence.
In the next and final post in this series, we'll go through these transformer layers. This will mean implementing layer normalisation, self-attention and feed-forward networks, so it may be a long one!
As ever, if you found this post useful, have any questions, or think you might be interested in some of those upcoming posts, please feel free to reach out, or just follow me on X.