The last couple of months i've been using Claude a lot, both for work and also to discuss things like whether we are in a simulation, or AI ethics, or whatever else really. Given how much I use it, I realised I didn't have a very complete picture of what was going on under the hood, and how transformer architecture really worked. With that in mind I decided to take on a few projects to help me learn more about current ML models. The first of these projects is to recreate GPT2 model from scratch in C++. Note that I don't just mean, create something that is basically the same architecture, but rather I want to be able to load in the GPT2 weights, and for any given input text get the same output as the model that you can load from the Hugging Face transformers library.
Although there are already many excellent writeups of implementing GPT2 from scratch out there (see e.g. this one), part of the process for learning something new is to see if I can explain it to someone else. Only if I can do that do I feel like I actually understand the thing properly. With that in mind, i'm going to write a series of posts going through my implementation, and hopefully in the process help both myself and maybe someone else as well. I tend to approach things from a fairly technical point of view, and to be honest have found a lot of the explanations regarding things like attention (keys and queries etc) did little to really help me understand what was actually going on, so will try and present things in a way that makes more sense to me.
Below is a high level diagram of the GPT2 architecture (thank you Claude), from an input string all the way down to the probability distribution over next token:
Trying to explain this whole process in a single blog post is probably a bit ambitious, so i'm breaking it down into a few parts. This first part is going to cover only the tokenization process, which is the first part of the chain that takes us from the input string (raw text) to input tokens (a vector of integers). I'm aiming to get the remaining parts out over the next couple of weeks or so, but we'll see how that goes.
All the code for the C++ implementation, including tests that verify it matches the python version are already available at my Git here. I'll be tidying it up and adding comments as I write these posts, so although it already all works, some bits may not be very well documented yet.
Before we get onto any code it is probably useful to go through a concrete example of exactly what the tokenization process is doing. In principle there are many different ways that one could do this, for example: i) Word based tokenization, where splits occur on whitespace and punctuation. This is simple but struggles with words that aren't in the tokenizer's vocabulary, or ii) Character based tokenization, where splits occur on every character. This means it can handle any text sequence but is inefficient and leads to lots of tokens.
The GPT2 tokenization process tries to strike a balance between these two, enabling processing of any arbitrary text, but handling it more efficiently that simply tokenizing every character. So, let's break it down using the example sentence:"GPT2 was created by OpenAI"
The first thing that happens is that the tokenizer uses a thoroughly horrific regex (regular expression) to split the input into more manageable smaller chunks while preserving important linguistic features:
's|'t|'re|'ve|'m|'ll|'d| ?[a-zA-Z]+| ?[0-9]+| ?[^\s\w]+|\s+(?!\S)|\s+
Applying this regex to our example string yields the following set of sub-strings:
["GPT2", " was", " created", " by", " OpenAI"]
Note: In GPT-2's tokenizer, the space is represented by 'Ġ' (Unicode U+0120), we'll use that in the rest of this example.
Each token then undergoes Byte-Pair Encoding. Before we run through this example though we need to take a look at two concepts that will be important - the vocabulary, and the merge list. Both of these are the result of the tokenizer's training process and are used during the tokenization of input text.
The vocabulary for GPT2 is stored in a JSON file that maps 50,257 tokens to their corresponding unique integer IDs. It includes individual characters, common sub-words, and frequent words. A few concrete examples are shown below:
{
"!": 0,
""": 1,
...
"ĠThe": 383,
...
"The": 464,
...
"the": 1169,
...
"<|endoftext|>": 50256
}
Note here that both capitalisation, and whether there was a preceding space matters! "The" and "the" and " The" are all considered separate tokens within the GPT2 vocabulary.
There are 50,000 possible merges that we can perform during byte pair encoding. Each entry in the list represents a merge operation, showing which pairs of tokens should be merged. The order of the merges is crucial - it represents the priority of the merges. The top few merges from the list are shown below:
Ġ t
Ġ a
h e
i n
r e
o n
Ġt he
...
These two concepts, the vocab and the merge list, are closely related, and are the result of the same generation process. If you are wondering why the vocab has slightly more entries (257) than the merge list, it is because the vocabulary includes 256 entries for all possible bytes (0-255), and also includes the special "end of text" symbol with ID 50256. These are not derived from merges but are included to ensure any byte sequence can be tokenized. The merge list determines how characters and sub-words are combined during tokenization, and the tokens that result from those merges are found in the vocabulary.
A quick note on the special "<|endoftext|>" token (ID 50256). It's used to mark the end of a document or a segment of text during training and inference, however it isn't typically used during the standard tokenization process of input text. It's typically added manually to the end of tokenized sequences when preparing training data or generating text.
Ok, with that diversion over let's look at how this works for "GPT2" and " OpenAI", including the ranks of different pair combinations:
Step | Current State | Pair Ranks | Chosen Merge |
---|---|---|---|
1 | ["G", "P", "T", "2"] | (G,P): 16705 (P,T): 11315 (T,2): -1 | (P,T) → "PT" |
2 | ["G", "PT", "2"] | (G,PT): -1 (PT,2): -1 | No more merges |
Step | Current State | Pair Ranks | Chosen Merge |
---|---|---|---|
1 | ["Ġ", "O", "p", "e", "n", "A", "I"] | (Ġ,O): 185 (O,p): 18002 (p,e): 176 (e,n): 13 (n,A): -1 (A,I): 19930 | (e,n) → "en" |
2 | ["Ġ", "O", "p", "en", "A", "I"] | (Ġ,O): 185 (O,p): 18002 (p,en): 3362 (en,A): -1 (A,I): 19930 | (Ġ,O) → "ĠO" |
3 | ["ĠO", "p", "en", "A", "I"] | (ĠO,p): 8415 (p,en): 3362 (en,A): -1 (A,I): 19930 | (p,en) → "pen" |
4 | ["ĠO", "pen", "A", "I"] | (ĠO,pen): 4692 (pen,A): -1 (A,I): 19930 | (ĠO,pen) → "ĠOpen" |
5 | ["ĠOpen", "A", "I"] | (ĠOpen, A): -1 (A,I): 19930 | (A, I) → "AI" |
6 | ["ĠOpen", "AI"] | (ĠOpen,AI): -1 | No more merges |
The BPE process stops when no more merges can be applied based on the learned merge rules. The merge with the lowest rank (highest frequency) is chosen at each step.
Finally, each resulting subword is looked up in the vocabulary to convert it to an integer token ID:
"G" → 38
"PT" → 11571
"2" → 17
"Ġwas" → 373
"Ġcreated" → 2726
"Ġby" → 416
"ĠOpen" → 4946
"AI" → 20185
[38, 11571, 17, 373, 2726, 416, 4946, 20815]
This sequence of integer token IDs is what would be fed into the GPT-2 model for processing. This is the process that we now want to actually code up in our tokenizer class.
Let's start by looking at the main structure of our tokenizer class:
cpp
class tokenizer_t {private:// Encoder: maps tokens to IDs using the vocabularystd::map<string_t, int> encoder;// Decoder: maps IDs back to tokensstd::map<int, string_t> decoder;// merge_ranks: stores the priority of merge operationsstd::vector<std::pair<string_t, string_t>> merge_ranks;// Regex pattern for tokenization - performs initial splitting of input stringstd::regex regex_splitter;// Byte-to-unicode mappingstd::map<uint8_t, char32_t> byte_encoder;// Private methods...public:tokenizer_t(const string_t& vocab_file, const string_t& merges_file);// Tokenize input textstd::vector<int> tokenize(const string_t& text);// Get strings back from the tokensstd::vector<string_t> detokenize(const std::vector<int>& tokens);// Other public methods...};
There isn't much to say here, the class contains all the objects that we are going to need for tokenization:
The constructor for the tokenizer is passed two files, one for the vocabulary and one for the merge list:
cpp
tokenizer_t::tokenizer_t(const string_t& vocab_file, const string_t& merges_file){// Load vocabulary from JSON filestd::ifstream vocab_stream(vocab_file);json vocab_json;vocab_stream >> vocab_json;for (auto it = vocab_json.begin(); it != vocab_json.end(); ++it) {int id = it.value();// create the encoder entry mapping string -> intencoder[it.key()] = id;// create the corresponding decoder entry int -> stringdecoder[id] = it.key();}// Load BPE merges from text filestd::ifstream merges_stream(merges_file);string_t line;std::getline(merges_stream, line); // Skip first line (header)while (std::getline(merges_stream, line)) {if (line.empty()) {break;}// the merges file is just space separatedsize_t split_pos = line.find(' ');if (split_pos == string_t::npos) {die("Invalid line in merges file: " + line);}string_t first = line.substr(0, split_pos);string_t second = line.substr(split_pos + 1);// Add this merge rule to merge_ranksmerge_ranks.emplace_back(first, second);}// Compile regex pattern for tokenizationregex_splitter = std::regex("'s|'t|'re|'ve|'m|'ll|'d| ?[a-zA-Z]+| ?[0-9]+| ?[^\s\w]+|\s+(?!\S)|\s+");// Initialize byte encoder/decoderbyte_encoder = bytes_to_unicode();}
The constructor then does a few basic things:
There isn't really anything else to say here, so we can move onto the byte-to-unicode mapping.
The bytes_to_unicode() function creates a reversible mapping between byte values (0-255) and Unicode characters:
cpp
// Function to create byte-to-unicode mapping for GPT-2 tokenizationstd::map<uint8_t, char32_t> tokenizer_t::bytes_to_unicode(){// Purpose: Create a specific bijective mapping between byte values (0-255) and Unicode code points.// This mapping is designed to be consistent with GPT-2's original tokenization scheme.std::vector<uint8_t> bs;// Step 1: Add printable ASCII characters (33 to 126, i.e., '!' to '~')// Note: We will handle 0-32 (and the other missing values) laterfor (int i = 33; i <= 126; ++i)bs.push_back(i);// Step 2: Add extended ASCII characters (161 - '¡' to 172 - '¬' and 174 - '®'to 255 - 'ÿ')for (int i = 161; i <= 172; ++i)bs.push_back(i);for (int i = 174; i <= 255; ++i)bs.push_back(i);// Create a copy of bs to store the Unicode mappingsstd::vector<char32_t> cs(bs.begin(), bs.end());int n = 0;// Step 3: Map remaining byte values (0-32, 127-160, 173) to Unicode points starting at 256// This includes control characters, space, delete, and some extended ASCII characters// Mapping these to 256+ ensures:// 1. Consistency with GPT-2's original tokenization scheme// 2. Clear visual distinction of special characters during debugging// 3. Avoidance of potential issues with the way text editors handle control charactersfor (int b = 0; b < 256; ++b) {// if we have already added this byte, skip itif (std::find(bs.begin(), bs.end(), b) != bs.end())continue;bs.push_back(b);// Map to Unicode characters starting from 256// Note: we add 256 to avoid conflicts with the ASCII rangecs.push_back(256 + n);++n;}// Create the final mapping// Note: We need to use char32_t rather than char to handle Unicode code points over 255std::map<uint8_t, char32_t> result;for (size_t i = 0; i < bs.size(); ++i) {result[bs[i]] = cs[i];}return result;}
This function creates a one-to-one mapping between byte values (the uint8_t type) and Unicode code points (stored as char32_t), and ensures that all 256 possible byte values are represented as printable characters.
At first it wasn't really clear to me why this function needed to be so arbitrary looking:
The reason for this is that it ensures that all bytes can be represented as printable Unicode characters, which is useful for debugging and visualizing the tokenization process. It also avoids mapping bytes to Unicode control characters (0-31, 127-159) and other special characters like the soft hyphen (173). These control characters are non-printable characters that control or modify how text and space are processed and displayed. Some common control characters include:
These characters can cause inconsistent or unexpected text display, data truncation, processing errors, security risks, and debugging challenges. By avoiding direct mapping to control character code points, this slightly more convoluted process mitigates potential issues in text processing and improves the robustness of the tokenization process.
This allows for consistent handling of all possible byte values while maintaining a clear distinction between original printable ASCII characters and other byte values.
The main tokenization process is handled by the tokenize() method:
cpp
// Tokenize input textstd::vector<int> tokenizer_t::tokenize(const string_t& text){std::vector<int> tokens;// Use regex to try and split text into smaller chunksstd::sregex_iterator iter(text.begin(), text.end(), regex_splitter);// A default-constructed std::sregex_iterator represents the past-the-end iteratorstd::sregex_iterator end_iter;// while there are chunks left, tokenize themwhile (iter != end_iter) {string_t utf8_token = iter->str();// Apply byte-level encodingstd::u32string utf32_token;for (uint8_t b : utf8_token) {utf32_token += byte_encoder.at(b);}// Apply BPE encodingstd::vector<string_t> bpe_encoded = bpe(utf32_token);for (const string_t& bpe_token : bpe_encoded) {int token_id = encoder.at(bpe_token);tokens.push_back(token_id);}// Move to the next regex match++iter;}return tokens;}
At this point this function is pretty straight forward, and at a high level implements the tokenization process we described at the start of the post:
There is one slightly funny thing going on here which i think warrants an explanation and has to do with the regex evaluation (surprise..). Most iterators in C++ have an iter.end() function, and you can just keep going until you reach the end. The regex iterator is a bit different, and you have to compare it to a default constructed iterator to see if you have reached the end, because it doesn't know how long the iteration is going to be until it is over. This is why we have to create the end_iter object, and use that in our while loop condition.
The main thing that is happening here though is the BPE encoding. This is implemented in the bpe() method, which we will go through next.
This is where most of the work is going on, the code for the bpe() method is shown below:
cpp
// performs byte pair encoding on a UTF-32 encoded inputstd::vector<string_t> tokenizer_t::bpe(const std::u32string& input){// Initialize a vector of UTF-32 tokens.// Right now each entry it just a single character from the input,// however these will potentially get merged through the BPE processstd::vector<std::u32string> tokens;tokens.reserve(input.size());for (char32_t c : input) {tokens.push_back(std::u32string(1, c));}// Main BPE loopwhile (true) {std::pair<std::u32string, std::u32string> best_pair;int best_rank = -1;// Find the best pair to merge based on rankfor (size_t i = 0; i < tokens.size() - 1; ++i) {// Get the rank of the current pairint rank = get_pair_rank(utf32_to_utf8(tokens[i]), utf32_to_utf8(tokens[i + 1]));// Update best_pair and best_rank if this pair is betterif (rank != -1 && (best_rank == -1 || rank < best_rank)) {best_pair = {tokens[i], tokens[i + 1]};best_rank = rank;}}// If no mergeable pair found, exit the loopif (best_rank == -1) {break;}// Merge the best pair of tokensstd::vector<std::u32string> merged_tokens;for (size_t i = 0; i < tokens.size(); ++i) {if (i < tokens.size() - 1 && tokens[i] == best_pair.first && tokens[i + 1] == best_pair.second) {// Merge the pairmerged_tokens.push_back(best_pair.first + best_pair.second);++i; // Skip the next token as it's now merged} else {// Keep the token as ismerged_tokens.push_back(tokens[i]);}}// Update tokens with the new merged versiontokens = std::move(merged_tokens);}// Convert the final vector from UTF-32 to UTF-8std::vector<string_t> result;for (const std::u32string& token : tokens) {result.push_back(utf32_to_utf8(token));}return result;}
Starting from the top, the first thing we do is to split the input (one of the chunks of text produced by the regex expression) into individual characters:
cpp
// Initialize a vector of UTF-32 tokens.// Right now each entry it just a single character from the input,// however these will potentially get merged through the BPE processstd::vector<std::u32string> tokens;tokens.reserve(input.size());for (char32_t c : input) {tokens.push_back(std::u32string(1, c));}
Right now these are already valid tokens - the vocabulary for GPT2 includes all the single byte characters as tokens, so in principle we could stop right now. The main loop in this function is going to try and find the most common valid pairings (i.e. those that exist in the merge list) until there are no valid pairings left. The content of the tokens vector at that point will contain the final set of tokens extracted from this input. The first step of this loop is to check all the possible pairs of tokens within this vector, get their ranks from the merge list, and find the pair that has the highest rank.
cpp
std::pair<std::u32string, std::u32string> best_pair;int best_rank = -1;// Find the best pair to merge based on rankfor (size_t i = 0; i < tokens.size() - 1; ++i) {// Get the rank of the current pairint rank = get_pair_rank(utf32_to_utf8(tokens[i]), utf32_to_utf8(tokens[i + 1]));// Update best_pair and best_rank if this pair is betterif (rank != -1 && (best_rank == -1 || rank < best_rank)) {best_pair = {tokens[i], tokens[i + 1]};best_rank = rank;}}
Here the get_pair_rank function is just a simple lookup to see if the pair exists in our merge list. If it does it returns the rank, and otherwise returns -1. If after this process the best rank is still only -1, that means there are no valid pairs left to merge and we exit the loop:
cpp
// If no mergeable pair found, exit the loopif (best_rank == -1) {break;}
Otherwise we create an updated vector that merges the best pair of tokens, and keeps all the others as they were. This updated vector replaces our original tokens vector, and we move on to the next iteration of the loop:
cpp
// Merge the best pair of tokensstd::vector<std::u32string> merged_tokens;for (size_t i = 0; i < tokens.size(); ++i) {if (i < tokens.size() - 1 && tokens[i] == best_pair.first && tokens[i + 1] == best_pair.second) {// Merge the pairmerged_tokens.push_back(best_pair.first + best_pair.second);++i; // Skip the next token as it's now merged} else {// Keep the token as ismerged_tokens.push_back(tokens[i]);}}// Update tokens with the new merged versiontokens = std::move(merged_tokens);
That's all there is to it. Once the loop exits we have our vector of tokens, and we just need to convert them back to UTF-8 before returning them, as that is what our vocabulary encoder is expecting. We can then lookup each entry in the vocabulary to get the final token IDs that we will feed into the model.
Referring back to the architecture diagram at the very start of this post, we have only just covered the first couple of boxes! That said, I think this is actually one of the more fiddly bits of implementing GPT2 from scratch, the rest is mostly just matrix stuff which may end up being quicker to rattle through.
The broad plan is to break the rest of the architecture into another couple of blog posts, the next being the embedding/unembedding process, and getting the next token, leaving the transformer block till last. After those are done i'll carry on with some more machine-learning themed posts, for example i'd like to repeat this with Llama1, which was released about 4 years after GPT2 to get a better understanding of what moved forward in that time, and to also look at fine tuning these models on custom datasets, and using RLHF.
If you found this post useful, have any questions, or think you might be interested in some of those upcoming posts, please feel free to reach out, or just follow me on X.