Resolved: When doing pre-training of a transformer model, how can I add words to the vocabulary?

Question:

Given a DistilBERT trained language model for a given language, taken from the Huggingface hub, I want to pre-train the model on a specific domain, and I want to add new words that are:
  • definitely non existing in the original training set
  • and impossible to handle via word piece toeknization – basically you can think of these words as “codes” that are a normalized form of a named entity

Consider that:
  • I would like to avoid to learn a new tokenizer: I am fine to add the new words, and then let the model learn their embeddings via pre-training
  • the number of the “words” is way larger that the “unused” tokens in the “stock” vocabulary

The only advice that I have found is the one reported here:

Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.


Do you think this is the only way of achieve my goal?
If yes, I do not have any idea of how to write this “script”: does someone has some hints at how to proceeed (sample code, documentation etc)?

Answer:

As per my comment, I’m assuming that you go with a pre-trained checkpoint, if only to “avoid [learning] a new tokenizer.” Also, the solution works with PyTorch, which might be more suitable for such changes. I haven’t checked Tensorflow (which is mentioned in one of your quotes), so no guarantees that this works across platforms.
To solve your problem, let us divide this into two sub-problems:
  • Adding the new tokens to the tokenizer, and
  • Re-sizing the token embedding matrix of the model accordingly.

The first can actually be achieved quite simply by using .add_tokens(). I’m referencing the slow tokenizer’s implementation of it (because it’s in Python), but from what I can see, this also exists for the faster Rust-based tokenizers.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Will return an integer corresponding to the number of added tokens
# The input could also be a list of strings instead of a single string
num_new_tokens = tokenizer.add_tokens("dennlinger")  
You can quickly verify that this worked by looking at the encoded input ids:
print(tokenizer("This is dennlinger."))
# 'input_ids': [101, 2023, 2003, 30522, 1012, 102]
The index 30522 now corresponds to the new token with my username, so we can check the first part. However, if we look at the function docstring of .add_tokens(), it also says:

Note, hen adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer. In order to do that, please use the PreTrainedModel.resize_token_embeddings method.


Looking at this particular function, the description is a bit confusing, but we can get a correctly resized matrix (with randomly initialized weights for new tokens), by simply passing the previous model size, plus the number of new tokens:
from transformers import AutoModel

model = AutoModel.from_pretrained("distilbert-base-uncased")
model.resize_token_embeddings(model.config.vocab_size + num_new_tokens)

# Test that everything worked correctly
model(**tokenizer("This is dennlinger", return_tensors="pt"))
EDIT: Notably, .resize_token_embeddings() also takes care of any associated weights; this means, if you are pre-training, it will also adjust the size of the language modeling head (which should have the same number of tokens), or fix tied weights that would be affected by an increased number of tokens.

If you have better answer, please add a comment about this, thank you!