Data Preprocessing

Data Preprocessing#

Over the course of experimenting with fine-tuning tinyllama, I realized that a lot of the complexity comes from the data pre-processing step. This notebook will break down the pre-processing process and call attention to different possible considerations and approaches.

In pre-processing the data, it is important to understand (at least) the following:

How is the raw data stuctured?
What does a single example from the raw data look like?
How do we need to transform the data for training?
What does a single example in the transformed data look like?
How do we need to tokenize the data for training?
How do we batch the tokenized data for training?

Setup#

We’re going to load the tokenizer and the dataset, but not the model. We won’t need it in this notebook.

%pip install --upgrade -r ./tinyllama_requirements.txt

CACHE_DIR =  "../cache/TinyLlama/" # the path to the cache directory; where cache files will be saved

Load the Tokenizer#

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer corresponding to the model checkpoint
model_ckpt = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(
    model_ckpt,
)

The model does not specify a pad token. If we want to use padding (more on this later), we need to set the pad token. We can do this with:

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token

'</s>'

Load the Dataset#

from datasets import load_dataset
from pathlib import Path

slimorca = load_dataset("Open-Orca/SlimOrca", cache_dir=str(Path(CACHE_DIR) / "data"))

How is the raw data stuctured?#

What have we actually loaded here? We used the Hugging Face datasets library to load the dataset, and the resulting object is a datasetdict. A datasetdict is a dictionary-like object for managing datasets, where the keys are the names of the datasets, and the values are the actual datasets. Usually the datasets are splits such as “train”, “valid”, and “test”. In this case, we only have a “train” split, with only one feature, “conversations”.

DatasetDicts enable us to map various pre-processing operations to the datasets they contain concisely and efficiently.

Let’s take a look.

slimorca

DatasetDict({
    train: Dataset({
        features: ['conversations'],
        num_rows: 517982
    })
})

The train split within the DatasetDict is a Dataset:

slimorca["train"]

Dataset({
    features: ['conversations'],
    num_rows: 517982
})

How do we need to transform the data for training?#

Hugging Face Datasets provide various methods for querying, subsetting, and processing data, generally quite efficiently because they use the Apache Arrow format. For example, if we want a validation set, we can split our training set into separate train/valid sets. The data are shuffled by default, so we don’t need to shuffle as a separate step. Note that we apply this operation to the Dataset, not the DatasetDict.

The resulting object is a new DatasetDict with two keys: “train” and “test”.

slimorca_split = slimorca["train"].train_test_split(test_size=0.1, seed=42)
slimorca_split

DatasetDict({
    train: Dataset({
        features: ['conversations'],
        num_rows: 466183
    })
    test: Dataset({
        features: ['conversations'],
        num_rows: 51799
    })
})

What does a single example from the raw data look like?#

Now let’s look at some of the actual data examples. What do the conversations look like?

slimorca_split["train"][42]

{'conversations': [{'from': 'system',
   'value': 'You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.',
   'weight': None},
  {'from': 'human',
   'value': 'Q:Read the article and select the best answer. Article: Tina was not like many of her classmates. She didn\'t listen to popular music or watch many movies, and she wasn\'t interested in nice clothes. When she got together with her friends, they wanted to listen to rock and pop music. When Tina asked  if  they would  like  to  try classical    music, they all looked at her strangely."Classical music  is  for old people, " one of  her friends said. Tina was worried that something was wrong with her. She decided to talk to her father. As she entered his study  , her father could feel something was wrong. "Dad, am I strange?" she asked her father."Of course not, " he answered. "Why do you ask that?" "Because I don\'t like the same things as my classmates do. They want to listen to Mariah Carey\'s music. I like Yo Yo Ma\'s." "I can understand, Tina,  it\'s all  right _ You don\'t have to copy   what other people do. Everybody has different tastes. Some of them are popular, and others aren\'t. "After talking with her father, Tina felt better. She realized    that being different made her special.  It was an important lesson for her to learn. Question: Tina\'s father made Tina feel  _  . Options: A: angry B: worried C: excited D: better\nA:',
   'weight': 0.0},
  {'from': 'gpt', 'value': 'D: better', 'weight': 1.0}]}

The objects are dictionaries with information about the roles (“system”, “human”, or “gpt”) and values making up the exchange. Our task is to map this to a format on which we can train the model.

The Hugging Face Transformers library provides convenient chat model templates. The Hugging Face docs recommend applying the chat templates as a preprocessing step.

We won’t go into too much detail about the concept of chat templates—you can read more here. For now, just know that they provide a means of clearly indicating which part of a string came from the user and which part is the LLM’s expected response. Models are trained on specific chat formats, and using different formats will generally result in bad responses at inference time.

Here’s an example of a chat dictionary translated to a string with the chat template.

chat = [
    {
        "role": "system",
        "content": "You are a helpful assistant and an expert at making coffee.",
    },
    {"role": "user", "content": "How do I make coffee with a Chemex coffee maker?"},
    {
        "role": "assistant",
        "content": "To make coffee with a Chemex:\n1. Boil water to about 200°F (93°C).\n2. Place the Chemex filter in the top and rinse it with hot water to remove paper taste and warm the vessel. Discard the rinse water.\n3. Add coffee grounds to the filter. Use a medium-coarse grind, about 1 gram of coffee per 16 grams of water.\n4. Pour just enough hot water to saturate the grounds. Wait 30 seconds for the coffee to 'bloom'.\n5. Slowly pour the remaining water over the grounds in a circular motion. Aim for a total brew time of 3.5 to 4.5 minutes.\n6. Once brewing is complete, remove the filter and enjoy.",
    },
]

print(tokenizer.apply_chat_template(chat, tokenize=False))

No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

<s>[INST] <<SYS>>
You are a helpful assistant and an expert at making coffee.
<</SYS>>

How do I make coffee with a Chemex coffee maker? [/INST] To make coffee with a Chemex:
1. Boil water to about 200°F (93°C).
2. Place the Chemex filter in the top and rinse it with hot water to remove paper taste and warm the vessel. Discard the rinse water.
3. Add coffee grounds to the filter. Use a medium-coarse grind, about 1 gram of coffee per 16 grams of water.
4. Pour just enough hot water to saturate the grounds. Wait 30 seconds for the coffee to 'bloom'.
5. Slowly pour the remaining water over the grounds in a circular motion. Aim for a total brew time of 3.5 to 4.5 minutes.
6. Once brewing is complete, remove the filter and enjoy. </s>

Notice that using the chat template adds some special tokens indicating the beginning/end of the chat (<s> and </s>) and the beginning/end of the instruction ([INST] and [/INST]). We will need to add these to our tokenizer as special tokens later.

Let’s map the conversations to the expected input for the chat template. The Hugging Face chat model templates expect a dictionary similar to conversations but with some notable differences. We need to replace from with role, value with content, gpt with assistant, and human with user. We will save the actual conversion until later, as there are other considerations we still need to address.

def format_chat(ex):
    role_mapping = {"gpt": "assistant", "system": "system", "human": "user"}
    chat = [
        {"role": role_mapping[message["from"]], "content": message["value"]}
        for message in ex["conversations"]
    ]

    return {"chat": chat}


slimorca_split_formatted_chat = slimorca_split.map(format_chat, num_proc=32)
slimorca_split_formatted_chat

DatasetDict({
    train: Dataset({
        features: ['conversations', 'chat'],
        num_rows: 466183
    })
    test: Dataset({
        features: ['conversations', 'chat'],
        num_rows: 51799
    })
})

What does a single example of the transformed data look like?#

Now we have added a ‘chat’ key to each example in slimorca_split_formatted_chat. Note how we applied this transformation. We wrote a function that processed a single example and then used the map method to apply it to all examples in the DatasetDict. The num_proc parameter specifies how many processes to use for parallel processing, dramatically increasing the speed of the process.

Let’s compare the original and formatted data.

print(slimorca_split_formatted_chat["train"][42]["conversations"])
print(slimorca_split_formatted_chat["train"][42]["chat"])

[{'from': 'system', 'value': 'You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.', 'weight': None}, {'from': 'human', 'value': 'Q:Read the article and select the best answer. Article: Tina was not like many of her classmates. She didn\'t listen to popular music or watch many movies, and she wasn\'t interested in nice clothes. When she got together with her friends, they wanted to listen to rock and pop music. When Tina asked  if  they would  like  to  try classical    music, they all looked at her strangely."Classical music  is  for old people, " one of  her friends said. Tina was worried that something was wrong with her. She decided to talk to her father. As she entered his study  , her father could feel something was wrong. "Dad, am I strange?" she asked her father."Of course not, " he answered. "Why do you ask that?" "Because I don\'t like the same things as my classmates do. They want to listen to Mariah Carey\'s music. I like Yo Yo Ma\'s." "I can understand, Tina,  it\'s all  right _ You don\'t have to copy   what other people do. Everybody has different tastes. Some of them are popular, and others aren\'t. "After talking with her father, Tina felt better. She realized    that being different made her special.  It was an important lesson for her to learn. Question: Tina\'s father made Tina feel  _  . Options: A: angry B: worried C: excited D: better\nA:', 'weight': 0.0}, {'from': 'gpt', 'value': 'D: better', 'weight': 1.0}]
[{'content': 'You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.', 'role': 'system'}, {'content': 'Q:Read the article and select the best answer. Article: Tina was not like many of her classmates. She didn\'t listen to popular music or watch many movies, and she wasn\'t interested in nice clothes. When she got together with her friends, they wanted to listen to rock and pop music. When Tina asked  if  they would  like  to  try classical    music, they all looked at her strangely."Classical music  is  for old people, " one of  her friends said. Tina was worried that something was wrong with her. She decided to talk to her father. As she entered his study  , her father could feel something was wrong. "Dad, am I strange?" she asked her father."Of course not, " he answered. "Why do you ask that?" "Because I don\'t like the same things as my classmates do. They want to listen to Mariah Carey\'s music. I like Yo Yo Ma\'s." "I can understand, Tina,  it\'s all  right _ You don\'t have to copy   what other people do. Everybody has different tastes. Some of them are popular, and others aren\'t. "After talking with her father, Tina felt better. She realized    that being different made her special.  It was an important lesson for her to learn. Question: Tina\'s father made Tina feel  _  . Options: A: angry B: worried C: excited D: better\nA:', 'role': 'user'}, {'content': 'D: better', 'role': 'assistant'}]

Tokenize the Data#

As mentioned above, our instruction formatting includes some special tokens we would like to add to the tokenizer’s vocabulary. We can do that as follows:

# Add the instruction tokens to the tokenizer
special_tokens = ["[INST]", "[/INST]", "<<SYS>>", "<</SYS>>"]
# Adding special tokens to the tokenizer
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})

A few questions might occur to you at this point:

Why don’t we add the <s> and </s> tokens to the tokenizer? Those were used in the chat formatting too!

We don’t need to add those tokens because they are already the tokenizer’s bos (beginning of sequence) and eos (end of sequence) tokens, as shown below.

What exactly does it mean to “add tokens to the tokenizer”? Adding tokens to the tokenizer expands the tokenizer’s vocabulary, the list of possible tokens that the model can use. Tokens in the tokenizer’s vocabulary won’t be split into smaller units. Before adding [INST] to the vocabulary, for example, it could only be formed through the following combination of tokens: ['[', '/', 'INST', ']']. This allows the model to recognize the [INST] token as a single, distinct semantic unit: it only needs to learn one token to recognize the start of an instruction, not a sequence of four tokens that might also be used in other contexts. Furthermore, since all of the sequences we want to train on will include the special tokens above, there are some effiency gains: each of these tokens will take up only one unit of context length rather than multiple.

# the <s> and </s> tokens are already part of the vocabulary
tokenizer.bos_token, tokenizer.eos_token

('<s>', '</s>')

Structure the Tokenized Data#

Before we actually tokenize the chat data in our DatasetDict, we need to make some decisions about how to structure the tokenized data. So far, we’ve been thinking in terms of individual examples from the training data. This isn’t necessarily the most relevant perspective to the model. We should think in terms of sequences of tokes and batches of sequences.

A sequence is a single list of tokens from the training data, usually limited to some uniform length. Sequences of the desired length can be formed by starting with a single example and padding it with the tokenizer’s padding token to make it the desired length if it is too short, or by truncating it if it is too long. Another option is sequence packing, wherein multiple shorter sequences are packed into a single longer sequence.
A batch is a list of sequences. During training, the model is fed a batch of sequences at a time.

We also need to think about our training objective as that can also affect the way in which we have to structure the data. Do the data have an input/output structure? In our case, we have instruction/response pairs: how do we structure them? This relates to the behavior of the model during training and is neither obvious nor easy to figure out.

In the Hugging Face training ecosystem, the collator is responsible for taking inputs, generating labels, and assembling the inputs into batches. Let’s see what happens if we tokenize our data and supply it to the DataCollatorForLanguageModeling collator.

Note that, while the apply_chat_template function provides the option to tokenize the chat, we’re going to apply the chat template and then tokenize in separate steps. This makes the process more transparent; we don’t have to wonder how the arguments from the helper function are passed to the tokenizer. This also corresponds to the approach shown in the docs. The training example does not include direct tokenization as part of the apply_chat_template call.

def tokenize_function(ex):
    # Apply chat template for formatting
    formatted_chat = tokenizer.apply_chat_template(
        ex["chat"],
        tokenize=False,  # Apply formatting but do not tokenize
        add_generation_prompt=False
    )

    # Tokenize using the standard tokenizer method
    tokenized_output = tokenizer(
        formatted_chat,
        add_special_tokens=False,  # apply_chat_template already added special tokens
    )
    
    return tokenized_output

slimorca_split_formatted_chat_tokenized = slimorca_split_formatted_chat.map(
    tokenize_function, num_proc=32
).remove_columns(["conversations", "chat"])

slimorca_split_formatted_chat_tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 466183
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 51799
    })
})

What happened here? We ended up with two features, input_ids and attention_mask.

The input_ids are our tokenized chats. We converted the chat dictionaries into formatted strings with the apply_chat_template function and then tokenized them using the standard tokenizer.
The attention_mask is a binary mask that indicates whether each token is a padding token or not. During training, the model will ignore the tokens indicated by the mask. The attention mask is particularly useful for batching inputs. We need inputs of the same length, but the actual texts will be different lengths, so we pad (and/or truncate) them to the desired length. The attention mask tells the model to ignore the padding tokens.

A quick note: tokenizer() and tokenizer.encode() are different! The former generates an attention mask and the latter does not. When preparing the data for training, we should use the former. The latter is useful for inference or for other cases where the attention mask is not needed.

Let’s see what these actually look like. We’ll examine the beginnings and ends of the first example.

print("Input IDs: ", slimorca_split_formatted_chat_tokenized["train"][0]["input_ids"][0:10], "...", slimorca_split_formatted_chat_tokenized["train"][0]["input_ids"][-10:])
print("Attention Mask: ", slimorca_split_formatted_chat_tokenized["train"][0]["attention_mask"][0:10], "...", slimorca_split_formatted_chat_tokenized["train"][0]["attention_mask"][-10:])

Input IDs:  [1, 32000, 259, 32002, 29871, 13, 3492, 526, 385, 319] ... [13, 13, 22550, 29901, 11837, 2897, 7681, 275, 29871, 2]
Attention Mask:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

The attention mask is just a list of 1s! This is because we did not instruct the tokenizer to apply any padding or truncation.

The first example is 370 tokens long. Let’s see what happens if we specify that the output should be 375 tokens long and pad to that length.

def tokenize_function_with_padding(ex):
    # Apply chat template for formatting
    formatted_chat = tokenizer.apply_chat_template(
        ex["chat"],
        tokenize=False,  # Apply formatting but do not tokenize
        add_generation_prompt=False
    )

    # Tokenize using the standard tokenizer method
    tokenized_output = tokenizer(
        formatted_chat,
        add_special_tokens=False,  # apply_chat_template already added special tokens
        padding="max_length",
        max_length=375,
    )
    
    return tokenized_output

x = tokenize_function_with_padding(slimorca_split_formatted_chat["train"][0])

print("Input IDs: ", x["input_ids"][0:10], "...", x["input_ids"][-10:])
print("Attention Mask: ", x["attention_mask"][0:10], "...", x["attention_mask"][-10:])

Input IDs:  [1, 32000, 259, 32002, 29871, 13, 3492, 526, 385, 319] ... [2897, 7681, 275, 29871, 2, 2, 2, 2, 2, 2]
Attention Mask:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ... [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

The last five tokens in the input_id are the pad token, and the last five attention_mask values are 0, indicating that those padding tokens should be ignored.

There are quite a few ways we can handle sequence lengths, mostly involving different padding and/or truncation schemes:

Padding means adding padding tokens (often the same as the eos token) to a sequence to make it the desired length
- We can choose left or right padding: padding the beginning of the sequence or the end of the sequence.
- We typically pad to the maximum length of the batch or to a pre-defined maximum length.
Truncation means removing tokens from a sequence to make it the desired length.
- As with padding, we can truncate from either side of the sequence, though truncating the end is more common.
Padding and Truncation are usually used together. Short sequences are padded to reach a desired length, and long sequences are truncated to reach the desired length.

Let’s see what changes if we use truncation and padding. We’re going to use a sequence length of 1024 tokens, which should give us a mix of truncated and padded results.

def tokenize_function_max_length(ex):
    # Apply chat template for formatting
    formatted_chat = tokenizer.apply_chat_template(
        ex["chat"],
        tokenize=False,  # Apply formatting but do not tokenize
        add_generation_prompt=False
    )

    # Tokenize using the standard tokenizer method
    tokenized_output = tokenizer(
        formatted_chat,
        add_special_tokens=False,  # apply_chat_template already added special tokens
        padding="max_length", # pad to the specified length
        max_length=1024, # max length at which to truncate or to which to pad
        truncation=True # truncate to the specified length
    )
    
    return tokenized_output

slimorca_split_formatted_chat_tokenized_max_length = slimorca_split_formatted_chat.map(
    tokenize_function_max_length, num_proc=32
).remove_columns(["conversations", "chat"])

print("Input IDs: ", slimorca_split_formatted_chat_tokenized_max_length["train"][0]["input_ids"][0:10], "...", slimorca_split_formatted_chat_tokenized_max_length["train"][0]["input_ids"][-10:])
print("Attention Mask: ", slimorca_split_formatted_chat_tokenized_max_length["train"][0]["attention_mask"][0:10], "...", slimorca_split_formatted_chat_tokenized_max_length["train"][0]["attention_mask"][-10:])

Input IDs:  [1, 32000, 259, 32002, 29871, 13, 3492, 526, 385, 319] ... [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
Attention Mask:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The 12th example, on the other hand, is longer than 1024 tokens, so we won’t see any padding. In fact, the input has been truncated, as we can see if we decode the final tokens.

print("Input IDs: ", slimorca_split_formatted_chat_tokenized_max_length["train"][11]["input_ids"][0:10], "...", slimorca_split_formatted_chat_tokenized_max_length["train"][11]["input_ids"][-10:])
print("Attention Mask: ", slimorca_split_formatted_chat_tokenized_max_length["train"][11]["attention_mask"][0:10], "...", slimorca_split_formatted_chat_tokenized_max_length["train"][11]["attention_mask"][-10:])
print("\n", "End of Decoded Output: ", tokenizer.decode(slimorca_split_formatted_chat_tokenized_max_length["train"][11]["input_ids"][-40:]), sep="")

Input IDs:  [1, 32000, 259, 32002, 29871, 13, 3492, 526, 385, 319] ... [29889, 29903, 29889, 18148, 29889, 29871, 13, 259, 13, 17302]
Attention Mask:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

End of Decoded Output: job approval was 45 percent with 42 percent of those surveyed saying they disapprove of what he is doing in the U.S. Senate. 
  
 Among

Collation#

A collator is responsible for forming batches from a collection of sequences. The simplest collators just form batches of sequences from the inputs, where a “batch” is a tensor of shape [batch_size, sequence_length].

from transformers import DefaultDataCollator

# Assuming 'slimorca_tokenized_split' is your dataset and 'data_collator' is your collator
batch_size = 5  # Define your batch size

# Create an instance of the data collator
collator = DefaultDataCollator()

# Retrieve examples from the dataset
examples = [slimorca_split_formatted_chat_tokenized_max_length['train'][i] for i in range(batch_size)]

# Run the collator on these examples to create a batch
batch = collator(examples)

First, we’ll just tokenize the data as-is and explore what the collator does with it.

batch, batch['input_ids'].shape

({'input_ids': tensor([[    1, 32000,   259,  ...,     2,     2,     2],
          [    1, 32000,   259,  ...,     2,     2,     2],
          [    1, 32000,   259,  ...,     2,     2,     2],
          [    1, 32000,   259,  ...,     2,     2,     2],
          [    1, 32000,   259,  ...,     2,     2,     2]]),
  'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
          [1, 1, 1,  ..., 0, 0, 0],
          [1, 1, 1,  ..., 0, 0, 0],
          [1, 1, 1,  ..., 0, 0, 0],
          [1, 1, 1,  ..., 0, 0, 0]])},
 torch.Size([5, 1024]),
 dict)

We end up with a dictionary of tensors, where each tensor has a shape of [batch_size, sequence_length].

We don’t usually use the DefaultDataCollator. Instead, we use the DataCollatorForLanguageModeling collator. Let’s see what it does with our data.

from transformers import DataCollatorForLanguageModeling

# Assuming 'slimorca_tokenized_split' is your dataset and 'data_collator' is your collator
batch_size = 5  # Define your batch size

# Create an instance of the data collator
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Retrieve examples from the dataset
examples = [slimorca_split_formatted_chat_tokenized_max_length['train'][i] for i in range(batch_size)]

# Run the collator on these examples to create a batch
batch = collator(examples)

tokenizer.decode(examples[2]['input_ids'])

'<s>[INST]  <<SYS>> \nYou are a helpful assistant, who always provide explanation. Think like you are answering to a five year old.\n<</SYS>> \n\nPossible review types:\npick from the following.\n (1). negative;\n (2). positive;.\nGenerate a (1). review for a place [/INST]  Alright sweetie, imagine we went to a place and we didn\'t like it. So, a negative review is when we share our experience with others to let them know what went wrong. \n\nFor example: "I went to this place with my family, and we didn\'t have a good time. The people working there were not very friendly, and we waited a long time for our food. When the food came, it didn\'t taste good either. We were sad about our visit and wouldn\'t want to go back." </s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s>'

batch['input_ids'].shape

torch.Size([5, 1024])

print("batch contents:" , batch.keys(), "\n", "batch shape: ", batch['input_ids'].shape)
print("Input IDs: ", batch['input_ids'][:, 0:10], "...", batch['input_ids'][:, -10:])

print("Attention Mask: ", batch['attention_mask'][:, 0:10], "...", batch['attention_mask'][:, -10:])

print("Labels: ", batch['labels'][:, 0:10], "...", batch['labels'][:, -10:])

batch contents: dict_keys(['input_ids', 'attention_mask', 'labels']) 
 batch shape:  torch.Size([5, 1024])
Input IDs:  tensor([[    1, 32000,   259, 32002, 29871,    13,  3492,   526,   385,   319],
        [    1, 32000,   259, 32002, 29871,    13,  3492,   526,   385,   319],
        [    1, 32000,   259, 32002, 29871,    13,  3492,   526,   263,  8444],
        [    1, 32000,   259, 32002, 29871,    13,  3492,   526,   385,   319],
        [    1, 32000,   259, 32002, 29871,    13,  3492,   526,   385,   319]]) ... tensor([[2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]])
Attention Mask:  tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) ... tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Labels:  tensor([[    1, 32000,   259, 32002, 29871,    13,  3492,   526,   385,   319],
        [    1, 32000,   259, 32002, 29871,    13,  3492,   526,   385,   319],
        [    1, 32000,   259, 32002, 29871,    13,  3492,   526,   263,  8444],
        [    1, 32000,   259, 32002, 29871,    13,  3492,   526,   385,   319],
        [    1, 32000,   259, 32002, 29871,    13,  3492,   526,   385,   319]]) ... tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100],
        [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100],
        [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100],
        [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100],
        [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

A few observations:

the collator generates labels that are the same length as the input. The labels are just the inputs.
eos tokens (2) are replaced with -100 in the labels, signifying that the model should not use them in calculating the loss.
the collator pads the labels to the maximum length of the batch; that is, to the length of the longest sequence in the batch.

An aside on shifting labels#

What do the labels actually do? In the most basic structure of Causal Language Modeling, our labels should be our input ids shifted one to the right. Why shifted to the right? In causal language modeling, the objective is to predict the next token given all of the preceding tokens. So if the model sees input_ids 1, it should predict input_ids 2. By making the labels the same as the input ids shifted one to the right, we establish that the task is to predict the next token given all of the preceding tokens.

Crucially, the DataCollatorForLanguageModeling will not shift the labels to the right. This is left to the model training code. E.g. this is where it happens for the Hugging Face Trainer for Causal Language Models.

Labels and Instruction Tuning#

It is natural to think that, in the case of instruction tuning, we can simply treat the tokenized inputs as out input_ids and have the tokenized responses as our labels. This is not the case, though. With causal language modeling, the model generates a single token at a time based on some number of preceding tokens. That “next token” is the label. Having a set of labels that fundamentally differ from the inputs will not effectively train the model.