Fine-Tuning GPT-2 on a Single GPU#

The [t5-small on a single GPU](1. T5-Small on Single GPU) example provided a straightforward example of fine-tuning a language model. However, you might have noticed that the training problem was still essentially structured as a supervised learning problem: we had a text (code snippet) and a desired completion. When training LLMs like the GPT models, labels are not provided manually. We instead use an approach called self-supervised learning wherein the objective is automatically computed from the inputs. One example of self-supervised learning is causal language modeling, where the task is to predict the next word based on the previous words. E.g. the sentence “The boy hid behind the tree” would be decomposed into the following training tasks:

  • Input: The, Target: boy

  • Input: The boy, Target: hid

  • Input: The boy hid, Target: behind

  • Input: The boy hid behind, Target: the

  • Input: The boy hid behind the, Target: tree.

This requires us to preprocess our data and pass it along to the model somewhat differently, which will be the subject of this notebook. We will still limit this example to training on a single GPU (an a10 with 24GB VRAM). We will use the gpt2 model with 124M parameters. Later, we will work though Eleuther’s Transformer Math blog post to understand the memory costs associated with training this model under different conditions and verify that it matches our experience. Hugging Face also provides a guide to model memory anatomy.

According to the Hugging Face post, a good heuristic is that we require around 18GB VRAM + additional memory for activations (dependent on sequence length, batch size, and various model architecture details) for mixed-precision training. In this case, that translates to around 2GB VRAM + activations.

Topics Covered in this Notebook#

The major difference between this exampl and the t5-small example is the focus on self-supervised learning. Additionally, this notebook will go a little deeper into:

  • monitoring training metrics with MLflow

  • measuring memory usage

Before progressing to multi-GPU and multi-node training, we will also explore ways to improve training efficiency on a single GPU with techniques such as mixed-precision training.

Choosing a Fine-Tuning Task#

We will fine-tune GPT2 on the tinystories dataset. TinyStories is:

a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4.

and can be used to train small models (actually quite a bit smaller than GPT-2) that

still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

(Source)

We can evaluate the model by passing prompts such as this example from the TinyStories paper:

Once upon a time there was a pumpkin. It was a very special pumpkin, it could speak. It was sad because it couldn’t move. Every day, it would say

and evaluating the grammar, consistency, and creativity of the output. We hope to see improvements in these areas after training.

1. Load the model and try some examples#

We’ll begin by loading the model and trying out some examples.

%pip install --upgrade -r ./gpt2_requirements.txt
# Some Environment Setup
OUTPUT_DIR = # the path to the output directory; where model checkpoints will be saved
LOG_DIR = # the path to the log directory; where logs will be saved
CACHE_DIR = # the path to the cache directory; where cache files will be saved
from pathlib import Path

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    device_map="auto",
    cache_dir=Path(CACHE_DIR) / "model",
)
examples = [
    "There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and",
    "There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to",
    "Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and",
]

# Tokenize the examples
inputs = tokenizer(examples, return_tensors="pt", padding=True, add_special_tokens=True, truncation=True)

# Move tensors to the same device as the model
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate text with the model
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=50,
    do_sample=True,
    top_p=0.95,
)
# Decode and print the outputs
for i, output in enumerate(outputs):
    print(f"Completion for example {i + 1}:")
    print(tokenizer.decode(output, skip_special_tokens=True))
    print("\n")

Not the most coherent results. Hopefully our fine-tuning will improve this. Let’s get the dataset and take a look at it.

2. Get the dataset#

from datasets import load_dataset
tinystories = load_dataset('roneneldan/TinyStories',
                           cache_dir=str(Path(CACHE_DIR) / "data"))

Inspect the Dataset#

tinystories

There are > 2 million training samples and > 20,000 validation samples.

import pandas as pd

# Convert the train dataset to a pandas dataframe and preview the first few rows
df = pd.DataFrame(tinystories['train'][:10])
print(df)

3. Fine-Tune the Model#

This time around, we’re going to train the model with a little more care. In particular, we will:

  • keep a close eye on training metrics using MLflow

  • do a few test runs to choose a set of reasonable hyperparameters for our final fine-tuning run

  • use mixed-precision training for faster training

As in the t5-small example, we are not going to fine-tune on the entire dataset. Instead, we will sample 100,000 examples and fine-tune on those.

from torch.utils.data import DataLoader
import os

# Shuffle and select a subset of the train data
sample_size = 100000
shuffled_train_data = tinystories["train"].shuffle(seed=42)
subset_train_data = shuffled_train_data.select(range(sample_size))


def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

if not os.path.exists("./cache/"):
    os.makedirs("./cache/")

# Tokenize and cache the train data
tokenized_train_data = subset_train_data.map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    cache_file_name=str(Path(CACHE_DIR) / "train_cache.arrow")  
)

# Tokenize and cache the validation data
tokenized_validation_data = tinystories["validation"].map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    cache_file_name=str(Path(CACHE_DIR) / "validation_cache.arrow")
)
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
import mlflow

# Define the training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=4, 
    per_device_eval_batch_size=4, 
    warmup_steps=1,
    weight_decay=0.01,
    logging_dir=LOG_DIR,
    logging_steps=10,  ## Log every 10 steps
    evaluation_strategy="steps",  ## Evaluate every 'eval_steps'
    eval_steps=10000,
    fp16=True,
)

# Initialize the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_validation_data.select(range(5000)),  ## Use only the first 5k rows for eval data
    data_collator=data_collator,
)

# Start training and track with MLflow
with mlflow.start_run(log_system_metrics=True):
    trainer.train()
    mlflow.log_params(training_args.to_dict())

4. Load the Fine-Tuned Model Checkpoint and Run some Examples#

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

examples = [
    "There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and",
    "There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to",
    "Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and",
]

# Specify the path to your checkpoint
checkpoint_path = Path(OUTPUT_DIR) / "checkpoint-5000"

# Load the tokenizer and model from the checkpoint
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(checkpoint_path)

# Create a pipeline for text generation (adjust task as needed)
gpt2_pipeline = pipeline(
    "text-generation", model=model, tokenizer=tokenizer, device_map="auto"
)

# Use the pipeline for inference
for output in gpt2_pipeline(examples, max_new_tokens=250):
    print(output[0]["generated_text"])
    print("\n")

Comparisons#

Here’s how the results from the base model compare to the results from the fine-tuned model:

  1. Prompt: “There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and”

  • base:

There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and she was summoned to protect it.

She was summoned to save the little guy from a giant cat, one of those monstrous monsters of the west who used its claws and arms to slay the little man, or so it said, from the same fate. It couldn’t be any better than the evil, big cat known as the Great Monster of the Abyss. The cat had so many of them that they were not even aware of their existence.

No one knew who it was, or who she was.

It would be her magic that made the cat disappear as it flew past. The cat that looked like it was afraid could not see. No one could touch it. It was as though the world had been given an entirely false sense of morality.

But the cat was too young. She would probably die of thirst. Even if the cat had survived the battle, the cat would never take another breath. She would die too. And she would be reborn as a creature of the evil, monster, and woman world, and be reborn as a pet, to be eaten by monsters to feed the monsters.

And if they got her, she would be happy, or at least healthy, for a while longer […]

  • fine-tuned:

There was a cat with magic powers. It could turn invisible. But one day, the cat lost its magic and flew away. It was a sad cat. The cat wanted to find its magic back.

A little girl saw the cat. She wanted to play with the cat. The little girl knew what was wrong. She called the girl’s house. She came to her house. “Look, what is your magic?” the girl’s mom said. “No, you must not do that. A cat is not magical. It can only be scary. You cannot fly, or make noise, or see things. You must learn how to use magic.”

The girl’s mom showed her the cat. She used the cat’s power. She made a big fire in front of the fire. The fire was hot and fierce. The fire aquired magic and light. The cat flew far and low and said, “Hello there, little girl. I am sorry that I cannot use magic. I must not be scared. I am just curious, like you. Did you learn anything while you were flying? What are you doing?”

The girl’s mom was surprised. She was not scared of the dark, or of the power, or of the noise. She was proud of the cat’s friend, Tom. Tom told the girl

  1. Prompt: “There was a cloud that could laugh. It laughed every day. But one day, the cloud didn’t laugh. The animals in the forest decided to”

  • base:

There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to attack. When the fox said something about the sky and sky, the animals said something very interesting. If you were looking up from the top of the trees, you would have seen this thing flying across the sky. But no, the fox just flew toward the clouds, and then his breath was blocked, because it had no breath left."

The moment the fox appeared, the two of them were stunned, as if they had seen a very special event.

At that moment, that forest fire also ignited completely, and the fire that had been so powerful was extinguished.

What was happening was really the same in the same way.

[Previous Chapter] [Table of Contents] [Next Chapter]

  • fine-tuned:

There was a cloud that could laugh. It laughed every day. But one day, the cloud didn't laugh. The animals in the forest decided to laugh. They made funny noise and laughed.

The animals lived in a big forest. They saw the big cloud. They asked the animals to go and play with her. But they were afraid of the cloud. The clou had bad things inside. The cloud was happy and made funny noises.

One day, the animals heard a loud noise. It was the wind. The wind blew hard, and the cloud came down. The animals saw the cloud in the air. It was flying above them. It was happy to see the cloud.

The animals were curious. They saw a big tree with a hole. They wanted to climb the tree. But the tree was not big or strong. It was weak. It fell down and broke into pieces.

The animals thought they were magic. They pulled on the pieces of the tree and watched the cloud disappear. The cloud started to laugh. It thought that it was going to be the best day ever.

But then, something happened. The sun came out. It turned red and golden. The animals were happy. They had seen the cloud. They had made the cloud come out. It was not a monster. It was a baby. The baby thought the

  1. Prompt: “Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and”

  • base:

Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and she could tell that the moment it happened she couldn't wait to experience it again. It didn't matter if it was like a lightning bolt hitting her heart or a thousand bolts, she kept on following it to the next day and day. It could never be her, she couldn't let it stop her from giving you the chance.

On this day of night, her eyes flashed with a starlike beauty that seemed to be burning. She felt like she was looking at one of the sunspots. At night, she saw the star at her feet with its glowing red hue.

“What do you think, your father?” said the beautiful girl in a blue cloak.

“No, I think you should not be alone, just that you can only see so much. I understand you don’t understand all that much, but you’re in a very bad mood and that may get you in trouble, you know.”

“My father is a very important man. There were some things that didn’t go well, but now you’re safe, so you don’t have any problems. I love you all, but we need to have a quiet time, as always.”

Mia looked down and smiled

  • fine-tuned:

Every night, Mia looked at the stars. But one night, one star twinkled differently. It seemed to be sending a message. Mia thought hard about what it could mean and knew it must be something bad.

On the third night, Mia remembered what she heard in the morning. A strange light was flapping around in the night sky. She ran to her window and opened it. She saw an orange glow that shone brightly in the dark. Mia wondered what it was.

Mia looked out the window and saw a big, beautiful, orange sun. She saw how it looked like something glowing in the night sky. It was a happy and bright thing. Mia smiled and ran back to her house.

Mia had a plan. She had a little box with something to wrap this sun. She filled it with sand and then started to wrap it in the box. She found an old pillow, a ball and a blanket. She said to herself, “Look, I don’t have any more sun left. I just have this rock and this blanket. I hope that the sun will come to me in the morning.”

Mia thought about it. She had a special idea. She said to herself, “Just like last night, I can come to the moon with the sun in my box.”

The next morning, Mia woke up bright. A big blue moon was

The fine-tuned versions are not necessarily more coherent as stories. They meander, they introduce new characters that serve no purpose, and, at least in the 250-token snippets generated, they don’t appear to move toward resolution.

On the other hand, the vocabulary is simpler and the sentences are shorter and (possibly as a consequence of being shorter) more internally consistent. For example, compare this snippet from the base model:

When the fox said something about the sky and sky, the animals said something very interesting. If you were looking up from the top of the trees, you would have seen this thing flying across the sky. But no, the fox just flew toward the clouds, and then his breath was blocked, because it had no breath left.

To this from the fine-tuned model:

One day, the animals heard a loud noise. It was the wind. The wind blew hard, and the cloud came down. The animals saw the cloud in the air. It was flying above them. It was happy to see the cloud.

There is a clear difference in sentence length and structure, even if the whole doesn’t necessarily make more sense (“It was happy to see the cloud”?).