Daniel Liden

Blog / About Me / Photos / LLM Fine Tuning / Notes /

PyTorch Review

This is a quick run through the appendix on PyTorch from Sebastian Raschka's Build a Large Language Model (From Scratch) book, currently available via Manning's MEAP. I haven't spent an enormous amount of time with PyTorch in the last year or so, so it seemed worth the effort to work through it.

A.1 PyTorch

There are three broad components to PyTorch:

  • A tensor library extending array-oriented programming from NumPy with additional features for accelerated computation on GPUs.
  • An automatic differentiation engine (autograd), which ehables automatic computation of gradients for tensor operations for backpropagation/model optimization.
  • A deep learning library, offering modular, flexible, and extensible building blocks for designing and training deep learning models.

Let's make sure we have it installed correctly…

import torch


Let's make sure we can use mps (on mac).



A.2 Understanding tensors

Tensors generalize vectors and matrices to arbitrary dimensions. PyTorch tensors are similar to NumPy arrays but have have several additional features:

  • an automatic differentiation engine
  • gpu computation

Still, it has a numpy-like API.

Creating tensors

# 0d tensor (scalar)

# 1d tensor (vector)
print(torch.tensor([1, 2, 3]))

# 2d tensor
print(torch.tensor([[1, 2], [3, 4]]))

# 3d tensor
print(torch.tensor([[[1, 2], [3,4]], [[5,6], [7,8]]]))
tensor([1, 2, 3])
tensor([[1, 2],
        [3, 4]])
tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])

Tensor data types

These are important to pay attention to! So let's pay attention to them. The default (from above) is the 64-bit integer.

tensor1d = torch.tensor([1,2,3])

For floats, PyTorch uses 32-bit precision by default.

floatvec = torch.tensor([1., 2., 3.])

Why this default?

  • GPU architectures are optimized for 32-bit computations
  • 32-bit precision is sufficient for most deep learning tasks but uses less memory and computational resources than 64-bit.

it is easy to change dtype (and precision) with a tensor's .to method.


Common tensor operations

  • brief survey of the most common tensor operations prior to getting into the computational graphs concept.
tensor2d = torch.tensor([[1, 2, 3], [4, 5, 6]])


print(tensor2d.reshape(3, 2))
tensor([[1, 2],
        [3, 4],
        [5, 6]])

It is more common to use view than reshape.

print(tensor2d.view(3, 2))
tensor([[1, 2],
        [3, 4],
        [5, 6]])


tensor([[1, 4],
        [2, 5],
        [3, 6]])

Matrix multiplication is usually handled with matmul.

tensor([[14, 32],
        [32, 77]])
print(tensor2d @ tensor2d.T)
tensor([[14, 32],
        [32, 77]])

A.3 Models as Computational Graphs

The previous section covered PyTorch's tensor library. This section gets into its automatic differentiation engine (autograd). Autograd provides functions for automatically computing gradients in dynamic computational graphs.

So what's a computational graph? It lays out the sequence of calculations needed to compute the gradients for backprop. We'll go through an example showing the forward pass of a logstic regression classifier.

import torch.nn.functional as F

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2])
b = torch.tensor([0.0])

z = x1 * w1 + b
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a,y)

This results in a computational graph which PyTorch builds in the background.

Input and weight -> (u = w1 * x1) -> +b -> (z = u + b) -> (a = σ(z)) -> loss = L(a,y) <- y

A.4 Automatic Differentiation

PyTorch will automatically build such a graph if one of its terminal nodes has the requires_grad attribute set to True. This enables us to train neural nets via backpropagation. Working backward from the above:

\begin{align*} \frac{\partial L}{\partial w_1} &= \frac{\partial u}{\partial w_1} \times \frac{\partial z}{\partial u} \times \frac{\partial a}{\partial z} \times \frac{\partial L}{\partial a} \\ \frac{\partial L}{\partial b} &= \frac{\partial z}{\partial b} \times \frac{\partial a}{\partial z} \times \frac{\partial L}{\partial a} \end{align*}

Basically–apply the chain rule right to left.

Quick reminder of some definitions:

  • a partial derivative measures the rate at which a function changes w/r/t one of its variables
  • a gradient is a vector of all the partial derivatives of a multivariate function

So what exactly does this have to do with torch as an autograd engine? PyTorch tracks every operation performed on tensors and can, therefore, construct a computational graph in the background. Then it cal cann on the grad function to compute the gradient of the loss w/r/t the model parameter as follows:

import torch.nn.functional as F
from torch.autograd import grad

y = torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)

z = x1 * w1 + b
a = torch.sigmoid(z)

loss = F.binary_cross_entropy(a, y)
grad_L_w1 = grad(loss, w1, retain_graph=True) #A
grad_L_b = grad(loss, b, retain_graph=True)

We seldom manually call the grad function. We usually call .backward on the loss, which computes the gradients of all the leaf nodes in the graph, which will be stored via the .grad attributes of the tensors.


A.5 Implementing multilayer neural networks

Now we get to the third major component of Pytorch: its library for implementing deep neural networks.

We will focus on a fully-connected MLP. To implement an NN in PyTorch, we:

  • subclass the torch.nn.Module class to define a custom architecture
  • define layers within the __init__ constructor of the module subclass, specifying how they interact in the forward method.
  • defined the forward method, which describes how data passes through the network and relates as a computational graph.

We generally do not need to implement the backward method ourselves.

Here is code illustrating a basic NN with two hidden layers.

class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):

        self.layers = torch.nn.Sequential(
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            # output layer
            torch.nn.Linear(20, num_outputs),

    def forward(self, x):
        logits = self.layers(x)
        return logits

We can instantiate this with 50 inputs and 3 outputs.

model = NeuralNetwork(50, 3)
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)

We can count the total number of trainable parameters as follows:

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable model parameters:", num_params)
Total number of trainable model parameters: 2213

A parameter is trainable if its requires_grad attribute is True. We can investigate specific layers. Let's look at the first linear layer.

Parameter containing:
tensor([[-0.0844,  0.0863,  0.1168,  ...,  0.0203, -0.0814, -0.0504],
        [ 0.0288,  0.0004, -0.1411,  ..., -0.0322, -0.1085,  0.0682],
        [-0.1075, -0.0173, -0.0476,  ..., -0.0684, -0.0522, -0.1316],
        [ 0.1129, -0.0639, -0.0662,  ...,  0.1284, -0.0707,  0.1090],
        [ 0.0790, -0.1206, -0.1156,  ...,  0.1393, -0.0233,  0.1035],
        [-0.0078, -0.0789,  0.0931,  ...,  0.0220, -0.0572,  0.1112]],

This is truncated, so let's look at the shape instead to make sure it matches with our expectations.

from rich import print

torch.Size([30, 50])

We can call on the model like this:

X = torch.rand((1,50))
out = model(X)
tensor([[ 0.0623, -0.0063, -0.1485]], grad_fn=<AddmmBackward0>)

We generated a single random example (50 dimensions) and passed it to the model. This was the forward pass. The forward pass simply means calculating the output tensors from the input tensors.

As we can see from the grad_fn, this forward pass computes a computational graph for backprop. This can be wasteful and unnecessary if we're just interested in inference. We use the torch.no_grad context manager to get around this.

with torch.no_grad():
    out = model(X)
tensor([[ 0.0623, -0.0063, -0.1485]])

And this approach just computes the output tensors.

Usually in PyTorch we don't pass the final layer to a nonlinear activation function, because the loss function usually combines softmax with negativel og-likelihood loss in a single class. We have to call softmax explicitly if we want class-membership probabilities.

with torch.no_grad():
    out = torch.softmax(model(X), dim=1)
tensor([[0.3645, 0.3403, 0.2952]])

A.6 Setting up efficient data loaders

A DataSet is a class that defines how individual records are loaded. A DataLoader class handles dataset shuffling and assembling data records into batches.

This example shows a dataset of five training examples with two features each, along with a tensor of class labels. We also have a test dataset of two entries.

X_train = torch.tensor(
    [[-1.2, 3.1], [-0.9, 2.9], [-0.5, 2.6], [2.3, -1.1], [2.7, -1.5]]
y_train = torch.tensor([0, 0, 0, 1, 1])
X_test = torch.tensor(
        [-0.8, 2.8],
        [2.6, -1.6],
y_test = torch.tensor([0, 1])

Let's first make these into a DataSet.

from torch.utils.data import Dataset

class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

Note the three main components of the above Dataset definition:

  1. __init__, to set up attributes we can access in the other methods. This might be file paths, file objects, database connectors, etc. Here we just use X and y, which we point toward the correct tensor objects in memory.
  2. __getitem__ is for defining instructions for retrieving exactly one record via index.
  3. __len__ is for retrieving the length of the dataset.


Now we can use the DataLoader class to define how to sample from the Dataset we defined.

from torch.utils.data import DataLoader


train_loader = DataLoader(

test_loader = DataLoader(

Now we can iterate over the train_loader as follows:

for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)
Batch 1: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])

Batch 2: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])

Batch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])

Note that we can set drop_last=True to drop the last uneven batch, as significantly uneven batch sizes can harm convergence.

The num_workers argument relates to parallelizing data loading/processing. 0 indicates that it will all be done in the main process, not in separate worker processes. This can slow things down a lot.

A.7 A typical training loop

In this section, we combine many of the techniques from above to show a complete training loop.

import torch
import torch.nn.functional as F
model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
num_epochs = 3
for epoch in range(num_epochs): 
    for batch_idx, (features, labels) in enumerate(train_loader):
        logits = model(features)
        loss = F.cross_entropy(logits, labels)
        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
              f" | Train Loss: {loss:.2f}")
    # Optional model evaluation
Epoch: 001/003 | Batch 000/003 | Train Loss: 0.75
Epoch: 001/003 | Batch 001/003 | Train Loss: 0.65
Epoch: 001/003 | Batch 002/003 | Train Loss: 0.42
Epoch: 002/003 | Batch 000/003 | Train Loss: 0.05
Epoch: 002/003 | Batch 001/003 | Train Loss: 0.13
Epoch: 002/003 | Batch 002/003 | Train Loss: 0.00
Epoch: 003/003 | Batch 000/003 | Train Loss: 0.01
Epoch: 003/003 | Batch 001/003 | Train Loss: 0.00
Epoch: 003/003 | Batch 002/003 | Train Loss: 0.02

Note the use of model.train and model.eval. These set the model into training and evaluation mode, respectively. Some components behave differently during training or inference, such as ddropout or batch normalization. We don't have these or anything like them, so this is redundant in our code, but still good practice.

We pass the logits directly to cross_entropy to compute the loss and call loss.backward() to compute gradients. optimizer.step uses the gradients to update the model parameters.

It is important that we include an optimizer.zero_grad call in each update to reset the gradients and ensure they do not accumulate.

Now we can make predictions with the model.

with torch.no_grad():
    outputs = model(X_train)
tensor([[ 2.9320, -4.2563],
        [ 2.6045, -3.8389],
        [ 2.1484, -3.2514],
        [-2.1461,  2.1496],
        [-2.5004,  2.5210]])

If we want the class membership, we can obtain it with:

probas = torch.softmax(outputs, dim=1)
tensor([[    0.9992,     0.0008],
        [    0.9984,     0.0016],
        [    0.9955,     0.0045],
        [    0.0134,     0.9866],
        [    0.0066,     0.9934]])

There are two classes, so the above represents the probabilities of belonging to class 1 or class 2. The first three have high probability of class 1; the last two of class 2.

We can convery into class labels as follows:

predictions = torch.argmax(probas, dim=1)
tensor([0, 0, 0, 1, 1])

We don't need to compute softmax probabilities to accomplish this.

print(torch.argmax(outputs, dim=1))
tensor([0, 0, 0, 1, 1])

Is it correct?

predictions == y_train
tensor([True, True, True, True, True])

and to get the proportion correct:

torch.sum(predictions == y_train) / len(y_train)

A.8 Saving and Loading Models

We can save a model as follows:

torch.save(model.state_dict(), "model.pth")

.pt and .pth are the most common extensions, by convention, but we can use whatever we want.

We restore a model with:

model = NeuralNetwork(2,2)

It is necessary to have an instance of the model in memory in order to load the model weights.

A.9 Optimizing training performance with GPUs

Computations on GPUs

  • Modifying training runs to use GPU in PyTorch is easy
  • In PyTorch, a device is where computations occur and data resides. A pytorch tensor lives on a device and its operations are executed on that device.

Because I am running this locally, I am going to try to follow these examples with mps.

print("MPS is available." if torch.backends.mps.is_available() else "MPS is not available.")
MPS is available.

By default, operations are done on CPU.

tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])
tensor_1 + tensor_2
tensor([5., 7., 9.])

Now we can transfer the tensors to GPU and perform the addition there.

tensor_1 = tensor_1.to("mps")
tensor_2 = tensor_2.to("mps")
tensor_1 + tensor_2
tensor([5., 7., 9.], device='mps:0')

All tensors have to be on the same device or the computation will fail.

tensor_1 = tensor_1.to("mps")
tensor_2 = tensor_2.to("cpu")
tensor_1 + tensor_2
Traceback (most recent call last):
  File "<string>", line 17, in __PYTHON_EL_eval
  File "<string>", line 3, in <module>
  File "/var/folders/vq/mfrl6bsd37jglvmz0vyxf3000000gn/T/babel-YaG8HR/python-l9RkUi", line 3, in <module>
    tensor_1 + tensor_2
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, mps:0 and cpu!

Single-GPU Training

All we need to do to train on a single GPU is:

  • set device = torch.device("cuda")
  • set model = model.to(device)
  • set features, labels = features.to(device), labels.to(device)

This is usually considered the best practice:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Multi-GPU Training

This section introduces the idea of distributed training.

The most basic approach uses PyTorch's DistributedDataParallel strategy. DDP splits inputs across available devices and processes the subsets simultaneously. How does this work?

  • PyTorch launches a separate process on each GPU
  • Each process keeps a copy of the model
  • The copies are synchronized during training. The computed gradients are averaged and synchronized during training to update the model copies.

DDP offers enhanced training speed.

DDP does not function properly in interactive environments like Jupyter notebooks. DDP code must be run as a script, not within a notebook interface.

First we load the utilities:

import torch.multiprocessing as mp
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
  • multiprocessing includes various functions for spawning multiple processes and applying functions in parallel.
  • DistributedSampler is for dividing the training data among processes.
  • The init/destroy process group functions are for starting and ending the distributed training modules.

Here is an example script for distributed training.

def ddp_setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12345"
def prepare_dataset():
    train_loader = DataLoader(
        # this ensures each GPU receives different data subsample
    return train_loader, test_loader
def main(rank, world_size, num_epochs):      
    ddp_setup(rank, world_size)
    train_loader, test_loader = prepare_dataset()
    model = NeuralNetwork(num_inputs=2, num_outputs=2)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
    # Wrap the model in DDP to enable gradient synchronization
    model = DDP(model, device_ids=[rank])
    for epoch in range(num_epochs):
    for features, labels in train_loader:
            features, labels = features.to(rank), labels.to(rank) 
            print(f"[GPU{rank}] Epoch: {epoch+1:03d}/{num_epochs:03d}"
                  f" | Batchsize {labels.shape[0]:03d}"
                  f" | Train/Val Loss: {loss:.2f}")
    train_acc = compute_accuracy(model, train_loader, device=rank)
    print(f"[GPU{rank}] Training accuracy", train_acc)
    test_acc = compute_accuracy(model, test_loader, device=rank)
    print(f"[GPU{rank}] Test accuracy", test_acc)
    # exit distributed training, free up resources

if __name__ == "__main__":
    print("Number of GPUs available:", torch.cuda.device_count())
    num_epochs = 3
    world_size = torch.cuda.device_count()
    mp.spawn(main, args=(world_size, num_epochs), nprocs=world_size)

If you only want to use some GPUs, set the CUDA_VISIBLE_DEVICES environment variable.

CUDA_VISIBLE_DIVICES=0,2 python training_script.py

Date: 2024-03-31 Sun 00:00

Emacs 29.3 (Org mode 9.6.15)