# PyTorch Review

This is a quick run through the appendix on PyTorch from Sebastian Raschka's Build a Large Language Model (From Scratch) book, currently available via Manning's MEAP. I haven't spent an enormous amount of time with PyTorch in the last year or so, so it seemed worth the effort to work through it.

## A.1 PyTorch

There are three broad components to PyTorch:

- A tensor library extending array-oriented programming from NumPy with additional features for accelerated computation on GPUs.
- An automatic differentiation engine (autograd), which ehables automatic computation of gradients for tensor operations for backpropagation/model optimization.
- A deep learning library, offering modular, flexible, and extensible building blocks for designing and training deep learning models.

Let's make sure we have it installed correctly…

```
import torch
torch.__version__
```

2.2.2

Let's make sure we can use `mps`

(on mac).

torch.backends.mps.is_available()

True

Great.

## A.2 Understanding tensors

Tensors generalize vectors and matrices to arbitrary dimensions. PyTorch tensors are similar to NumPy arrays but have have several additional features:

- an automatic differentiation engine
- gpu computation

Still, it has a numpy-like API.

### Creating tensors

# 0d tensor (scalar) print(torch.tensor(1)) # 1d tensor (vector) print(torch.tensor([1, 2, 3])) # 2d tensor print(torch.tensor([[1, 2], [3, 4]])) # 3d tensor print(torch.tensor([[[1, 2], [3,4]], [[5,6], [7,8]]]))

tensor(1) tensor([1, 2, 3]) tensor([[1, 2], [3, 4]]) tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

### Tensor data types

These are important to pay attention to! So let's pay attention to them. The default (from above) is the 64-bit integer.

tensor1d = torch.tensor([1,2,3]) print(tensor1d.dtype)

torch.int64

For floats, PyTorch uses 32-bit precision by default.

floatvec = torch.tensor([1., 2., 3.]) print(floatvec.dtype)

torch.float32

Why this default?

- GPU architectures are optimized for 32-bit computations
- 32-bit precision is sufficient for most deep learning tasks but uses less memory and computational resources than 64-bit.

it is easy to change `dtype`

(and precision) with a tensor's `.to`

method.

```
print(torch.tensor([1,2,3]).to(torch.float32).dtype)
```

torch.float32

### Common tensor operations

- brief survey of the most common tensor operations prior to getting into the computational graphs concept.

tensor2d = torch.tensor([[1, 2, 3], [4, 5, 6]])

Reshape:

```
print(tensor2d.reshape(3, 2))
```

tensor([[1, 2], [3, 4], [5, 6]])

It is more common to use `view`

than `reshape`

.

```
print(tensor2d.view(3, 2))
```

tensor([[1, 2], [3, 4], [5, 6]])

Transpose

```
print(tensor2d.T)
```

tensor([[1, 4], [2, 5], [3, 6]])

Matrix multiplication is usually handled with `matmul`

.

```
print(tensor2d.matmul(tensor2d.T))
```

tensor([[14, 32], [32, 77]])

```
print(tensor2d @ tensor2d.T)
```

tensor([[14, 32], [32, 77]])

## A.3 Models as Computational Graphs

The previous section covered PyTorch's tensor library. This section gets into its automatic differentiation engine (autograd). Autograd provides functions for automatically computing gradients in dynamic computational graphs.

So what's a computational graph? It lays out the sequence of calculations needed to compute the gradients for backprop. We'll go through an example showing the forward pass of a logstic regression classifier.

import torch.nn.functional as F y = torch.tensor([1.0]) x1 = torch.tensor([1.1]) w1 = torch.tensor([2.2]) b = torch.tensor([0.0]) z = x1 * w1 + b a = torch.sigmoid(z) loss = F.binary_cross_entropy(a,y)

This results in a computational graph which PyTorch builds in the background.

Input and weight -> (u = w_{1} * x_{1}) -> +b -> (z = u + b) -> (a = σ(z)) -> loss = L(a,y) <- y

## A.4 Automatic Differentiation

PyTorch will automatically build such a graph if one of its terminal nodes has the `requires_grad`

attribute set to True. This enables us to train neural nets via backpropagation. Working backward from the above:

Basically–apply the chain rule right to left.

Quick reminder of some definitions:

- a partial derivative measures the rate at which a function changes w/r/t one of its variables
- a gradient is a vector of all the partial derivatives of a multivariate function

So what exactly does this have to do with torch as an autograd engine? PyTorch tracks every operation performed on tensors and can, therefore, construct a computational graph in the background. Then it cal cann on the `grad`

function to compute the gradient of the loss w/r/t the model parameter as follows:

import torch.nn.functional as F from torch.autograd import grad y = torch.tensor([1.0]) x1 = torch.tensor([1.1]) w1 = torch.tensor([2.2], requires_grad=True) b = torch.tensor([0.0], requires_grad=True) z = x1 * w1 + b a = torch.sigmoid(z) loss = F.binary_cross_entropy(a, y) grad_L_w1 = grad(loss, w1, retain_graph=True) #A grad_L_b = grad(loss, b, retain_graph=True)

print(grad_L_w1) print(grad_L_b)

(tensor([-0.0898]),) (tensor([-0.0817]),)

We seldom manually call the grad function. We usually call `.backward`

on the loss, which computes the gradients of all the leaf nodes in the graph, which will be stored via the `.grad`

attributes of the tensors.

print(loss.backward()) print(w1.grad) print(b.grad)

None tensor([-0.0898]) tensor([-0.0817])

## A.5 Implementing multilayer neural networks

Now we get to the third major component of Pytorch: its library for implementing deep neural networks.

We will focus on a fully-connected MLP. To implement an NN in PyTorch, we:

- subclass the
`torch.nn.Module`

class to define a custom architecture - define layers within the
`__init__`

constructor of the module subclass, specifying how they interact in the forward method. - defined the forward method, which describes how data passes through the network and relates as a computational graph.

We generally do not need to implement the `backward`

method ourselves.

Here is code illustrating a basic NN with two hidden layers.

class NeuralNetwork(torch.nn.Module): def __init__(self, num_inputs, num_outputs): super().__init__() self.layers = torch.nn.Sequential( # 1st hidden layer torch.nn.Linear(num_inputs, 30), torch.nn.ReLU(), # 2nd hidden layer torch.nn.Linear(30, 20), torch.nn.ReLU(), # output layer torch.nn.Linear(20, num_outputs), ) def forward(self, x): logits = self.layers(x) return logits

We can instantiate this with 50 inputs and 3 outputs.

model = NeuralNetwork(50, 3) print(model)

NeuralNetwork( (layers): Sequential( (0): Linear(in_features=50, out_features=30, bias=True) (1): ReLU() (2): Linear(in_features=30, out_features=20, bias=True) (3): ReLU() (4): Linear(in_features=20, out_features=3, bias=True) ) )

We can count the total number of trainable parameters as follows:

num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print("Total number of trainable model parameters:", num_params)

Total number of trainable model parameters: 2213

A parameter is *trainable* if its `requires_grad`

attribute is `True`

. We can investigate specific layers. Let's look at the first linear layer.

```
print(model.layers[0].weight)
```

Parameter containing: tensor([[-0.0844, 0.0863, 0.1168, ..., 0.0203, -0.0814, -0.0504], [ 0.0288, 0.0004, -0.1411, ..., -0.0322, -0.1085, 0.0682], [-0.1075, -0.0173, -0.0476, ..., -0.0684, -0.0522, -0.1316], ..., [ 0.1129, -0.0639, -0.0662, ..., 0.1284, -0.0707, 0.1090], [ 0.0790, -0.1206, -0.1156, ..., 0.1393, -0.0233, 0.1035], [-0.0078, -0.0789, 0.0931, ..., 0.0220, -0.0572, 0.1112]], requires_grad=True)

This is truncated, so let's look at the shape instead to make sure it matches with our expectations.

from rich import print print(model.layers[0].weight.shape)

torch.Size([30, 50])

We can call on the model like this:

X = torch.rand((1,50)) out = model(X) print(out)

tensor([[ 0.0623, -0.0063, -0.1485]], grad_fn=<AddmmBackward0>)

We generated a single random example (50 dimensions) and passed it to the model. This was the *forward pass*. The forward pass simply means calculating the output tensors from the input tensors.

As we can see from the `grad_fn`

, this forward pass computes a computational graph for backprop. This can be wasteful and unnecessary if we're just interested in inference. We use the `torch.no_grad`

context manager to get around this.

with torch.no_grad(): out = model(X) print(out)

tensor([[ 0.0623, -0.0063, -0.1485]])

And this approach just computes the output tensors.

Usually in PyTorch we don't pass the final layer to a nonlinear activation function, because the loss function usually combines softmax with negativel og-likelihood loss in a single class. We have to call softmax explicitly if we want class-membership probabilities.

with torch.no_grad(): out = torch.softmax(model(X), dim=1) print(out)

tensor([[0.3645, 0.3403, 0.2952]])

## A.6 Setting up efficient data loaders

A `DataSet`

is a class that defines how individual records are loaded. A `DataLoader`

class handles dataset shuffling and assembling data records into batches.

This example shows a dataset of five training examples with two features each, along with a tensor of class labels. We also have a test dataset of two entries.

X_train = torch.tensor( [[-1.2, 3.1], [-0.9, 2.9], [-0.5, 2.6], [2.3, -1.1], [2.7, -1.5]] ) y_train = torch.tensor([0, 0, 0, 1, 1]) X_test = torch.tensor( [ [-0.8, 2.8], [2.6, -1.6], ] ) y_test = torch.tensor([0, 1])

Let's first make these into a `DataSet`

.

from torch.utils.data import Dataset class ToyDataset(Dataset): def __init__(self, X, y): self.features = X self.labels = y def __getitem__(self, index): one_x = self.features[index] one_y = self.labels[index] return one_x, one_y def __len__(self): return self.labels.shape[0] train_ds = ToyDataset(X_train, y_train) test_ds = ToyDataset(X_test, y_test)

Note the three main components of the above Dataset definition:

`__init__`

, to set up attributes we can access in the other methods. This might be file paths, file objects, database connectors, etc. Here we just use X and y, which we point toward the correct tensor objects in memory.`__getitem__`

is for defining instructions for retrieving exactly one record via`index`

.`__len__`

is for retrieving the length of the dataset.print(len(train_ds))

5

Now we can use the `DataLoader`

class to define how to sample from the Dataset we defined.

from torch.utils.data import DataLoader torch.manual_seed(123) train_loader = DataLoader( dataset=train_ds, batch_size=2, shuffle=True, num_workers=0 ) test_loader = DataLoader( dataset=test_ds, batch_size=2, shuffle=False, num_workers=0 )

Now we can iterate over the `train_loader`

as follows:

for idx, (x, y) in enumerate(train_loader): print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[ 2.3000, -1.1000], [-0.9000, 2.9000]]) tensor([1, 0]) Batch 2: tensor([[-1.2000, 3.1000], [-0.5000, 2.6000]]) tensor([0, 0]) Batch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])

Note that we can set `drop_last=True`

to drop the last uneven batch, as significantly uneven batch sizes can harm convergence.

The `num_workers`

argument relates to parallelizing data loading/processing. 0 indicates that it will all be done in the main process, not in separate worker processes. This can slow things down a lot.

## A.7 A typical training loop

In this section, we combine many of the techniques from above to show a complete training loop.

import torch import torch.nn.functional as F torch.manual_seed(123) model = NeuralNetwork(num_inputs=2, num_outputs=2) optimizer = torch.optim.SGD(model.parameters(), lr=0.5) num_epochs = 3 for epoch in range(num_epochs): model.train() for batch_idx, (features, labels) in enumerate(train_loader): logits = model(features) loss = F.cross_entropy(logits, labels) optimizer.zero_grad() loss.backward() optimizer.step() ### LOGGING print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}" f" | Batch {batch_idx:03d}/{len(train_loader):03d}" f" | Train Loss: {loss:.2f}") model.eval() # Optional model evaluation

Epoch: 001/003 | Batch 000/003 | Train Loss: 0.75 Epoch: 001/003 | Batch 001/003 | Train Loss: 0.65 Epoch: 001/003 | Batch 002/003 | Train Loss: 0.42 Epoch: 002/003 | Batch 000/003 | Train Loss: 0.05 Epoch: 002/003 | Batch 001/003 | Train Loss: 0.13 Epoch: 002/003 | Batch 002/003 | Train Loss: 0.00 Epoch: 003/003 | Batch 000/003 | Train Loss: 0.01 Epoch: 003/003 | Batch 001/003 | Train Loss: 0.00 Epoch: 003/003 | Batch 002/003 | Train Loss: 0.02

Note the use of `model.train`

and `model.eval`

. These set the model into training and evaluation mode, respectively. Some components behave differently during training or inference, such as ddropout or batch normalization. We don't have these or anything like them, so this is redundant in our code, but still good practice.

We pass the logits directly to `cross_entropy`

to compute the loss and call `loss.backward()`

to compute gradients. `optimizer.step`

uses the gradients to update the model parameters.

It is important that we include an `optimizer.zero_grad`

call in each update to reset the gradients and ensure they do not accumulate.

Now we can make predictions with the model.

model.eval() with torch.no_grad(): outputs = model(X_train) print(outputs)

tensor([[ 2.9320, -4.2563], [ 2.6045, -3.8389], [ 2.1484, -3.2514], [-2.1461, 2.1496], [-2.5004, 2.5210]])

If we want the class membership, we can obtain it with:

torch.set_printoptions(sci_mode=False) probas = torch.softmax(outputs, dim=1) print(probas)

tensor([[ 0.9992, 0.0008], [ 0.9984, 0.0016], [ 0.9955, 0.0045], [ 0.0134, 0.9866], [ 0.0066, 0.9934]])

There are two classes, so the above represents the probabilities of belonging to class 1 or class 2. The first three have high probability of class 1; the last two of class 2.

We can convery into class labels as follows:

predictions = torch.argmax(probas, dim=1) print(predictions)

tensor([0, 0, 0, 1, 1])

We don't need to compute softmax probabilities to accomplish this.

print(torch.argmax(outputs, dim=1))

tensor([0, 0, 0, 1, 1])

Is it correct?

```
predictions == y_train
```

tensor([True, True, True, True, True])

and to get the proportion correct:

torch.sum(predictions == y_train) / len(y_train)

tensor(1.)

## A.8 Saving and Loading Models

We can save a model as follows:

```
torch.save(model.state_dict(), "model.pth")
```

`.pt`

and `.pth`

are the most common extensions, by convention, but we can use whatever we want.

We restore a model with:

model = NeuralNetwork(2,2) model.load_state_dict(torch.load("model.pth"))

It is necessary to have an instance of the model in memory in order to load the model weights.

## A.9 Optimizing training performance with GPUs

### Computations on GPUs

- Modifying training runs to use GPU in PyTorch is easy
- In PyTorch, a
`device`

is where computations occur and data resides. A pytorch tensor lives on a device and its operations are executed on that device.

Because I am running this locally, I am going to try to follow these examples with mps.

print("MPS is available." if torch.backends.mps.is_available() else "MPS is not available.")

MPS is available.

By default, operations are done on CPU.

tensor_1 = torch.tensor([1., 2., 3.]) tensor_2 = torch.tensor([4., 5., 6.]) tensor_1 + tensor_2

tensor([5., 7., 9.])

Now we can transfer the tensors to GPU and perform the addition there.

tensor_1 = tensor_1.to("mps") tensor_2 = tensor_2.to("mps") tensor_1 + tensor_2

tensor([5., 7., 9.], device='mps:0')

All tensors have to be on the same device or the computation will fail.

tensor_1 = tensor_1.to("mps") tensor_2 = tensor_2.to("cpu") tensor_1 + tensor_2

Traceback (most recent call last): File "<string>", line 17, in __PYTHON_EL_eval File "<string>", line 3, in <module> File "/var/folders/vq/mfrl6bsd37jglvmz0vyxf3000000gn/T/babel-YaG8HR/python-l9RkUi", line 3, in <module> tensor_1 + tensor_2 ~~~~~~~~~^~~~~~~~~~ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, mps:0 and cpu!

### Single-GPU Training

All we need to do to train on a single GPU is:

- set
`device = torch.device("cuda")`

- set
`model = model.to(device)`

- set
`features, labels = features.to(device), labels.to(device)`

This is usually considered the best practice:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Multi-GPU Training

This section introduces the idea of *distributed training*.

The most basic approach uses PyTorch's `DistributedDataParallel`

strategy. DDP splits inputs across available devices and processes the subsets simultaneously. How does this work?

- PyTorch launches a separate process on each GPU
- Each process keeps a copy of the model
- The copies are synchronized during training. The computed gradients are averaged and synchronized during training to update the model copies.

DDP offers enhanced training speed.

DDP does not function properly in interactive environments like Jupyter notebooks. DDP code must be run as a script, not within a notebook interface.

First we load the utilities:

import torch.multiprocessing as mp from torch.utils.data.distributed import DistributedSampler from torch.nn.parallel import DistributedDataParallel as DDP from torch.distributed import init_process_group, destroy_process_group

`multiprocessing`

includes various functions for spawning multiple processes and applying functions in parallel.`DistributedSampler`

is for dividing the training data among processes.- The init/destroy process group functions are for starting and ending the distributed training modules.

Here is an example script for distributed training.

def ddp_setup(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12345" init_process_group( backend="nccl", rank=rank, world_size=world_size ) torch.cuda.set_device(rank) def prepare_dataset(): ... train_loader = DataLoader( dataset=train_ds, batch_size=2, shuffle=False, pin_memory=True, drop_last=True, # this ensures each GPU receives different data subsample sampler=DistributedSampler(train_ds) ) return train_loader, test_loader def main(rank, world_size, num_epochs): ddp_setup(rank, world_size) train_loader, test_loader = prepare_dataset() model = NeuralNetwork(num_inputs=2, num_outputs=2) model.to(rank) optimizer = torch.optim.SGD(model.parameters(), lr=0.5) # Wrap the model in DDP to enable gradient synchronization model = DDP(model, device_ids=[rank]) for epoch in range(num_epochs): for features, labels in train_loader: features, labels = features.to(rank), labels.to(rank) ... print(f"[GPU{rank}] Epoch: {epoch+1:03d}/{num_epochs:03d}" f" | Batchsize {labels.shape[0]:03d}" f" | Train/Val Loss: {loss:.2f}") model.eval() train_acc = compute_accuracy(model, train_loader, device=rank) print(f"[GPU{rank}] Training accuracy", train_acc) test_acc = compute_accuracy(model, test_loader, device=rank) print(f"[GPU{rank}] Test accuracy", test_acc) # exit distributed training, free up resources destroy_process_group() if __name__ == "__main__": print("Number of GPUs available:", torch.cuda.device_count()) torch.manual_seed(123) num_epochs = 3 world_size = torch.cuda.device_count() mp.spawn(main, args=(world_size, num_epochs), nprocs=world_size)

If you only want to use some GPUs, set the `CUDA_VISIBLE_DEVICES`

environment variable.

CUDA_VISIBLE_DIVICES=0,2 python training_script.py