logo

Cheatsheet - PyTorch

PyTorch is an open-source machine learning library primarily used for deep learning applications. It's known for its flexibility, dynamic computation graph, and Pythonic interface.

1. Tensors: The Building Blocks

Tensors are the fundamental data structure in PyTorch, similar to NumPy arrays but with GPU acceleration and automatic differentiation capabilities.

1.1 Creating Tensors

Operation Syntax Example
From data (list, tuple, NumPy array) torch.tensor(data, dtype=None, device=None) x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
Uninitialized tensor torch.empty(shape) x = torch.empty(2, 3)
Tensor with random values (uniform) torch.rand(shape) x = torch.rand(2, 2)
Tensor with random values (normal) torch.randn(shape) x = torch.randn(2, 2)
Tensor of zeros torch.zeros(shape, dtype=None) x = torch.zeros(3, 3)
Tensor of ones torch.ones(shape, dtype=None) x = torch.ones(1, 5)
Tensor of specific value torch.full(shape, value, dtype=None) x = torch.full((2, 2), 7)
Tensor from a range torch.arange(start, end, step, dtype=None) x = torch.arange(0, 10, 2)
Tensor from values in a range torch.linspace(start, end, steps, dtype=None) x = torch.linspace(0, 1, 5)
Tensor with same properties as another torch.ones_like(input), torch.zeros_like(input), torch.rand_like(input) x = torch.ones_like(existing_tensor)

1.2 Tensor Properties & Conversion

Property/Conversion Syntax Example
Shape tensor.shape or tensor.size() x.shape Output: torch.Size([2, 2])
Data type tensor.dtype x.dtype Output: torch.float32
Device (CPU/GPU) tensor.device x.device Output: cpu (or cuda:0)
To NumPy array tensor.numpy() np_array = x.numpy()
From NumPy array torch.from_numpy(np_array) x = torch.from_numpy(np_array)
To CPU tensor.cpu() x_cpu = x_gpu.cpu()
To GPU tensor.cuda(), tensor.to('cuda'), tensor.to(device) x_gpu = x_cpu.cuda() or x_gpu = x_cpu.to('cuda')
Change data type tensor.to(dtype) or tensor.type(dtype) x = x.to(torch.int64) or x = x.type(torch.float64)
Item (for single-element tensors) tensor.item() single_element_tensor.item()

1.3 Tensor Operations

  • Arithmetic: +, -, *, /, %, **, torch.add(), torch.sub(), torch.mul(), torch.div(), torch.pow(), etc.
    • y = x + y or torch.add(x, y, out=result)
    • y.add_(x) (in-place addition)
  • Indexing/Slicing: Same as NumPy.
    • x[0, :], x[:, 1], x[1, 1].item()
  • Reshaping:
    • x.view(new_shape): Returns a new tensor with the same data but different shape. Requires contiguous memory.
    • x.reshape(new_shape): Similar to view, but can handle non-contiguous memory by making a copy if necessary.
    • x.T or x.transpose(dim0, dim1): Transpose.
    • x.permute(dim_order): Rearrange dimensions.
    • x.unsqueeze(dim): Add a dimension.
    • x.squeeze(dim): Remove a dimension (if size is 1).
  • Concatenation:
    • torch.cat((t1, t2), dim=0)
  • Stacking:
    • torch.stack((t1, t2), dim=0)
  • Aggregation:
    • torch.sum(x), x.sum(), x.sum(dim=0)
    • torch.mean(x), x.mean()
    • torch.max(x), x.min(), x.argmax(), x.argmin()
  • Matrix Multiplication:
    • torch.matmul(tensor1, tensor2) or tensor1 @ tensor2
    • torch.mm(tensor1, tensor2) (for 2D matrices)
    • tensor1.mm(tensor2)

2. Autograd: Automatic Differentiation

PyTorch's automatic differentiation engine.

  • requires_grad=True: Tells PyTorch to track operations on a tensor for gradient computation.
    • x = torch.tensor([1., 2.], requires_grad=True)
  • tensor.grad: Stores gradients of a scalar loss with respect to the tensor.
  • loss.backward(): Computes gradients. Gradients accumulate, so you often need optimizer.zero_grad().
  • with torch.no_grad():: Temporarily disable gradient tracking. Useful during evaluation or when updating model weights.
    • with torch.no_grad(): pred = model(x)
  • tensor.detach(): Creates a new tensor that shares the same data as tensor but does not require gradients. It's "detached" from the computation graph.
x = torch.tensor([1., 2.], requires_grad=True)
y = x + 2
z = y * y * 3
out = z.mean() # scalar output

out.backward() # compute gradients
print(x.grad) # d(out)/dx

3. Neural Network Modules (torch.nn)

The torch.nn module provides classes for building neural networks.

3.1 Basic Layers

  • Linear (Fully Connected): nn.Linear(in_features, out_features)
  • Convolutional:
    • nn.Conv1d(in_channels, out_channels, kernel_size, ...)
    • nn.Conv2d(in_channels, out_channels, kernel_size, ...)
    • nn.Conv3d(in_channels, out_channels, kernel_size, ...)
  • Pooling:
    • nn.MaxPool2d(kernel_size, stride=None, ...)
    • nn.AvgPool2d(kernel_size, stride=None, ...)
  • Activation Functions:
    • nn.ReLU(), nn.Sigmoid(), nn.Tanh(), nn.LeakyReLU(), nn.Softmax(dim=...)
  • Normalization:
    • nn.BatchNorm1d(num_features), nn.BatchNorm2d(num_features)
  • Dropout:
    • nn.Dropout(p=0.5)
  • Recurrent:
    • nn.RNN(), nn.LSTM(), nn.GRU()
  • Embedding:
    • nn.Embedding(num_embeddings, embedding_dim) (for word embeddings)
  • Containers:
    • nn.Sequential(*layers): A linear stack of modules.
    • nn.ModuleList([module1, module2, ...]): Holds a list of submodules.
    • nn.ParameterList([param1, param2, ...]): Holds a list of parameters.

3.2 Defining a Custom Neural Network

import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

model = SimpleNet(input_size=10, hidden_size=5, num_classes=2)
# print(model)

4. Loss Functions (torch.nn and torch.nn.functional)

Calculate how far an output is from a target.

Loss Function Class (nn) Functional (F) Use Case
Mean Squared Error nn.MSELoss() F.mse_loss(input, target) Regression tasks
Cross Entropy nn.CrossEntropyLoss() F.cross_entropy(input, target) Multi-class classification (input is raw scores/logits)
Binary Cross Entropy with Logits nn.BCEWithLogitsLoss() F.binary_cross_entropy_with_logits(input, target) Binary classification (input is raw scores/logits)
Binary Cross Entropy nn.BCELoss() F.binary_cross_entropy(input, target) Binary classification (input is probabilities 0-1)
L1 Loss (Mean Absolute Error) nn.L1Loss() F.l1_loss(input, target) Robust regression
Negative Log Likelihood Loss nn.NLLLoss() F.nll_loss(input, target) Multi-class classification (input is log-probabilities)
Kullback-Leibler Divergence nn.KLDivLoss() F.kl_div(input, target) Measuring difference between two probability distributions
Margin Ranking Loss nn.MarginRankingLoss() F.margin_ranking_loss(input1, input2, target) Ranking tasks
MultiMarginLoss (SVM-like) nn.MultiMarginLoss() F.multi_margin_loss(input, target) Multi-class classification (SVM-style)

5. Optimizers (torch.optim)

Update model weights to minimize the loss.

Optimizer Class (optim) Description
Stochastic GD optim.SGD(model.parameters(), lr=0.01, momentum=0.9) Basic gradient descent. Supports momentum.
Adam optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-08) Adaptive moment estimation. Popular, generally good performance.
RMSprop optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99) Adaptive learning rate optimizer.
Adagrad optim.Adagrad(model.parameters(), lr=0.01) Adaptive learning rate for sparse data.
Adadelta optim.Adadelta(model.parameters(), lr=1.0, rho=0.9) Adaptive learning rate optimizer, less sensitive to learning rate hyperparameter.

Optimization Step

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Inside training loop:
optimizer.zero_grad() # Clear previous gradients
loss.backward()       # Compute gradients of loss w.r.t. model parameters
optimizer.step()      # Update model parameters

6. Data Loading (torch.utils.data)

Efficiently load data in batches.

6.1 Dataset

Abstract class representing a dataset. Your custom dataset should inherit from it and implement __len__ and __getitem__.

from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data # a tensor or list of tensors
        self.labels = labels # a tensor or list of tensors

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        label = self.labels[idx]
        return sample, label

# Example usage:
# dataset = CustomDataset(some_tensor_data, some_tensor_labels)

6.2 DataLoader

Wraps a Dataset to provide iterators for easy batching, shuffling, and multiprocessing.

from torch.utils.data import DataLoader

# Create dummy data and labels
dummy_data = torch.randn(100, 10) # 100 samples, 10 features
dummy_labels = torch.randint(0, 2, (100,)) # 100 binary labels

dataset = CustomDataset(dummy_data, dummy_labels)

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4 # For multiprocessing, 0 for main process
)

# Iterate through data
for batch_idx, (inputs, targets) in enumerate(dataloader):
    # inputs.shape will be [32, 10] (or less for last batch)
    # targets.shape will be [32]
    pass

7. GPU Usage (CUDA)

Move models and tensors to GPU for accelerated computation.

# 1. Check for CUDA availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 2. Move tensor to device
x = torch.randn(3, 3).to(device)

# 3. Move model to device
model = SimpleNet(10, 5, 2).to(device)

# Ensure all inputs to the model are also on the same device
# inputs = inputs.to(device)
# targets = targets.to(device)

8. Saving and Loading Models

8.1 Saving

Recommended: Save state_dict (parameters only).

# Save model parameters
torch.save(model.state_dict(), 'model_weights.pth')

# Save entire model (not recommended for cross-version compatibility)
# torch.save(model, 'entire_model.pth')

8.2 Loading

# 1. Instantiate the model architecture
model = SimpleNet(input_size=10, hidden_size=5, num_classes=2)

# 2. Load the state_dict
model.load_state_dict(torch.load('model_weights.pth'))

# 3. Set model to evaluation mode (important for BatchNorm, Dropout)
model.eval()

# For inference:
# with torch.no_grad():
#     output = model(input_tensor)

# To load entire model (if saved that way):
# model = torch.load('entire_model.pth')
# model.eval()

9. Training Loop Structure

# 1. Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 2. Hyperparameters
input_size = 784 # For MNIST
hidden_size = 500
num_classes = 10
num_epochs = 5
batch_size = 100
learning_rate = 0.001

# 3. Dataset and DataLoader (example for MNIST)
# train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
# train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

# 4. Model instantiation
model = SimpleNet(input_size, hidden_size, num_classes).to(device)

# 5. Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# 6. Training loop
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Move tensors to the configured device
        images = images.reshape(-1, input_size).to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print (f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')

# 7. Evaluation (on test set)
model.eval() # Set model to evaluation mode
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, input_size).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print(f'Accuracy of the network on the 10000 test images: {100 * correct / total} %')