Cheatsheet - PyTorch
PyTorch is an open-source machine learning library primarily used for deep learning applications. It's known for its flexibility, dynamic computation graph, and Pythonic interface.
1. Tensors: The Building Blocks
Tensors are the fundamental data structure in PyTorch, similar to NumPy arrays but with GPU acceleration and automatic differentiation capabilities.
1.1 Creating Tensors
| Operation | Syntax | Example |
|---|---|---|
| From data (list, tuple, NumPy array) | torch.tensor(data, dtype=None, device=None) |
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) |
| Uninitialized tensor | torch.empty(shape) |
x = torch.empty(2, 3) |
| Tensor with random values (uniform) | torch.rand(shape) |
x = torch.rand(2, 2) |
| Tensor with random values (normal) | torch.randn(shape) |
x = torch.randn(2, 2) |
| Tensor of zeros | torch.zeros(shape, dtype=None) |
x = torch.zeros(3, 3) |
| Tensor of ones | torch.ones(shape, dtype=None) |
x = torch.ones(1, 5) |
| Tensor of specific value | torch.full(shape, value, dtype=None) |
x = torch.full((2, 2), 7) |
| Tensor from a range | torch.arange(start, end, step, dtype=None) |
x = torch.arange(0, 10, 2) |
| Tensor from values in a range | torch.linspace(start, end, steps, dtype=None) |
x = torch.linspace(0, 1, 5) |
| Tensor with same properties as another | torch.ones_like(input), torch.zeros_like(input), torch.rand_like(input) |
x = torch.ones_like(existing_tensor) |
1.2 Tensor Properties & Conversion
| Property/Conversion | Syntax | Example |
|---|---|---|
| Shape | tensor.shape or tensor.size() |
x.shape Output: torch.Size([2, 2]) |
| Data type | tensor.dtype |
x.dtype Output: torch.float32 |
| Device (CPU/GPU) | tensor.device |
x.device Output: cpu (or cuda:0) |
| To NumPy array | tensor.numpy() |
np_array = x.numpy() |
| From NumPy array | torch.from_numpy(np_array) |
x = torch.from_numpy(np_array) |
| To CPU | tensor.cpu() |
x_cpu = x_gpu.cpu() |
| To GPU | tensor.cuda(), tensor.to('cuda'), tensor.to(device) |
x_gpu = x_cpu.cuda() or x_gpu = x_cpu.to('cuda') |
| Change data type | tensor.to(dtype) or tensor.type(dtype) |
x = x.to(torch.int64) or x = x.type(torch.float64) |
| Item (for single-element tensors) | tensor.item() |
single_element_tensor.item() |
1.3 Tensor Operations
- Arithmetic:
+, -, *, /, %, **,torch.add(),torch.sub(),torch.mul(),torch.div(),torch.pow(), etc.y = x + yortorch.add(x, y, out=result)y.add_(x)(in-place addition)
- Indexing/Slicing: Same as NumPy.
x[0, :],x[:, 1],x[1, 1].item()
- Reshaping:
x.view(new_shape): Returns a new tensor with the same data but different shape. Requires contiguous memory.x.reshape(new_shape): Similar toview, but can handle non-contiguous memory by making a copy if necessary.x.Torx.transpose(dim0, dim1): Transpose.x.permute(dim_order): Rearrange dimensions.x.unsqueeze(dim): Add a dimension.x.squeeze(dim): Remove a dimension (if size is 1).
- Concatenation:
torch.cat((t1, t2), dim=0)
- Stacking:
torch.stack((t1, t2), dim=0)
- Aggregation:
torch.sum(x),x.sum(),x.sum(dim=0)torch.mean(x),x.mean()torch.max(x),x.min(),x.argmax(),x.argmin()
- Matrix Multiplication:
torch.matmul(tensor1, tensor2)ortensor1 @ tensor2torch.mm(tensor1, tensor2)(for 2D matrices)tensor1.mm(tensor2)
2. Autograd: Automatic Differentiation
PyTorch's automatic differentiation engine.
requires_grad=True: Tells PyTorch to track operations on a tensor for gradient computation.x = torch.tensor([1., 2.], requires_grad=True)
tensor.grad: Stores gradients of a scalar loss with respect to the tensor.loss.backward(): Computes gradients. Gradients accumulate, so you often needoptimizer.zero_grad().with torch.no_grad():: Temporarily disable gradient tracking. Useful during evaluation or when updating model weights.with torch.no_grad(): pred = model(x)
tensor.detach(): Creates a new tensor that shares the same data astensorbut does not require gradients. It's "detached" from the computation graph.
x = torch.tensor([1., 2.], requires_grad=True)
y = x + 2
z = y * y * 3
out = z.mean() # scalar output
out.backward() # compute gradients
print(x.grad) # d(out)/dx
3. Neural Network Modules (torch.nn)
The torch.nn module provides classes for building neural networks.
3.1 Basic Layers
- Linear (Fully Connected):
nn.Linear(in_features, out_features) - Convolutional:
nn.Conv1d(in_channels, out_channels, kernel_size, ...)nn.Conv2d(in_channels, out_channels, kernel_size, ...)nn.Conv3d(in_channels, out_channels, kernel_size, ...)
- Pooling:
nn.MaxPool2d(kernel_size, stride=None, ...)nn.AvgPool2d(kernel_size, stride=None, ...)
- Activation Functions:
nn.ReLU(),nn.Sigmoid(),nn.Tanh(),nn.LeakyReLU(),nn.Softmax(dim=...)
- Normalization:
nn.BatchNorm1d(num_features),nn.BatchNorm2d(num_features)
- Dropout:
nn.Dropout(p=0.5)
- Recurrent:
nn.RNN(),nn.LSTM(),nn.GRU()
- Embedding:
nn.Embedding(num_embeddings, embedding_dim)(for word embeddings)
- Containers:
nn.Sequential(*layers): A linear stack of modules.nn.ModuleList([module1, module2, ...]): Holds a list of submodules.nn.ParameterList([param1, param2, ...]): Holds a list of parameters.
3.2 Defining a Custom Neural Network
import torch.nn as nn
import torch.nn.functional as F
class SimpleNet(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(SimpleNet, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
return out
model = SimpleNet(input_size=10, hidden_size=5, num_classes=2)
# print(model)
4. Loss Functions (torch.nn and torch.nn.functional)
Calculate how far an output is from a target.
| Loss Function | Class (nn) |
Functional (F) |
Use Case |
|---|---|---|---|
| Mean Squared Error | nn.MSELoss() |
F.mse_loss(input, target) |
Regression tasks |
| Cross Entropy | nn.CrossEntropyLoss() |
F.cross_entropy(input, target) |
Multi-class classification (input is raw scores/logits) |
| Binary Cross Entropy with Logits | nn.BCEWithLogitsLoss() |
F.binary_cross_entropy_with_logits(input, target) |
Binary classification (input is raw scores/logits) |
| Binary Cross Entropy | nn.BCELoss() |
F.binary_cross_entropy(input, target) |
Binary classification (input is probabilities 0-1) |
| L1 Loss (Mean Absolute Error) | nn.L1Loss() |
F.l1_loss(input, target) |
Robust regression |
| Negative Log Likelihood Loss | nn.NLLLoss() |
F.nll_loss(input, target) |
Multi-class classification (input is log-probabilities) |
| Kullback-Leibler Divergence | nn.KLDivLoss() |
F.kl_div(input, target) |
Measuring difference between two probability distributions |
| Margin Ranking Loss | nn.MarginRankingLoss() |
F.margin_ranking_loss(input1, input2, target) |
Ranking tasks |
| MultiMarginLoss (SVM-like) | nn.MultiMarginLoss() |
F.multi_margin_loss(input, target) |
Multi-class classification (SVM-style) |
5. Optimizers (torch.optim)
Update model weights to minimize the loss.
| Optimizer | Class (optim) |
Description |
|---|---|---|
| Stochastic GD | optim.SGD(model.parameters(), lr=0.01, momentum=0.9) |
Basic gradient descent. Supports momentum. |
| Adam | optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-08) |
Adaptive moment estimation. Popular, generally good performance. |
| RMSprop | optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99) |
Adaptive learning rate optimizer. |
| Adagrad | optim.Adagrad(model.parameters(), lr=0.01) |
Adaptive learning rate for sparse data. |
| Adadelta | optim.Adadelta(model.parameters(), lr=1.0, rho=0.9) |
Adaptive learning rate optimizer, less sensitive to learning rate hyperparameter. |
Optimization Step
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Inside training loop:
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients of loss w.r.t. model parameters
optimizer.step() # Update model parameters
6. Data Loading (torch.utils.data)
Efficiently load data in batches.
6.1 Dataset
Abstract class representing a dataset. Your custom dataset should inherit from it and implement __len__ and __getitem__.
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data, labels):
self.data = data # a tensor or list of tensors
self.labels = labels # a tensor or list of tensors
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx]
label = self.labels[idx]
return sample, label
# Example usage:
# dataset = CustomDataset(some_tensor_data, some_tensor_labels)
6.2 DataLoader
Wraps a Dataset to provide iterators for easy batching, shuffling, and multiprocessing.
from torch.utils.data import DataLoader
# Create dummy data and labels
dummy_data = torch.randn(100, 10) # 100 samples, 10 features
dummy_labels = torch.randint(0, 2, (100,)) # 100 binary labels
dataset = CustomDataset(dummy_data, dummy_labels)
dataloader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4 # For multiprocessing, 0 for main process
)
# Iterate through data
for batch_idx, (inputs, targets) in enumerate(dataloader):
# inputs.shape will be [32, 10] (or less for last batch)
# targets.shape will be [32]
pass
7. GPU Usage (CUDA)
Move models and tensors to GPU for accelerated computation.
# 1. Check for CUDA availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# 2. Move tensor to device
x = torch.randn(3, 3).to(device)
# 3. Move model to device
model = SimpleNet(10, 5, 2).to(device)
# Ensure all inputs to the model are also on the same device
# inputs = inputs.to(device)
# targets = targets.to(device)
8. Saving and Loading Models
8.1 Saving
Recommended: Save state_dict (parameters only).
# Save model parameters
torch.save(model.state_dict(), 'model_weights.pth')
# Save entire model (not recommended for cross-version compatibility)
# torch.save(model, 'entire_model.pth')
8.2 Loading
# 1. Instantiate the model architecture
model = SimpleNet(input_size=10, hidden_size=5, num_classes=2)
# 2. Load the state_dict
model.load_state_dict(torch.load('model_weights.pth'))
# 3. Set model to evaluation mode (important for BatchNorm, Dropout)
model.eval()
# For inference:
# with torch.no_grad():
# output = model(input_tensor)
# To load entire model (if saved that way):
# model = torch.load('entire_model.pth')
# model.eval()
9. Training Loop Structure
# 1. Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 2. Hyperparameters
input_size = 784 # For MNIST
hidden_size = 500
num_classes = 10
num_epochs = 5
batch_size = 100
learning_rate = 0.001
# 3. Dataset and DataLoader (example for MNIST)
# train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
# train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
# 4. Model instantiation
model = SimpleNet(input_size, hidden_size, num_classes).to(device)
# 5. Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# 6. Training loop
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
# Move tensors to the configured device
images = images.reshape(-1, input_size).to(device)
labels = labels.to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i+1) % 100 == 0:
print (f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')
# 7. Evaluation (on test set)
model.eval() # Set model to evaluation mode
with torch.no_grad():
correct = 0
total = 0
for images, labels in test_loader:
images = images.reshape(-1, input_size).to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print(f'Accuracy of the network on the 10000 test images: {100 * correct / total} %')