Cheatsheet - PyTorch
PyTorch is an open-source machine learning library primarily used for deep learning applications. It's known for its flexibility, dynamic computation graph, and Pythonic interface.
1. Tensors: The Building Blocks
Tensors are the fundamental data structure in PyTorch, similar to NumPy arrays but with GPU acceleration and automatic differentiation capabilities.
1.1 Creating Tensors
Operation | Syntax | Example |
---|---|---|
From data (list, tuple, NumPy array) | torch.tensor(data, dtype=None, device=None) |
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) |
Uninitialized tensor | torch.empty(shape) |
x = torch.empty(2, 3) |
Tensor with random values (uniform) | torch.rand(shape) |
x = torch.rand(2, 2) |
Tensor with random values (normal) | torch.randn(shape) |
x = torch.randn(2, 2) |
Tensor of zeros | torch.zeros(shape, dtype=None) |
x = torch.zeros(3, 3) |
Tensor of ones | torch.ones(shape, dtype=None) |
x = torch.ones(1, 5) |
Tensor of specific value | torch.full(shape, value, dtype=None) |
x = torch.full((2, 2), 7) |
Tensor from a range | torch.arange(start, end, step, dtype=None) |
x = torch.arange(0, 10, 2) |
Tensor from values in a range | torch.linspace(start, end, steps, dtype=None) |
x = torch.linspace(0, 1, 5) |
Tensor with same properties as another | torch.ones_like(input) , torch.zeros_like(input) , torch.rand_like(input) |
x = torch.ones_like(existing_tensor) |
1.2 Tensor Properties & Conversion
Property/Conversion | Syntax | Example |
---|---|---|
Shape | tensor.shape or tensor.size() |
x.shape Output: torch.Size([2, 2]) |
Data type | tensor.dtype |
x.dtype Output: torch.float32 |
Device (CPU/GPU) | tensor.device |
x.device Output: cpu (or cuda:0 ) |
To NumPy array | tensor.numpy() |
np_array = x.numpy() |
From NumPy array | torch.from_numpy(np_array) |
x = torch.from_numpy(np_array) |
To CPU | tensor.cpu() |
x_cpu = x_gpu.cpu() |
To GPU | tensor.cuda() , tensor.to('cuda') , tensor.to(device) |
x_gpu = x_cpu.cuda() or x_gpu = x_cpu.to('cuda') |
Change data type | tensor.to(dtype) or tensor.type(dtype) |
x = x.to(torch.int64) or x = x.type(torch.float64) |
Item (for single-element tensors) | tensor.item() |
single_element_tensor.item() |
1.3 Tensor Operations
- Arithmetic:
+, -, *, /, %, **
,torch.add()
,torch.sub()
,torch.mul()
,torch.div()
,torch.pow()
, etc.y = x + y
ortorch.add(x, y, out=result)
y.add_(x)
(in-place addition)
- Indexing/Slicing: Same as NumPy.
x[0, :]
,x[:, 1]
,x[1, 1].item()
- Reshaping:
x.view(new_shape)
: Returns a new tensor with the same data but different shape. Requires contiguous memory.x.reshape(new_shape)
: Similar toview
, but can handle non-contiguous memory by making a copy if necessary.x.T
orx.transpose(dim0, dim1)
: Transpose.x.permute(dim_order)
: Rearrange dimensions.x.unsqueeze(dim)
: Add a dimension.x.squeeze(dim)
: Remove a dimension (if size is 1).
- Concatenation:
torch.cat((t1, t2), dim=0)
- Stacking:
torch.stack((t1, t2), dim=0)
- Aggregation:
torch.sum(x)
,x.sum()
,x.sum(dim=0)
torch.mean(x)
,x.mean()
torch.max(x)
,x.min()
,x.argmax()
,x.argmin()
- Matrix Multiplication:
torch.matmul(tensor1, tensor2)
ortensor1 @ tensor2
torch.mm(tensor1, tensor2)
(for 2D matrices)tensor1.mm(tensor2)
2. Autograd: Automatic Differentiation
PyTorch's automatic differentiation engine.
requires_grad=True
: Tells PyTorch to track operations on a tensor for gradient computation.x = torch.tensor([1., 2.], requires_grad=True)
tensor.grad
: Stores gradients of a scalar loss with respect to the tensor.loss.backward()
: Computes gradients. Gradients accumulate, so you often needoptimizer.zero_grad()
.with torch.no_grad():
: Temporarily disable gradient tracking. Useful during evaluation or when updating model weights.with torch.no_grad(): pred = model(x)
tensor.detach()
: Creates a new tensor that shares the same data astensor
but does not require gradients. It's "detached" from the computation graph.
x = torch.tensor([1., 2.], requires_grad=True)
y = x + 2
z = y * y * 3
out = z.mean() # scalar output
out.backward() # compute gradients
print(x.grad) # d(out)/dx
3. Neural Network Modules (torch.nn
)
The torch.nn
module provides classes for building neural networks.
3.1 Basic Layers
- Linear (Fully Connected):
nn.Linear(in_features, out_features)
- Convolutional:
nn.Conv1d(in_channels, out_channels, kernel_size, ...)
nn.Conv2d(in_channels, out_channels, kernel_size, ...)
nn.Conv3d(in_channels, out_channels, kernel_size, ...)
- Pooling:
nn.MaxPool2d(kernel_size, stride=None, ...)
nn.AvgPool2d(kernel_size, stride=None, ...)
- Activation Functions:
nn.ReLU()
,nn.Sigmoid()
,nn.Tanh()
,nn.LeakyReLU()
,nn.Softmax(dim=...)
- Normalization:
nn.BatchNorm1d(num_features)
,nn.BatchNorm2d(num_features)
- Dropout:
nn.Dropout(p=0.5)
- Recurrent:
nn.RNN()
,nn.LSTM()
,nn.GRU()
- Embedding:
nn.Embedding(num_embeddings, embedding_dim)
(for word embeddings)
- Containers:
nn.Sequential(*layers)
: A linear stack of modules.nn.ModuleList([module1, module2, ...])
: Holds a list of submodules.nn.ParameterList([param1, param2, ...])
: Holds a list of parameters.
3.2 Defining a Custom Neural Network
import torch.nn as nn
import torch.nn.functional as F
class SimpleNet(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(SimpleNet, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
return out
model = SimpleNet(input_size=10, hidden_size=5, num_classes=2)
# print(model)
4. Loss Functions (torch.nn
and torch.nn.functional
)
Calculate how far an output is from a target.
Loss Function | Class (nn ) |
Functional (F ) |
Use Case |
---|---|---|---|
Mean Squared Error | nn.MSELoss() |
F.mse_loss(input, target) |
Regression tasks |
Cross Entropy | nn.CrossEntropyLoss() |
F.cross_entropy(input, target) |
Multi-class classification (input is raw scores/logits) |
Binary Cross Entropy with Logits | nn.BCEWithLogitsLoss() |
F.binary_cross_entropy_with_logits(input, target) |
Binary classification (input is raw scores/logits) |
Binary Cross Entropy | nn.BCELoss() |
F.binary_cross_entropy(input, target) |
Binary classification (input is probabilities 0-1) |
L1 Loss (Mean Absolute Error) | nn.L1Loss() |
F.l1_loss(input, target) |
Robust regression |
Negative Log Likelihood Loss | nn.NLLLoss() |
F.nll_loss(input, target) |
Multi-class classification (input is log-probabilities) |
Kullback-Leibler Divergence | nn.KLDivLoss() |
F.kl_div(input, target) |
Measuring difference between two probability distributions |
Margin Ranking Loss | nn.MarginRankingLoss() |
F.margin_ranking_loss(input1, input2, target) |
Ranking tasks |
MultiMarginLoss (SVM-like) | nn.MultiMarginLoss() |
F.multi_margin_loss(input, target) |
Multi-class classification (SVM-style) |
5. Optimizers (torch.optim
)
Update model weights to minimize the loss.
Optimizer | Class (optim ) |
Description |
---|---|---|
Stochastic GD | optim.SGD(model.parameters(), lr=0.01, momentum=0.9) |
Basic gradient descent. Supports momentum. |
Adam | optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-08) |
Adaptive moment estimation. Popular, generally good performance. |
RMSprop | optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99) |
Adaptive learning rate optimizer. |
Adagrad | optim.Adagrad(model.parameters(), lr=0.01) |
Adaptive learning rate for sparse data. |
Adadelta | optim.Adadelta(model.parameters(), lr=1.0, rho=0.9) |
Adaptive learning rate optimizer, less sensitive to learning rate hyperparameter. |
Optimization Step
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Inside training loop:
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients of loss w.r.t. model parameters
optimizer.step() # Update model parameters
6. Data Loading (torch.utils.data
)
Efficiently load data in batches.
6.1 Dataset
Abstract class representing a dataset. Your custom dataset should inherit from it and implement __len__
and __getitem__
.
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data, labels):
self.data = data # a tensor or list of tensors
self.labels = labels # a tensor or list of tensors
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx]
label = self.labels[idx]
return sample, label
# Example usage:
# dataset = CustomDataset(some_tensor_data, some_tensor_labels)
6.2 DataLoader
Wraps a Dataset
to provide iterators for easy batching, shuffling, and multiprocessing.
from torch.utils.data import DataLoader
# Create dummy data and labels
dummy_data = torch.randn(100, 10) # 100 samples, 10 features
dummy_labels = torch.randint(0, 2, (100,)) # 100 binary labels
dataset = CustomDataset(dummy_data, dummy_labels)
dataloader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4 # For multiprocessing, 0 for main process
)
# Iterate through data
for batch_idx, (inputs, targets) in enumerate(dataloader):
# inputs.shape will be [32, 10] (or less for last batch)
# targets.shape will be [32]
pass
7. GPU Usage (CUDA)
Move models and tensors to GPU for accelerated computation.
# 1. Check for CUDA availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# 2. Move tensor to device
x = torch.randn(3, 3).to(device)
# 3. Move model to device
model = SimpleNet(10, 5, 2).to(device)
# Ensure all inputs to the model are also on the same device
# inputs = inputs.to(device)
# targets = targets.to(device)
8. Saving and Loading Models
8.1 Saving
Recommended: Save state_dict
(parameters only).
# Save model parameters
torch.save(model.state_dict(), 'model_weights.pth')
# Save entire model (not recommended for cross-version compatibility)
# torch.save(model, 'entire_model.pth')
8.2 Loading
# 1. Instantiate the model architecture
model = SimpleNet(input_size=10, hidden_size=5, num_classes=2)
# 2. Load the state_dict
model.load_state_dict(torch.load('model_weights.pth'))
# 3. Set model to evaluation mode (important for BatchNorm, Dropout)
model.eval()
# For inference:
# with torch.no_grad():
# output = model(input_tensor)
# To load entire model (if saved that way):
# model = torch.load('entire_model.pth')
# model.eval()
9. Training Loop Structure
# 1. Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 2. Hyperparameters
input_size = 784 # For MNIST
hidden_size = 500
num_classes = 10
num_epochs = 5
batch_size = 100
learning_rate = 0.001
# 3. Dataset and DataLoader (example for MNIST)
# train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
# train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
# 4. Model instantiation
model = SimpleNet(input_size, hidden_size, num_classes).to(device)
# 5. Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# 6. Training loop
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(train_loader):
# Move tensors to the configured device
images = images.reshape(-1, input_size).to(device)
labels = labels.to(device)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i+1) % 100 == 0:
print (f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')
# 7. Evaluation (on test set)
model.eval() # Set model to evaluation mode
with torch.no_grad():
correct = 0
total = 0
for images, labels in test_loader:
images = images.reshape(-1, input_size).to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print(f'Accuracy of the network on the 10000 test images: {100 * correct / total} %')