PyTorch Basics Tutorial: Practical Data Loading with Dataset and DataLoader

The Cornerstone of Data Loading: Dataset and DataLoader in PyTorch¶

In the training process of machine learning and deep learning, data loading is a crucial step. If the data loading环节 is chaotic or inefficient, it directly impacts the efficiency and stability of model training. PyTorch provides two core components, Dataset and DataLoader, to help efficiently manage and load data. This article will guide beginners through the basic process of data loading with accessible language and practical code.

I. Why Are Dataset and DataLoader Needed?¶

During model training, data must be input to the model in batches. Dataset is responsible for “storing and reading individual data samples,” while DataLoader handles “packing data from Dataset into batches, supporting operations like shuffling and multi-threaded loading.” Without them, you would need to manually handle data indexing, shuffling, and batch reading—operations that are error-prone and inefficient.

II. Dataset: The “Warehouse” for Data¶

Dataset is an abstract base class for data processing in PyTorch. To use it, you need to inherit this class and implement two core methods:
- __getitem__(self, index): Returns a single data sample (usually containing features and labels) with the given index.
- __len__(self): Returns the total number of samples in the dataset.

1. Custom SimpleDataset¶

Suppose we have a set of 2D data where each sample has 20 features and 1 label (an integer from 0 to 4). We can define a custom Dataset class:

import torch
from torch.utils.data import Dataset

class SimpleDataset(Dataset):
    def __init__(self, features, labels):
        # features: Tensor of all samples, shape (num_samples, num_features)
        # labels: Tensor of all labels, shape (num_samples,)
        self.features = features
        self.labels = labels

    def __getitem__(self, index):
        return self.features[index], self.labels[index]

    def __len__(self):
        return len(self.features)

# Generate random data
features = torch.randn(1000, 20)  # 1000 samples, 20 features each
labels = torch.randint(0, 5, (1000,))  # 1000 labels, range 0-4

# Create Dataset instance
dataset = SimpleDataset(features, labels)

# Test: Get the 0th sample
sample_features, sample_labels = dataset[0]
print(f"0th sample features shape: {sample_features.shape}, label: {sample_labels}")

2. Using TensorDataset (Simpler)¶

If your data is already in tensor form, TensorDataset can directly wrap features and labels without defining a custom class:

from torch.utils.data import TensorDataset

# Generate random data (same as above)
features = torch.randn(1000, 20)
labels = torch.randint(0, 5, (1000,))

# Wrap with TensorDataset
dataset = TensorDataset(features, labels)  # Automatically handles __getitem__ and __len__

# Test: Get the 1st sample
sample_features, sample_labels = dataset[1]
print(f"1st sample features shape: {sample_features.shape}, label: {sample_labels}")

III. DataLoader: The “Courier” for Data¶

DataLoader is a “wrapper” for Dataset. It reads data from Dataset in batches and provides features like:
- batch_size: Number of samples per batch (required).
- shuffle: Whether to shuffle data during training (set to True for training sets, False for validation/test sets).
- num_workers: Number of threads for multi-threaded data loading (default 0 for Windows to avoid errors).
- pin_memory: Whether to lock memory (optimizes GPU data transfer speed, optional).

1. Basic DataLoader Usage¶

from torch.utils.data import DataLoader

# Create DataLoader
dataloader = DataLoader(
    dataset,                # Input Dataset
    batch_size=32,          # 32 samples per batch
    shuffle=True,           # Shuffle during training
    num_workers=0,          # Single-threaded loading (Windows: 0 recommended)
    pin_memory=True         # Optimize GPU data transfer (optional)
)

# Iterate through DataLoader to get batches
for batch_idx, (batch_features, batch_labels) in enumerate(dataloader):
    print(f"Batch {batch_idx} | Features shape: {batch_features.shape}, Labels shape: {batch_labels.shape}")
    break  # Print only the first batch

2. Key Parameter Explanations¶

batch_size: Size of each batch (e.g., 64, 128). If the total number of samples is not a multiple of batch_size, the last batch will be truncated (use drop_last=True to discard incomplete batches).
shuffle: Set to True during training to prevent the model from “learning data order” (e.g., the sequence of digits in MNIST). Set to False during validation/testing for reproducible results.
num_workers: Multi-threaded data loading (larger values speed up loading but increase memory usage). Windows defaults to num_workers=0; Linux/macOS can use os.cpu_count() (e.g., num_workers=4 for 4 CPU cores).

IV. Practical Example: Loading Image Data (MNIST)¶

Beyond tensor data, we often need to load image data (e.g., MNIST handwritten digits). PyTorch’s torchvision library provides built-in datasets and data transformation tools, which work seamlessly with Dataset and DataLoader.

1. Install Dependencies (if not installed)¶

pip install torch torchvision

2. Load MNIST Dataset¶

import torchvision
import torchvision.transforms as transforms

# Define data transformations: convert image to tensor and normalize
transform = transforms.Compose([
    transforms.ToTensor(),       # Convert image to tensor (0-1 range)
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST mean and std
])

# Load training and test datasets
train_dataset = torchvision.datasets.MNIST(
    root='./data',             # Data storage path
    train=True,                # Load training set
    download=True,             # Download data if not present
    transform=transform        # Apply transformations
)

test_dataset = torchvision.datasets.MNIST(
    root='./data',
    train=False,               # Load test set
    download=True,
    transform=transform
)

# Create DataLoaders
train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,
    num_workers=2              # Linux/macOS: use 2 threads (example)
)

test_loader = DataLoader(
    test_dataset,
    batch_size=64,
    shuffle=False,
    num_workers=2
)

# Test: Iterate through the first batch
for images, labels in train_loader:
    print(f"Image shape: {images.shape} (batch_size=64, channels=1, height=28, width=28)")
    print(f"Labels shape: {labels.shape}")
    break

V. Common Issues and Solutions¶

Error: num_workers > 0 on Windows
Solution: Set num_workers=0 (simpler code; adjust based on PyTorch version if needed).
Unexpected Data Shapes
Solution: Print batch_features.shape to check dimensions. Image data typically uses (batch_size, channels, height, width) (e.g., MNIST: (64, 1, 28, 28)).
High Memory Usage
Solution: Reduce batch_size (e.g., from 128 to 64) or lower num_workers.

VI. Summary¶

Dataset and DataLoader are the foundation of PyTorch data processing. Dataset defines how data is stored, while DataLoader efficiently reads data in batches.

Key Steps Recap:
1. Define a Dataset (or use TensorDataset for tensor data).
2. Create a DataLoader with parameters like batch_size and shuffle.
3. Iterate through the DataLoader to get batches for model training.

With this knowledge, you can handle various data types (tensors, images, etc.) and optimize data loading for your projects. Next, explore more complex data sources like CSV files or custom image folders to further master Dataset extension and customization.