What is an Optimizer?¶
In deep learning, an optimizer is like an “intelligent guide” that helps the model find the direction to minimize the loss function (e.g., the gap between the model’s predictions and true labels) from a large number of parameters. Imagine climbing a mountain: you want to go from the peak (high loss) to the valley (low loss), and the optimizer is the “navigation system” that decides where and how far to step each time. Its core task is to update the model parameters so that the model performs better on the training data.
Why Different Optimizers?¶
Different optimizers are designed for different problems, each with its own advantages and disadvantages:
- SGD (Stochastic Gradient Descent) : The most basic optimizer, updating parameters using the gradient of a single sample each time.
- Pros: Simple, low memory usage.
- Cons: Slow to converge and sensitive to the learning rate (too large causes oscillations, too small leads to slow convergence).
-
Improvement: Adding “momentum” (Momentum) accelerates convergence, similar to physical “inertia,” making parameter updates smoother.
-
Adam (Adaptive Moment Estimation) : Currently the most popular optimizer, combining momentum and adaptive learning rates.
- Pros: Performs well with default parameters, converges quickly, and is insensitive to the learning rate, making it suitable for most scenarios.
-
Key Feature: Automatically adjusts the learning rate based on the parameter’s historical gradients, allowing different parameters to update at different speeds.
-
AdamW : An improved version of Adam with weight decay (L2 regularization), which effectively prevents overfitting.
Overview of Optimizers in PyTorch¶
PyTorch’s torch.optim module provides various optimizers. Here are the most commonly used ones for beginners:
| Optimizer | Core Features | Applicable Scenarios |
|---|---|---|
| SGD | Basic stochastic gradient descent; requires manual tuning of learning rate and momentum | Simple models or when strict parameter control is needed |
| SGD+Momentum | Adds momentum to accelerate convergence and reduce oscillations | Models with high training volatility (e.g., RNNs) |
| Adam | Adaptive learning rate + momentum; excellent performance with default parameters | Most deep learning tasks (CNNs, Transformers, etc.) |
| AdamW | Adam + weight decay; prevents overfitting | Small datasets or complex models |
Practical: Comparing Optimizers in PyTorch¶
We use a simple linear regression model to compare the performance of SGD and Adam. The goal is to learn the linear relationship y = 2x + 3 (with noise to simulate real data).
Step 1: Prepare Data¶
Generate noisy linear data:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
# Set random seed for reproducibility
torch.manual_seed(42)
# Generate 100 samples with 1 feature each; labels follow y = 2x + 3 + noise
x = torch.randn(100, 1) * 10 # Input features
y = 2 * x + 3 + torch.randn(100, 1) * 1.5 # True relationship + noise
Step 2: Define the Model¶
Define a simple linear model using PyTorch’s nn.Module:
class LinearRegression(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(in_features=1, out_features=1) # 1D input → 1D output
# Initialize two models: one with SGD, one with Adam
model_sgd = LinearRegression()
model_adam = LinearRegression()
Step 3: Define Loss Function and Optimizers¶
Use Mean Squared Error (MSE) as the loss function, with SGD and Adam as optimizers:
# Loss function: Mean Squared Error
criterion = nn.MSELoss()
# SGD optimizer: learning rate = 0.01 (manual tuning required)
optimizer_sgd = optim.SGD(model_sgd.parameters(), lr=0.01)
# Adam optimizer: uses default parameters (no manual tuning needed)
optimizer_adam = optim.Adam(model_adam.parameters(), lr=0.01)
Step 4: Train the Model¶
Training loop: Forward pass → compute loss → backward pass → update parameters
def train(optimizer, model, x, y, epochs=1000):
losses = []
for epoch in range(epochs):
# Forward pass: model prediction
pred = model(x)
loss = criterion(pred, y)
losses.append(loss.item())
# Backward pass + parameter update
optimizer.zero_grad() # Clear gradients (to avoid accumulation)
loss.backward() # Compute gradients
optimizer.step() # Update parameters
# Print loss every 100 epochs
if (epoch + 1) % 100 == 0:
print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')
return losses
# Train both models
losses_sgd = train(optimizer_sgd, model_sgd, x, y)
losses_adam = train(optimizer_adam, model_adam, x, y)
Step 5: Compare Results¶
After training, observe results through loss curves and final parameters:
- Loss Curve:
plt.figure(figsize=(10, 5))
plt.plot(losses_sgd, label='SGD')
plt.plot(losses_adam, label='Adam')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curve Comparison')
plt.legend()
plt.show()
Intuition: Adam converges faster with smoother loss reduction; SGD may oscillate and converge slower.
- Final Parameters:
print("Parameters after SGD training:")
print(f"Weight: {model_sgd.linear.weight.item():.2f}, Bias: {model_sgd.linear.bias.item():.2f}")
print("\nParameters after Adam training:")
print(f"Weight: {model_adam.linear.weight.item():.2f}, Bias: {model_adam.linear.bias.item():.2f}")
Output: Both should approximate the true values Weight ≈ 2, Bias ≈ 3, with Adam’s parameters being more stable.
Summary and Recommendations¶
- Beginners: Start with Adam : Default parameters work well for most scenarios, ensuring fast convergence and stability without manual learning rate tuning.
- When to Use SGD: For strict parameter control (e.g., small datasets) or when experimenting with momentum/learning rate adjustments.
- Key Tips: If loss does not decrease, try increasing the learning rate (
lr=0.1) or using AdamW (to prevent overfitting).
In practice, there is no “best” optimizer—only the “most suitable” one. Start with Adam and experiment with others based on your task needs!