Learning PyTorch from Scratch: A Basic Explanation of Activation Functions and Convolutional Layers

1. Activation Functions¶

Why Activation Functions?
The basic unit of a neural network is the neuron, whose output is a weighted sum (linear transformation) of its inputs. Without an activation function, no matter how many layers of neurons are stacked, the output remains a linear combination of the inputs, making it impossible to fit complex non-linear relationships (e.g., edges and textures in images). The role of an activation function is to introduce non-linear transformations, enabling the network to “learn complex patterns.”

1. Common Activation Functions and PyTorch Implementations¶

(1) ReLU (Rectified Linear Unit)¶

Formula: y = max(0, x)
When the input x > 0, the output equals the input; when x ≤ 0, the output is 0.
Features: Simple to compute, solves the gradient vanishing problem of traditional Sigmoid/Tanh functions, and trains quickly. It is currently the most commonly used activation function.
PyTorch Implementation:

  import torch
  import torch.nn as nn

  # Create ReLU activation function instance
  relu = nn.ReLU()

  # Test with a tensor
  x = torch.tensor([-1.0, 0.0, 1.5, -0.3])
  y = relu(x)
  print("ReLU Output:", y)  # Output: tensor([0.0000, 0.0000, 1.5000, 0.0000])

(2) Sigmoid Function¶

Formula: y = 1 / (1 + exp(-x))
Output range is (0, 1), commonly used in the output layer of binary classification problems (converts results to probabilities).
Features: Outputs probability values, but when x is very large or very small, the gradient approaches 0, leading to gradient vanishing, making training difficult for deep networks.
PyTorch Implementation:

  sigmoid = nn.Sigmoid()
  x = torch.tensor([-2.0, 0.0, 2.0])
  y = sigmoid(x)
  print("Sigmoid Output:", y)  # Output: tensor([0.1192, 0.5000, 0.8808])

(3) Tanh Function¶

Formula: y = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
Output range is (-1, 1) with a mean of 0, theoretically easier to train than Sigmoid.
Features: Still suffers from gradient vanishing and is less commonly used, mainly in hidden layers of RNNs and other sequence models.
PyTorch Implementation:

  tanh = nn.Tanh()
  x = torch.tensor([-1.0, 0.0, 1.0])
  y = tanh(x)
  print("Tanh Output:", y)  # Output: tensor([-0.7616, 0.0000, 0.7616])

2. Convolutional Layers¶

What is a Convolutional Layer?
Convolutional layers are the core of CNNs (Convolutional Neural Networks). They extract local features (e.g., edges, textures) by sliding a “filter” (convolution kernel) over input data. For example, different kernels can identify patterns like “edges” and “stripes” in images.

1. Basic Concepts of Convolution Operations¶

Input: Assume the input is a grayscale image (single-channel, 2D matrix) with shape (H, W).
Convolution Kernel (Filter/Kernel): A small matrix (e.g., 3x3) used for feature extraction.
Stride: Number of pixels the kernel slides each time (default: 1).
Padding: Adding zeros around the input edges to control output size (prevent feature loss).

Simple Example:
Input: 3x3 matrix (pixel values)

[[1, 2, 3],
 [4, 5, 6],
 [7, 8, 9]]

Convolution Kernel: 2x2 matrix

[[a, b],
 [c, d]]

Output Calculation: The sum of pixel products covered by the kernel at each position
- Top-left: 1*a + 2*b + 4*c + 5*d
- Top-right: 2*a + 3*b + 5*c + 6*d
- Bottom-left: 4*a + 5*b + 7*c + 8*d
- Bottom-right: 5*a + 6*b + 8*c + 9*d

2. Convolutional Layers in PyTorch (nn.Conv2d)¶

In PyTorch, convolutional layers are implemented via nn.Conv2d for processing 2D image data (e.g., RGB images). The input shape is (batch_size, in_channels, height, width), where:
- batch_size: Number of samples in a batch (e.g., 16 images processed at once).
- in_channels: Number of input channels (3 for RGB, 1 for grayscale).
- height/width: Image height and width.

Key Parameters:
- in_channels: Number of input channels.
- out_channels: Number of output channels (number of convolution kernels, determines feature map complexity).
- kernel_size: Convolution kernel size (e.g., 3 for 3x3).
- stride: Stride (default: 1).
- padding: Padding (default: 0).

Output Shape Calculation:
Output height/width = (input_height/width + 2*padding - kernel_size) // stride + 1

Code Example:
Assume input is an RGB image (batch=1, channels=3, size=28x28) with a 3x3 kernel and 16 output channels:

# 1. Import libraries
import torch
import torch.nn as nn

# 2. Create input tensor (batch=1, channels=3, height=28, width=28)
x = torch.randn(1, 3, 28, 28)  # Random data generation

# 3. Define the convolutional layer
conv = nn.Conv2d(
    in_channels=3,     # Input channels (RGB)
    out_channels=16,   # Output channels (16 kernels)
    kernel_size=3,     # 3x3 kernel
    stride=1,          # Stride=1
    padding=1          # Padding=1 to keep output size unchanged
)

# 4. Forward pass (apply convolution)
y = conv(x)

# 5. Print input and output shapes
print("Input Shape:", x.shape)    # Output: torch.Size([1, 3, 28, 28])
print("Output Shape:", y.shape)   # Output: torch.Size([1, 16, 28, 28])

3. Convolutional Layer + Activation Function Practical Example¶

Convolutional layers are typically followed by activation functions (e.g., ReLU) to form the basic unit of “Convolution → Activation”:

# Create Conv2d + ReLU
conv_relu = nn.Sequential(
    nn.Conv2d(3, 16, 3, 1, 1),  # Convolutional layer
    nn.ReLU()                   # Activation function
)

y = conv_relu(x)
print("Convolution + ReLU Output:", y.shape)  # Still torch.Size([1, 16, 28, 28])

Summary: Activation functions introduce non-linearity to neural networks, while convolutional layers extract image features via sliding windows. Their combination forms the core structure of CNN models. Subsequent learning of pooling layers (e.g., MaxPooling) and fully connected layers will enable the construction of complete image recognition models.

Practice Exercise: Try modifying the kernel size (e.g., kernel_size=5) or stride (stride=2) to observe changes in output shape and understand how parameters affect feature extraction.