Introduction to PyTorch Neural Networks: Fully Connected Layers and Backpropagation Principles

1. Fundamentals of Neural Networks: From Neurons to Fully Connected Layers

A neural network resembles a network composed of many “mini-brains,” where each “brain” handles a portion of the information. The fully connected layer is the most fundamental information-transmission unit in a neural network. Essentially, a fully connected layer functions by connecting every neuron in the previous layer to every neuron in the current layer—much like a social network where everyone is connected to everyone else, ensuring information flows freely.

For example, if the previous layer has 3 neurons and the current layer has 5 neurons, each neuron in the current layer receives a weighted sum of the outputs from the 3 neurons in the previous layer. Mathematically:

\[\text{Output} = \text{Weight Matrix} \times \text{Input} + \text{Bias Vector}\]
  • Weight Matrix: Each element represents the connection strength between a neuron in the previous layer and a neuron in the current layer (denoted as W).
  • Bias Vector: Provides an independent “starting point” for each neuron in the current layer (denoted as b).

2. Forward Propagation: Flow of Information from Input to Output

After constructing the fully connected layers, we need to propagate data from the input layer through each subsequent layer to the output layer—a process called forward propagation. Consider a simple two-layer neural network:

  1. Input Layer: Assume the input is a vector x (e.g., 784 pixel values of an MNIST handwritten digit).
  2. First Fully Connected Layer: Pass through nn.Linear(in_features=784, out_features=128) (PyTorch’s fully connected layer) to get x1 = W1 @ x + b1.
  3. Activation Function: Apply the ReLU activation function y1 = ReLU(x1) (introduces non-linearity to enable the network to fit complex relationships).
  4. Second Fully Connected Layer: Pass through nn.Linear(in_features=128, out_features=10) to get x2 = W2 @ y1 + b2 (output 10 class scores).

The PyTorch code for this is:

import torch
import torch.nn as nn

# Define a two-layer fully connected network (simplified version)
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)  # First fully connected layer: 784->128
        self.fc2 = nn.Linear(128, 10)   # Second fully connected layer: 128->10
        self.relu = nn.ReLU()  # Activation function

    def forward(self, x):
        x = self.fc1(x)       # First fully connected + linear transformation
        x = self.relu(x)      # ReLU activation
        x = self.fc2(x)       # Second fully connected + linear transformation
        return x

# Pseudo-input: 1 sample with 784 dimensions
x = torch.randn(1, 784)
model = SimpleNet()
output = model(x)  # Forward propagation to get output
print(output.shape)  # Output shape: torch.Size([1, 10])

3. Backpropagation: The Core of Neural Network “Self-Correction”

3.1 Why Backpropagation is Needed?

Forward propagation only calculates the output but cannot enable the model to “learn.” To find the appropriate weights (W) and biases (b) that minimize the model’s prediction error, backpropagation computes the gradient of the loss function with respect to the weights, allowing the model to automatically adjust parameters.

3.2 Gradient Descent: The Core Idea of Parameter Updates

Imagine standing on a hillside, trying to reach the bottom (minimize loss) as quickly as possible. The gradient points in the steepest downward direction, so the update rule for gradient descent is:

\[W = W - \eta \times \frac{\partial \text{Loss}}{\partial W}\]
  • Learning Rate (η): Controls the step size (too large and you might overshoot the minimum; too small and convergence is slow).
  • Gradient: The derivative of the loss function with respect to the weight, indicating how much the loss changes when adjusting the weight.

3.3 Chain Rule: The Mathematical Foundation of Backpropagation

Backpropagation is essentially the application of the chain rule in neural networks. For a three-layer network (input→fc1→relu→fc2→output), the gradient of the loss function L with respect to W1 is calculated via:

\[\frac{\partial L}{\partial W1} = \frac{\partial L}{\partial x2} \times \frac{\partial x2}{\partial y1} \times \frac{\partial y1}{\partial x1} \times \frac{\partial x1}{\partial W1}\]

Starting from the output layer, we “backwardly calculate” the gradient for each parameter and propagate it layer by layer to the input layer. PyTorch’s autograd automatically records the computation graph and calculates gradients!

3.4 Detailed Steps of Backpropagation (With a Simple Example)

Consider training a two-layer network (input→fc1→relu→fc2→output) with Mean Squared Error (MSE) loss:

\[L = \frac{1}{2}(y_{pred} - y_{true})^2\]

Step 1: Compute forward pass outputs
Using PyTorch to automatically record the computation graph (with requires_grad=True for parameters):

x = torch.tensor([1.0, 2.0])  # Input
y_true = torch.tensor([3.0])   # True value

# Define parameters (simplified version)
W1 = torch.tensor([[0.5, 0.3], [0.2, 0.7]])  # Simplified 784->128
b1 = torch.tensor([0.1, 0.2])
W2 = torch.tensor([[0.4, 0.6, 0.1]])          # Simplified 128->10
b2 = torch.tensor([0.3])

# Forward propagation
x1 = W1 @ x + b1  # x1 = [0.5*1+0.3*2+0.1=1.2, 0.2*1+0.7*2+0.2=1.8]
y1 = torch.relu(x1)  # y1 = [1.2, 1.8] (ReLU preserves positive values)
x2 = W2 @ y1 + b2    # x2 = 0.4*1.2 + 0.6*1.8 + 0.1 = 0.48+1.08+0.1=1.66
y_pred = x2[0]       # Simplified output to a single value

Step 2: Compute gradients of loss with respect to parameters
Using loss.backward() to automatically compute gradients via the chain rule:

loss = (y_pred - y_true) ** 2 / 2  # Mean Squared Error (divide by 2 for easier calculation)
loss.backward()  # Backpropagation to compute gradients

# The grad attribute of parameters will be automatically populated
print("Gradient of W2:", W2.grad)  # tensor([[1.2, 1.8, 0.0]]) (example values)
print("Gradient of W1:", W1.grad)  # tensor([[0.4*0.2*(1.66-3)=...]] (requires chain calculation)

Step 3: Parameter Update
Use an optimizer (e.g., SGD) to update weights based on gradients:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
optimizer.step()  # Update W1, b1, W2, b2 using gradients

4. Complete Backpropagation Process in PyTorch

4.1 Summary of Core Steps:

  1. Define the Network: Use nn.Module and nn.Linear to define fully connected layers and activation functions.
  2. Forward Propagation: Input data is passed through the network to compute the output.
  3. Compute Loss: Use a loss function (e.g., MSE, CrossEntropyLoss) to compare predictions with true values.
  4. Backward Propagation: Call loss.backward() to automatically compute gradients for all parameters.
  5. Parameter Update: Use an optimizer (e.g., torch.optim.SGD) to update weights based on gradients.

4.2 Complete Code Example (Training a Simple Regression Task):

# 1. Import libraries
import torch
import torch.nn as nn
import torch.optim as optim

# 2. Generate simulated data (y = 2x1 + 3x2 + 5 + noise)
x = torch.randn(100, 2)  # 100 samples with 2 features
true_w = torch.tensor([2.0, 3.0])
true_b = torch.tensor(5.0)
y_true = (x @ true_w) + true_b + 0.1 * torch.randn(100)  # Add noise

# 3. Define the model
class LinearNet(nn.Module):
    def __init__(self):
        super(LinearNet, self).__init__()
        self.fc = nn.Linear(2, 1)  # Input: 2 dimensions, Output: 1 dimension (simplified, no activation)
    def forward(self, x):
        return self.fc(x)

model = LinearNet()

# 4. Define loss function and optimizer
criterion = nn.MSELoss()  # Mean Squared Error
optimizer = optim.SGD(model.parameters(), lr=0.1)  # Learning rate 0.1

# 5. Training loop
for epoch in range(100):
    # Forward pass
    y_pred = model(x)
    loss = criterion(y_pred, y_true.unsqueeze(1))  # Adjust dimensions for matching

    # Backward pass
    optimizer.zero_grad()  # Clear gradients (PyTorch accumulates gradients by default)
    loss.backward()        # Compute gradients via backpropagation
    optimizer.step()       # Update parameters using gradients

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

# 6. Output final parameters (should approximate true_w and true_b)
print("Learned weights:", model.fc.weight)  # tensor([[2.0005, 3.0002]])
print("Learned bias:", model.fc.bias)      # tensor([5.0001])

5. Key Concept Review

  • Fully Connected Layer: Connects every neuron in the previous layer to every neuron in the current layer, enabling feature weighted combinations.
  • Forward Propagation: The forward flow of data from input to output.
  • Backpropagation: Backward calculation of the loss gradient with respect to parameters, implemented via the chain rule.
  • Gradient Descent: Updates parameters by moving in the direction opposite to the gradient, minimizing the loss function.
  • Automatic Differentiation: PyTorch’s autograd automatically records the computation graph and calculates gradients, simplifying parameter updates.

By understanding these principles, you can effectively debug and optimize your models, even as you leverage PyTorch’s high-level abstractions for implementation.

Xiaoye