Foreword

In the previous chapter, we got started with deep learning through the linear regression example and became familiar with the usage of PaddlePaddle. In this chapter, we will learn more interesting knowledge about convolutional neural networks (CNNs). Deep learning has become so popular largely due to its excellent performance in computer vision, where CNNs are almost universally used to extract image features. In this chapter, we will learn how to define a CNN in PaddlePaddle and use it to complete an image recognition task. By studying the MNIST image dataset classification example, we will master the use of CNNs.

Training the Model

Create a mnist_classification.py file and first import the required packages. This time, we use the MNIST dataset interface and image processing toolkits.

import numpy as np
import paddle as paddle
import paddle.dataset.mnist as mnist
import paddle.fluid as fluid
from PIL import Image
import matplotlib.pyplot as plt

In image recognition, the algorithms used have undergone multiple iterations, such as Multi-Layer Perceptrons (MLPs). Before CNNs became widely used, MLPs were very popular for image recognition, and they had certain advantages at that time. Let’s learn how to define a simple MLP using PaddlePaddle. The following code defines a simple MLP with three layers: two hidden layers with 100 neurons each and an output layer with 10 neurons (since MNIST consists of grayscale handwritten digits from 0 to 9, there are 10 categories). The output layer uses the Softmax activation function, acting as a classifier. Including the input layer, the MLP structure is: Input Layer -> Hidden Layer -> Hidden Layer -> Output Layer.

# Define a Multi-Layer Perceptron
def multilayer_perceptron(input):
    # First fully connected layer with ReLU activation
    hidden1 = fluid.layers.fc(input=input, size=100, act='relu')
    # Second fully connected layer with ReLU activation
    hidden2 = fluid.layers.fc(input=hidden1, size=100, act='relu')
    # Fully connected output layer with Softmax activation (size matches the number of labels)
    fc = fluid.layers.fc(input=hidden2, size=10, act='softmax')
    return fc

CNNs are widely used for image feature extraction, and they are employed in tasks like image classification, object detection, and text recognition. A typical CNN consists of convolutional layers, pooling layers, and fully connected layers, sometimes with additional layers like Batch Normalization and Dropout. Let’s create a simple CNN with five layers (including the input layer). Its structure is: Input Layer -> Convolutional Layer -> Pooling Layer -> Convolutional Layer -> Pooling Layer -> Output Layer. We use fluid.layers.conv2d() for convolution, where num_filters sets the number of kernels, filter_size sets the kernel size, and stride sets the step size. For pooling, we use fluid.layers.pool2d(), with pool_size for the pooling size, pool_stride for the step size, and pool_type for the pooling type (max pooling is used here, with avg for average pooling).

# Convolutional Neural Network
def convolutional_neural_network(input):
    # First convolutional layer: 32 kernels of size 3x3
    conv1 = fluid.layers.conv2d(input=input,
                                num_filters=32,
                                filter_size=3,
                                stride=1)

    # First pooling layer: 2x2 max pooling with stride 1
    pool1 = fluid.layers.pool2d(input=conv1,
                                pool_size=2,
                                pool_stride=1,
                                pool_type='max')

    # Second convolutional layer: 64 kernels of size 3x3
    conv2 = fluid.layers.conv2d(input=pool1,
                                num_filters=64,
                                filter_size=3,
                                stride=1)

    # Second pooling layer: 2x2 max pooling with stride 1
    pool2 = fluid.layers.pool2d(input=conv2,
                                pool_size=2,
                                pool_stride=1,
                                pool_type='max')

    # Fully connected output layer with Softmax activation
    fc = fluid.layers.fc(input=pool2, size=10, act='softmax')
    return fc

Define the input layer for image data. The images are 28x28 grayscale images, so the input shape is [1, 28, 28] (1 channel for grayscale, 28x28 pixels). For a 32x32 RGB image, the shape would be [3, 32, 32] (3 channels for RGB). The Batch dimension is handled automatically by PaddlePaddle and can be ignored.

# Define the input layer
image = fluid.layers.data(name='image', shape=[1, 28, 28], dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64')

We can now use the defined networks to get the classifier. Try both MLP and CNN to compare their accuracy.

# Get the classifier
# model = multilayer_perceptron(image)  # Uncomment to test MLP
model = convolutional_neural_network(image)  # Use CNN for this chapter

Next, define the loss function (cross-entropy, commonly used in classification) and the accuracy function.

# Get loss function and accuracy function
cost = fluid.layers.cross_entropy(input=model, label=label)
avg_cost = fluid.layers.mean(cost)
acc = fluid.layers.accuracy(input=model, label=label)

Clone the main program as the test program for evaluating test accuracy and predicting custom images later.

# Get the test program
test_program = fluid.default_main_program().clone(for_test=True)

Define the optimization method. We use the Adam optimizer with a learning rate of 0.001.

# Define optimization method
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=0.001)
opts = optimizer.minimize(avg_cost)

Read the MNIST dataset with a batch size of 128 (128 images per training iteration).

# Get MNIST data
train_reader = paddle.batch(mnist.train(), batch_size=128)
test_reader = paddle.batch(mnist.test(), batch_size=128)

Initialize the executor and parameters. For Fluid, this is the standard setup.

# Define an executor using CPU
place = fluid.CPUPlace()
exe = fluid.Executor(place)
# Initialize parameters
exe.run(fluid.default_startup_program())

Define the data feeder, which specifies the input data dimensions (image and label).

# Define data feeder
feeder = fluid.DataFeeder(place=place, feed_list=[image, label])

Now start training for 5 passes (epochs). We print the training loss and accuracy every 100 batches, then evaluate the test set after each pass.

# Start training and testing
for pass_id in range(5):
    # Training phase
    for batch_id, data in enumerate(train_reader()):
        train_cost, train_acc = exe.run(program=fluid.default_main_program(),
                                        feed=feeder.feed(data),
                                        fetch_list=[avg_cost, acc])
        # Print info every 100 batches
        if batch_id % 100 == 0:
            print('Pass:%d, Batch:%d, Cost:%0.5f, Accuracy:%0.5f' %
                  (pass_id, batch_id, train_cost[0], train_acc[0]))

    # Testing phase
    test_accs = []
    test_costs = []
    for batch_id, data in enumerate(test_reader()):
        test_cost, test_acc = exe.run(program=test_program,
                                      feed=feeder.feed(data),
                                      fetch_list=[avg_cost, acc])
        test_accs.append(test_acc[0])
        test_costs.append(test_cost[0])
    # Calculate average test results
    test_cost = (sum(test_costs) / len(test_costs))
    test_acc = (sum(test_accs) / len(test_accs))
    print('Test:%d, Cost:%0.5f, Accuracy:%0.5f' % (pass_id, test_cost, test_acc))

Output:

Pass:0, Batch:0, Cost:3.50138, Accuracy:0.07812
Pass:0, Batch:100, Cost:0.14832, Accuracy:0.96875
Pass:0, Batch:200, Cost:0.13408, Accuracy:0.96875
Pass:0, Batch:300, Cost:0.11601, Accuracy:0.97656
Pass:0, Batch:400, Cost:0.27977, Accuracy:0.92969
Test:0, Cost:0.08879, Accuracy:0.97379
Pass:1, Batch:0, Cost:0.11175, Accuracy:0.96875
Pass:1, Batch:100, Cost:0.07854, Accuracy:0.97656
Pass:1, Batch:200, Cost:0.04025, Accuracy:0.99219
Pass:1, Batch:300, Cost:0.09936, Accuracy:0.98438
Pass:1, Batch:400, Cost:0.19245, Accuracy:0.95312
Test:1, Cost:0.10123, Accuracy:0.97241
Pass:2, Batch:0, Cost:0.13749, Accuracy:0.96094
Pass:2, Batch:100, Cost:0.06074, Accuracy:0.98438
Pass:2, Batch:200, Cost:0.01982, Accuracy:0.99219
Pass:2, Batch:300, Cost:0.06725, Accuracy:0.97656
Pass:2, Batch:400, Cost:0.10043, Accuracy:0.96875
Test:2, Cost:0.13354, Accuracy:0.96776
Pass:3, Batch:0, Cost:0.08895, Accuracy:0.98438
Pass:3, Batch:100, Cost:0.06339, Accuracy:0.96875
Pass:3, Batch:200, Cost:0.05107, Accuracy:0.98438
Pass:3, Batch:300, Cost:0.08062, Accuracy:0.97656
Pass:3, Batch:400, Cost:0.07631, Accuracy:0.96875
Test:3, Cost:0.11465, Accuracy:0.97449
Pass:4, Batch:0, Cost:0.01259, Accuracy:1.00000
Pass:4, Batch:100, Cost:0.01203, Accuracy:1.00000
Pass:4, Batch:200, Cost:0.08451, Accuracy:0.97656
Pass:4, Batch:300, Cost:0.16532, Accuracy:0.98438
Pass:4, Batch:400, Cost:0.09657, Accuracy:0.98438
Test:4, Cost:0.14624, Accuracy:0.97211

Predicting Custom Images

After training, we use the cloned test_program to predict custom images. Preprocessing steps must match the training: grayscale conversion, resize to 28x28, convert to a normalized array.

# Preprocess the image for prediction
def load_image(file):
    im = Image.open(file).convert('L')  # Grayscale conversion
    im = im.resize((28, 28), Image.ANTIALIAS)  # Resize to 28x28
    im = np.array(im).reshape(1, 1, 28, 28).astype(np.float32)  # Reshape to [1,1,28,28]
    im = im / 255.0 * 2.0 - 1.0  # Normalize to [-1, 1] (matching training preprocessing)
    return im

Download a test image (e.g., infer_3.png):

!wget https://github.com/yeyupiaoling/LearnPaddle2/blob/master/note4/infer_3.png?raw=true -O 'infer_3.png'

Display the image using Matplotlib:

img = Image.open('infer_3.png')
plt.imshow(img)
plt.show()

Output Image:
Image of a handwritten digit 3, grayscale, 28x28

Load the preprocessed image and run prediction:

# Load image and start prediction
img = load_image('./infer_3.png')
results = exe.run(program=test_program,
                  feed={'image': img, "label": np.array([[1]]).astype("int64")},
                  fetch_list=[model])

Determine the predicted label with the highest probability:

# Get the predicted label with the highest probability
lab = np.argsort(results)
print("The predicted label for this image is: %d" % lab[0][0][-1])

Output:

The predicted label for this image is: 3

Conclusion

This chapter concludes here. By now, you should find PaddlePaddle easy to use, as we defined a CNN and completed image classification training and prediction efficiently. CNNs excel in image recognition, while RNNs play a similar role in natural language processing. We will learn about RNNs in the next chapter.

Synchronization Links:
- Baidu AI Studio: http://aistudio.baidu.com/aistudio/projectdetail/29346
- Kesci K-Lab: https://www.kesci.com/home/project/5bf8c998954d6e001066d780
- GitHub Repository: https://github.com/yeyupiaoling/LearnPaddle2/tree/master/note4

Note: The latest code is available on GitHub.


Previous Chapter: 《PaddlePaddle from Entry to ML Experimentation》3——Linear Regression

Next Chapter: 《PaddlePaddle from Entry to ML Experimentation》5——Recurrent Neural Networks

References

  1. https://blog.csdn.net/m_buddy/article/details/80224409
  2. http://www.paddlepaddle.org/documentation/docs/zh/1.0/beginners_guide/quick_start/recognize_digits/README.cn.html
Xiaoye