Preface

In the previous article “Obtaining Common Public Face Datasets and Creating Custom Face Datasets”, we introduced common face datasets and how to create your own dataset. In this chapter, we will move to the second step of face recognition: face detection. In face recognition tasks, the process typically involves: first checking if a face is present in an image, then cropping the face, aligning it using facial landmarks, and finally completing face recognition through face comparison.

MTCNN (Multi-task Convolutional Neural Network) integrates face region detection and facial landmark detection into a single model. It is divided into three cascaded network structures: P-Net, R-Net, and O-Net. Proposed in 2016 by the Shenzhen Research Institute of the Chinese Academy of Sciences for face detection, this multi-task neural network model uses three cascaded networks with a candidate box plus classifier approach to achieve fast and efficient face detection. The three cascaded networks are: P-Net (rapidly generates candidate windows), R-Net (filters and selects high-precision candidate windows), and O-Net (generates final bounding boxes and facial landmarks). Like many convolutional neural network models for image tasks, MTCNN also employs techniques such as image pyramids, bounding box regression, and non-maximum suppression (NMS).

Project Source Code: https://github.com/yeyupiaoling/PaddlePaddle-MTCNN

Environment

  • PaddlePaddle 2.0.1
  • Python 3.7

File Introduction

  • models/Loss.py: Loss functions used by MTCNN, including classification loss, bounding box loss, and landmark loss
  • models/PNet.py: PNet network structure
  • models/RNet.py: RNet network structure
  • models/ONet.py: ONet network structure
  • utils/data_format_converter.py: Merges multiple images into a single file
  • utils/data.py: Training data reader
  • utils/utils.py: Various utility functions
  • train_PNet/generate_PNet_data.py: Generates training data for PNet
  • train_PNet/train_PNet.py: Trains the PNet model
  • train_RNet/generate_RNet_data.py: Generates training data for RNet
  • train_RNet/train_RNet.py: Trains the RNet model
  • train_ONet/generate_ONet_data.py: Generates training data for ONet
  • train_ONet/train_ONet.py: Trains the ONet model
  • infer_path.py: Predicts images using a file path, detects face positions and key points, and displays results
  • infer_camera.py: Predicts images from the camera, detects face positions and key points, and displays results in real-time

Dataset Download

  • WIDER Face Download the training data “WIDER Face Training Images”, extract the WIDER_train folder and place it under the dataset directory. Also download Face annotations, extract it, and place the wider_face_train_bbx_gt.txt file in the dataset directory.
  • Deep Convolutional Network Cascade for Facial Point Detection Download the “Training set”, extract it, and place the lfw_5590 and net_7876 folders under the dataset directory.

After extracting the datasets, the dataset directory should contain:
- Folders: lfw_5590, net_7876, WIDER_train
- Annotation files: testImageList.txt, trainImageList.txt, wider_face_train.txt, wider_face_train_bbx_gt.txt

Model Training

Model training consists of three steps: training the PNet model, training the RNet model, and training the ONet model. Each step depends on the result of the previous step.

Step 1: Train PNet Model

PNet (Proposal Network) is a fully convolutional network. Its basic structure uses three convolutional layers followed by a face classifier to determine if a region is a face, along with bounding box regression. (Note: The original paper’s key regression was removed in this implementation.)

PNet Architecture

  • cd train_PNet (Navigate to the train_PNet folder)
  • python3 generate_PNet_data.py (Generate training data for PNet)
  • python3 train_PNet.py (Start training the PNet model)

Step 2: Train RNet Model

RNet (Refine Network) is a convolutional neural network with an additional fully connected layer compared to PNet, enabling stricter filtering of input data. After PNet generates candidate windows, RNet filters out low-quality candidates and further optimizes predictions using Bounding-Box Regression and NMS. (Key regression was also removed here.)

RNet Architecture

  • cd train_RNet (Navigate to the train_RNet folder)
  • python3 generate_RNet_data.py (Generate training data for RNet using the trained PNet model)
  • python3 train_RNet.py (Start training the RNet model)

Step 3: Train ONet Model

ONet (Output Network) is a more complex convolutional neural network with an additional convolutional layer compared to RNet. It provides more supervision for facial region recognition and regresses facial landmarks, ultimately outputting five facial key points.

ONet Architecture

  • cd train_ONet (Navigate to the train_ONet folder)
  • python3 generate_ONet_data.py (Generate training data for ONet using the trained PNet and RNet models)
  • python3 train_ONet.py (Start training the ONet model)

Inference

  • python3 infer_path.py (Uses an image path to detect face boxes and landmarks, then displays the results)
    Inference Example

  • python3 infer_camera.py (Captures images from the camera, detects face boxes and landmarks, and displays real-time results)

References

  1. https://github.com/AITTSMD/MTCNN-Tensorflow
  2. https://blog.csdn.net/qq_36782182/article/details/83624357
Xiaoye