Preface¶
In the previous article “Obtaining Common Public Face Datasets and Creating Custom Face Datasets”, we introduced common face datasets and how to create your own dataset. In this chapter, we will move to the second step of face recognition: face detection. In face recognition tasks, the process typically involves: first checking if a face is present in an image, then cropping the face, aligning it using facial landmarks, and finally completing face recognition through face comparison.
MTCNN (Multi-task Convolutional Neural Network) integrates face region detection and facial landmark detection into a single model. It is divided into three cascaded network structures: P-Net, R-Net, and O-Net. Proposed in 2016 by the Shenzhen Research Institute of the Chinese Academy of Sciences for face detection, this multi-task neural network model uses three cascaded networks with a candidate box plus classifier approach to achieve fast and efficient face detection. The three cascaded networks are: P-Net (rapidly generates candidate windows), R-Net (filters and selects high-precision candidate windows), and O-Net (generates final bounding boxes and facial landmarks). Like many convolutional neural network models for image tasks, MTCNN also employs techniques such as image pyramids, bounding box regression, and non-maximum suppression (NMS).
Project Source Code: https://github.com/yeyupiaoling/PaddlePaddle-MTCNN
Environment¶
- PaddlePaddle 2.0.1
- Python 3.7
File Introduction¶
models/Loss.py: Loss functions used by MTCNN, including classification loss, bounding box loss, and landmark lossmodels/PNet.py: PNet network structuremodels/RNet.py: RNet network structuremodels/ONet.py: ONet network structureutils/data_format_converter.py: Merges multiple images into a single fileutils/data.py: Training data readerutils/utils.py: Various utility functionstrain_PNet/generate_PNet_data.py: Generates training data for PNettrain_PNet/train_PNet.py: Trains the PNet modeltrain_RNet/generate_RNet_data.py: Generates training data for RNettrain_RNet/train_RNet.py: Trains the RNet modeltrain_ONet/generate_ONet_data.py: Generates training data for ONettrain_ONet/train_ONet.py: Trains the ONet modelinfer_path.py: Predicts images using a file path, detects face positions and key points, and displays resultsinfer_camera.py: Predicts images from the camera, detects face positions and key points, and displays results in real-time
Dataset Download¶
- WIDER Face Download the training data “WIDER Face Training Images”, extract the
WIDER_trainfolder and place it under thedatasetdirectory. Also download Face annotations, extract it, and place thewider_face_train_bbx_gt.txtfile in thedatasetdirectory. - Deep Convolutional Network Cascade for Facial Point Detection Download the “Training set”, extract it, and place the
lfw_5590andnet_7876folders under thedatasetdirectory.
After extracting the datasets, the dataset directory should contain:
- Folders: lfw_5590, net_7876, WIDER_train
- Annotation files: testImageList.txt, trainImageList.txt, wider_face_train.txt, wider_face_train_bbx_gt.txt
Model Training¶
Model training consists of three steps: training the PNet model, training the RNet model, and training the ONet model. Each step depends on the result of the previous step.
Step 1: Train PNet Model¶
PNet (Proposal Network) is a fully convolutional network. Its basic structure uses three convolutional layers followed by a face classifier to determine if a region is a face, along with bounding box regression. (Note: The original paper’s key regression was removed in this implementation.)

cd train_PNet(Navigate to thetrain_PNetfolder)python3 generate_PNet_data.py(Generate training data for PNet)python3 train_PNet.py(Start training the PNet model)
Step 2: Train RNet Model¶
RNet (Refine Network) is a convolutional neural network with an additional fully connected layer compared to PNet, enabling stricter filtering of input data. After PNet generates candidate windows, RNet filters out low-quality candidates and further optimizes predictions using Bounding-Box Regression and NMS. (Key regression was also removed here.)

cd train_RNet(Navigate to thetrain_RNetfolder)python3 generate_RNet_data.py(Generate training data for RNet using the trained PNet model)python3 train_RNet.py(Start training the RNet model)
Step 3: Train ONet Model¶
ONet (Output Network) is a more complex convolutional neural network with an additional convolutional layer compared to RNet. It provides more supervision for facial region recognition and regresses facial landmarks, ultimately outputting five facial key points.

cd train_ONet(Navigate to thetrain_ONetfolder)python3 generate_ONet_data.py(Generate training data for ONet using the trained PNet and RNet models)python3 train_ONet.py(Start training the ONet model)
Inference¶
-
python3 infer_path.py(Uses an image path to detect face boxes and landmarks, then displays the results)

-
python3 infer_camera.py(Captures images from the camera, detects face boxes and landmarks, and displays real-time results)
References¶
- https://github.com/AITTSMD/MTCNN-Tensorflow
- https://blog.csdn.net/qq_36782182/article/details/83624357