Preface

To develop a face recognition system, a face dataset is essential. Therefore, the preparation work for developing this face recognition system involves obtaining a face dataset. This chapter introduces publicly available datasets and self-made datasets to lay the groundwork for subsequent face recognition system development.

Public Face Datasets

There are many publicly available face datasets. Here, we introduce several commonly used ones.

CelebA Face Dataset

Official download link: https://pan.baidu.com/s/1zw0KA1iYW41Oo1xZRuHkKQ
Password: zu3w

After downloading the dataset, there are 3 folders:
- The Anno folder stores annotation files.
- The Eval folder stores evaluation list files.
- The Img folder stores image files.

The Img folder contains three types of image files:
- img_align_celeba.zip: Images of faces centered, cropped, and resized to a uniform size of 178×178 in JPG format.
- img_align_celeba_png.7z: Contains the same images as img_align_celeba.zip, but in PNG format, making the files larger.
- img_celeba.7z: Raw face images without centering or cropping.

The Anno folder contains 5 annotation files:

  1. identity_CelebA.txt: Specifies the person ID for each image, with the format image_name person_id.
000001.jpg 2880
000002.jpg 2937
000003.jpg 8692
000004.jpg 5805
  1. list_attr_celeba.txt: Annotates facial attributes (e.g., hair color, glasses).
5_o_Clock_Shadow Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Big_Lips Big_Nose Black_Hair Blond_Hair Blurry Brown_Hair Bushy_Eyebrows Chubby Double_Chin Eyeglasses Goatee Gray_Hair Heavy_Makeup High_Cheekbones Male Mouth_Slightly_Open Mustache Narrow_Eyes No_Beard Oval_Face Pale_Skin Pointy_Nose Receding_Hairline Rosy_Cheeks Sideburns Smiling Straight_Hair Wavy_Hair Wearing_Earrings Wearing_Hat Wearing_Lipstick Wearing_Necklace Wearing_Necktie Young 
000001.jpg -1  1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1  1  1 -1  1 -1 -1  1 -1 -1  1 -1 -1 -1  1  1 -1  1 -1  1 -1 -1  1
000002.jpg -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1  1 -1  1 -1 -1  1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1  1
000003.jpg -1 -1 -1 -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1  1  1 -1 -1  1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1  1
  1. list_bbox_celeba.txt: Annotates the position of faces in images, with the format image_id x_1 y_1 width height.
image_id x_1 y_1 width height
000001.jpg    95  71 226 313
000002.jpg    72  94 221 306
000003.jpg   216  59  91 126
000004.jpg   622 257 564 781
  1. list_landmarks_align_celeba.txt: Annotates facial landmarks (5 key points: eyes, nose, mouth corners) after centering.
lefteye_x lefteye_y righteye_x righteye_y nose_x nose_y leftmouth_x leftmouth_y rightmouth_x rightmouth_y
000001.jpg 69  109  106  113   77  142   73  152  108  154
000002.jpg 69  110  107  112   81  135   70  151  108  153
000003.jpg 76  112  104  106  108  128   74  156   98  158
000004.jpg 72  113  108  108  101  138   71  155  101  151
  1. list_landmarks_celeba.txt: Annotates facial landmarks in the original images.
lefteye_x lefteye_y righteye_x righteye_y nose_x nose_y leftmouth_x leftmouth_y rightmouth_x rightmouth_y
000001.jpg 165  184  244  176  196  249  194  271  266  260
000002.jpg 140  204  220  204  168  254  146  289  226  289
000003.jpg 244  104  264  105  263  121  235  134  251  140

LFW Dataset

Dataset download link: http://mmlab.ie.cuhk.edu.hk/archive/CNN/data/train.zip

After unzipping, the LFW dataset contains 2 folders and 2 text files:
- lfw_5590 and net_7876: Folders storing face images.
- testImageList.txt and trainImageList.txt: Text files with annotations of image paths, face bounding boxes, and 5 facial landmarks.

lfw_5590\Aaron_Eckhart_0001.jpg 84 161 92 169 106.250000 107.750000 146.750000 112.250000 125.250000 142.750000 105.250000 157.750000 139.750000 161.750000
lfw_5590\Aaron_Guiel_0001.jpg 85 172 93 181 100.250000 111.250000 145.750000 116.750000 124.250000 136.750000 92.750000 159.750000 138.750000 163.750000
lfw_5590\Aaron_Peirsol_0001.jpg 88 173 94 179 106.750000 113.250000 146.750000 113.250000 129.250000 139.750000 108.250000 153.250000 146.750000 152.750000
lfw_5590\Aaron_Pena_0001.jpg 67 176 83 192 101.750000 116.750000 145.250000 103.750000 125.250000 136.750000 119.750000 163.750000 146.250000 155.750000

WIDER Face Dataset

Official image download link: http://pan.baidu.com/s/1c0DfSmW
Annotation file download link: http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/bbx_annotation/wider_face_split.zip

WIDER_train.zip contains face images, with each image possibly containing multiple faces.

The wider_face_train_bbx_gt.txt file annotates face positions with the format xmin ymin width height. The first line is the image path, the second line is the number of faces, and the third line contains the face coordinates.

0--Parade/0_Parade_marchingband_1_849.jpg
1
449 330 122 149 0 0 0 0 0 0 
0--Parade/0_Parade_Parade_0_904.jpg
1
361 98 263 339 0 0 0 0 0 0 

From the annotation file, wider_face_train.txt can be generated with the format xmin ymin xmax ymax. Some images have multiple annotations (since they contain multiple faces), unlike previous datasets where each image has only one face.

1--Handshaking/1_Handshaking_Handshaking_1_288 336.00 82.00 448.00 244.00 
1--Handshaking/1_Handshaking_Handshaking_1_924 271.42 425.64 508.92 681.64 
1--Handshaking/1_Handshaking_Handshaking_1_866 364.42 894.49 451.76 993.88 545.13 771.01 623.44 879.44 186.73 825.22 256.00 945.69 
1--Handshaking/1_Handshaking_Handshaking_1_164 223.46 100.68 351.16 275.03 532.87 100.68 665.48 279.94 
1--Handshaking/1_Handshaking_Handshaking_1_243 393.00 198.00 426.00 245.00 

EMore Dataset

Download link: https://pan.baidu.com/s/1eXohwNBHbbKXh5KHyItVhQ

train.rec contains training data. The following code extracts images and saves them locally, with images of the same person in the same folder:

import cv2
from PIL import Image, ImageFile
from pathlib import Path

ImageFile.LOAD_TRUNCATED_IMAGES = True
import mxnet as mx
from tqdm import tqdm


def load_mx_rec(rec_path):
    save_path = rec_path / 'images'
    if not save_path.exists():
        save_path.mkdir()
    imgrec = mx.recordio.MXIndexedRecordIO(str(rec_path / 'train.idx'), str(rec_path / 'train.rec'), 'r')
    img_info = imgrec.read_idx(0)
    header, _ = mx.recordio.unpack(img_info)
    max_idx = int(header.label[0])
    for idx in tqdm(range(1, max_idx)):
        img_info = imgrec.read_idx(idx)
        header, img = mx.recordio.unpack_img(img_info)
        label = int(header.label)
        label_path = save_path / str(label)
        if not label_path.exists():
            label_path.mkdir()
        path = str(label_path / '{}.jpg'.format(idx))
        cv2.imwrite(path, img)


if __name__ == '__main__':
    load_mx_rec(Path('faces_emore'))

CASIA-WebFace Dataset

Download link: https://pan.baidu.com/s/1OjyZRhZhl__tOvhLnXeapQ
Extraction code: nf6i

Facial landmark annotation file download link: https://download.csdn.net/download/qq_33200967/18929804

Making a Face Dataset

The following introduces how to create your own face dataset. The project’s open-source address is: https://github.com/yeyupiaoling/FaceDataset. The project has two stages:
1. First Stage: Image collection and basic cleaning.
2. Second Stage: Advanced cleaning and facial information annotation using Baidu’s face recognition service.

First Stage

The core idea is to collect Chinese celebrity names, search for images using these names, and then delete corrupted images, images without faces, or images with multiple faces (only keep images with one face).

Step 1: Collect Celebrity Names

Implemented in get_star_name.py. The core code requests Baidu search API to extract celebrity names.

def get_page(pages, star_name):
    params = []
    for i in range(0, 12 * pages + 12, 12):
        params.append({
            'resource_id': 28266,
            'from_mid': 1,
            'format': 'json',
            'ie': 'utf-8',
            'oe': 'utf-8',
            'query': '明星',
            'sort_key': '',
            'sort_type': 1,
            'stat0': '',
            'stat1': star_name,
            'stat2': '',
            'stat3': '',
            'pn': i,
            'rn': 12})

    url = 'https://sp0.baidu.com/8aQDcjqpAAV3otqbppnN2DJv/api.php'

    x = 0
    for param in params:
        try:
            res = requests.get(url, params=param, timeout=50)
            js = json.loads(res.text)
            results = js.get('data')[0].get('result')
        except AttributeError as e:
            print('【Error】Error: %s' % e)
            continue

        for result in results:
            img_name = result['ename']
            f.write(img_name + '\n')

Step 2: Download Images

Implemented in download_image.py. The core code downloads images from Baidu Images using celebrity names.
```python
def download_image(key_word, download_max):
download_sum = 0
str_gsm = ‘80’
save_path = ‘star_image’ + ‘/’ + key_word
if not os.path.exists(save_path):
os.makedirs(save_path)
while download_sum < download_max:
if download_sum >= download_max:
break
str_pn = str(download_sum)
url = ‘http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&’ \
‘word=’ + key_word + ‘&pn=’ + str_pn + ‘&gsm=’ + str_gsm + ‘&ct=&ic=0&lm=-1&width=0&height=0’
try:
result = requests.get(url, timeout=30).text
img_urls = re.findall(‘“objURL”:”(.*?)”,’, result, re.S)
if len(img_urls) < 1:
break
for img_url in img_urls:
img = requests.get(img_url, timeout=30)
img_name = save_path + ‘/’ + str(uuid.uuid1()) + ‘.jpg’
with open(img_name, ‘wb’) as f:
f.write(img.content)
with open(‘image_url_list.txt’, ‘a+’, encoding=’utf-8’) as f

Xiaoye