Preface¶
To develop a face recognition system, a face dataset is essential. Therefore, the preparation work for developing this face recognition system involves obtaining a face dataset. This chapter introduces publicly available datasets and self-made datasets to lay the groundwork for subsequent face recognition system development.
Public Face Datasets¶
There are many publicly available face datasets. Here, we introduce several commonly used ones.
CelebA Face Dataset¶
Official download link: https://pan.baidu.com/s/1zw0KA1iYW41Oo1xZRuHkKQ
Password: zu3w
After downloading the dataset, there are 3 folders:
- The Anno folder stores annotation files.
- The Eval folder stores evaluation list files.
- The Img folder stores image files.
The Img folder contains three types of image files:
- img_align_celeba.zip: Images of faces centered, cropped, and resized to a uniform size of 178×178 in JPG format.
- img_align_celeba_png.7z: Contains the same images as img_align_celeba.zip, but in PNG format, making the files larger.
- img_celeba.7z: Raw face images without centering or cropping.
The Anno folder contains 5 annotation files:
identity_CelebA.txt: Specifies the person ID for each image, with the formatimage_name person_id.
000001.jpg 2880
000002.jpg 2937
000003.jpg 8692
000004.jpg 5805
list_attr_celeba.txt: Annotates facial attributes (e.g., hair color, glasses).
5_o_Clock_Shadow Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Big_Lips Big_Nose Black_Hair Blond_Hair Blurry Brown_Hair Bushy_Eyebrows Chubby Double_Chin Eyeglasses Goatee Gray_Hair Heavy_Makeup High_Cheekbones Male Mouth_Slightly_Open Mustache Narrow_Eyes No_Beard Oval_Face Pale_Skin Pointy_Nose Receding_Hairline Rosy_Cheeks Sideburns Smiling Straight_Hair Wavy_Hair Wearing_Earrings Wearing_Hat Wearing_Lipstick Wearing_Necklace Wearing_Necktie Young
000001.jpg -1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 1 -1 -1 1 -1 -1 -1 1 1 -1 1 -1 1 -1 -1 1
000002.jpg -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1
000003.jpg -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 1 1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1
list_bbox_celeba.txt: Annotates the position of faces in images, with the formatimage_id x_1 y_1 width height.
image_id x_1 y_1 width height
000001.jpg 95 71 226 313
000002.jpg 72 94 221 306
000003.jpg 216 59 91 126
000004.jpg 622 257 564 781
list_landmarks_align_celeba.txt: Annotates facial landmarks (5 key points: eyes, nose, mouth corners) after centering.
lefteye_x lefteye_y righteye_x righteye_y nose_x nose_y leftmouth_x leftmouth_y rightmouth_x rightmouth_y
000001.jpg 69 109 106 113 77 142 73 152 108 154
000002.jpg 69 110 107 112 81 135 70 151 108 153
000003.jpg 76 112 104 106 108 128 74 156 98 158
000004.jpg 72 113 108 108 101 138 71 155 101 151
list_landmarks_celeba.txt: Annotates facial landmarks in the original images.
lefteye_x lefteye_y righteye_x righteye_y nose_x nose_y leftmouth_x leftmouth_y rightmouth_x rightmouth_y
000001.jpg 165 184 244 176 196 249 194 271 266 260
000002.jpg 140 204 220 204 168 254 146 289 226 289
000003.jpg 244 104 264 105 263 121 235 134 251 140
LFW Dataset¶
Dataset download link: http://mmlab.ie.cuhk.edu.hk/archive/CNN/data/train.zip
After unzipping, the LFW dataset contains 2 folders and 2 text files:
- lfw_5590 and net_7876: Folders storing face images.
- testImageList.txt and trainImageList.txt: Text files with annotations of image paths, face bounding boxes, and 5 facial landmarks.
lfw_5590\Aaron_Eckhart_0001.jpg 84 161 92 169 106.250000 107.750000 146.750000 112.250000 125.250000 142.750000 105.250000 157.750000 139.750000 161.750000
lfw_5590\Aaron_Guiel_0001.jpg 85 172 93 181 100.250000 111.250000 145.750000 116.750000 124.250000 136.750000 92.750000 159.750000 138.750000 163.750000
lfw_5590\Aaron_Peirsol_0001.jpg 88 173 94 179 106.750000 113.250000 146.750000 113.250000 129.250000 139.750000 108.250000 153.250000 146.750000 152.750000
lfw_5590\Aaron_Pena_0001.jpg 67 176 83 192 101.750000 116.750000 145.250000 103.750000 125.250000 136.750000 119.750000 163.750000 146.250000 155.750000
WIDER Face Dataset¶
Official image download link: http://pan.baidu.com/s/1c0DfSmW
Annotation file download link: http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/bbx_annotation/wider_face_split.zip
WIDER_train.zip contains face images, with each image possibly containing multiple faces.
The wider_face_train_bbx_gt.txt file annotates face positions with the format xmin ymin width height. The first line is the image path, the second line is the number of faces, and the third line contains the face coordinates.
0--Parade/0_Parade_marchingband_1_849.jpg
1
449 330 122 149 0 0 0 0 0 0
0--Parade/0_Parade_Parade_0_904.jpg
1
361 98 263 339 0 0 0 0 0 0
From the annotation file, wider_face_train.txt can be generated with the format xmin ymin xmax ymax. Some images have multiple annotations (since they contain multiple faces), unlike previous datasets where each image has only one face.
1--Handshaking/1_Handshaking_Handshaking_1_288 336.00 82.00 448.00 244.00
1--Handshaking/1_Handshaking_Handshaking_1_924 271.42 425.64 508.92 681.64
1--Handshaking/1_Handshaking_Handshaking_1_866 364.42 894.49 451.76 993.88 545.13 771.01 623.44 879.44 186.73 825.22 256.00 945.69
1--Handshaking/1_Handshaking_Handshaking_1_164 223.46 100.68 351.16 275.03 532.87 100.68 665.48 279.94
1--Handshaking/1_Handshaking_Handshaking_1_243 393.00 198.00 426.00 245.00
EMore Dataset¶
Download link: https://pan.baidu.com/s/1eXohwNBHbbKXh5KHyItVhQ
train.rec contains training data. The following code extracts images and saves them locally, with images of the same person in the same folder:
import cv2
from PIL import Image, ImageFile
from pathlib import Path
ImageFile.LOAD_TRUNCATED_IMAGES = True
import mxnet as mx
from tqdm import tqdm
def load_mx_rec(rec_path):
save_path = rec_path / 'images'
if not save_path.exists():
save_path.mkdir()
imgrec = mx.recordio.MXIndexedRecordIO(str(rec_path / 'train.idx'), str(rec_path / 'train.rec'), 'r')
img_info = imgrec.read_idx(0)
header, _ = mx.recordio.unpack(img_info)
max_idx = int(header.label[0])
for idx in tqdm(range(1, max_idx)):
img_info = imgrec.read_idx(idx)
header, img = mx.recordio.unpack_img(img_info)
label = int(header.label)
label_path = save_path / str(label)
if not label_path.exists():
label_path.mkdir()
path = str(label_path / '{}.jpg'.format(idx))
cv2.imwrite(path, img)
if __name__ == '__main__':
load_mx_rec(Path('faces_emore'))
CASIA-WebFace Dataset¶
Download link: https://pan.baidu.com/s/1OjyZRhZhl__tOvhLnXeapQ
Extraction code: nf6i
Facial landmark annotation file download link: https://download.csdn.net/download/qq_33200967/18929804
Making a Face Dataset¶
The following introduces how to create your own face dataset. The project’s open-source address is: https://github.com/yeyupiaoling/FaceDataset. The project has two stages:
1. First Stage: Image collection and basic cleaning.
2. Second Stage: Advanced cleaning and facial information annotation using Baidu’s face recognition service.
First Stage¶
The core idea is to collect Chinese celebrity names, search for images using these names, and then delete corrupted images, images without faces, or images with multiple faces (only keep images with one face).
Step 1: Collect Celebrity Names¶
Implemented in get_star_name.py. The core code requests Baidu search API to extract celebrity names.
def get_page(pages, star_name):
params = []
for i in range(0, 12 * pages + 12, 12):
params.append({
'resource_id': 28266,
'from_mid': 1,
'format': 'json',
'ie': 'utf-8',
'oe': 'utf-8',
'query': '明星',
'sort_key': '',
'sort_type': 1,
'stat0': '',
'stat1': star_name,
'stat2': '',
'stat3': '',
'pn': i,
'rn': 12})
url = 'https://sp0.baidu.com/8aQDcjqpAAV3otqbppnN2DJv/api.php'
x = 0
for param in params:
try:
res = requests.get(url, params=param, timeout=50)
js = json.loads(res.text)
results = js.get('data')[0].get('result')
except AttributeError as e:
print('【Error】Error: %s' % e)
continue
for result in results:
img_name = result['ename']
f.write(img_name + '\n')
Step 2: Download Images¶
Implemented in download_image.py. The core code downloads images from Baidu Images using celebrity names.
```python
def download_image(key_word, download_max):
download_sum = 0
str_gsm = ‘80’
save_path = ‘star_image’ + ‘/’ + key_word
if not os.path.exists(save_path):
os.makedirs(save_path)
while download_sum < download_max:
if download_sum >= download_max:
break
str_pn = str(download_sum)
url = ‘http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&’ \
‘word=’ + key_word + ‘&pn=’ + str_pn + ‘&gsm=’ + str_gsm + ‘&ct=&ic=0&lm=-1&width=0&height=0’
try:
result = requests.get(url, timeout=30).text
img_urls = re.findall(‘“objURL”:”(.*?)”,’, result, re.S)
if len(img_urls) < 1:
break
for img_url in img_urls:
img = requests.get(img_url, timeout=30)
img_name = save_path + ‘/’ + str(uuid.uuid1()) + ‘.jpg’
with open(img_name, ‘wb’) as f:
f.write(img.content)
with open(‘image_url_list.txt’, ‘a+’, encoding=’utf-8’) as f