前言¶

開發人臉識別系統，人臉數據集是必須的。所以在我們開發這套人臉識別系統的準備工作就是獲取人臉數據集。本章將從公開的數據集到自制人臉數據集介紹，爲我們之後開發人臉識別系統做好準備。

公開人臉數據集¶

公開的人臉數據集有很多，本中我們就介紹幾個比較常用的人臉數據集。

CelebA人臉數據集¶

官方提供的下載地址：鏈接:https://pan.baidu.com/s/1zw0KA1iYW41Oo1xZRuHkKQ 密碼:zu3w

該數據集下載後有3個文件夾，Anno文件夾是存放標註文件的，Eval文件夾是存放評估列表文件的，Img文件是存放圖片文件的。
Img中有3中類型的圖像文件，其中
- img_align_celeba.zip是經過對人臉居中，裁剪，並統一大小爲178*178的jpg圖片；
- img_align_celeba_png.7z中的圖片跟img_align_celeba.zip中的圖片一樣，唯一不同的是這些圖片是png格式的，所以這些圖片要大得多。
- img_celeba.7z這個是人臉圖片的原始圖片，沒有經過居中裁剪等處理的圖片。

Anno文件夾中有5個標註文件，其中
1. identity_CelebA.txt是指定每張圖片對應的人臉標籤，格式爲圖片名稱人臉ID

000001.jpg 2880
000002.jpg 2937
000003.jpg 8692
000004.jpg 5805

list_attr_celeba.txt文件是標註人臉屬性的，比如該人臉是否黑色頭髮，是否戴眼鏡等等

5_o_Clock_Shadow Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Big_Lips Big_Nose Black_Hair Blond_Hair Blurry Brown_Hair Bushy_Eyebrows Chubby Double_Chin Eyeglasses Goatee Gray_Hair Heavy_Makeup High_Cheekbones Male Mouth_Slightly_Open Mustache Narrow_Eyes No_Beard Oval_Face Pale_Skin Pointy_Nose Receding_Hairline Rosy_Cheeks Sideburns Smiling Straight_Hair Wavy_Hair Wearing_Earrings Wearing_Hat Wearing_Lipstick Wearing_Necklace Wearing_Necktie Young 
000001.jpg -1  1  1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1  1  1 -1  1 -1 -1  1 -1 -1  1 -1 -1 -1  1  1 -1  1 -1  1 -1 -1  1
000002.jpg -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1  1 -1  1 -1 -1  1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1  1
000003.jpg -1 -1 -1 -1 -1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1  1  1 -1 -1  1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1 -1  1

list_bbox_celeba.txt文件是標註人臉在圖片中的位置，標註信息爲image_id x_1 y_1 width height

image_id x_1 y_1 width height
000001.jpg    95  71 226 313
000002.jpg    72  94 221 306
000003.jpg   216  59  91 126
000004.jpg   622 257 564 781

list_landmarks_align_celeba.txt該文件是居中後圖片的人臉關鍵點的標註文件，一共有5個關鍵點，爲眼睛、鼻子和嘴角。

lefteye_x lefteye_y righteye_x righteye_y nose_x nose_y leftmouth_x leftmouth_y rightmouth_x rightmouth_y
000001.jpg 69  109  106  113   77  142   73  152  108  154
000002.jpg 69  110  107  112   81  135   70  151  108  153
000003.jpg 76  112  104  106  108  128   74  156   98  158
000004.jpg 72  113  108  108  101  138   71  155  101  151

list_landmarks_celeba.txt文件是原圖片中人臉關鍵點的位置。

lefteye_x lefteye_y righteye_x righteye_y nose_x nose_y leftmouth_x leftmouth_y rightmouth_x rightmouth_y
000001.jpg 165  184  244  176  196  249  194  271  266  260
000002.jpg 140  204  220  204  168  254  146  289  226  289
000003.jpg 244  104  264  105  263  121  235  134  251  140

LFW數據集¶

數據集下載地址：http://mmlab.ie.cuhk.edu.hk/archive/CNN/data/train.zip

LFW數據集解壓之後得到2個文件夾和2個文本文件。
- lfw_5590和net_7876文件夾都是存放人臉圖片的
- testImageList.txt和trainImageList.txt都是標註信息文本文件，標註信息爲圖片文件、人臉box的座標位置、人臉5個關鍵點的座標位置

lfw_5590\Aaron_Eckhart_0001.jpg 84 161 92 169 106.250000 107.750000 146.750000 112.250000 125.250000 142.750000 105.250000 157.750000 139.750000 161.750000
lfw_5590\Aaron_Guiel_0001.jpg 85 172 93 181 100.250000 111.250000 145.750000 116.750000 124.250000 136.750000 92.750000 159.750000 138.750000 163.750000
lfw_5590\Aaron_Peirsol_0001.jpg 88 173 94 179 106.750000 113.250000 146.750000 113.250000 129.250000 139.750000 108.250000 153.250000 146.750000 152.750000
lfw_5590\Aaron_Pena_0001.jpg 67 176 83 192 101.750000 116.750000 145.250000 103.750000 125.250000 136.750000 119.750000 163.750000 146.250000 155.750000

WIDER人臉數據集¶

官方提供圖片下載地址：http://pan.baidu.com/s/1c0DfSmW
標註文件下載地址：http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/bbx_annotation/wider_face_split.zip

WIDER_train.zip解壓是得到人臉圖片，每張圖片可呢個包含多個人臉。

wider_face_train_bbx_gt.txt文件也同樣是標註人臉所在圖片的位置，不過這裏標註的方式，標註的信息爲xmin ymin width height。第一行是圖片的路徑，第二行是標註的數量，因爲圖片中可能有多張人臉，第三就是圖片的標註信息。

0--Parade/0_Parade_marchingband_1_849.jpg
1
449 330 122 149 0 0 0 0 0 0 
0--Parade/0_Parade_Parade_0_904.jpg
1
361 98 263 339 0 0 0 0 0 0

通過上面的標註文件可以生成wider_face_train.txt，標註的方式變成xmin ymin xmax ymax。有些圖片有多個標註數據，因爲這個數據集的圖片中多人臉的，跟前面的數據集不同，前面的都是一張圖片只有一張人臉。

1--Handshaking/1_Handshaking_Handshaking_1_288 336.00 82.00 448.00 244.00 
1--Handshaking/1_Handshaking_Handshaking_1_924 271.42 425.64 508.92 681.64 
1--Handshaking/1_Handshaking_Handshaking_1_866 364.42 894.49 451.76 993.88 545.13 771.01 623.44 879.44 186.73 825.22 256.00 945.69 
1--Handshaking/1_Handshaking_Handshaking_1_164 223.46 100.68 351.16 275.03 532.87 100.68 665.48 279.94 
1--Handshaking/1_Handshaking_Handshaking_1_243 393.00 198.00 426.00 245.00

emore數據集¶

下載地址：https://pan.baidu.com/s/1eXohwNBHbbKXh5KHyItVhQ

其中train.rec包含了訓練數據，通過下面的代碼可以提取照片保存在本地，同一個人的圖片放在同一個文件夾中。

import cv2
from PIL import Image, ImageFile
from pathlib import Path

ImageFile.LOAD_TRUNCATED_IMAGES = True
import mxnet as mx
from tqdm import tqdm


def load_mx_rec(rec_path):
    save_path = rec_path / 'images'
    if not save_path.exists():
        save_path.mkdir()
    imgrec = mx.recordio.MXIndexedRecordIO(str(rec_path / 'train.idx'), str(rec_path / 'train.rec'), 'r')
    img_info = imgrec.read_idx(0)
    header, _ = mx.recordio.unpack(img_info)
    max_idx = int(header.label[0])
    for idx in tqdm(range(1, max_idx)):
        img_info = imgrec.read_idx(idx)
        header, img = mx.recordio.unpack_img(img_info)
        label = int(header.label)
        # img = Image.fromarray(img)
        label_path = save_path / str(label)
        if not label_path.exists():
            label_path.mkdir()
        path = str(label_path / '{}.jpg'.format(idx))
        cv2.imwrite(path, img)
        # img.save(label_path / '{}.jpg'.format(idx), quality=95)


if __name__ == '__main__':
    load_mx_rec(Path('faces_emore'))

CASIA-WebFace數據集¶

下載地址：https://pan.baidu.com/s/1OjyZRhZhl__tOvhLnXeapQ 提取碼：nf6i

人臉關鍵點標註文件下載地址：https://download.csdn.net/download/qq_33200967/18929804

製作人臉數據集¶

下面我們就介紹如何製作自己的人臉數據集，項目的開源地址：https://github.com/yeyupiaoling/FaceDataset 。該項目可以分爲兩個階段，第一階段是人臉圖片的獲取和簡單的清洗，第二階段是人臉圖片的高級清洗和標註人臉信息。人臉信息的標註和清洗使用到了百度的人臉識別服務。

第一階段¶

爬取人臉圖片的核心思路就是獲取中國明星的名字，然後使用明星的名字作爲圖片搜索的關鍵字進行獲取圖片，然後刪除下載過程損壞的圖片和沒有包含人臉的圖片，或者過多人臉的圖片（我們只保存一張圖片只包含一張人臉的圖片）。

首先獲取中國明星的名字，該功能主要在get_star_name.py中實現。獲取明顯的名字核心代碼如下，獲取的名字不能保證百分之百正確，所以可能需要手動去檢查。

# 獲取明星的名字並保存到文件中
def get_page(pages, star_name):
    params = []
    # 設置訪問的請求頭，包括分頁數和明星所在的地區
    for i in range(0, 12 * pages + 12, 12):
        params.append({
            'resource_id': 28266,
            'from_mid': 1,
            'format': 'json',
            'ie': 'utf-8',
            'oe': 'utf-8',
            'query': '明星',
            'sort_key': '',
            'sort_type': 1,
            'stat0': '',
            'stat1': star_name,
            'stat2': '',
            'stat3': '',
            'pn': i,
            'rn': 12})

    # 請求的百度接口獲取明星的名字
    url = 'https://sp0.baidu.com/8aQDcjqpAAV3otqbppnN2DJv/api.php'

    x = 0
    # 根據請求頭下載明星的名字
    for param in params:
        try:
            # 獲取請求數據
            res = requests.get(url, params=param, timeout=50)
            # 把網頁數據轉換成json數據
            js = json.loads(res.text)
            # 獲取json中的明星數據
            results = js.get('data')[0].get('result')
        except AttributeError as e:
            print('【錯誤】出現錯誤：%s' % e)
            continue

        # 從數據中提取明星的名字
        for result in results:
            img_name = result['ename']
            f.write(img_name + '\n',)

然後根據明星的名字從網上下載圖片，該功能主要在download_image.py中實現，以下就是下載圖片的核心代碼片段。

# 獲取百度圖片下載圖片
def download_image(key_word, download_max):
    download_sum = 0
    str_gsm = '80'
    # 把每個明顯的圖片存放在單獨一個文件夾中
    save_path = 'star_image' + '/' + key_word
    if not os.path.exists(save_path):
        os.makedirs(save_path)
    while download_sum < download_max:
        # 下載次數超過指定值就停止下載
        if download_sum >= download_max:
            break
        str_pn = str(download_sum)
        # 定義百度圖片的路徑
        url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&' \
              'word=' + key_word + '&pn=' + str_pn + '&gsm=' + str_gsm + '&ct=&ic=0&lm=-1&width=0&height=0'
        try:
            # 獲取當前頁面的源碼
            result = requests.get(url, timeout=30).text
            # 獲取當前頁面的圖片URL
            img_urls = re.findall('"objURL":"(.*?)",', result, re.S)
            if len(img_urls) < 1:
                break
            # 把這些圖片URL一個個下載
            for img_url in img_urls:
                # 獲取圖片內容
                img = requests.get(img_url, timeout=30)
                img_name = save_path + '/' + str(uuid.uuid1()) + '.jpg'
                # 保存圖片
                with open(img_name, 'wb') as f:
                    f.write(img.content)
                with open('image_url_list.txt', 'a+', encoding='utf-8') as f:
                    f.write(img_name + '\t' + img_url + '\n')
                download_sum += 1
                if download_sum >= download_max:
                    break
        except Exception as e:
            download_sum += 1
            continue

下載圖片完成之後，有很多損壞的圖片，需要把這些損壞的圖片刪除，該功能主要在delete_error_image.py實現。下面是刪除損壞圖片的核心代碼片段。

# 刪除不是JPEG或者PNG格式的圖片
def delete_error_image(father_path):
    # 獲取父級目錄的所有文件以及文件夾
    image_paths = []
    for root, dirs, files in os.walk(father_path):
        for file in files:
            image_paths.append(os.path.join(root, file))
    for image in tqdm(image_paths):
        try:
            # 獲取圖片的類型
            image_type = imghdr.what(image)
            # 如果圖片格式不是JPEG同時也不是PNG就刪除圖片
            if image_type is not 'jpeg' and image_type is not 'png':
                os.remove(image)
                continue
            # 刪除灰度圖
            img = numpy.array(Image.open(image))
            if len(img.shape) is 2:
                os.remove(image)
        except:
            os.remove(image)

下載的圖片中可能沒有人臉，或者包含多張人臉，所以我們要把這些圖片刪除掉，該功能主要在delete_more_than_one.py中實現。刪除沒有人臉或者過多人臉圖片的關鍵代碼片段如下。

# 刪除兩個人臉以上的圖片或者沒有人臉的圖片
def delete_image(result, image_path):
    try:
        face_num = int(result['result']['face_num'])
        if face_num is not 1:
            os.remove(image_path)
        else:
            face_type = result['result']['face_list'][0]['face_type']['type']
            probability = result['result']['face_list'][0]['face_type']['probability']
            if face_type == 'cartoon' and probability > 0.8:
                os.remove(image_path)
    except:
        os.remove(image_path)

第二階段¶

第二階段屬於高級清理和標註人臉信息。這一個階段首先是把每個文件夾中包含相同一個人的圖片較多的人臉，選擇其中一個作爲主人臉圖片。然後使用這個主圖片來對比其他圖片，判斷是否是同一個人，如果不是就刪除該圖片。接着就刪除URL文件中，一些刪除的文件對應的URL。最好就使用百度的人臉檢測服務標註清理後的圖片，最終得到一個人臉數據集。

首先是從衆多圖片中選擇一個主圖片，這個功能主要在find_same_person.py中實現，以下是獲取主圖片的核心代碼片段。這個程序消耗時間比較多，其實也可以通過手動標記的方式，選擇一個主的人臉圖片，當然這個是非常大的一個工作量。

# 尋找同一個人的作爲0.jpg，作爲主的參考圖片
def find_same_person(person_image_path):
    # 獲取該人中的所有圖片
    image_paths = os.listdir(person_image_path)
    if '0.jpg' in image_paths:
        image_paths.remove('0.jpg')
    # 臨時選擇第一個作爲主圖片
    temp_image = os.path.join(person_image_path, image_paths[0])
    main_path = os.path.join(person_image_path, '0.jpg')
    if os.path.exists(main_path):
        os.remove(main_path)
    shutil.copyfile(temp_image, main_path)
    for main_image in image_paths:
        # 獲取主圖片的全路徑
        main_image = os.path.join(person_image_path, main_image)
        # 獲取主圖片的base64
        main_img = get_file_content(main_image)
        # 統計相同人臉數量
        same_sum = 0
        for other_image in image_paths:
            # 獲取其他對比人臉的全路徑
            other_image = os.path.join(person_image_path, other_image)
            # 獲取其他對比圖片的base64
            other_img = get_file_content(other_image)
            # 獲取對比結果
            result = match_image(main_img, other_img)
            time.sleep(0.5)
            # 判斷是不是同一個人
            if if_same_person(result):
                same_sum += 1
            # 當相同的人臉超過6個是就做爲主圖片
            if same_sum >= 6:
                if os.path.exists(main_path):
                    os.remove(main_path)
                shutil.copyfile(main_image, main_path)
                break
        if same_sum > 6:
            break

然後刪除與主圖片不是同一個人的圖片，這個功能主要在delete_not_same_person.py中實現，以下是刪除不是同一個人臉的圖片核心代碼片段。

        for name_path in tqdm(name_paths):
            image_paths = os.listdir(os.path.join(father_path, name_path))
            for image_path in image_paths:
                # 正確圖片的路徑
                main_image = os.path.join(father_path, name_path, '0.jpg')
                # 要對比的圖片
                img_path = os.path.join(father_path, name_path, image_path)
                # 獲取圖片的base64
                main_img = get_file_content(main_image)
                img = get_file_content(img_path)
                time.sleep(0.5)
                # 預測圖片並進行處理
                result = match_image(main_img, img)
                delete_image(result, img_path)
            shutil.move(src=os.path.join(father_path, name_path), dst=os.path.join('star_image', name_path))

然後執行delete_surplus_url.py程序，從image_url_list.txt中刪除本地不存在圖片對應的URL。

    # 刪除圖片過少的文件夾
    delete_too_few()
    list_path = 'image_url_list.txt'
    lines = get_txt_list(list_path)
    # 重新改寫這個文件
    with open(list_path, 'w', encoding='utf-8') as f:
        for line in lines:
            exist = file_if_exist(line)
            # 把存在的文件的list保留
            if exist:
                f.write(line)

最後執行annotate_image.py程序，利用百度人臉檢測接口標註人臉圖片，以下是標註人臉的核心代碼片段。

# 把預測結果和圖片的URL寫入到標註文件中
def annotate_image(result, image_path, image_url):
    # 獲取文件夾名字，並得到已經記錄多少人
    father_path = os.path.dirname(image_path)
    image_name = os.path.basename(image_path).split('.')[0]
    # 獲取明星的名字
    name = father_path.split('/')[-1]
    # 把這些名字轉換成數字標號
    names.add(name)
    num_name = str(len(names) - 1)
    annotation_path = os.path.join('annotations', num_name)
    dict_names_list.append((name, num_name))
    annotation_file_path = os.path.join(annotation_path, str(image_name) + '.json')
    # 創建存放標註文件的文件夾
    if not os.path.exists(annotation_path):
        os.makedirs(annotation_path)

    try:
        # 名字
        name = name
        # 年齡
        age = result['result']['face_list'][0]['age']
        # 性別，male:男性 female:女性
        gender = result['result']['face_list'][0]['gender']['type']
        # 臉型，square: 正方形 triangle:三角形 oval: 橢圓 heart: 心形 round: 圓形
        face_shape = result['result']['face_list'][0]['face_shape']['type']
        # 是否帶眼鏡，none:無眼鏡，common:普通眼鏡，sun:墨鏡
        glasses = result['result']['face_list'][0]['glasses']['type']
        # 表情，none:不笑；smile:微笑；laugh:大笑
        expression = result['result']['face_list'][0]['expression']['type']
        # 顏值，範圍0-100
        beauty = result['result']['face_list'][0]['beauty']
        # 人臉在圖片中的位置
        location = str(result['result']['face_list'][0]['location']).replace("'", '"')
        # 人臉旋轉角度參數
        angle = str(result['result']['face_list'][0]['angle']).replace("'", '"')
        # 72個特徵點位置
        landmark72 = str(result['result']['face_list'][0]['landmark72']).replace("'", '"')
        # 4個關鍵點位置，左眼中心、右眼中心、鼻尖、嘴中心
        landmark = str(result['result']['face_list'][0]['landmark']).replace("'", '"')
        # 拼接成符合json格式的字符串
        txt = '{"name":"%s", "image_url":"%s","age":%f, "gender":"%s", "glasses":"%s", "expression":"%s", "beauty":%f, "face_shape":"%s", "location":%s, "angle":%s, "landmark72":%s, "landmark":%s}' \
              % (name, image_url, age, gender, glasses, expression, beauty, face_shape, location, angle, landmark72,
                 landmark)
        # 轉換成json數據並格式化
        json_dicts = json.loads(txt)
        json_format = json.dumps(json_dicts, sort_keys=True, indent=4, separators=(',', ':'))
        # 寫入標註文件
        with open(annotation_file_path, 'w', encoding='utf-8') as f_a:
            f_a.write(json_format)
    except Exception as e:
        os.remove(image_path)
        pass

整個項目完成的時間的非常久的，特別是使用到百度AI服務的程序，爲了不出現每秒訪問次數超過2次（免費的版本是每秒自動訪問2次），所在做了休眠處理，所以這樣浪費了不少時間。

項目GitHub地址： https://github.com/yeyupiaoling/FaceDataset

參考資料¶

http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm
http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/
http://ai.baidu.com/docs#/Face-Detect-V3/top

前言¶