鏈接地址：https://blog.doiduoyi.com/authors/1584446358138
初心：記錄優秀的Doi技術團隊學習經歷

*本篇文章基於 PaddlePaddle 0.10.0、Python 2.7

前言¶

在前兩篇文章驗證碼端到端的識別和車牌端到端的識別這兩篇文章中其實就使用到了場景文字識別了，在本篇中就針對場景文字識別這個問題好好說說。
場景文字識別到底有什麼用呢，說得大一些在自動駕駛領域，公路上總會有很多的路牌和標識，這些路牌標識通常會有很多的文字說明，我們就要識別這些文字來了解它們的含義。還有老師在黑板上寫的筆記，如果使用場景文字識別技術，我們直接拍個照，直接識別黑板中的文字內容，就可以省去很多抄筆記時間了。

數據集的介紹¶

場景文字是怎樣的呢，來看看這張圖像

這張圖像中包含了大量的文字，我們要做的就是把這些文字識別出來。這張圖像是SynthText in the Wild Dataset，這個數據集非常大，有41G。爲了方便學習，我們在本項目中使用這個數據集，而是使用更小的Task 2.3: Word Recognition (2013 edition)，這個數據集的訓練數據和測試數據一共也就160M左右，非常適合我們做學習使用，該數據集的圖像如下：

數據的讀取¶

官方給出的數據讀取列表有兩個，一個是訓練數據的圖像列表gt.txt，另一個是測試數據的圖像列表Challenge2_Test_Task3_GT.txt。它們的格式如下：

word_1.png, "Tiredness"
word_2.png, "kills"
word_3.png, "A"
word_4.png, "short"
word_5.png, "break"
word_6.png, "could"

前面的word_1.png是圖像的路徑，後面的Tiredness是圖像包含的文字內容。
基於這個數據格式，我們要編寫一個工具類來讀取這些數據信息。

def get_file_list(image_file_list):
    '''
    生成用於訓練和測試數據的文件列表。
    :param image_file_list: 圖像文件和列表文件的路徑
    :type image_file_list: str
    '''
    dirname = os.path.dirname(image_file_list)
    path_list = []
    with open(image_file_list) as f:
        for line in f:
            line_split = line.strip().split(',', 1)
            filename = line_split[0].strip()
            path = os.path.join(dirname, filename)
            label = line_split[1][2:-1].strip()
            if label:
                path_list.append((path, label))

    return path_list

然後通過調用該方法就可以那到數據的信息了，通過這些數據就可以生成訓練和測試用的reader了。

# coding=utf-8
import os
import cv2
from paddle.v2.image import load_image

class DataGenerator(object):
    def __init__(self, char_dict, image_shape):
        '''
        :param char_dict: 標籤的字典類
        :type char_dict: class
        :param image_shape: 圖像的固定形狀
        :type image_shape: tuple
        '''
        self.image_shape = image_shape
        self.char_dict = char_dict

    def train_reader(self, file_list):
        '''
        訓練讀取數據
        :param file_list: 用預訓練的圖像列表，包含標籤和圖像路徑
        :type file_list: list
        '''
        def reader():
            UNK_ID = self.char_dict['<unk>']
            for image_path, label in file_list:
                label = [self.char_dict.get(c, UNK_ID) for c in label]
                yield self.load_image(image_path), label
        return reader

    def load_image(self, path):
        '''
        加載圖像並將其轉換爲一維向量
        :param path: 圖像數據的路徑
        :type path: str
        '''
        image = load_image(path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

        # 將所有圖像調整爲固定形狀
        if self.image_shape:
            image = cv2.resize(
                image, self.image_shape, interpolation=cv2.INTER_CUBIC)

        image = image.flatten() / 255.
        return image

從上面的代碼你可能留意到這裏使用的label是標籤字典的value，所以我們要對在訓練時出現的字符做一個標籤字典，如下格式字符出現次數：

生成的標籤字典的代碼如下，使用到的數據就是上面通過路徑和label拿到的list。

def build_label_dict(file_list, save_path):
    """
    從訓練數據建立標籤字典
    :param file_list: 包含標籤的訓練數據列表
    :type file_list: list
    :params save_path: 保存標籤字典的路徑
    :type save_path: str
    """
    values = defaultdict(int)
    for path, label in file_list:
        for c in label:
            if c:
                values[c] += 1

    values['<unk>'] = 0
    with open(save_path, "w") as f:
        for v, count in sorted(
                values.iteritems(), key=lambda x: x[1], reverse=True):
            f.write("%s\t%d\n" % (v, count))

生成了標籤字典之後，就要拿這些標籤字典來給DataGenerator生成訓練所需要的reader，代碼如下：

def load_dict(dict_path):
    """
    從字典路徑加載標籤字典
    :param dict_path: 標籤字典的路徑
    :type dict_path: str
    """
    return dict((line.strip().split("\t")[0], idx)
                for idx, line in enumerate(open(dict_path, "r").readlines()))

最後通過調用PaddlePaddle的API就可以生成trainer使用的reader。

reader=paddle.batch(
            paddle.reader.shuffle(
                data_generator.train_reader(train_file_list),
                buf_size=conf.buf_size),
            batch_size=conf.batch_size)

獲得的reader的可以trainer.train訓練的時候傳給訓練器。

# 開始訓練
trainer.train(
    reader=reader,
    feeding=feeding,
    event_handler=event_handler,
    num_passes=conf.num_passes)

上面就是開始訓練的代碼，但是現在還不能直接開始訓練，我們的訓練器trainer還沒有定義，接下來就介紹訓練器的定義。

定義訓練器¶

通過調用PaddlePaddle的接口paddle.trainer.SGD就可以生成一個訓練器trainer了

trainer = paddle.trainer.SGD(cost=model.cost,
                             parameters=params,
                             update_equation=optimizer,
                             extra_layers=model.eval)

定義神經網絡模型¶

在定義訓練器的時候，需要用到參數cost和extra_layers都要用到神經網絡模型來生成這兩參數的值，所以還要先定義一個神經網絡模型。
首先先要定義數據的大小和label，這定義數據的大小時，因爲數據是個長方形，所以還有說明寬度和高度。

# 圖像輸入爲一個浮動向量
self.image = layer.data(
    name='image',
    type=paddle.data_type.dense_vector(self.image_vector_size),
    height=self.shape[1],
    width=self.shape[0])

# 將標籤輸入爲ID列表
if not self.is_infer:
    self.label = layer.data(
        name='label',
        type=paddle.data_type.integer_value_sequence(self.num_classes))

然後通過卷積神經網絡獲取圖像特徵

    def conv_groups(self, input, num, with_bn):
        '''
        用圖像卷積組獲得圖像特徵。
        :param input: 輸入層
        :type input: LayerOutput
        :param num: 過濾器的數量。
        :type num: int
        :param with_bn: 是否使用BN層
        :type with_bn: bool
        '''
        assert num % 4 == 0

        filter_num_list = conf.filter_num_list
        is_input_image = True
        tmp = input

        for num_filter in filter_num_list:
            # 因爲是灰度圖所以num_channels參數是1
            if is_input_image:
                num_channels = 1
                is_input_image = False
            else:
                num_channels = None

            tmp = img_conv_group(
                input=tmp,
                num_channels=num_channels,
                conv_padding=conf.conv_padding,
                conv_num_filter=[num_filter] * (num / 4),
                conv_filter_size=conf.conv_filter_size,
                conv_act=Relu(),
                conv_with_batchnorm=with_bn,
                pool_size=conf.pool_size,
                pool_stride=conf.pool_stride, )

        return tmp

然後通過這些圖像的特徵張開成特徵向量

# 通過CNN獲取圖像特徵
conv_features = self.conv_groups(self.image, conf.filter_num,
                                 conf.with_bn)

# 將CNN的輸出展開成一系列特徵向量。
sliced_feature = layer.block_expand(
    input=conv_features,
    num_channels=conf.num_channels,
    stride_x=conf.stride_x,
    stride_y=conf.stride_y,
    block_x=conf.block_x,
    block_y=conf.block_y)

然後將RNN的輸出映射到字符分佈

# 使用RNN向前和向後捕獲序列信息。
gru_forward = simple_gru(
    input=sliced_feature, size=conf.hidden_size, act=Relu())
gru_backward = simple_gru(
    input=sliced_feature,
    size=conf.hidden_size,
    act=Relu(),
    reverse=True)

# 將RNN的輸出映射到字符分佈。
self.output = layer.fc(input=[gru_forward, gru_backward],
                       size=self.num_classes + 1,
                       act=Linear())

self.log_probs = paddle.layer.mixed(
    input=paddle.layer.identity_projection(input=self.output),
    act=paddle.activation.Softmax())

最後就可以開始拿cost和extra_layers了，

if not self.is_infer:
    self.cost = layer.warp_ctc(
        input=self.output,
        label=self.label,
        size=self.num_classes + 1,
        norm_by_times=conf.norm_by_times,
        blank=self.num_classes)

    self.eval = evaluator.ctc_error(input=self.output, label=self.label)

生成訓練器¶

使用cost還可以生成訓練參數

# 創建訓練參數
params = paddle.parameters.create(model.cost)

最後還缺一個優化方法

# 創建訓練參數
optimizer = paddle.optimizer.Momentum(momentum=conf.momentum)

這樣四個參數cost，parameters，update_equation，extra_layers我們都拿到了。可以創建一個訓練器了。

開始訓練¶

訓練模型一共要4個參數，到目前爲止，我們只拿到一個reader參數，還有另外feeding，event_handler，num_passes這三個參數。
定義數據層之間的關係

# 說明數據層之間的關係
feeding = {'image': 0, 'label': 1}

定義訓練事件，讓它在訓練訓練的過程中輸出一下日誌信息，觀察我們模型的收斂情況。

# 訓練事件
def event_handler(event):
    if isinstance(event, paddle.event.EndIteration):
        if event.batch_id % conf.log_period == 0:
            print("Pass %d, batch %d, Samples %d, Cost %f, Eval %s" %
                  (event.pass_id, event.batch_id, event.batch_id *
                   conf.batch_size, event.cost, event.metrics))

    if isinstance(event, paddle.event.EndPass):
        # 這裏由於訓練和測試數據共享相同的格式
        # 我們仍然使用reader.train_reader來讀取測試數據
        result = trainer.test(
            reader=paddle.batch(
                data_generator.train_reader(test_file_list),
                batch_size=conf.batch_size),
            feeding=feeding)
        print("Test %d, Cost %f, Eval %s" %
              (event.pass_id, result.cost, result.metrics))
        with gzip.open(
                os.path.join(model_save_dir, "params_pass.tar.gz"), "w") as f:
            trainer.save_parameter_to_tar(f)

說明訓練的輪數

num_passes=conf.num_passes

在訓練之前還要初始化PaddlePaddle

# 初始化PaddlePaddle
paddle.init(use_gpu=conf.use_gpu, trainer_count=conf.trainer_count)

在訓練的過程中會輸入一下日誌信息：

Pass 0, batch 0, Samples 0, Cost 39.119792, Eval {}
Test 0, Cost 35.374924, Eval {}
Pass 1, batch 0, Samples 0, Cost 30.138696, Eval {}
Test 1, Cost 21.629668, Eval {}
Pass 2, batch 0, Samples 0, Cost 21.412227, Eval {}
Test 2, Cost 22.698648, Eval {}
Pass 3, batch 0, Samples 0, Cost 22.565864, Eval {}
Test 3, Cost 21.634227, Eval {}

開始預測¶

通過之前的訓練，我們有了訓練參數，可以使用這些參數進行預測了。

def infer(model_path, image_shape, label_dict_path,infer_file_list_path):

    infer_file_list = get_file_list(infer_file_list_path)
    # 獲取標籤字典
    char_dict = load_dict(label_dict_path)
    # 獲取反轉的標籤字典
    reversed_char_dict = load_reverse_dict(label_dict_path)
    # 獲取字典大小
    dict_size = len(char_dict)
    # 獲取reader
    data_generator = DataGenerator(char_dict=char_dict, image_shape=image_shape)
    # 初始化PaddlePaddle
    paddle.init(use_gpu=True, trainer_count=2)
    # 加載訓練好的參數
    parameters = paddle.parameters.Parameters.from_tar(gzip.open(model_path))
    # 獲取網絡模型
    model = Model(dict_size, image_shape, is_infer=True)
    # 獲取預測器
    inferer = paddle.inference.Inference(output_layer=model.log_probs, parameters=parameters)
    # 開始預測
    test_batch = []
    labels = []
    for i, (image, label) in enumerate(data_generator.infer_reader(infer_file_list)()):
        test_batch.append([image])
        labels.append(label)
    infer_batch(inferer, test_batch, labels, reversed_char_dict)

上面使用的反轉的標籤字典定義如下，通過標籤字典的文件即可生成反轉的標籤字典

def load_reverse_dict(dict_path):
    """
    從字典路徑加載反轉的標籤字典
    :param dict_path: 標籤字典的路徑
    :type dict_path: str
    """
    return dict((idx, line.strip().split("\t")[0])
                for idx, line in enumerate(open(dict_path, "r").readlines()))

通過傳入上面獲取是的inferer和圖像的一維向量，還有反轉的標籤字典就可以進行預測了。

def infer_batch(inferer, test_batch, labels, reversed_char_dict):
    # 獲取初步預測結果
    infer_results = inferer.infer(input=test_batch)
    num_steps = len(infer_results) // len(test_batch)
    probs_split = [
        infer_results[i * num_steps:(i + 1) * num_steps]
        for i in xrange(0, len(test_batch))
    ]
    results = []
    # 最佳路徑解碼
    for i, probs in enumerate(probs_split):
        output_transcription = ctc_greedy_decoder(
            probs_seq=probs, vocabulary=reversed_char_dict)
        results.append(output_transcription)
    # 打印預測結果
    for result, label in zip(results, labels):
        print("\n預測結果: %s\n實際文字: %s" %(result, label))

這個還使用到了最佳路徑解碼，使用的解碼器如下：

def ctc_greedy_decoder(probs_seq, vocabulary):
    """CTC貪婪（最佳路徑）解碼器。
    由最可能的令牌組成的路徑被進一步後處理
    刪除連續的重複和所有的空白。
    :param probs_seq: 每個詞彙表上概率的二維列表字符。
                      每個元素都是浮點概率列表爲一個字符。
    :type probs_seq: list
    :param vocabulary: 詞彙表
    :type vocabulary: list
    :return: 解碼結果字符串
    :rtype: baseline
    """
    # 尺寸驗證
    for probs in probs_seq:
        if not len(probs) == len(vocabulary) + 1:
            raise ValueError("probs_seq dimension mismatchedd with vocabulary")
    # argmax以獲得每個時間步長的最佳指標
    max_index_list = list(np.array(probs_seq).argmax(axis=1))
    # 刪除連續的重複索引
    index_list = [index_group[0] for index_group in groupby(max_index_list)]
    # 刪除空白索引
    blank_index = len(vocabulary)
    index_list = [index for index in index_list if index != blank_index]
    # 將索引列表轉換爲字符串
    return ''.join([vocabulary[index] for index in index_list])

最後在main方法中直接運行預測程序就可以了。

if __name__ == "__main__":
    # 要預測的圖像
    infer_file_list_path = '../data/test_data/Challenge2_Test_Task3_GT.txt'
    # 模型的路徑
    model_path = '../models/params_pass.tar.gz'
    # 圖像的大小
    image_shape = (173, 46)
    # 標籤的路徑
    label_dict_path = '../data/label_dict.txt'
    # 開始預測
    infer(model_path, image_shape, label_dict_path, infer_file_list_path)

預測的結果：

預測結果: FFt
實際文字: PROPER

預測結果: FD
實際文字: FOOD

預測結果: F:
實際文字: PRONTO

預測結果: 6vdt:tdnd
實際文字: professional

預測結果: La
實際文字: Java

從預測結果來看，模型效果並不是很理想，錯誤了非常高，這個數據量並不是很大，所以模型收斂的不是很好，也很容易出現過擬合現象。筆者加正則效果也不明顯，讀者可以自己在config.py這個文件中修改網絡模型和訓練器的配置，嘗試是模型收斂得更好，也可以選擇更大的數據來解決這個問題。

上一章：《我的PaddlePaddle學習之路》筆記七——車牌端到端的識別 ¶

下一章：《我的PaddlePaddle學習之路》筆記九——使用VOC數據集的實現目標檢測 ¶

項目代碼¶

GitHub地址:https://github.com/yeyupiaoling/LearnPaddle

參考資料¶

http://paddlepaddle.org/
http://www.robots.ox.ac.uk/~vgg/data/scenetext/
http://rrc.cvc.uab.es/?ch=2&com=introduction

前言¶