# 中文標點符號模型

本想是基於PaddleSpeech開發的中文標點符號模型,默認使用的預訓練模型爲ernie-3.0-medium-zh,做了一些優化,識別效果更佳。該模型可以用於語音識別結果添加標點符號,使用案例PPASR

源碼地址:https://github.com/yeyupiaoling/PunctuationModel

安裝環境

  1. 安裝PaddlePaddle的GPU版本,命令如下,如果已經安裝過了,請忽略。
conda install paddlepaddle-gpu==2.3.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
  1. 安裝PaddleNLP工具,命令如下。
python -m pip install paddlenlp -i https://mirrors.aliyun.com/pypi/simple/ -U

準備數據

這裏提供了一個較小的數據集iwslt2012,下載這個數據,解壓並把得到的全部文本文件複製到dataset目錄下,結構如下,這個數據集質量不是很好,中英文標點符號混合了,同時也有很多不合理的文本,例如網頁的HTML代碼,我們可以簡單做一個處理,把英文的標點符號,.?替換成中文的,。?,如果想要更好的數據,可以進一步清理數據,或者自定義數據集。
如果想自定義數據集,可以參考這個數據集的格式進行製作,如果是純中文,可以不需要空格隔開文本。注意在製作標點符號列表punc_vocab時,不需要加上空格,項目默認會加上空格的。

├── dataset
│   ├── dev.txt
│   ├── punc_vocab
│   ├── test.txt
│   └── train.txt

訓練

準備好數據集之後,就可以執行train.py開始訓練,如果是自定義數據集,在開始訓練之前,要注意修改類別數量參數num_classes,執行命令如下,第一次訓練時會下載ernie預訓練模型,所以需要聯網。

python train.py

訓練輸出的日誌:

-----------  Configuration Arguments -----------
batch_size: 32
dev_data_path: dataset/dev.txt
learning_rate: 1e-05
model_path: models/checkpoint
num_classes: 4
num_epoch: 20
num_workers: 8
pretrained_token: ernie-3.0-medium-zh
punc_path: dataset/punc_vocab
train_data_path: dataset/train.txt
------------------------------------------------
[2022-09-14 17:11:48.482046 INFO   ] train:train:39 - 正在預處理數據集,時間比較長,請耐心等待...
[2022-09-14 17:11:48,482] [    INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
[2022-09-14 17:11:48,507] [    INFO] - tokenizer config file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json
[2022-09-14 17:11:48,508] [    INFO] - Special tokens file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json
100%|█████████████████████████████████████████████████████████████| 4328594/4328594 [05:42<00:00, 12645.37it/s]
[2022-09-14 17:17:31,589] [    INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
[2022-09-14 17:17:31,610] [    INFO] - tokenizer config file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json
[2022-09-14 17:17:31,610] [    INFO] - Special tokens file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json
100%|██████████████████████████████████████████████████████████████| 33741/33741 [00:02<00:00, 12532.24it/s]
[2022-09-14 17:17:34.309391 INFO   ] train:train:58 - 預處理數據集完成!
[2022-09-14 17:17:34,309] [    INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh.pdparams
W0914 17:17:34.310540 10320 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0914 17:17:34.313140 10320 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2022-09-14 17:17:37.758967 INFO   ] train:train:90 - Train epoch: [1/20], batch: [0/1283], loss: 2.05675, f1_score: 0.02082, learning rate: 0.00001000, eta: 2:18:40
[2022-09-14 17:17:54.295418 INFO   ] train:train:90 - Train epoch: [1/20], batch: [100/1283], loss: 0.12979, f1_score: 0.33040, learning rate: 0.00000990, eta: 1:11:06
[2022-09-14 17:18:10.936073 INFO   ] train:train:90 - Train epoch: [1/20], batch: [200/1283], loss: 0.13771, f1_score: 0.37442, learning rate: 0.00000980, eta: 1:10:43
[2022-09-14 17:18:27.706051 INFO   ] train:train:90 - Train epoch: [1/20], batch: [300/1283], loss: 0.10602, f1_score: 0.47096, learning rate: 0.00000970, eta: 1:10:35
[2022-09-14 17:18:44.545404 INFO   ] train:train:90 - Train epoch: [1/20], batch: [400/1283], loss: 0.12836, f1_score: 0.55652, learning rate: 0.00000961, eta: 1:10:27
[2022-09-14 17:19:01.434206 INFO   ] train:train:90 - Train epoch: [1/20], batch: [500/1283], loss: 0.11024, f1_score: 0.51312, learning rate: 0.00000951, eta: 1:10:18

評估

訓練結束之後,可以進行評估模型,觀察模型的收斂情況,執行命令如下。

python eval.py

輸出的日誌信息:

-----------  Configuration Arguments -----------
batch_size: 32
model_path: models/checkpoint
num_classes: 4
num_workers: 8
pretrained_token: ernie-3.0-medium-zh
punc_path: dataset/punc_vocab
test_data_path: dataset/test.txt
------------------------------------------------
[2022-09-14 19:17:54.851788 INFO   ] eval:evaluate:32 - 正在預處理數據集,時間比較長,請耐心等待...
[2022-09-14 19:17:54,851] [    INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
[2022-09-14 19:17:54,877] [    INFO] - tokenizer config file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json
[2022-09-14 19:17:54,877] [    INFO] - Special tokens file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json
100%|████████████████████████████████████████████████████████████████████████████████████| 43468/43468 [00:03<00:00, 12605.40it/s]
[2022-09-14 19:17:58.336113 INFO   ] eval:evaluate:43 - 預處理數據集完成!
[2022-09-14 19:17:58,336] [    INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh.pdparams
W0914 19:17:58.337256 11985 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0914 19:17:58.339792 11985 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2022-09-14 19:18:02.054659 INFO   ] eval:evaluate:63 - Batch: [0/13], loss: 0.08727, f1_score: 0.78612
[2022-09-14 19:18:02.775505 INFO   ] eval:evaluate:65 - Avg eval, loss: 0.12825, f1_score: 0.70011

導出預測模型

要執行模型之前,需要導出預測模型方能使用,執行下面命令導出預測模型,導出的模型文件默認會保存在models/pun_modelsPPASR就需要把這整個文件夾複製到models目錄下。

python export_model.py

輸出的日誌信息:

-----------  Configuration Arguments -----------
infer_model_path: models/pun_models
model_path: models/checkpoint
num_classes: 4
pretrained_token: ernie-3.0-medium-zh
punc_path: dataset/punc_vocab
------------------------------------------------
[2022-09-14 19:20:42,188] [    INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh.pdparams
W0914 19:20:42.189301 12045 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0914 19:20:42.192952 12045 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2022-09-14 19:20:49.433919 INFO   ] export_model:main:43 - 模型導出成功,保存在:models/pun_models

給文本添加標點符號

使用導出的預測模型爲文本添加標點符號,也可以下載博主提供的模型,解壓到dataset目錄下,通過text參數指定中文文本,實現添加標點符號,這可以應用在語音識別結果上面,具體可以參考PPASR語音識別項目。

python infer.py --text=近幾年不但我用書給女兒兒壓歲也勸說親朋不要給女兒壓歲錢而改送壓歲書

輸出日誌信息:

-----------  Configuration Arguments -----------
infer_model_path: models/pun_models
text: 近幾年不但我用書給女兒兒壓歲也勸說親朋不要給女兒壓歲錢而改送壓歲書
------------------------------------------------
[2022-09-14 19:23:48,566] [    INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
[2022-09-14 19:23:48,590] [    INFO] - tokenizer config file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json
[2022-09-14 19:23:48,591] [    INFO] - Special tokens file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json
[2022-09-14 19:23:49.960468 INFO   ] predictor:__init__:60 - 標點符號模型加載成功。
近幾年,不但我用書給女兒兒壓歲,也勸說親朋不要給女兒壓歲錢而改送壓歲書。
小夜