Chinese Punctuation Model¶
This is a Chinese punctuation model developed based on PaddleSpeech. The default pretrained model used is ernie-3.0-medium-zh, and some optimizations have been made to achieve better recognition results. This model can be used to add punctuation marks to speech recognition results, with a usage example in PPASR.
Source Code Address: https://github.com/yeyupiaoling/PunctuationModel
Installation Environment¶
- Install the GPU version of PaddlePaddle using the following command. If you have already installed it, you can skip this step.
conda install paddlepaddle-gpu==2.3.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
- Install the PaddleNLP tool using the following command.
python -m pip install paddlenlp -i https://mirrors.aliyun.com/pypi/simple/ -U
Data Preparation¶
A smaller dataset iwslt2012 is provided here. Download this dataset, extract it, and copy all the resulting text files to the dataset directory. The directory structure is as follows. This dataset has poor quality, with mixed Chinese and English punctuation marks and many unreasonable texts (e.g., HTML code from web pages). You can simply process it by replacing English punctuation marks , . ? with Chinese ones , 。 ?. For better data, you can further clean the data or customize the dataset.
If you want to customize the dataset, refer to the format of this dataset for production. For pure Chinese, you don’t need to separate texts with spaces. Note that when creating the punctuation mark list punc_vocab, you don’t need to add spaces; the project will automatically add spaces.
├── dataset
│ ├── dev.txt
│ ├── punc_vocab
│ ├── test.txt
│ └── train.txt
Training¶
After preparing the dataset, you can execute train.py to start training. For a customized dataset, note to modify the num_classes parameter before training. Execute the following command. The first training will download the ERNIE pretrained model, so an internet connection is required.
python train.py
Training log output:
----------- Configuration Arguments -----------
batch_size: 32
dev_data_path: dataset/dev.txt
learning_rate: 1e-05
model_path: models/checkpoint
num_classes: 4
num_epoch: 20
num_workers: 8
pretrained_token: ernie-3.0-medium-zh
punc_path: dataset/punc_vocab
train_data_path: dataset/train.txt
------------------------------------------------
[2022-09-14 17:11:48.482046 INFO ] train:train:39 - 正在预处理数据集,时间比较长,请耐心等待...
[2022-09-14 17:11:48,482] [ INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
[2022-09-14 17:11:48,507] [ INFO] - tokenizer config file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json
[2022-09-14 17:11:48,508] [ INFO] - Special tokens file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json
100%|█████████████████████████████████████████████████████████████| 4328594/4328594 [05:42<00:00, 12645.37it/s]
[2022-09-14 17:17:31,589] [ INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
[2022-09-14 17:17:31,610] [ INFO] - tokenizer config file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json
[2022-09-14 17:17:31,610] [ INFO] - Special tokens file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json
100%|██████████████████████████████████████████████████████████████| 33741/33741 [00:02<00:00, 12532.24it/s]
[2022-09-14 17:17:34.309391 INFO ] train:train:58 - 预处理数据集完成!
[2022-09-14 17:17:34,309] [ INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh.pdparams
W0914 17:17:34.310540 10320 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0914 17:17:34.313140 10320 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2022-09-14 17:17:37.758967 INFO ] train:train:90 - Train epoch: [1/20], batch: [0/1283], loss: 2.05675, f1_score: 0.02082, learning rate: 0.00001000, eta: 2:18:40
[2022-09-14 17:17:54.295418 INFO ] train:train:90 - Train epoch: [1/20], batch: [100/1283], loss: 0.12979, f1_score: 0.33040, learning rate: 0.00000990, eta: 1:11:06
[2022-09-14 17:18:10.936073 INFO ] train:train:90 - Train epoch: [1/20], batch: [200/1283], loss: 0.13771, f1_score: 0.37442, learning rate: 0.00000980, eta: 1:10:43
[2022-09-14 17:18:27.706051 INFO ] train:train:90 - Train epoch: [1/20], batch: [300/1283], loss: 0.10602, f1_score: 0.47096, learning rate: 0.00000970, eta: 1:10:35
[2022-09-14 17:18:44.545404 INFO ] train:train:90 - Train epoch: [1/20], batch: [400/1283], loss: 0.12836, f1_score: 0.55652, learning rate: 0.00000961, eta: 1:10:27
[2022-09-14 17:19:01.434206 INFO ] train:train:90 - Train epoch: [1/20], batch: [500/1283], loss: 0.11024, f1_score: 0.51312, learning rate: 0.00000951, eta: 1:10:18
Evaluation¶
After training, you can evaluate the model to check its convergence. Execute the following command.
python eval.py
Output log information:
----------- Configuration Arguments -----------
batch_size: 32
model_path: models/checkpoint
num_classes: 4
num_workers: 8
pretrained_token: ernie-3.0-medium-zh
punc_path: dataset/punc_vocab
test_data_path: dataset/test.txt
------------------------------------------------
[2022-09-14 19:17:54.851788 INFO ] eval:evaluate:32 - 正在预处理数据集,时间比较长,请耐心等待...
[2022-09-14 19:17:54,851] [ INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
[2022-09-14 19:17:54,877] [ INFO] - tokenizer config file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json
[2022-09-14 19:17:54,877] [ INFO] - Special tokens file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json
100%|████████████████████████████████████████████████████████████████████████████████████| 43468/43468 [00:03<00:00, 12605.40it/s]
[2022-09-14 19:17:58.336113 INFO ] eval:evaluate:43 - 预处理数据集完成!
[2022-09-14 19:17:58,336] [ INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh.pdparams
W0914 19:17:58.337256 11985 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0914 19:17:58.339792 11985 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2022-09-14 19:18:02.054659 INFO ] eval:evaluate:63 - Batch: [0/13], loss: 0.08727, f1_score: 0.78612
[2022-09-14 19:18:02.775505 INFO ] eval:evaluate:65 - Avg eval, loss: 0.12825, f1_score: 0.70011
Exporting the Prediction Model¶
Before using the model, you need to export it as a prediction model. Execute the following command to export the model. The exported model files will be saved in models/pun_models by default. PPASR requires copying this entire folder to the models directory.
python export_model.py
Output log information:
----------- Configuration Arguments -----------
infer_model_path: models/pun_models
model_path: models/checkpoint
num_classes: 4
pretrained_token: ernie-3.0-medium-zh
punc_path: dataset/punc_vocab
------------------------------------------------
[2022-09-14 19:20:42,188] [ INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh.pdparams
W0914 19:20:42.189301 12045 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0914 19:20:42.192952 12045 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2022-09-14 19:20:49.433919 INFO ] export_model:main:43 - 模型导出成功,保存在:models/pun_models
Adding Punctuation to Text¶
Using the exported prediction model to add punctuation marks to text. You can also download the model provided by the author here, extract it to the dataset directory, and specify the Chinese text through the text parameter to add punctuation marks. This can be applied to speech recognition results. Refer to the PPASR speech recognition project for details.
python infer.py --text=近几年不但我用书给女儿儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书
Output log information:
----------- Configuration Arguments -----------
infer_model_path: models/pun_models
text: 近几年不但我用书给女儿儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书
------------------------------------------------
[2022-09-14 19:23:48,566] [ INFO] - Already cached /home/test/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
[2022-09-14 19:23:48,590] [ INFO] - tokenizer config file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json
[2022-09-14 19:23:48,591] [ INFO] - Special tokens file saved in /home/test/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json
[2022-09-14 19:23:49.960468 INFO ] predictor:__init__:60 - 标点符号模型加载成功。
近几年,不但我用书给女儿儿压岁,也劝说亲朋不要给女儿压岁钱而改送压岁书。