DeepSpeech2 Chinese Speech Recognition

This project is developed based on the DeepSpeech project of PaddlePaddle with significant modifications, making it convenient for training custom Chinese datasets, as well as testing and usage. DeepSpeech2 is an end-to-end automatic speech recognition (ASR) engine implemented based on PaddlePaddle. Its paper is 《Baidu’s Deep Speech 2 paper》. This project also supports various data augmentation methods to adapt to different usage scenarios. It supports training and prediction under Windows and Linux, and inference prediction on development boards such as Nvidia Jetson.

The environment used in this project:
- Python 3.7
- PaddlePaddle 2.1.2
- Windows or Ubuntu

Source code of this tutorial: https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech

Model Download

Dataset Number of Convolution Layers Number of RNN Layers RNN Layer Size Test Character Error Rate Download Link
aishell (179 hours) 2 3 1024 0.084532 Download
free_st_chinese_mandarin_corpus (109 hours) 2 3 1024 0.170260 Download
thchs_30 (34 hours) 2 3 1024 0.026838 Download

Note: The parameters provided here are for training. If you want to use them for prediction, you also need to export the model. The decoding method used is beam search.

Feel free to raise an issue for any questions.

Environment Setup

I used the local environment with Anaconda and created a Python 3.7 virtual environment. It is recommended that readers use a local environment for easier communication. If you encounter installation issues, please feel free to raise an issue.

  • First, install PaddlePaddle 2.1.2 GPU version. If you have already installed it, skip this step.
conda install paddlepaddle-gpu==2.1.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
  • Install other dependency libraries.
python -m pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/

Data Preparation

  1. The download_data directory contains scripts for downloading public datasets and creating training data lists and vocabularies. This project provides downloads for three Chinese Mandarin speech datasets: Aishell, Free ST-Chinese-Mandarin-Corpus, and THCHS-30, with a total size exceeding 28G. Downloading these three datasets can be done with the following code. If you want to train quickly, you can download only one of them. Note: noise.py can be downloaded or not; it is used for data augmentation during training. If you don’t want to use noise data augmentation, you can skip downloading it.
    ```shell script
    cd download_data/
    python aishell.py
    python free_st_chinese_mandarin_corpus.py
    python thchs_30.py
    python noise.py
**Note:** The above code only supports execution under Linux. For Windows, you can obtain the `DATA_URL` in the program for individual download. It is recommended to use a download tool like Thunder for faster download speeds. Then modify the `download()` function to use the absolute path of the file. For example:
```python
# Original line
filepath = download(url, md5sum, target_dir)
# Modified line
filepath = "D:\\Download\\data_aishell.tgz"
  1. If developers have their own datasets, they can use them for training (alongside or instead of the downloaded datasets). Custom speech data needs to follow the following format. The default sampling rate for audio in this project is 16000Hz. create_data.py provides uniform conversion to 16000Hz by setting is_change_frame_rate=True.
  2. Speech files should be placed in the PaddlePaddle-DeepSpeech/dataset/audio/ directory. For example, if you have a wav folder with speech files, place it there.
  3. Data list files should be placed in PaddlePaddle-DeepSpeech/dataset/annotation/. Each line in the data list should contain the relative path to the speech file and its corresponding Chinese text. Note: The Chinese text must only contain pure Chinese characters, no punctuation, Arabic numerals, or English letters.
    ```shell script
    dataset/audio/wav/0175/H0175A0171.wav 我需要把空调温度调到二十度
    dataset/audio/wav/0175/H0175A0377.wav 出彩中国人
    dataset/audio/wav/0175/H0175A0470.wav 据克而瑞研究中心监测
    dataset/audio/wav/0175/H0175A0180.wav 把温度加大到十八
3. Finally, run the following script to generate data lists and vocabulary:
```shell script
python create_data.py

Model Training

  • Execute the training script to start training the speech recognition model. Models are saved every epoch and every 2000 batches in the PaddlePaddle-DeepSpeech/models/param/ directory. Data augmentation is enabled by default. To disable it, set augment_conf_path=None. For details on data augmentation, see the FAQ section.
    ```shell script
    CUDA_VISIBLE_DEVICES=0,1 python train.py
- During training, the program uses VisualDL to record results. Start VisualDL with:
```shell
visualdl --logdir=log --host=0.0.0.0
  • Access http://localhost:8040 in your browser to view the training results.

Evaluation

Run the following script to evaluate the model using character error rate (CER):

python eval.py --resume_model=./models/param/50.pdparams

Example output:

-----------  Configuration Arguments -----------
alpha: 1.2
batch_size: 64
beam_size: 10
beta: 0.35
cutoff_prob: 1.0
cutoff_top_n: 40
decoding_method: ctc_greedy
error_rate_type: cer
lang_model_path: ./lm/zh_giga.no_cna_cmn.prune01244.klm
mean_std_path: ./dataset/mean_std.npz
resume_model: ./models/param/50.pdparams
num_conv_layers: 2
num_proc_bsearch: 8
num_rnn_layers: 3
rnn_layer_size: 1024
test_manifest: ./dataset/manifest.test
use_gpu: True
vocab_path: ./dataset/zh_vocab.txt
------------------------------------------------
[INFO 2021-03-18 16:38:53,689 eval.py:83] 开始评估 ...
Character Error Rate: [cer] (64/284) = 0.077040
...
Total Character Error Rate: [cer] (284/284) = 0.055882
[INFO 2021-03-18 16:39:38,215 eval.py:117] 完成评估

Model Export

Trained or downloaded models are parameter files. Export them to prediction models for faster inference using the Inference API (TensorRT acceleration is also supported on some devices).

python export_model.py --resume_model=./models/param/50.pdparams

Example output:

Successfully loaded pretrained model: ./models/param/50.pdparams
-----------  Configuration Arguments -----------
mean_std_path: ./dataset/mean_std.npz
num_conv_layers: 2
num_rnn_layers: 3
rnn_layer_size: 1024
pretrained_model: ./models/param/50.pdparams
save_model_path: ./models/infer/
use_gpu: True
vocab_path: ./dataset/zh_vocab.txt
------------------------------------------------
Model exported successfully and saved to: ./models/infer/

Local Prediction

Use this script to make predictions with the model. Ensure the model is exported first (see Model Export). Specify the audio file path with --wav_path. Supports Chinese number conversion to Arabic numerals (enabled by default with --to_an=True).
```shell script
python infer_path.py –wav_path=./dataset/test.wav

Example output:

----------- Configuration Arguments -----------
alpha: 1.2
beam_size: 10
beta: 0.35
cutoff_prob: 1.0
cutoff_top_n: 40
decoding_method: ctc_greedy
enable_mkldnn: False
is_long_audio: False
lang_model_path: ./lm/zh_giga.no_cna_cmn.prune01244.klm
mean_std_path: ./dataset/mean_std.npz
model_dir: ./models/infer/
to_an: True
use_gpu: True
vocab_path: ./dataset/zh_vocab.txt
wav_path: ./dataset/test.wav


Time taken: 132 ms, Recognition Result: 近几年不但我用书给女儿儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书, Score: 94

# Long Audio Prediction

Use the `--is_long_audio` parameter for long audio recognition, which splits audio using VAD (Voice Activity Detection), recognizes short segments, and concatenates results.
```shell script
python infer_path.py --wav_path=./dataset/test_vad.wav --is_long_audio=True

Web Deployment

Start a web service with HTTP interface for speech recognition. After starting, access http://localhost:5000 in the browser.
```shell script
python infer_server.py

The web interface allows uploading audio files (short/long) or recording directly, with playback of recorded audio. Chinese number conversion is enabled by default (`--to_an=True`).


# GUI Interface Deployment

Open the GUI interface to select audio files for recognition (short/long), with audio recording support and playback of recognition results. Defaults to greedy decoding; specify beam search via command line parameters.
```shell script
python infer_gui.py

Related Projects

Xiaoye