PPASR Speech Recognition (Advanced Level)

This project has three branch stages: Beginner Level, Advanced Level, and Final Level. The current branch is the advanced level. As the level increases, the recognition accuracy improves, making it more suitable for actual project use. Stay tuned!

PPASR (Advanced Level) is an end-to-end automatic speech recognition system implemented based on PaddlePaddle 2. Compared to the beginner level, the advanced level improves model accuracy in three key areas:

  1. Model Replacement: Adopting the DeepSpeech2 model, released by Baidu in 2015. Its paper is available at Baidu’s Deep Speech 2 paper.

  2. Audio Preprocessing: Using a more effective preprocessing method for speech recognition—calculating linear spectrograms using FFT energy.

  3. Decoder Enhancement: Replacing the previous greedy decoder with a beam search decoder that can load language models to adjust decoding results, resulting in more reasonable predicted outputs and improved accuracy.

Environment Requirements

  • Anaconda 3
  • Python 3.7
  • PaddlePaddle 2.1.3
  • Windows 10 or Ubuntu 18.04

Installation

1. Install PaddlePaddle 2.1.3 (GPU Version)

If not already installed, run:

conda install paddlepaddle-gpu==2.1.3 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/

2. Install Project Dependencies

Execute the following command to install all required packages:

python -m pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/

Fix LLVM Version Error (If Encountered)

If you encounter an LLVM version error, run these commands to build LLVM from source:

cd ~
wget https://releases.llvm.org/9.0.0/llvm-9.0.0.src.tar.xz
wget http://releases.llvm.org/9.0.0/cfe-9.0.0.src.tar.xz
wget http://releases.llvm.org/9.0.0/clang-tools-extra-9.0.0.src.tar.xz
tar xvf llvm-9.0.0.src.tar.xz
tar xvf cfe-9.0.0.src.tar.xz
tar xvf clang-tools-extra-9.0.0.src.tar.xz
mv llvm-9.0.0.src llvm-src
mv cfe-9.0.0.src llvm-src/tools/clang
mv clang-tools-extra-9.0.0.src llvm-src/tools/clang/tools/extra
sudo mkdir -p /usr/local/llvm
sudo mkdir -p llvm-src/build
cd llvm-src/build
sudo cmake -G "Unix Makefiles" -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/usr/local/llvm" ..
sudo make -j8
sudo make install
export LLVM_CONFIG=/usr/local/llvm/bin/llvm-config

Then re-run the pip install command above.

3. Install Beam Search Decoder (Optional)

For the beam search decoder (required for Linux only):

cd decoders
pip3 install swig_decoders-1.2-cp37-cp37m-linux_x86_64.whl

Note: If installation fails, compile the ctc_decoders library manually (Ubuntu only):

cd decoders
sh setup.sh

4. Download Language Model

The beam search decoder requires a language model. Download it and place it in the lm directory:
```shell script
cd PaddlePaddle-DeepSpeech/
mkdir lm
cd lm
wget https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm

## Data Preparation

### 1. Download Public Datasets
The project includes three large Chinese Mandarin datasets: Aishell, Free ST-Chinese-Mandarin-Corpus, and THCHS-30 (total size >28G). Run the following to download them:
```shell script
python3 download_data/aishell.py
python3 download_data/free_st_chinese_mandarin_corpus.py
python3 download_data/thchs_30.py

(Optional: Train with a single dataset for faster testing.)

2. Custom Dataset Format

If using a custom dataset, structure it as follows:
- Place audio files in dataset/audio/.
- Create a data list file in dataset/annotation/ with lines formatted as:

  dataset/audio/wav/0175/H0175A0171.wav 我需要把空调温度调到二十度
  dataset/audio/wav/0175/H0175A0377.wav 出彩中国人
  ...

(Ensure text contains only Chinese characters, no punctuation, numbers, or letters.)

3. Generate Data Lists and Vocabulary

Run the following command to create data lists, vocabulary files, and normalize statistics:
```shell script
python3 create_data.py

This generates:
- `manifest.train`/`manifest.test`: Data lists with audio paths, lengths, and labels.
- `vocabulary.txt`: Character dictionary (labels for each character).
- `mean_std.npz`: Mean and standard deviation for data normalization.

**Sample Output**:

----------- Configuration Arguments -----------
annotation_path: dataset/annotation/
count_threshold: 0
is_change_frame_rate: True
manifest_path: dataset/manifest.train
manifest_prefix: dataset/
num_samples: -1
num_workers: 8
output_path: ./dataset/mean_std.npz
vocab_path: dataset/vocabulary.txt


开始生成数据列表…
100%|███████████████████████| 13388/13388 [00:09<00:00, 1454.08it/s]
完成生成数据列表,数据集总长度为34.16小时!

## Model Training

### 1. Start Training
Run the training script. For multi-GPU training, use:
```shell script
# Single GPU
python3 train.py

# Multi GPU (e.g., use GPUs 0 and 1)
python -m paddle.distributed.launch --gpus '0,1' train.py

Key Parameters:
- --gpus: Specify GPU IDs (e.g., --gpus='0,1').
- --pretrained_model: Load a pre-trained model for transfer learning.
- --resume_model: Resume training from a checkpoint.

Sample Training Output:

[2021-09-17 10:46:03.117764] 训练数据:13254
............
[2021-09-17 08:41:16.135825] Train epoch: [24/50], batch: [5900/6349], loss: 3.84609, learning rate: 0.00000688, eta: 10:38:40
[2021-09-17 08:41:38.698795] Train epoch: [24/50], batch: [6000/6349], loss: 0.92967, learning rate: 0.00000688, eta: 8:42:11
...

2. Visualize Training with VisualDL

Monitor training progress using VisualDL:

visualdl --logdir=log --host 0.0.0.0

Access http://localhost:8040 in your browser to view training metrics.

Evaluation

Evaluate model performance using character error rate (CER):
```shell script
python3 eval.py –model_path=models/epoch_50/

**Key Parameters**:
- `--decoder`: Choose `ctc_greedy` (greedy) or `ctc_beam_search` (beam search, requires language model).
- `--beam_size`: Set beam width for beam search (default: 10).

## Model Export

Export the trained model for inference:
```shell
python export_model.py --resume=models/epoch_50/

Inference

Use the exported model to predict speech:
shell script python3 infer.py --audio_path=dataset/test.wav

Key Parameters:
- --audio_path: Path to the audio file to recognize.
- --decoder: Choose decoding method (greedy/beam search).

Model Downloads

Dataset Conv Layers RNN Layers RNN Size Test CER Download Link
Aishell (179 hours) 2 3 1024 0.083327 Download
Free ST-Chinese-Mandarin-Corpus (109 hours) 2 3 1024 0.143291 Download
THCHS-30 (34 hours) 2 3 1024 0.047665 Download

Note: The provided models are training checkpoints. For prediction, you must first export the model using export_model.py and use the beam search decoder.


For parameter details, run python train.py --help or python eval.py --help.

Xiaoye