PPASR Speech Recognition (Beginner Level)¶

This project will branch into three stages: beginner, intermediate, and advanced. Currently, we are at the beginner level. As the level increases, the recognition accuracy improves, making it more suitable for actual projects. Please stay tuned!

PPASR is an end-to-end automatic speech recognition system implemented based on PaddlePaddle 2. Its key feature is simplicity. While maintaining a reasonable accuracy, the project is designed to be accessible, allowing every developer interested in speech recognition to get started easily. PPASR uses only convolutional neural networks without complex special network structures. The model is straightforward and end-to-end, requiring no audio alignment because it uses CTC Loss as the loss function.

In traditional speech recognition models, audio alignment between text and speech is typically required before training, which is time-consuming. After alignment, the predicted labels are only partial results, requiring post-processing to obtain the final output. To address this, Connectionist Temporal Classification (CTC) was developed. CTC eliminates the need for audio alignment; it takes complete speech data as input and outputs the entire sequence result, similar to OCR.

Data Preprocessing¶

The project primarily processes audio using Mel Frequency Cepstral Coefficients (MFCCs). When reading audio files, the librosa.load(wav_path, sr=16000) function is used, followed by librosa.feature.mfcc() for data processing.

MFCCs are calculated through the following steps: pre-emphasis, framing, windowing, Fast Fourier Transform (FFT), Mel filter bank, Discrete Cosine Transform (DCT), which extract speech features and reduce computational dimensions. All audio files in this project have a sampling rate of 16000 Hz. If your audio has a different sampling rate, it will be converted to 16000 Hz by the create_manifest.py script.

GitHub Repository: https://github.com/yeyupiaoling/PPASR/tree/beginner

Online Run¶

Project Address: https://aistudio.baidu.com/aistudio/projectdetail/1597936

Installation¶

This project can run on Windows or Ubuntu. Installation is simple with the following command:

pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/

Data Preparation¶

The data directory contains scripts for downloading public datasets and creating training data lists and dictionaries. This project provides three public Chinese Mandarin speech datasets: Aishell, Free ST-Chinese-Mandarin-Corpus, and THCHS-30, totaling over 28GB. To download all three:
```shell script
python3 data/aishell.py
python3 data/free_st_chinese_mandarin_corpus.py
python3 data/thchs_30.py

- If you have your own dataset, you can use it for training (either alone or combined with the above datasets). Custom audio data must follow these formats:
  1. Audio files should be placed in the `dataset/audio/` directory (e.g., a `wav` folder containing all audio files).
  2. Data list files should be in the `dataset/annotation/` directory. Each line in the file should contain the relative path to the audio file and its corresponding Chinese text, separated by a tab (`\t`). The text must contain only pure Chinese characters, no punctuation, numbers, or English letters.
```shell script
dataset/audio/wav/0175/H0175A0171.wav 我需要把空调温度调到二十度
dataset/audio/wav/0175/H0175A0377.wav 出彩中国人
dataset/audio/wav/0175/H0175A0470.wav 据克而瑞研究中心监测
dataset/audio/wav/0175/H0175A0180.wav 把温度加大到十八

Run the following command to create data lists, build a vocabulary (data dictionary), and save the results in the dataset/ directory:
```shell script
python3 create_manifest.py

This command processes the data, calculates the mean and standard deviation for data normalization, and generates necessary files. Key parameters include:
- `is_change_frame_rate`: Whether to convert audio to 16000 Hz (default: True).
- `min_duration` and `max_duration`: Limit audio length to prevent GPU memory issues.

**Output Example:**
```shell
-----------  Configuration Arguments -----------
annotation_path: dataset/annotation/
count_threshold: 0
is_change_frame_rate: True
manifest_path: dataset/manifest.train
manifest_prefix: dataset/
max_duration: 20
min_duration: 0
vocab_path: dataset/zh_vocab.json
------------------------------------------------
开始生成数据列表...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141600/141600 [00:17<00:00, 8321.22it/s]
完成生成数据列表，数据集总长度为178.97小时！
开始生成数据字典...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 140184/140184 [00:01<00:00, 89476.12it/s]
数据字典生成完成！
开始抽取1%的数据计算均值和标准值...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 140184/140184 [01:33<00:00, 1507.15it/s]
【特别重要】：均值：-3.146301, 标准值：52.998405, 请根据这两个值修改训练参数！

For parameter details, run:

python create_manifest.py --help

Model Training¶

Execute the training script. The model is saved in the models/ directory. Greedy decoding is used for testing. Multi-GPU training is supported; specify CUDA_VISIBLE_DEVICES to use specific GPUs.
```shell script
CUDA_VISIBLE_DEVICES=0,1 python3 train.py

**Key Parameters:**
- `data_mean` and `data_std`: Must match the values calculated during data preparation.
- `pretrained_model`: Optional path to a pre-trained model (requires matching vocabulary).

**Training Output Example:**

----------- Configuration Arguments -----------
batch_size: 32
data_mean: -3.146301
data_std: 52.998405
dataset_vocab: dataset/zh_vocab.json
learning_rate: 0.001
num_epoch: 200
num_workers: 8
pretrained_model: None
save_model: models/
test_manifest: dataset/manifest.test
train_manifest: dataset/manifest.train

I0303 16:55:39.645823 16572 nccl_context.cc:189] init nccl context nranks: 2 local rank: 0 gpu id: 0 ring id: 0
…
Epoch 0: ExponentialDecay set learning rate to 0.001.
[2021-03-03 16:56:01.754491] Train epoch 0, batch 0, loss: 269.343811
[2021-03-03 16:58:08.436214] Train epoch 0, batch 100, loss: 7.195621
…

For parameter details, run:
```shell
python train.py --help

Evaluation and Prediction¶

Evaluation: Use eval.py to evaluate model performance via Character Error Rate (CER). Greedy decoding is used.
```shell script
python3 eval.py –model_path=models/step_final/

- **Prediction:** Use `infer.py` to predict speech recognition results by providing an audio file path.
```shell script
python3 infer.py --audio_path=./dataset/test.wav

Model Download¶

Dataset	CER	Download Link
AISHELL	0.151082	Download
free_st_chinese_mandarin_corpus	0.214240	Download
thchs30	0.081742	Download

VisualDL for Training Monitoring¶

To visualize training results, run:

visualdl --logdir=log --host 0.0.0.0

Then access http://localhost:8040 in your browser.