PPASR Speech Recognition (Beginner Level)¶
This project will branch into three stages: beginner, intermediate, and advanced. Currently, we are at the beginner level. As the level increases, the recognition accuracy improves, making it more suitable for actual projects. Please stay tuned!
PPASR is an end-to-end automatic speech recognition system implemented based on PaddlePaddle 2. Its key feature is simplicity. While maintaining a reasonable accuracy, the project is designed to be accessible, allowing every developer interested in speech recognition to get started easily. PPASR uses only convolutional neural networks without complex special network structures. The model is straightforward and end-to-end, requiring no audio alignment because it uses CTC Loss as the loss function.
In traditional speech recognition models, audio alignment between text and speech is typically required before training, which is time-consuming. After alignment, the predicted labels are only partial results, requiring post-processing to obtain the final output. To address this, Connectionist Temporal Classification (CTC) was developed. CTC eliminates the need for audio alignment; it takes complete speech data as input and outputs the entire sequence result, similar to OCR.
Data Preprocessing¶
The project primarily processes audio using Mel Frequency Cepstral Coefficients (MFCCs). When reading audio files, the librosa.load(wav_path, sr=16000) function is used, followed by librosa.feature.mfcc() for data processing.
MFCCs are calculated through the following steps: pre-emphasis, framing, windowing, Fast Fourier Transform (FFT), Mel filter bank, Discrete Cosine Transform (DCT), which extract speech features and reduce computational dimensions. All audio files in this project have a sampling rate of 16000 Hz. If your audio has a different sampling rate, it will be converted to 16000 Hz by the create_manifest.py script.
GitHub Repository: https://github.com/yeyupiaoling/PPASR/tree/beginner
Online Run¶
Project Address: https://aistudio.baidu.com/aistudio/projectdetail/1597936
Installation¶
- This project can run on Windows or Ubuntu. Installation is simple with the following command:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/
Data Preparation¶
- The
datadirectory contains scripts for downloading public datasets and creating training data lists and dictionaries. This project provides three public Chinese Mandarin speech datasets: Aishell, Free ST-Chinese-Mandarin-Corpus, and THCHS-30, totaling over 28GB. To download all three:
```shell script
python3 data/aishell.py
python3 data/free_st_chinese_mandarin_corpus.py
python3 data/thchs_30.py
- If you have your own dataset, you can use it for training (either alone or combined with the above datasets). Custom audio data must follow these formats:
1. Audio files should be placed in the `dataset/audio/` directory (e.g., a `wav` folder containing all audio files).
2. Data list files should be in the `dataset/annotation/` directory. Each line in the file should contain the relative path to the audio file and its corresponding Chinese text, separated by a tab (`\t`). The text must contain only pure Chinese characters, no punctuation, numbers, or English letters.
```shell script
dataset/audio/wav/0175/H0175A0171.wav 我需要把空调温度调到二十度
dataset/audio/wav/0175/H0175A0377.wav 出彩中国人
dataset/audio/wav/0175/H0175A0470.wav 据克而瑞研究中心监测
dataset/audio/wav/0175/H0175A0180.wav 把温度加大到十八
- Run the following command to create data lists, build a vocabulary (data dictionary), and save the results in the
dataset/directory:
```shell script
python3 create_manifest.py
This command processes the data, calculates the mean and standard deviation for data normalization, and generates necessary files. Key parameters include:
- `is_change_frame_rate`: Whether to convert audio to 16000 Hz (default: True).
- `min_duration` and `max_duration`: Limit audio length to prevent GPU memory issues.
**Output Example:**
```shell
----------- Configuration Arguments -----------
annotation_path: dataset/annotation/
count_threshold: 0
is_change_frame_rate: True
manifest_path: dataset/manifest.train
manifest_prefix: dataset/
max_duration: 20
min_duration: 0
vocab_path: dataset/zh_vocab.json
------------------------------------------------
开始生成数据列表...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141600/141600 [00:17<00:00, 8321.22it/s]
完成生成数据列表,数据集总长度为178.97小时!
开始生成数据字典...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 140184/140184 [00:01<00:00, 89476.12it/s]
数据字典生成完成!
开始抽取1%的数据计算均值和标准值...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 140184/140184 [01:33<00:00, 1507.15it/s]
【特别重要】:均值:-3.146301, 标准值:52.998405, 请根据这两个值修改训练参数!
For parameter details, run:
python create_manifest.py --help
Model Training¶
- Execute the training script. The model is saved in the
models/directory. Greedy decoding is used for testing. Multi-GPU training is supported; specifyCUDA_VISIBLE_DEVICESto use specific GPUs.
```shell script
CUDA_VISIBLE_DEVICES=0,1 python3 train.py
**Key Parameters:**
- `data_mean` and `data_std`: Must match the values calculated during data preparation.
- `pretrained_model`: Optional path to a pre-trained model (requires matching vocabulary).
**Training Output Example:**
----------- Configuration Arguments -----------
batch_size: 32
data_mean: -3.146301
data_std: 52.998405
dataset_vocab: dataset/zh_vocab.json
learning_rate: 0.001
num_epoch: 200
num_workers: 8
pretrained_model: None
save_model: models/
test_manifest: dataset/manifest.test
train_manifest: dataset/manifest.train
I0303 16:55:39.645823 16572 nccl_context.cc:189] init nccl context nranks: 2 local rank: 0 gpu id: 0 ring id: 0
…
Epoch 0: ExponentialDecay set learning rate to 0.001.
[2021-03-03 16:56:01.754491] Train epoch 0, batch 0, loss: 269.343811
[2021-03-03 16:58:08.436214] Train epoch 0, batch 100, loss: 7.195621
…
For parameter details, run:
```shell
python train.py --help
Evaluation and Prediction¶
- Evaluation: Use
eval.pyto evaluate model performance via Character Error Rate (CER). Greedy decoding is used.
```shell script
python3 eval.py –model_path=models/step_final/
- **Prediction:** Use `infer.py` to predict speech recognition results by providing an audio file path.
```shell script
python3 infer.py --audio_path=./dataset/test.wav
Model Download¶
| Dataset | CER | Download Link |
|---|---|---|
| AISHELL | 0.151082 | Download |
| free_st_chinese_mandarin_corpus | 0.214240 | Download |
| thchs30 | 0.081742 | Download |
VisualDL for Training Monitoring¶
To visualize training results, run:
visualdl --logdir=log --host 0.0.0.0
Then access http://localhost:8040 in your browser.