MASR: A Speech Recognition Framework for Simple and Practical ASR Projects

Introduction

MASR is a speech recognition framework implemented based on Pytorch, dedicated to simple and practical speech recognition projects. It can be deployed on servers, Nvidia Jetson devices, and is planned to support mobile devices like Android in the future.

Source Code: https://github.com/yeyupiaoling/MASR

Environment Used:
- Anaconda 3
- Python 3.7
- Pytorch 1.10.0
- Windows 10 or Ubuntu 18.04

Model Download

Dataset Model Preprocessing Language Word Error Rate (WER) Download Link
aishell (179 hours) deepspeech2 linear Chinese 0.06346 Download
free_st_chinese_mandarin_corpus (109 hours) deepspeech2 linear Chinese 0.13941 Download
thchs_30 (34 hours) deepspeech2 linear Chinese 0.06751 Download
Ultra-large dataset (1600+ hours real + 1300+ hours synthetic) deepspeech2 linear Chinese 0.06215 Download
Ultra-large dataset (1600+ hours real + 1300+ hours synthetic) deepspeech2_big linear Chinese 0.05517 Star the project first, then Download
Librispeech (960 hours) deepspeech2 English linear 0.12842 Download

Notes:
1. The word error rate was calculated using the eval.py program with the ctc_beam_search decoding method.
2. The downloaded compressed file contains mean_std.npz and vocabulary.txt. Extract all files and copy them to the project root directory.

Feel free to open an issue for questions or discussions.

Quick Inference

  • Download the provided model or train your own model, then follow Model Export to export the model. Use infer_path.py to predict audio by specifying the audio path with --wav_path. For details, see Model Deployment.
    ```shell script
    python infer_path.py –wav_path=./dataset/test.wav
**Output:**

----------- Configuration Arguments -----------
alpha: 1.2
beam_size: 10
beta: 0.35
cutoff_prob: 1.0
cutoff_top_n: 40
decoding_method: ctc_greedy
enable_mkldnn: False
is_long_audio: False
lang_model_path: ./lm/zh_giga.no_cna_cmn.prune01244.klm
mean_std_path: ./dataset/mean_std.npz
model_dir: ./models/infer/
to_an: True
use_gpu: True
use_tensorrt: False
vocab_path: ./dataset/zh_vocab.txt
wav_path: ./dataset/test.wav


Time taken: 82, Recognition Result: 近几年不但我用书给女儿儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书, Score: 94

## Data Preparation

1. The `download_data` directory contains scripts to download public datasets and generate training data lists and vocabularies. The project provides downloads for three Chinese Mandarin datasets: Aishell, Free ST-Chinese-Mandarin-Corpus, and THCHS-30 (total size >28G). Run the following to download:
```shell script
cd download_data/
python aishell.py
python free_st_chinese_mandarin_corpus.py
python thchs_30.py
python noise.py

Note: The above scripts only work on Linux. For Windows, manually download the files using a download manager and modify the download() function in the script to use the local file path.

  1. For custom datasets:
    - Place audio files in dataset/audio/
    - Create a data list file in dataset/annotation/ with each line in the format: relative_audio_path\ttext
    - Example data list:
   dataset/audio/wav/0175/H0175A0171.wav   我需要把空调温度调到二十度
   dataset/audio/wav/0175/H0175A0377.wav   出彩中国人
  1. Run the data processing script:
python create_data.py

This generates manifest.test, manifest.train, mean_std.npz, and vocabulary.txt in the dataset/ directory.

Model Training

  • Data Preparation: Ensure create_data.py is run successfully, generating the required files in dataset/.

  • Training Commands:

# Single GPU Training
CUDA_VISIBLE_DEVICES=0 python train.py

# Multi-GPU Training
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py

Training Output:

-----------  Configuration Arguments -----------
alpha: 2.2
augment_conf_path: conf/augmentation.json
batch_size: 32
beam_size: 300
beta: 4.3
cutoff_prob: 0.99
cutoff_top_n: 40
dataset_vocab: dataset/vocabulary.txt
decoder: ctc_greedy
lang_model_path: lm/zh_giga.no_cna_cmn.prune01244.klm
learning_rate: 5e-05
max_duration: 20
mean_std_path: dataset/mean_std.npz
min_duration: 0.5
num_epoch: 65
num_proc_bsearch: 10
num_workers: 8
pretrained_model: None
resume_model: None
save_model_path: models/
test_manifest: dataset/manifest.test
train_manifest: dataset/manifest.train
use_model: deepspeech2
------------------------------------------------
[2021-09-17 08:41:16.135825] Train epoch: [24/50], batch: [5900/6349], loss: 3.84609, learning rate: 0.00000688, eta: 10:38:40
...
[2021-09-17 08:43:07.817434] Test epoch: 24, time/epoch: 0:24:30.756875, loss: 6.90274, cer: 0.15213
  • VisualDL: Track training results with VisualDL:
visualdl --logdir=log --host=0.0.0.0

Access http://localhost:8040 in your browser to view the dashboard.

Model Evaluation

Evaluate model performance using character error rate (WER):

python eval.py --resume_model=models/deepspeech2/best_model

Evaluation Output:

-----------  Configuration Arguments -----------
alpha: 2.2
batch_size: 32
beam_size: 300
beta: 4.3
cutoff_prob: 0.99
cutoff_top_n: 40
dataset_vocab: dataset/vocabulary.txt
decoder: ctc_beam_search
lang_model_path: lm/zh_giga.no_cna_cmn.prune01244.klm
mean_std_path: dataset/mean_std.npz
num_proc_bsearch: 10
num_workers: 8
resume_model: models/deepspeech2/best_model/
test_manifest: dataset/manifest.test
use_model: deepspeech2
------------------------------------------------
100%|██████████████████████████████| 45/45 [00:09<00:00,  4.50it/s]
Time taken: 10s, WER: 0.095808

Model Export

Export trained model parameters to an inference model:

python export_model.py --resume_model=models/deepspeech2/epoch_50/

Export Output:

-----------  Configuration Arguments -----------
dataset_vocab: dataset/vocabulary.txt
mean_std_path: dataset/mean_std.npz
resume_model: models/deepspeech2/epoch_50
save_model: models/deepspeech2/
use_model: deepspeech2
------------------------------------------------
[2021-09-18 10:23:47.022243] Successfully loaded model parameters: models/deepspeech2/epoch_50/model.pdparams
Inference model saved to: models/deepspeech2/infer

Local Inference

Predict audio using the exported model:
```shell script
python infer_path.py –wav_path=./dataset/test.wav

**Output:**

----------- Configuration Arguments -----------
alpha: 2.2
beam_size: 300
beta: 4.3
cutoff_prob: 0.99
cutoff_top_n: 40
decoding_method: ctc_beam_search
is_long_audio: False
lang_model_path: ./lm/zh_giga.no_cna_cmn.prune01244.klm
model_dir: models/deepspeech2/infer/
to_an: True
use_gpu: True
use_model: deepspeech2
vocab_path: dataset/vocabulary.txt
wav_path: ./dataset/test.wav


Time taken: 82, Recognition Result: 近几年不但我用书给女儿儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书, Score: 94

## Long Audio Inference

For long audio files, use VAD to split and recognize:
```shell script
python infer_path.py --wav_path=./dataset/test_vad.wav --is_long_audio=True

Output:

-----------  Configuration Arguments -----------
alpha: 2.2
beam_size: 300
beta: 4.3
cutoff_prob: 0.99
cutoff_top_n: 40
decoding_method: ctc_greedy
is_long_audio: 1
lang_model_path: ./lm/zh_giga.no_cna_cmn.prune01244.klm
model_dir: ./models/deepspeech2/infer/
to_an: True
use_gpu: True
vocab_path: ./dataset/zh_vocab.txt
wav_path: dataset/test_vad.wav
------------------------------------------------
[分割音频1] Score: 70, Result: 记的12铺地补买上过了矛乱钻吃出满你都着们现上就只有1良解太穷了了臭力量紧不着还绑在大理达高的铁股上
...
[最终结果] Time: 1587, Score: 79, Result: 近几年不但我用书给女儿儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书

Real-time Inference

Run real-time inference with the --real_time_demo flag:
```shell script
python infer_path.py –wav_path=./dataset/test.wav –real_time_demo=True

**Output:**

----------- Configuration Arguments -----------
alpha: 2.2
beam_size: 300
beta: 4.3
cutoff_prob: 0.99
cutoff_top_n: 40
decoding_method: ctc_beam_search
feature_method: linear
is_long_audio: False
lang_model_path: ./lm/zh_giga.no_cna_cmn.prune01244.klm
model_dir: models/deepspeech2/infer/
real_time_demo: True
to_an: False
use_gpu: True
use_model: deepspeech2
vocab_path: dataset/vocabulary.txt
wav_path: ./dataset/test.wav


【实时结果】:Time: 19ms, Result: , Score: -15
【实时结果】:Time: 22ms, Result: 近几年, Score: -8

【最终结果】:Time: 163ms, Result: 近几年不但我用输给女儿压岁也劝说亲朋不要给女儿压岁钱而改送压岁书, Score: -2

## Web Deployment

Start a web service for HTTP-based speech recognition:
```shell script
python infer_server.py

Access http://localhost:5000 in your browser to use the web interface.

GUI Deployment

Run the GUI interface for audio selection and recognition:
shell script python infer_gui.py
A graphical interface will appear for audio upload and real-time recognition.

References

Xiaoye