WenetSpeech Dataset

A Mandarin speech dataset with over 10,000 hours of audio, usage link: PPASR

The WenetSpeech Dataset contains over 10,000 hours of Mandarin speech data, all sourced from YouTube and Podcasts. OCR (Optical Character Recognition) and ASR (Automatic Speech Recognition) technologies are used to label each YouTube and Podcast recording. To improve corpus quality, WenetSpeech employs an innovative end-to-end label error detection method to further validate and filter the data.

  • All data is classified into 3 categories as shown in the table below:
Data Category Duration (hours) Confidence Applicable Systems
Strongly Labeled 10005 >=0.95 Supervised Training
Weakly Labeled 2478 [0.6, 0.95] Semi-supervised or Noisy Training
Unlabeled 9952 / Unsupervised Training or Pretraining
Total 22435 / /
  • Domains, speaking styles, and scenarios of the strongly labeled data are divided into 10 groups as shown below:
Domain YouTube (hours) Podcast (hours) Total (hours)
Audiobooks 0 250.9 250.9
Live Commentary 112.6 135.7 248.3
Documentary 386.7 90.5 477.2
Drama 4338.2 0 4338.2
Interviews 324.2 614 938.2
News 0 868 868
Reading 0 1110.2 1110.2
Discussion 204 90.7 294.7
Variety Shows 603.3 224.5 827.8
Others 144 507.5 651.5
Total 6113 3892 10005
  • Three subsets (S, M, L) for building ASR systems with different data scales:
Training Data Confidence Duration (hours)
L [0.95, 1.0] 10005
M 1.0 1000
S 1.0 100
  • Evaluation and test data:
Evaluation Data Duration (hours) Source Description
DEV 20 Internet Designed for cross-validation in speech tools during training
TEST_NET 23 Internet Competition test set
TEST_MEETING 15 Meetings Far-field, dialogue, spontaneous, and meeting datasets

1. Tutorial: Training an ASR Model with WenetSpeech

This tutorial introduces how to train a speech recognition model using only the strongly labeled data, mainly in three steps:

  1. Download and Extract the Dataset
    Download the dataset from the official website after filling out the form. You will receive an email with download instructions. Execute the three commands in the email to download and extract the dataset. Note: This requires 500GB of disk space.

  2. Prepare the Dataset
    The downloaded data is unprocessed. Use the create_wenetspeech_data.py script in the tools directory to segment and label the audio files. This step requires 3TB of disk space.

   cd tools/
   python create_wenetspeech_data.py --wenetspeech_json=/media/wenetspeech/WenetSpeech.json
  1. Create Training Data
    Similar to standard data preparation, run create_data.py in the project root directory to generate data lists, vocabulary files, and mean/std files for training. After this step, you can start training the model (see Training Model for details).
   python create_data.py

Project address: https://github.com/yeyupiaoling/PPASR

Xiaoye