前言¶
This project is a sound classification project based on PaddlePaddle, aiming to achieve recognition of various environmental sounds, animal calls, and languages. The project provides multiple sound classification models such as EcapaTdnn, PANNS, ResNetSE, CAMPPlus, and ERes2Net to support different application scenarios. Additionally, it offers test reports for the commonly used Urbansound8K dataset and examples of downloading and using some dialect datasets. Users can select suitable models and datasets according to their needs to achieve more accurate sound classification. The project has a wide range of application scenarios, including outdoor environmental monitoring, wildlife protection, and speech recognition. Meanwhile, the project encourages users to explore more application scenarios to promote the development and application of sound classification technology.
Source code address: AudioClassification-PaddlePaddle
Preparation¶
- Anaconda 3
- Python 3.8
- PaddlePaddle 2.4.0
- Windows 10 or Ubuntu 18.04
Project Features¶
- Supported models: EcapaTdnn, PANNS, TDNN, Res2Net, ResNetSE
- Supported pooling layers: AttentiveStatisticsPooling(ASP), SelfAttentivePooling(SAP), TemporalStatisticsPooling(TSP), TemporalAveragePooling(TAP)
- Supported preprocessing methods: MelSpectrogram, LogMelSpectrogram, Spectrogram, MFCC, Fbank
Model papers:
- EcapaTdnn: ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
- PANNS: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
- TDNN: Prediction of speech intelligibility with DNN-based performance measures
- Res2Net: Res2Net: A New Multi-scale Backbone Architecture
- ResNetSE: Squeeze-and-Excitation Networks
- CAMPPlus: CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking
- ERes2Net: An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification
Model Test Table¶
| Model | Params(M) | Preprocessing Method | Dataset | Number of Classes | Accuracy |
|---|---|---|---|---|---|
| CAMPPlus | 7.2 | Flank | UrbanSound8K | 10 | 0.96590 |
| PANNS(CNN10) | 4.9 | Flank | UrbanSound8K | 10 | 0.95454 |
| ResNetSE | 9.1 | Flank | UrbanSound8K | 10 | 0.92219 |
| TDNN | 2.7 | Flank | UrbanSound8K | 10 | 0.92045 |
| ERes2Net | 6.6 | Flank | UrbanSound8K | 10 | 0.90909 |
| EcapaTdnn | 6.2 | Flank | UrbanSound8K | 10 | 0.90503 |
| Res2Net | 5.6 | Flank | UrbanSound8K | 10 | 0.85812 |
Installation Environment¶
- First, install the GPU version of PaddlePaddle. If you have already installed it, skip this step.
conda install paddlepaddle-gpu==2.4.0 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
- Install the ppacls library.
Install using pip with the following command:
python -m pip install ppacls -U -i https://pypi.tuna.tsinghua.edu.cn/simple
Source code installation is recommended as it ensures the latest code is used.
git clone https://github.com/yeyupiaoling/AudioClassification_PaddlePaddle.git
cd AudioClassification_PaddlePaddle
python setup.py install
Data Preparation¶
Generate a data list for subsequent reading. The audio_path is the path to the audio files. Users need to place the audio dataset in the dataset/audio directory in advance, with each folder containing audio data of one category. Each audio data should be longer than 3 seconds, e.g., dataset/audio/鸟鸣/······. The audio is the location to store the data list. The generated data list will have the format: audio_path\tcategory_label, with the audio path and label separated by a tab. Users can modify the following function according to their data storage method.
Taking Urbansound8K as an example, it is a widely used public dataset for automatic urban environmental sound classification research, containing 10 categories: air conditioners, car horns, children playing, dog barking, drilling, engine idling, gunshots, jackhammer, siren, street music. The dataset download address: UrbanSound8K.tar.gz. For using this dataset, download and extract it to the dataset directory and modify the data list generation code as follows.
Execute create_data.py to generate the data list. The code provides two methods: one for custom data and one for generating Urbansound8K data lists, depending on the code.
python create_data.py
The generated list will look like this, with the audio path followed by its corresponding label (starting from 0), separated by a tab:
dataset/UrbanSound8K/audio/fold2/104817-4-0-2.wav 4
dataset/UrbanSound8K/audio/fold9/105029-7-2-5.wav 7
dataset/UrbanSound8K/audio/fold3/107228-5-0-0.wav 5
dataset/UrbanSound8K/audio/fold4/109711-3-2-4.wav 3
Modify Preprocessing Method¶
The configuration file by default uses the MelSpectrogram preprocessing method. To use other preprocessing methods, modify the configuration file as follows. Specific values can be adjusted according to your situation. If you are unsure how to set parameters, you can directly delete this part and use the default values.
preprocess_conf:
# Audio preprocessing method, supported: MelSpectrogram, Spectrogram, MFCC, Fbank
feature_method: 'MelSpectrogram'
# Set API parameters. For more parameters, check the corresponding API. If unsure, you can delete this part and use default values.
method_args:
sample_rate: 16000
n_fft: 1024
hop_length: 320
win_length: 1024
f_min: 50.0
f_max: 14000.0
n_mels: 64
Training¶
Then you can start training the model by creating train.py. Generally, no changes are needed to the parameters in the configuration file, but the following parameters need to be adjusted according to your actual dataset:
1. The number of categories dataset_conf.num_class, which varies by dataset. Set it according to your actual situation.
2. dataset_conf.batch_size. If there is insufficient GPU memory, reduce this parameter.
# Single-card training
CUDA_VISIBLE_DEVICES=0 python train.py
# Multi-card training
python -m paddle.distributed.launch --gpus '0,1' train.py
Training output log:
```
[2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:14 - ----------- Additional Configuration Parameters -----------
[2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 - configs: configs/ecapa_tdnn.yml
[2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 - pretrained_model: None
[2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 - resume_model: None
[2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 - save_model_path: models/
[2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 - use_gpu: True
[2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:17 - ------------------------------------------------
[2023-08-07 23:02:08.811036 INFO ] utils:print_arguments:19 - ----------- Configuration File Parameters -----------
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:22 - dataset_conf:
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:25 - aug_conf:
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - noise_aug_prob: 0.2
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - noise_dir: dataset/noise
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - speed_perturb: True
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - volume_aug_prob: 0.2
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - volume_perturb: False
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:25 - dataLoader:
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - batch_size: 64
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - num_workers: 4
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - do_vad: False
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:25 - eval_conf:
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - batch_size: 1
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - max_duration: 20
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - label_list_path: dataset/label_list.txt
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - max_duration: 3
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - min_duration: 0.5
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - sample_rate: 16000
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:25 - spec_aug_args:
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - freq_mask_width: [0, 8]
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - time_mask_width: [0, 10]
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - target_dB: -20
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - test_list: dataset/test_list.txt
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - train_list: dataset/train_list.txt
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - use_dB_normalization: True
[2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - use_spec_aug: True
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:22 - model_conf:
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - num_class: 10
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - pooling_type: ASP
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:22 - optimizer_conf:
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - optimizer: Adam
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - scheduler: WarmupCosineSchedulerLR
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:25 - scheduler_args:
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:27 - learning_rate: 0.001
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:27 - min_lr: 1e-05
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:27 - warmup_epoch: 5
[2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - weight_decay: 1e-06
[2023