前言¶

This project is a sound classification project based on PaddlePaddle, aiming to achieve recognition of various environmental sounds, animal calls, and languages. The project provides multiple sound classification models such as EcapaTdnn, PANNS, ResNetSE, CAMPPlus, and ERes2Net to support different application scenarios. Additionally, it offers test reports for the commonly used Urbansound8K dataset and examples of downloading and using some dialect datasets. Users can select suitable models and datasets according to their needs to achieve more accurate sound classification. The project has a wide range of application scenarios, including outdoor environmental monitoring, wildlife protection, and speech recognition. Meanwhile, the project encourages users to explore more application scenarios to promote the development and application of sound classification technology.

Source code address: AudioClassification-PaddlePaddle

Preparation¶

Anaconda 3
Python 3.8
PaddlePaddle 2.4.0
Windows 10 or Ubuntu 18.04

Project Features¶

Supported models: EcapaTdnn, PANNS, TDNN, Res2Net, ResNetSE
Supported pooling layers: AttentiveStatisticsPooling(ASP), SelfAttentivePooling(SAP), TemporalStatisticsPooling(TSP), TemporalAveragePooling(TAP)
Supported preprocessing methods: MelSpectrogram, LogMelSpectrogram, Spectrogram, MFCC, Fbank

Model papers:

Model Test Table¶

Model	Params(M)	Preprocessing Method	Dataset	Number of Classes	Accuracy
CAMPPlus	7.2	Flank	UrbanSound8K	10	0.96590
PANNS（CNN10）	4.9	Flank	UrbanSound8K	10	0.95454
ResNetSE	9.1	Flank	UrbanSound8K	10	0.92219
TDNN	2.7	Flank	UrbanSound8K	10	0.92045
ERes2Net	6.6	Flank	UrbanSound8K	10	0.90909
EcapaTdnn	6.2	Flank	UrbanSound8K	10	0.90503
Res2Net	5.6	Flank	UrbanSound8K	10	0.85812

Installation Environment¶

First, install the GPU version of PaddlePaddle. If you have already installed it, skip this step.

conda install paddlepaddle-gpu==2.4.0 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/

Install the ppacls library.

Install using pip with the following command:

python -m pip install ppacls -U -i https://pypi.tuna.tsinghua.edu.cn/simple

Source code installation is recommended as it ensures the latest code is used.

git clone https://github.com/yeyupiaoling/AudioClassification_PaddlePaddle.git
cd AudioClassification_PaddlePaddle
python setup.py install

Data Preparation¶

Generate a data list for subsequent reading. The audio_path is the path to the audio files. Users need to place the audio dataset in the dataset/audio directory in advance, with each folder containing audio data of one category. Each audio data should be longer than 3 seconds, e.g., dataset/audio/鸟鸣/······. The audio is the location to store the data list. The generated data list will have the format: audio_path\tcategory_label, with the audio path and label separated by a tab. Users can modify the following function according to their data storage method.

Taking Urbansound8K as an example, it is a widely used public dataset for automatic urban environmental sound classification research, containing 10 categories: air conditioners, car horns, children playing, dog barking, drilling, engine idling, gunshots, jackhammer, siren, street music. The dataset download address: UrbanSound8K.tar.gz. For using this dataset, download and extract it to the dataset directory and modify the data list generation code as follows.

Execute create_data.py to generate the data list. The code provides two methods: one for custom data and one for generating Urbansound8K data lists, depending on the code.

python create_data.py

The generated list will look like this, with the audio path followed by its corresponding label (starting from 0), separated by a tab:

dataset/UrbanSound8K/audio/fold2/104817-4-0-2.wav   4
dataset/UrbanSound8K/audio/fold9/105029-7-2-5.wav   7
dataset/UrbanSound8K/audio/fold3/107228-5-0-0.wav   5
dataset/UrbanSound8K/audio/fold4/109711-3-2-4.wav   3

Modify Preprocessing Method¶

The configuration file by default uses the MelSpectrogram preprocessing method. To use other preprocessing methods, modify the configuration file as follows. Specific values can be adjusted according to your situation. If you are unsure how to set parameters, you can directly delete this part and use the default values.

preprocess_conf:
  # Audio preprocessing method, supported: MelSpectrogram, Spectrogram, MFCC, Fbank
  feature_method: 'MelSpectrogram'
  # Set API parameters. For more parameters, check the corresponding API. If unsure, you can delete this part and use default values.
  method_args:
    sample_rate: 16000
    n_fft: 1024
    hop_length: 320
    win_length: 1024
    f_min: 50.0
    f_max: 14000.0
    n_mels: 64

Training¶

Then you can start training the model by creating train.py. Generally, no changes are needed to the parameters in the configuration file, but the following parameters need to be adjusted according to your actual dataset:
1. The number of categories dataset_conf.num_class, which varies by dataset. Set it according to your actual situation.
2. dataset_conf.batch_size. If there is insufficient GPU memory, reduce this parameter.

# Single-card training
CUDA_VISIBLE_DEVICES=0 python train.py
# Multi-card training
python -m paddle.distributed.launch --gpus '0,1' train.py

Training output log:
```
[2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:14 [2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 [2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 [2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 [2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 [2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:16 [2023-08-07 23:02:08.807036 INFO ] utils:print_arguments:17 [2023-08-07 23:02:08.811036 INFO ] utils:print_arguments:19 [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:22 [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:25 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:25 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:25 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:25 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.812035 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:22 [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:22 [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:25 - [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:27 - [2023-08-07 23:02:08.816062 INFO ] utils:print_arguments:29 - [2023 - ----------- Additional Configuration Parameters -----------
- configs: configs/ecapa_tdnn.yml
- pretrained_model: None
- resume_model: None
- save_model_path: models/
- use_gpu: True
- ------------------------------------------------
- ----------- Configuration File Parameters -----------
- dataset_conf:
aug_conf:
noise_aug_prob: 0.2
noise_dir: dataset/noise
speed_perturb: True
volume_aug_prob: 0.2
volume_perturb: False
dataLoader:
batch_size: 64
num_workers: 4
do_vad: False
eval_conf:
batch_size: 1
max_duration: 20
label_list_path: dataset/label_list.txt
max_duration: 3
min_duration: 0.5
sample_rate: 16000
spec_aug_args:
freq_mask_width: [0, 8]
time_mask_width: [0, 10]
target_dB: -20
test_list: dataset/test_list.txt
train_list: dataset/train_list.txt
use_dB_normalization: True
use_spec_aug: True
- model_conf:
num_class: 10
pooling_type: ASP
- optimizer_conf:
optimizer: Adam
scheduler: WarmupCosineSchedulerLR
scheduler_args:
learning_rate: 0.001
min_lr: 1e-05
warmup_epoch: 5
weight_decay: 1e-06

前言¶