Project Introduction¶
This project is a speech emotion recognition project with current average performance, provided for learning purposes. It will be continuously optimized to improve accuracy. If you have good suggestions, feel free to discuss them.
Source Code Address: SpeechEmotionRecognition-PaddlePaddle
Preparation for Use¶
- Anaconda 3
- Python 3.8
- PaddlePaddle 2.4.0
- Windows 10 or Ubuntu 18.04
Model Test Table¶
| Model | Params(M) | Preprocessing Method | Dataset | Number of Classes | Accuracy |
|---|---|---|---|---|---|
| BidirectionalLSTM | 1.8 | Flank | RAVDESS | 8 | 0.95193 |
Note: The RAVDESS dataset only uses Audio_Speech_Actors_01-24.zip
Environment Installation¶
- First, install the GPU version of PaddlePaddle. If you have already installed it, skip this step.
conda install paddlepaddle-gpu==2.4.0 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
- Install the ppser library.
Install using pip with the following command:
python -m pip install ppser -U -i https://pypi.tuna.tsinghua.edu.cn/simple
Recommended Source Code Installation to ensure you use the latest code:
git clone https://github.com/yeyupiaoling/SpeechEmotionRecognition-PaddlePaddle.git
cd SpeechEmotionRecognition-PaddlePaddle/
python setup.py install
Data Preparation¶
Generate data lists for subsequent reading. The project provides the RAVDESS dataset by default RAVDESS. Download this dataset and extract it to the dataset directory.
Generate data lists for subsequent reading. The project provides the RAVDESS dataset by default RAVDESS. For the introduction page of this dataset, it contains eight emotions: neutral, calm, happy, sad, angry, fear, disgust, and surprise. This project only uses the Audio_Speech_Actors_01-24.zip file. The dataset has only two sample sentences: “Kids are talking by the door” and “Dogs are sitting by the door”, making the training set very simple. Download this dataset and extract it to the dataset directory.
python create_data.py
If you use a custom dataset, follow the format where audio_path is the audio file path. You need to place the audio dataset in the dataset/audio directory in advance, with each folder containing one category of audio data. Each audio data is approximately 3 seconds long, e.g., dataset/audio/angry/······. The audio is the location for storing the data list, and the generated data categories are in the format audio_path\tlabel, with the audio path and label separated by a tab \t. You can modify the following function according to your data storage method.
Execute the get_data_list('dataset/audios', 'dataset') function in create_data.py to generate the data list, and it will also generate a normalization file. Check the code for details.
python create_data.py
The generated list looks like this, with the audio path first, followed by its corresponding label (starting from 0), separated by a tab \t:
dataset/Audio_Speech_Actors_01-24/Actor_13/03-01-01-01-02-01-13.wav 0
dataset/Audio_Speech_Actors_01-24/Actor_01/03-01-02-01-01-01-01.wav 1
dataset/Audio_Speech_Actors_01-24/Actor_01/03-01-03-02-01-01-01.wav 2
Note: The create_standard('configs/bi_lstm.yml') function in create_data.py must be executed to generate the normalization file.
Training¶
Next, you can start training the model by creating train.py. Generally, you don’t need to modify the parameters in the configuration file, but the following parameters need adjustment according to your actual dataset:
First, the most important is the classification size dataset_conf.num_class, which may vary depending on the dataset. Set it according to your actual situation. Then dataset_conf.batch_size can be reduced if you encounter insufficient GPU memory.
# Single-card training
CUDA_VISIBLE_DEVICES=0 python train.py
# Multi-card training
python -m paddle.distributed.launch --gpus '0,1' train.py
Training Output Logs:¶
```[2023-08-18 18:48:49.662963 INFO ] utils:print_arguments:14 - ----------- Additional Configuration Parameters -----------
[2023-08-18 18:48:49.662963 INFO ] utils:print_arguments:16 - configs: configs/bi_lstm.yml
[2023-08-18 18:48:49.662963 INFO ] utils:print_arguments:16 - local_rank: 0
[2023-08-18 18:48:49.662963 INFO ] utils:print_arguments:16 - pretrained_model: None
[2023-08-18 18:48:49.662963 INFO ] utils:print_arguments:16 - resume_model: None
[2023-08-18 18:48:49.662963 INFO ] utils:print_arguments:16 - save_model_path: models/
[2023-08-18 18:48:49.662963 INFO ] utils:print_arguments:16 - use_gpu: True
[2023-08-18 18:48:49.662963 INFO ] utils:print_arguments:17 - ------------------------------------------------
[2023-08-18 18:48:49.680176 INFO ] utils:print_arguments:19 - ----------- Configuration File Parameters -----------
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:22 - dataset_conf:
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:25 - aug_conf:
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:27 - noise_aug_prob: 0.2
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:27 - noise_dir: dataset/noise
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:27 - speed_perturb: True
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:27 - volume_aug_prob: 0.2
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:27 - volume_perturb: False
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:25 - dataLoader:
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:27 - batch_size: 32
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:27 - num_workers: 4
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:29 - do_vad: False
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:25 - eval_conf:
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:27 - batch_size: 1
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:27 - max_duration: 3
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:29 - label_list_path: dataset/label_list.txt
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:29 - max_duration: 3
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:29 - min_duration: 0.5
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:29 - sample_rate: 16000
[2023-08-18 18:48:49.681177 INFO ] utils:print_arguments:29 - scaler_path: dataset/standard.m
[2023-08-18 18:48:49.682177 INFO ] utils:print_arguments:29 - target_dB: -20
[2023-08-18 18:48:49.682177 INFO ] utils:print_arguments:29 - test_list: dataset/test_list.txt
[2023-08-18 18:48:49.682177 INFO ] utils:print_arguments:29 - train_list: dataset/train_list.txt
[2023-08-18 18:48:49.682177 INFO ] utils:print_arguments:29 - use_dB_normalization: True
[2023-08-18 18:48:49.682177 INFO ] utils:print_arguments:22 - model_conf:
[2023-08-18 18:48:49.682177 INFO ] utils:print_arguments:29 - num_class: None
[2023-08-18 18:48:49.682177 INFO ] utils:print_arguments:22 - optimizer_conf:
[2023-08-18 18:48:49.682177 INFO ] utils:print_arguments:29 - learning_rate: 0.001
[2023-08-18 18:48:49.682177 INFO ] utils:print_arguments:29 - optimizer: Adam
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:29 - scheduler: WarmupCosineSchedulerLR
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:25 - scheduler_args:
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:27 - max_lr: 0.001
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:27 - min_lr: 1e-05
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:27 - warmup_epoch: 5
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:29 - weight_decay: 1e-06
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:22 - preprocess_conf:
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:29 - feature_method: CustomFeatures
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:22 - train_conf:
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:29 - enable_amp: False
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:29 - log_interval: 10
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:29 - max_epoch: 60
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:31 - use_model: BidirectionalLSTM
[2023-08-18 18:48:49.683184 INFO ] utils:print_arguments:32 - ------------------------------------------------
[2023-08-18 18:48:49.683184 WARNING] trainer:init:66 - Multi-threaded data reading is not supported on Windows systems, which has been automatically disabled!
Layer (type) Input Shape Output Shape Param #¶
Linear-1 [[1, 312]] [1, 512] 160,256
LSTM-1 [[1, 1, 512]] [[1, 1, 512], [[2, 1, 256], [2, 1, 256]]] 1,576,960
Tanh-1 [[1, 512]] [1, 512] 0
Dropout-1 [[1, 512]] [1, 512] 0
Linear-2 [[1, 512]] [1, 256] 131,328
ReLU-1 [[1, 256]] [1, 256] 0
Linear-3 [[1, 256]] [1, 6] 1,542
================================================================================================
Total params: 1,870,086
Trainable params: 1,870,086
Non-trainable params: 0
Input size (MB): 0.00
Forward/backward pass size (MB): 0.03
Params size (MB): 7.13
Estimated Total Size (MB): 7.16
[2023-08-18 18:48:51.425936 INFO ] trainer