Project Introduction

This project is a speech emotion recognition project that supports multiple preprocessing methods and models.

Source Code Address: SpeechEmotionRecognition-Pytorch

Preparation

  • Anaconda 3
  • Python 3.11
  • PyTorch 2.2.1
  • Windows 11 or Ubuntu 22.04

Model Test Table

Model Params(M) Preprocessing Method Dataset Number of Classes Accuracy
BiLSTM 2.10 Emotion2Vec RAVDESS 8 0.85333
BiLSTM 1.87 CustomFeature RAVDESS 8 0.68666
BaseModel 0.19 Emotion2Vec RAVDESS 8 0.85333
BaseModel 0.08 CustomFeature RAVDESS 8 0.68000
BiLSTM 2.10 Emotion2Vec Larger Dataset 9 0.91826
BiLSTM 1.87 CustomFeature Larger Dataset 9 0.90817
BaseModel 0.19 Emotion2Vec Larger Dataset 9 0.92870
BaseModel 0.08 CustomFeature Larger Dataset 9 0.91026

Notes:
1. The RAVDESS dataset only uses Audio_Speech_Actors_01-24.zip
2. The Larger Dataset has nearly 25,000 audio samples with balanced quantities. The data features are also provided on the Knowledge Planet.

Environment Installation

  • First, install the GPU version of PyTorch. If already installed, skip this step.
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
  • Install the mser library.

Install using pip:

python -m pip install mser -U -i https://pypi.tuna.tsinghua.edu.cn/simple

Recommended Source Code Installation for the latest code:

git clone https://github.com/yeyupiaoling/SpeechEmotionRecognition-Pytorch.git
cd SpeechEmotionRecognition-Pytorch/
python setup.py install

Quick Usage

When using, simply set the --use_ms_model=iic/emotion2vec_plus_base parameter and the audio path.

python infer.py --audio_path=dataset/test.wav --use_ms_model=iic/emotion2vec_plus_base

Output:

[2024-07-02 19:45:36.154355 INFO   ] emotion2vec_predict:__init__:27 - Successfully loaded model: models/iic/emotion2vec_plus_base
Prediction result for audio: dataset/test.wav: angry, score: 1.0

Data Preparation

Generate a data list for subsequent reading. The project provides the RAVDESS dataset by default: RAVDESS. The dataset contains eight emotions: neutral, calm, happy, sad, angry, fear, disgust, surprise. Only Audio_Speech_Actors_01-24.zip is used, with two sentences: “Kids are talking by the door” and “Dogs are sitting by the door”.

Download the dataset and extract it to the dataset directory.

Then execute the create_ravdess_list('dataset/Audio_Speech_Actors_01-24', 'dataset') function in create_data.py to generate the data list and normalization file.

python create_data.py

For custom datasets, follow this format: audio_path is the audio file path. Place the audio dataset in the dataset/audio directory, with each folder containing one category of audio data (audio length ~3 seconds). For example: dataset/audio/angry/······

Execute the get_data_list('dataset/audios', 'dataset') function in create_data.py to generate the data list.

python create_data.py

The generated list format:

dataset/Audio_Speech_Actors_01-24/Actor_13/03-01-01-01-02-01-13.wav 0
dataset/Audio_Speech_Actors_01-24/Actor_01/03-01-02-01-01-01-01.wav 1
dataset/Audio_Speech_Actors_01-24/Actor_01/03-01-03-02-01-01-01.wav 2

Note: The create_standard('configs/bi_lstm.yml') function in create_data.py must be executed to generate the normalization file.

Feature Extraction (Optional)

Feature extraction is optional but speeds up training.

  1. Run extract_features.py to generate features and data lists:
python extract_features.py --configs=configs/bi_lstm.yml --save_dir=dataset/features
  1. Modify the configuration file to point dataset_conf.train_list and dataset_conf.test_list to the new feature lists.

Training

Start training with adjusted parameters in the config file (especially dataset_conf.num_class).

# Single GPU training
CUDA_VISIBLE_DEVICES=0 python train.py

# Multi-GPU training
CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py

Training log example:

[2024-02-03 15:09:26.178537 INFO   ] utils:print_arguments:29 - Preprocess method: Emotion2Vec
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
BiLSTM                                   [1, 8]                    --
├─Linear: 1-1                            [1, 512]                  393,728
├─LSTM: 1-2                              [1, 1, 512]               1,576,960
├─Tanh: 1-3                              [1, 512]                  --
├─Dropout: 1-4                           [1, 512]                  --
├─Linear: 1-5                            [1, 256]                  131,328
├─ReLU: 1-6                              [1, 256]                  --
├─Linear: 1-7                            [1, 8]                    2,056
==========================================================================================
Total params: 2,104,072
Trainable params: 2,104,072
Non-trainable params: 0
...

Evaluation

Run evaluation after each training epoch to generate accuracy and confusion matrix plots.

python eval.py --configs=configs/bi_lstm.yml

Evaluation output:

[2024-02-03 15:13:25.469242 INFO   ] trainer:evaluate:461 - Successfully loaded model: models/BiLSTM_Emotion2Vec/best_model/model.pth
100%|██████████████████████████████| 150/150 [00:00<00:00, 1281.96it/s]
Evaluation time: 1s, loss: 0.61840, accuracy: 0.87333

If using Chinese labels, install the font:

# Ubuntu font installation
git clone https://github.com/tracyone/program_font && cd program_font && ./install.sh

Prediction

Predict with trained model:

python infer.py --audio_path=dataset/test.wav

Output:

Successfully loaded model parameters: models/BiLSTM_Emotion2Vec/best_model/model.pth
[2024-07-02 19:48:42.864262 INFO   ] emotion2vec_predict:__init__:27 - Successfully loaded model: models/iic/emotion2vec_base
Prediction result for audio: dataset/test.wav: angry, score: 0.99995

Emotion2vec Model Prediction

Use the provided Emotion2vec model from ModelScope:

python infer.py --audio_path=dataset/test.wav --use_ms_model=iic/emotion2vec_plus_base

Output:

[2024-07-02 19:45:36.154355 INFO   ] emotion2vec_predict:__init__:27 - Successfully loaded model: models/iic/emotion2vec_plus_base
Prediction result for audio: dataset/test.wav: angry, score: 1.0

References

  1. SpeechEmotionRecognition-Pytorch
  2. AudioClassification-Pytorch
  3. FunASR
Xiaoye