Project Introduction¶

This project is a speech emotion recognition project that supports multiple preprocessing methods and models.

Source Code Address: SpeechEmotionRecognition-Pytorch

Preparation¶

Anaconda 3
Python 3.11
PyTorch 2.2.1
Windows 11 or Ubuntu 22.04

Model Test Table¶

Model	Params(M)	Preprocessing Method	Dataset	Number of Classes	Accuracy
BiLSTM	2.10	Emotion2Vec	RAVDESS	8	0.85333
BiLSTM	1.87	CustomFeature	RAVDESS	8	0.68666
BaseModel	0.19	Emotion2Vec	RAVDESS	8	0.85333
BaseModel	0.08	CustomFeature	RAVDESS	8	0.68000
BiLSTM	2.10	Emotion2Vec	Larger Dataset	9	0.91826
BiLSTM	1.87	CustomFeature	Larger Dataset	9	0.90817
BaseModel	0.19	Emotion2Vec	Larger Dataset	9	0.92870
BaseModel	0.08	CustomFeature	Larger Dataset	9	0.91026

Notes:
1. The RAVDESS dataset only uses Audio_Speech_Actors_01-24.zip
2. The Larger Dataset has nearly 25,000 audio samples with balanced quantities. The data features are also provided on the Knowledge Planet.

Environment Installation¶

First, install the GPU version of PyTorch. If already installed, skip this step.

conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia

Install the mser library.

Install using pip:

python -m pip install mser -U -i https://pypi.tuna.tsinghua.edu.cn/simple

Recommended Source Code Installation for the latest code:

git clone https://github.com/yeyupiaoling/SpeechEmotionRecognition-Pytorch.git
cd SpeechEmotionRecognition-Pytorch/
python setup.py install

Quick Usage¶

When using, simply set the --use_ms_model=iic/emotion2vec_plus_base parameter and the audio path.

python infer.py --audio_path=dataset/test.wav --use_ms_model=iic/emotion2vec_plus_base

Output:

[2024-07-02 19:45:36.154355 INFO   ] emotion2vec_predict:__init__:27 - Successfully loaded model: models/iic/emotion2vec_plus_base
Prediction result for audio: dataset/test.wav: angry, score: 1.0

Data Preparation¶

Generate a data list for subsequent reading. The project provides the RAVDESS dataset by default: RAVDESS. The dataset contains eight emotions: neutral, calm, happy, sad, angry, fear, disgust, surprise. Only Audio_Speech_Actors_01-24.zip is used, with two sentences: “Kids are talking by the door” and “Dogs are sitting by the door”.

Download the dataset and extract it to the dataset directory.

Then execute the create_ravdess_list('dataset/Audio_Speech_Actors_01-24', 'dataset') function in create_data.py to generate the data list and normalization file.

python create_data.py

For custom datasets, follow this format: audio_path is the audio file path. Place the audio dataset in the dataset/audio directory, with each folder containing one category of audio data (audio length ~3 seconds). For example: dataset/audio/angry/······

Execute the get_data_list('dataset/audios', 'dataset') function in create_data.py to generate the data list.

python create_data.py

The generated list format:

dataset/Audio_Speech_Actors_01-24/Actor_13/03-01-01-01-02-01-13.wav 0
dataset/Audio_Speech_Actors_01-24/Actor_01/03-01-02-01-01-01-01.wav 1
dataset/Audio_Speech_Actors_01-24/Actor_01/03-01-03-02-01-01-01.wav 2

Note: The create_standard('configs/bi_lstm.yml') function in create_data.py must be executed to generate the normalization file.

Feature Extraction (Optional)¶

Feature extraction is optional but speeds up training.

Run extract_features.py to generate features and data lists:

python extract_features.py --configs=configs/bi_lstm.yml --save_dir=dataset/features

Modify the configuration file to point dataset_conf.train_list and dataset_conf.test_list to the new feature lists.

Training¶

Start training with adjusted parameters in the config file (especially dataset_conf.num_class).

# Single GPU training
CUDA_VISIBLE_DEVICES=0 python train.py

# Multi-GPU training
CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py

Training log example:

[2024-02-03 15:09:26.178537 INFO   ] utils:print_arguments:29 - Preprocess method: Emotion2Vec
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
BiLSTM                                   [1, 8]                    --
├─Linear: 1-1                            [1, 512]                  393,728
├─LSTM: 1-2                              [1, 1, 512]               1,576,960
├─Tanh: 1-3                              [1, 512]                  --
├─Dropout: 1-4                           [1, 512]                  --
├─Linear: 1-5                            [1, 256]                  131,328
├─ReLU: 1-6                              [1, 256]                  --
├─Linear: 1-7                            [1, 8]                    2,056
==========================================================================================
Total params: 2,104,072
Trainable params: 2,104,072
Non-trainable params: 0
...

Evaluation¶

Run evaluation after each training epoch to generate accuracy and confusion matrix plots.

python eval.py --configs=configs/bi_lstm.yml

Evaluation output:

[2024-02-03 15:13:25.469242 INFO   ] trainer:evaluate:461 - Successfully loaded model: models/BiLSTM_Emotion2Vec/best_model/model.pth
100%|██████████████████████████████| 150/150 [00:00<00:00, 1281.96it/s]
Evaluation time: 1s, loss: 0.61840, accuracy: 0.87333

If using Chinese labels, install the font:

# Ubuntu font installation
git clone https://github.com/tracyone/program_font && cd program_font && ./install.sh

Prediction¶

Predict with trained model:

python infer.py --audio_path=dataset/test.wav

Output:

Successfully loaded model parameters: models/BiLSTM_Emotion2Vec/best_model/model.pth
[2024-07-02 19:48:42.864262 INFO   ] emotion2vec_predict:__init__:27 - Successfully loaded model: models/iic/emotion2vec_base
Prediction result for audio: dataset/test.wav: angry, score: 0.99995

Emotion2vec Model Prediction¶

Use the provided Emotion2vec model from ModelScope:

python infer.py --audio_path=dataset/test.wav --use_ms_model=iic/emotion2vec_plus_base

Output:

[2024-07-02 19:45:36.154355 INFO   ] emotion2vec_predict:__init__:27 - Successfully loaded model: models/iic/emotion2vec_plus_base
Prediction result for audio: dataset/test.wav: angry, score: 1.0

Project Introduction¶

Preparation¶

Model Test Table¶

Environment Installation¶

Quick Usage¶

Data Preparation¶

Feature Extraction (Optional)¶

Training¶

Evaluation¶

Prediction¶

Emotion2vec Model Prediction¶

References¶

Related Articles