Project Introduction¶
This project is a speech emotion recognition project that supports multiple preprocessing methods and models.
Source Code Address: SpeechEmotionRecognition-Pytorch
Preparation¶
- Anaconda 3
- Python 3.11
- PyTorch 2.2.1
- Windows 11 or Ubuntu 22.04
Model Test Table¶
| Model | Params(M) | Preprocessing Method | Dataset | Number of Classes | Accuracy |
|---|---|---|---|---|---|
| BiLSTM | 2.10 | Emotion2Vec | RAVDESS | 8 | 0.85333 |
| BiLSTM | 1.87 | CustomFeature | RAVDESS | 8 | 0.68666 |
| BaseModel | 0.19 | Emotion2Vec | RAVDESS | 8 | 0.85333 |
| BaseModel | 0.08 | CustomFeature | RAVDESS | 8 | 0.68000 |
| BiLSTM | 2.10 | Emotion2Vec | Larger Dataset | 9 | 0.91826 |
| BiLSTM | 1.87 | CustomFeature | Larger Dataset | 9 | 0.90817 |
| BaseModel | 0.19 | Emotion2Vec | Larger Dataset | 9 | 0.92870 |
| BaseModel | 0.08 | CustomFeature | Larger Dataset | 9 | 0.91026 |
Notes:
1. The RAVDESS dataset only uses Audio_Speech_Actors_01-24.zip
2. The Larger Dataset has nearly 25,000 audio samples with balanced quantities. The data features are also provided on the Knowledge Planet.
Environment Installation¶
- First, install the GPU version of PyTorch. If already installed, skip this step.
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
- Install the
mserlibrary.
Install using pip:
python -m pip install mser -U -i https://pypi.tuna.tsinghua.edu.cn/simple
Recommended Source Code Installation for the latest code:
git clone https://github.com/yeyupiaoling/SpeechEmotionRecognition-Pytorch.git
cd SpeechEmotionRecognition-Pytorch/
python setup.py install
Quick Usage¶
When using, simply set the --use_ms_model=iic/emotion2vec_plus_base parameter and the audio path.
python infer.py --audio_path=dataset/test.wav --use_ms_model=iic/emotion2vec_plus_base
Output:
[2024-07-02 19:45:36.154355 INFO ] emotion2vec_predict:__init__:27 - Successfully loaded model: models/iic/emotion2vec_plus_base
Prediction result for audio: dataset/test.wav: angry, score: 1.0
Data Preparation¶
Generate a data list for subsequent reading. The project provides the RAVDESS dataset by default: RAVDESS. The dataset contains eight emotions: neutral, calm, happy, sad, angry, fear, disgust, surprise. Only Audio_Speech_Actors_01-24.zip is used, with two sentences: “Kids are talking by the door” and “Dogs are sitting by the door”.
Download the dataset and extract it to the dataset directory.
Then execute the create_ravdess_list('dataset/Audio_Speech_Actors_01-24', 'dataset') function in create_data.py to generate the data list and normalization file.
python create_data.py
For custom datasets, follow this format: audio_path is the audio file path. Place the audio dataset in the dataset/audio directory, with each folder containing one category of audio data (audio length ~3 seconds). For example: dataset/audio/angry/······
Execute the get_data_list('dataset/audios', 'dataset') function in create_data.py to generate the data list.
python create_data.py
The generated list format:
dataset/Audio_Speech_Actors_01-24/Actor_13/03-01-01-01-02-01-13.wav 0
dataset/Audio_Speech_Actors_01-24/Actor_01/03-01-02-01-01-01-01.wav 1
dataset/Audio_Speech_Actors_01-24/Actor_01/03-01-03-02-01-01-01.wav 2
Note: The create_standard('configs/bi_lstm.yml') function in create_data.py must be executed to generate the normalization file.
Feature Extraction (Optional)¶
Feature extraction is optional but speeds up training.
- Run
extract_features.pyto generate features and data lists:
python extract_features.py --configs=configs/bi_lstm.yml --save_dir=dataset/features
- Modify the configuration file to point
dataset_conf.train_listanddataset_conf.test_listto the new feature lists.
Training¶
Start training with adjusted parameters in the config file (especially dataset_conf.num_class).
# Single GPU training
CUDA_VISIBLE_DEVICES=0 python train.py
# Multi-GPU training
CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py
Training log example:
[2024-02-03 15:09:26.178537 INFO ] utils:print_arguments:29 - Preprocess method: Emotion2Vec
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
BiLSTM [1, 8] --
├─Linear: 1-1 [1, 512] 393,728
├─LSTM: 1-2 [1, 1, 512] 1,576,960
├─Tanh: 1-3 [1, 512] --
├─Dropout: 1-4 [1, 512] --
├─Linear: 1-5 [1, 256] 131,328
├─ReLU: 1-6 [1, 256] --
├─Linear: 1-7 [1, 8] 2,056
==========================================================================================
Total params: 2,104,072
Trainable params: 2,104,072
Non-trainable params: 0
...
Evaluation¶
Run evaluation after each training epoch to generate accuracy and confusion matrix plots.
python eval.py --configs=configs/bi_lstm.yml
Evaluation output:
[2024-02-03 15:13:25.469242 INFO ] trainer:evaluate:461 - Successfully loaded model: models/BiLSTM_Emotion2Vec/best_model/model.pth
100%|██████████████████████████████| 150/150 [00:00<00:00, 1281.96it/s]
Evaluation time: 1s, loss: 0.61840, accuracy: 0.87333
If using Chinese labels, install the font:
# Ubuntu font installation
git clone https://github.com/tracyone/program_font && cd program_font && ./install.sh
Prediction¶
Predict with trained model:
python infer.py --audio_path=dataset/test.wav
Output:
Successfully loaded model parameters: models/BiLSTM_Emotion2Vec/best_model/model.pth
[2024-07-02 19:48:42.864262 INFO ] emotion2vec_predict:__init__:27 - Successfully loaded model: models/iic/emotion2vec_base
Prediction result for audio: dataset/test.wav: angry, score: 0.99995
Emotion2vec Model Prediction¶
Use the provided Emotion2vec model from ModelScope:
python infer.py --audio_path=dataset/test.wav --use_ms_model=iic/emotion2vec_plus_base
Output:
[2024-07-02 19:45:36.154355 INFO ] emotion2vec_predict:__init__:27 - Successfully loaded model: models/iic/emotion2vec_plus_base
Prediction result for audio: dataset/test.wav: angry, score: 1.0