Introduction¶
This project implements voiceprint recognition using the EcapaTdnn model. It may support more models in the future. The project also supports multiple data preprocessing methods. The loss function references the approach from the face recognition project PaddlePaddle-MobileFaceNets and uses ArcFace Loss. ArcFace Loss (Additive Angular Margin Loss) normalizes feature vectors and weights, adding an angular margin \( m \) to \( \theta \). The angular margin has a more direct impact on angles compared to the cosine margin.
Source Code: VoiceprintRecognition-Pytorch (V1)
Environment:
- Python 3.7
- PaddlePaddle 1.10.2
Model Download¶
| Model | Preprocessing Method | Dataset | Number of Classes | Classification Accuracy | Pairwise Comparison Accuracy | Model Download Link |
|---|---|---|---|---|---|---|
| EcapaTdnn | melspectrogram | Chinese Speech Dataset | 3242 | 0.9682 | 0.99982 | Download |
| EcapaTdnn | spectrogram | Chinese Speech Dataset | 3242 | 0.9690 | 0.99982 | Download |
| EcapaTdnn | melspectrogram | Larger Dataset | 6355 | 0.9166 | 0.99991 | Download |
| EcapaTdnn | spectrogram | Larger Dataset | 6355 | 0.9154 | 0.99990 | Download |
| EcapaTdnn | melspectrogram | Ultra-Large Dataset | 13718 | 0.9179 | 0.99995 | Download |
| EcapaTdnn | spectrogram | Ultra-Large Dataset | 13718 | 0.9344 | 0.99995 | Download |
Installation¶
- Install PyTorch GPU version (if already installed, skip):
pip install torch==1.10.2
- Install other dependencies (note: librosa version 0.9.1 is required for correct mel-spectrogram computation):
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/
Note: Fix for librosa and pyaudio installation errors
Data Preparation¶
The tutorial uses the Chinese Speech Dataset, which contains 3242 speakers with over 1,130,000 audio clips. Ensure all dataset files are extracted before use. For other datasets, preprocess with Python’s aukit for noise reduction and silence removal.
Step 1: Create Data List
The data list format is <audio_path \t speaker_id>. Create this list for easy reading. For the Chinese dataset (mp3 format), convert to wav first to improve speed.
In create_data.py:
# Code for creating data list
Run:
python create_data.py
Output Format:
dataset/zhvoice/zhmagicdata/5_895/5_895_20170614203758.wav 3238
dataset/zhvoice/zhmagicdata/5_895/5_895_20170614214007.wav 3238
...
Model Training¶
Use train.py with parameters to specify preprocessing and data augmentation:
- feature_method: melspectrogram or spectrogram
- augment_conf_path: Path to data augmentation config
VisualDL logs training progress. Start VisualDL with:
visualdl --logdir=log --host 0.0.0.0
Training Commands:
# Single GPU
python train.py
# Multi GPU
python train.py --gpus=0,1
Training Log Example:
----------- Configuration Arguments -----------
augment_conf_path: configs/augment.yml
batch_size: 64
feature_method: melspectrogram
gpus: 0
learning_rate: 0.001
num_epoch: 30
num_speakers: 3242
num_workers: 4
pretrained_model: None
resume: None
save_model_dir: models/
test_list_path: dataset/test_list.txt
train_list_path: dataset/train_list.txt
use_model: ecapa_tdnn
------------------------------------------------
...
[2022-04-24 09:25:10.481272] Train epoch [0/30], batch: [7500/8290], loss: 9.03724, accuracy: 0.33252, lr: 0.00100000, eta: 14:58:26
...
[2022-04-24 09:28:12.084404] Test 0, accuracy: 0.76057 time: 0:00:04
...
Data Augmentation¶
Supported operations: random crop, background noise addition, speed adjustment, volume adjustment, and SpecAugment. Modify parameters in configs/augment.yml:
noise:
min_snr_dB: 10
max_snr_dB: 30
noise_path: "dataset/noise"
prob: 0.5
Model Evaluation¶
After training, evaluate with eval.py:
python eval.py
Output Example:
----------- Configuration Arguments -----------
feature_method: melspectrogram
list_path: dataset/test_list.txt
num_speakers: 3242
resume: models/
use_model: ecapa_tdnn
------------------------------------------------
...
Classification Accuracy: 0.9608
Pairwise Comparison: 0.99980 (Best threshold: 0.58)
Voiceprint Comparison¶
Implement in infer_contrast.py:
# Code for audio feature extraction and similarity calculation
Run:
python infer_contrast.py --audio_path1=audio/a_1.wav --audio_path2=audio/b_2.wav
Output Example:
...
audio/a_1.wav and audio/b_2.wav are different speakers. Similarity: -0.095655
Voiceprint Recognition¶
Implement in infer_recognition.py:
# Code for loading registered voices and recognition logic
Run:
python infer_recognition.py
Output Example:
...
Loaded 沙瑞金 audio.
Loaded 李达康 audio.
Please choose function: 0 (Register), 1 (Recognize)
Recording for 3 seconds...
Recognized speaker: 夜雨飘零, Similarity: 0.920434
Other Versions¶
- TensorFlow: VoiceprintRecognition-Tensorflow
- PaddlePaddle: VoiceprintRecognition-PaddlePaddle
- Keras: VoiceprintRecognition-Keras
References¶
- https://github.com/PaddlePaddle/PaddleSpeech
- https://github.com/yeyupiaoling/PaddlePaddle-MobileFaceNets
- https://github.com/yeyupiaoling/PPASR