Introduction

This project implements voiceprint recognition using the EcapaTdnn model. It may support more models in the future. The project also supports multiple data preprocessing methods. The loss function references the approach from the face recognition project PaddlePaddle-MobileFaceNets and uses ArcFace Loss. ArcFace Loss (Additive Angular Margin Loss) normalizes feature vectors and weights, adding an angular margin \( m \) to \( \theta \). The angular margin has a more direct impact on angles compared to the cosine margin.

Source Code: VoiceprintRecognition-Pytorch (V1)

Environment:
- Python 3.7
- PaddlePaddle 1.10.2

Model Download

Model Preprocessing Method Dataset Number of Classes Classification Accuracy Pairwise Comparison Accuracy Model Download Link
EcapaTdnn melspectrogram Chinese Speech Dataset 3242 0.9682 0.99982 Download
EcapaTdnn spectrogram Chinese Speech Dataset 3242 0.9690 0.99982 Download
EcapaTdnn melspectrogram Larger Dataset 6355 0.9166 0.99991 Download
EcapaTdnn spectrogram Larger Dataset 6355 0.9154 0.99990 Download
EcapaTdnn melspectrogram Ultra-Large Dataset 13718 0.9179 0.99995 Download
EcapaTdnn spectrogram Ultra-Large Dataset 13718 0.9344 0.99995 Download

Installation

  1. Install PyTorch GPU version (if already installed, skip):
pip install torch==1.10.2
  1. Install other dependencies (note: librosa version 0.9.1 is required for correct mel-spectrogram computation):
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/

Note: Fix for librosa and pyaudio installation errors

Data Preparation

The tutorial uses the Chinese Speech Dataset, which contains 3242 speakers with over 1,130,000 audio clips. Ensure all dataset files are extracted before use. For other datasets, preprocess with Python’s aukit for noise reduction and silence removal.

Step 1: Create Data List
The data list format is <audio_path \t speaker_id>. Create this list for easy reading. For the Chinese dataset (mp3 format), convert to wav first to improve speed.

In create_data.py:

# Code for creating data list

Run:

python create_data.py

Output Format:

dataset/zhvoice/zhmagicdata/5_895/5_895_20170614203758.wav  3238
dataset/zhvoice/zhmagicdata/5_895/5_895_20170614214007.wav  3238
...

Model Training

Use train.py with parameters to specify preprocessing and data augmentation:
- feature_method: melspectrogram or spectrogram
- augment_conf_path: Path to data augmentation config

VisualDL logs training progress. Start VisualDL with:

visualdl --logdir=log --host 0.0.0.0

Training Commands:

# Single GPU
python train.py
# Multi GPU
python train.py --gpus=0,1

Training Log Example:

----------- Configuration Arguments -----------
augment_conf_path: configs/augment.yml
batch_size: 64
feature_method: melspectrogram
gpus: 0
learning_rate: 0.001
num_epoch: 30
num_speakers: 3242
num_workers: 4
pretrained_model: None
resume: None
save_model_dir: models/
test_list_path: dataset/test_list.txt
train_list_path: dataset/train_list.txt
use_model: ecapa_tdnn
------------------------------------------------
...
[2022-04-24 09:25:10.481272] Train epoch [0/30], batch: [7500/8290], loss: 9.03724, accuracy: 0.33252, lr: 0.00100000, eta: 14:58:26
...
[2022-04-24 09:28:12.084404] Test 0, accuracy: 0.76057 time: 0:00:04
...

Data Augmentation

Supported operations: random crop, background noise addition, speed adjustment, volume adjustment, and SpecAugment. Modify parameters in configs/augment.yml:

noise:
  min_snr_dB: 10
  max_snr_dB: 30
  noise_path: "dataset/noise"
  prob: 0.5

Model Evaluation

After training, evaluate with eval.py:

python eval.py

Output Example:

----------- Configuration Arguments -----------
feature_method: melspectrogram
list_path: dataset/test_list.txt
num_speakers: 3242
resume: models/
use_model: ecapa_tdnn
------------------------------------------------
...
Classification Accuracy: 0.9608
Pairwise Comparison: 0.99980 (Best threshold: 0.58)

Voiceprint Comparison

Implement in infer_contrast.py:

# Code for audio feature extraction and similarity calculation

Run:

python infer_contrast.py --audio_path1=audio/a_1.wav --audio_path2=audio/b_2.wav

Output Example:

...
audio/a_1.wav and audio/b_2.wav are different speakers. Similarity: -0.095655

Voiceprint Recognition

Implement in infer_recognition.py:

# Code for loading registered voices and recognition logic

Run:

python infer_recognition.py

Output Example:

...
Loaded 沙瑞金 audio.
Loaded 李达康 audio.
Please choose function: 0 (Register), 1 (Recognize)
Recording for 3 seconds...
Recognized speaker: 夜雨飘零, Similarity: 0.920434

Other Versions

References

  1. https://github.com/PaddlePaddle/PaddleSpeech
  2. https://github.com/yeyupiaoling/PaddlePaddle-MobileFaceNets
  3. https://github.com/yeyupiaoling/PPASR
Xiaoye