Foreword

This chapter introduces how to implement a simple voiceprint recognition model using TensorFlow. First, you need to be familiar with audio classification. If you are not familiar with it, you can check this article: Implementing Sound Classification with TensorFlow. Based on this knowledge, we will train a voiceprint recognition model that can identify who is speaking, which can be applied in projects requiring audio verification. The difference in this project is the use of ArcFace Loss. ArcFace loss: Additive Angular Margin Loss (additive angular margin loss function), which normalizes feature vectors and weights, and adds an angular margin \( m \) to \( \theta \). Angular margin has a more direct impact on angles than cosine margin.

Source Code Address: VoiceprintRecognition-Tensorflow

Environment Used:
- Python 3.7
- TensorFlow 2.3.0

Model Download

Dataset Number of Classes Accuracy Download Link
Chinese Voice Corpus Dataset 3242 99.9693% Download

Environment Setup

  1. Install TensorFlow. If you have already installed TensorFlow, no need to install again.
pip install tensorflow==2.3.0 -i https://mirrors.aliyun.com/pypi/simple/
  1. Install other dependent libraries using the following command.
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/

Note: Installation Error Solutions for librosa and pyaudio

Data Preparation

This tutorial uses the Chinese Voice Corpus Dataset, which contains voice data from 3242 people with over 1,130,000 voice entries. If you have other better datasets, you can mix them. Use the Python tool module aukit to process audio files, including noise reduction and silence removal.

First, create a data list with the format <Voice File Path\tVoice Classification Label>. This list is convenient for subsequent reading and integration with other voice datasets. The classification label is the unique ID of the speaker. Different datasets can be combined by writing corresponding functions to generate the data list.

Write the following code in create_data.py. Since the Chinese Voice Corpus Dataset is in MP3 format and the author found it to be slow to read, the author converts all MP3 files to WAV format. After creating the data list, check for errors and remove invalid data. Run the program to complete data preparation.

python create_data.py

After executing the above program, the following data format will be generated. To customize the data, refer to the data list format. The first part is the relative path of the audio, and the second part is the speaker ID corresponding to the audio, similar to classification.

dataset/zhvoice/zhmagicdata/5_895/5_895_20170614203758.wav  3238
dataset/zhvoice/zhmagicdata/5_895/5_895_20170614214007.wav  3238
dataset/zhvoice/zhmagicdata/5_941/5_941_20170613151344.wav  3239
dataset/zhvoice/zhmagicdata/5_941/5_941_20170614221329.wav  3239
dataset/zhvoice/zhmagicdata/5_941/5_941_20170616153308.wav  3239
dataset/zhvoice/zhmagicdata/5_968/5_968_20170614162657.wav  3240
dataset/zhvoice/zhmagicdata/5_968/5_968_20170622194003.wav  3240
dataset/zhvoice/zhmagicdata/5_968/5_968_20170707200554.wav  3240
dataset/zhvoice/zhmagicdata/5_970/5_970_20170616000122.wav  3241

Data Loading

With the data list and mean/std values created above, we can start training. The key is to convert the voice data into the short-time Fourier transform (STFT) magnitude spectrum. The librosa library is used to calculate audio features, such as librosa.feature.melspectrogram() for mel-spectrograms, outputting numpy values directly usable for training/prediction. MFCCs (Mel-frequency cepstral coefficients) are more commonly used in speech recognition, with the API librosa.feature.mfcc(). In this project, librosa.stft() and librosa.magphase() are used. During training, data augmentation techniques like random flipping, splicing, and random cropping are applied. After processing, a 257×257 magnitude spectrum is obtained.

wav, sr_ret = librosa.load(audio_path, sr=sr)
linear = librosa.stft(extended_wav, n_fft=n_fft, win_length=win_length, hop_length=hop_length)
mag, _ = librosa.magphase(linear)
freq, freq_time = mag.shape
spec_mag = mag[:, :spec_len]
mean = np.mean(spec_mag, 0, keepdims=True)
std = np.std(spec_mag, 0, keepdims=True)
spec_mag = (spec_mag - mean) / (std + 1e-5)

Model Training

Create train.py to start training. We use a modified resnet34 model. The input layer is set to [None, 1, 257, 257], which matches the shape of the STFT magnitude spectrum. If you use a different audio length, adjust this value. After each training epoch, model evaluation is performed to calculate accuracy and observe convergence. The model is saved after each epoch, including parameters for resuming training and prediction.

python train.py

During training, TensorBoard logs are saved. Start TensorBoard to monitor training results with:

tensorboard --logdir=log --host 0.0.0.0

Model Evaluation

After training, the prediction model is saved. Use it to predict audio features in the test set. Then compare the features pairwise, varying the threshold from 0 to 1 (step 0.01) to find the optimal threshold and calculate accuracy.

python eval.py

Sample output:

-----------  Configuration Arguments -----------
input_shape: (1, 257, 257)
list_path: dataset/test_list.txt
model_path: models/infer/model
------------------------------------------------

Starting to extract all audio features...
100%|█████████████████████████████████████████████████████| 5332/5332 [01:09<00:00, 77.06it/s]
Starting pairwise feature comparison...
100%|█████████████████████████████████████████████████████| 5332/5332 [01:43<00:00, 51.62it/s]
100%|█████████████████████████████████████████████████████| 100/100 [00:03<00:00, 28.04it/s]
When the threshold is 0.990000, the accuracy is maximum at: 0.999693

Voiceprint Comparison

Implement voiceprint comparison by creating infer_contrast.py. The infer() function outputs audio features (model has two outputs: classification and feature vector). Input two voices, get their feature vectors, compute cosine similarity, and use this as the similarity score. Adjust the threshold based on accuracy requirements.

python infer_contrast.py --audio_path1=audio/a_1.wav --audio_path2=audio/b_2.wav

Sample output:

-----------  Configuration Arguments -----------
audio_path1: audio/a_1.wav
audio_path2: audio/b_1.wav
input_shape: (257, 257, 1)
model_path: models/infer_model.h5
threshold: 0.7
------------------------------------------------
Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
resnet50v2_input (InputLayer [(None, 257, 257, 1)]     0         
_________________________________________________________________
resnet50v2 (Functional)      (None, 2048)              23558528  
_________________________________________________________________
batch_normalization (BatchNo (None, 2048)              8192      
=================================================================
Total params: 23,566,720
Trainable params: 23,517,184
Non-trainable params: 49,536
_________________________________________________________________

The voices audio/a_1.wav and audio/b_1.wav are not from the same person. Similarity: 0.503458

Voiceprint Recognition

Build infer_recognition.py for voiceprint recognition. Reuse the infer() function from voiceprint comparison. Add load_audio_db(), register(), and recognition() functions:
- load_audio_db(): Load registered voice samples from the audio_db folder.
- register(): Save new voice samples to the database and store their features.
- recognition(): Compare input voice with database samples to identify the speaker.

Run python infer_recognition.py to record 3 seconds of audio, then match it against the database.

Sample output:

-----------  Configuration Arguments -----------
audio_db: audio_db
input_shape: (257, 257, 1)
model_path: models/infer_model.h5
threshold: 0.7
------------------------------------------------
Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
resnet50v2_input (InputLayer [(None, 257, 257, 1)]     0         
_________________________________________________________________
resnet50v2 (Functional)      (None, 2048)              23558528  
_________________________________________________________________
batch_normalization (BatchNo (None, 2048)              8192      
=================================================================
Total params: 23,566,720
Trainable params: 23,517,184
Non-trainable params: 49,536
_________________________________________________________________

Loaded Li Dakang's audio.
Loaded Sha Ruijin's audio.
Please select a function: 0 for registration, 1 for recognition: 1
Press Enter to start recording for 3 seconds...
Recording ended!
Recognized speaker: Li Dakang, similarity: 0.920434

Other Versions

Xiaoye