Preface¶
This project claims to use Keras, but it actually uses the Keras interface under TensorFlow. The main purpose of this project is voiceprint recognition, also known as speaker recognition. It includes training on custom datasets, voiceprint comparison, and voiceprint recognition.
Source Code Address: VoiceprintRecognition-Keras
Environment Used:
- Python 3.7
- TensorFlow 2.3.0
Model Download¶
| Dataset | Number of Categories | Download Link |
|---|---|---|
| Chinese Speech Corpus Dataset | 3242 | Download |
| Larger Dataset | 6235 | Download |
Environment Installation¶
- Install TensorFlow with GPU support:
```shell script
pip install tensorflow==2.3.0 -i https://mirrors.aliyun.com/pypi/simple/
2. Install other dependency libraries using:
```shell
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/
Note: For solutions to installation errors of libsora and pyaudio, refer to docs/faq.md
Custom Data Training¶
This section explains how to train with a custom dataset. If you don’t want to train a model, you can skip to the next section and use the publicly available model for voiceprint recognition.
Data Creation¶
The author of this tutorial used the Chinese Speech Corpus Dataset, which contains voice data from 3242 people and over 1,130,000 voice recordings. If you have other better datasets, you can mix them, but use the Python tool module aukit to process the audio (denoise and remove silence).
First, create a data list with the format <voice file path \t voice classification label>. This list is convenient for subsequent reading and can be used with other voice datasets by writing corresponding functions to generate data lists and combining all datasets into one.
Write the following code in create_data.py: Since the Chinese Speech Corpus Dataset is in MP3 format (which is slow to read), the author converted all MP3 audio files to WAV format. After creating the data list, check for errors and remove incorrect data. Run the following program to prepare the data:
python create_data.py
After executing the above program, the following data format will be generated. For custom data, refer to the data list format: the first part is the relative path of the audio, and the second part is the corresponding speaker label (similar to classification):
dataset/zhvoice/zhmagicdata/5_895/5_895_20170614203758.wav 3238
dataset/zhvoice/zhmagicdata/5_895/5_895_20170614214007.wav 3238
dataset/zhvoice/zhmagicdata/5_941/5_941_20170613151344.wav 3239
dataset/zhvoice/zhmagicdata/5_941/5_941_20170614221329.wav 3239
dataset/zhvoice/zhmagicdata/5_941/5_941_20170616153308.wav 3239
dataset/zhvoice/zhmagicdata/5_968/5_968_20170614162657.wav 3240
dataset/zhvoice/zhmagicdata/5_968/5_968_20170622194003.wav 3240
dataset/zhvoice/zhmagicdata/5_968/5_968_20170707200554.wav 3240
dataset/zhvoice/zhmagicdata/5_970/5_970_20170616000122.wav 3241
Data Reading¶
With the created data list and mean/std values, you can start training. The main task is to convert voice data into the magnitude spectrum of Short-Time Fourier Transform (STFT). librosa can easily calculate audio features, such as the Mel-spectrogram API: librosa.feature.melspectrogram(), which outputs numpy values for direct training/prediction with PaddlePaddle. MFCCs (Mel-Frequency Cepstral Coefficients) are also important and used more in speech recognition with librosa.feature.mfcc(). In this project, librosa.stft() and librosa.magphase() are used. Data augmentation (random flipping, splicing, random cropping) is applied during training. Finally, a 257*257 STFT magnitude spectrum is obtained.
wav, sr_ret = librosa.load(audio_path, sr=sr)
linear = librosa.stft(extended_wav, n_fft=n_fft, win_length=win_length, hop_length=hop_length)
mag, _ = librosa.magphase(linear)
freq, freq_time = mag.shape
spec_mag = mag[:, :spec_len]
mean = np.mean(spec_mag, 0, keepdims=True)
std = np.std(spec_mag, 0, keepdims=True)
spec_mag = (spec_mag - mean) / (std + 1e-5)
Training¶
Before training, modify a few parameters in train.py:
- gpu: Specify which GPUs to use (for multi-GPU setups, use all available GPUs).
- resume: Resume training from a previously saved model (specify the model path if resuming).
- batch_size: Adjust based on your GPU memory size.
- num_classes: Number of classifications. Check the last label in the data list from the previous step; remember to add 1 (since labels start from 0).
Finally, run train.py to start training:
python train.py
Model Evaluation¶
After training, the prediction model will be saved. Use this model to predict audio features from the test set, then compare the features pairwise (with thresholds from 0 to 1, step 0.01) to find the optimal threshold and calculate accuracy.
python eval.py
Sample output:
----------- Configuration Arguments -----------
list_path: dataset/test_list.txt
model_path: models/resnet34-51.h5
------------------------------------------------
==> successfully loading model models/resnet34-51.h5.
Starting to extract all audio features...
100%|█████████████████████████████████████████████████████| 5332/5332 [01:09<00:00, 77.06it/s]
Starting pairwise comparison of audio features...
100%|█████████████████████████████████████████████████████| 5332/5332 [01:43<00:00, 51.62it/s]
100%|█████████████████████████████████████████████████████| 100/100 [00:03<00:00, 28.04it/s]
When the threshold is 0.790000, the accuracy is maximum at 0.999787
Voiceprint Comparison¶
Next, implement voiceprint comparison by creating infer_contrast.py. The infer() function outputs two model results: classification output and audio feature output. Here, we use the audio feature output. With two voice inputs, obtain their feature data, compute the cosine similarity, and use this as the similarity score. Adjust the threshold based on accuracy requirements.
python infer_contrast.py --audio_path1=audio/a_1.wav --audio_path2=audio/b_2.wav
Sample output:
----------- Configuration Arguments -----------
audio_path1: audio/a_1.wav
audio_path2: audio/b_2.wav
model_path: models/resnet34-51.h5
threshold: 0.79
------------------------------------------------
==> successfully loading model models/resnet34-51.h5.
The two audio files (audio/a_1.wav and audio/b_2.wav) are not from the same person. Similarity: 0.020499
Voiceprint Recognition¶
Build on the voiceprint comparison functions by creating infer_recognition.py. It adds load_audio_db(), register(), and recognition():
- load_audio_db(): Loads voice data from the voiceprint library (registered users; their data is stored here).
- register(): Saves the recorded audio to the voiceprint library and adds its features to the comparison data.
- recognition(): Compares the input voice with all registered voices in the library.
With these functions, you can implement voiceprint recognition (e.g., via local recording):
1. Load voice data from the audio_db folder.
2. Record a 3-second audio when the user presses Enter.
3. Use this audio for voiceprint recognition to match against the library and retrieve user information.
Run the following command to execute:
python infer_recognition.py
Sample output:
----------- Configuration Arguments -----------
audio_db: audio_db
model_path: models/resnet34-51.h5
threshold: 0.79
------------------------------------------------
==> successfully loading model models/resnet34-56.h5.
Loaded Li Dakang audio.
Loaded Sha Ruijin audio.
Please select a function: 0 (register audio to voiceprint library) or 1 (perform voiceprint recognition): 0
Press Enter to start recording (3 seconds):
Recording started...
Recording finished!
Please enter the user name for this audio: Yeyu Piaoling
Please select a function: 0 (register audio to voiceprint library) or 1 (perform voiceprint recognition): 1
Press Enter to start recording (3 seconds):
Recording started...
Recording finished!
Recognized speaker: Yeyu Piaoling, similarity: 0.920434
Other Versions¶
- PaddlePaddle: VoiceprintRecognition-PaddlePaddle
- PyTorch: VoiceprintRecognition-Pytorch
- TensorFlow: VoiceprintRecognition-Tensorflow