Today, let’s introduce a VAD tool. VAD (Voice Activity Detection) is used to split a long audio into multiple short audio segments based on silent positions. A common tool is WebRTC VAD, which is used in many projects. However, the tool I’ll introduce today is a small feature from YeAudio, implemented based on deep learning.

Usage¶

First, install the YeAudio library.

python -m pip install yeaudio -i https://pypi.tuna.tsinghua.edu.cn/simple -U

Here’s how to use it: just a few lines of code can get the positions of active speech. Note the following: the input data must be float32, and the sampling rate must be 8000 or 16000. Other sampling rates, such as multiples of 16000, may also work but accuracy is not guaranteed.

from yeaudio.audio import AudioSegment

audio_segment = AudioSegment.from_file('data/test.wav')

speech_timestamps = audio_segment.vad()
print(speech_timestamps)

The output is a list where each element is a dictionary containing the start and end positions of active speech.

[{'start': 11808, 'end': 24032}, {'start': 26144, 'end': 55264}, {'start': 57888, 'end': 125408}]

Streaming Real - time Detection¶

The latest version supports real - time streaming VAD detection. When recording, you can detect if the user has stopped speaking to complete business operations such as stopping recording and starting recognition.

from yeaudio.audio import AudioSegment
from yeaudio.streaming_vad import StreamingVAD

audio_seg = AudioSegment.from_file('data/test.wav')
data = audio_seg.samples

streaming_vad = StreamingVAD(sample_rate=audio_seg.sample_rate)

for ith_frame in range(0, len(data), streaming_vad.vad_frames):
    buffer = data[ith_frame:ith_frame + streaming_vad.vad_frames]
    state = streaming_vad(buffer)
    print("VAD state:", state)

The real - time output of the detection results is as follows:

VAD state: VADState.QUIET
VAD state: VADState.QUIET
VAD state: VADState.QUIET
VAD state: VADState.STARTING
VAD state: VADState.STARTING
VAD state: VADState.SPEAKING
VAD state: VADState.SPEAKING
VAD state: VADState.SPEAKING
······

Usage¶

Streaming Real - time Detection¶

Related Articles