Introduction

This chapter introduces how to use TensorFlow to train an audio classification model that can distinguish different audio types, such as identifying bird species based on bird calls. Let’s get started without further ado.

Environment Preparation

Mainly introduce the installation of librosa, PyAudio, and pydub. Other dependencies can be installed as needed.
- Python 3.7
- TensorFlow 2.0

Installing librosa

The easiest way is to use the pip command:

pip install pytest-runner
pip install librosa

If the pip installation fails, install from source code. Download the source code from: https://github.com/librosa/librosa/releases/ and unzip it (Windows users can download the zip package for easy extraction).

pip install pytest-runner
tar xzf librosa-<version>.tar.gz  # or unzip librosa-<version>.tar.gz
cd librosa-<version>/
python setup.py install

If you encounter the error libsndfile64bit.dll': error 0x7e, specify version 0.6.3: pip install librosa==0.6.3

Installing PyAudio

Use pip to install:

pip install pyaudio

If you’re on Windows with Python 3.7, download the whl file from: https://github.com/intxcc/pyaudio_portaudio/releases for installation.

Installing pydub

Use pip to install:

pip install pydub

Training a Classification Model

The key to converting audio into training data is using librosa to easily obtain the Mel Spectrogram via librosa.feature.melspectrogram(), which outputs numpy values directly usable for TensorFlow training and prediction. For details on Mel Spectrograms, refer to relevant resources. MFCCs (Mel-Frequency Cepstral Coefficients), used more in speech recognition, can be obtained via librosa.feature.mfcc(). The following code extracts the Mel Spectrogram with a specified audio duration:

y1, sr1 = librosa.load(data_path, duration=2.97)
ps = librosa.feature.melspectrogram(y=y1, sr=sr1)

Creating Training Data

To handle small but numerous audio files efficiently, generate TFRecord files to speed up training. Create create_data.py for this purpose.

First, generate a data list. Set audio_path as the audio file directory, where each subfolder contains audio from one category (e.g., dataset/audio/bird_sound/...). Each audio should be at least 2.1 seconds long (adjustable). The data list format is audio_path\tclass_label.

def get_data_list(audio_path, list_path):
    sound_sum = 0
    audios = os.listdir(audio_path)

    f_train = open(os.path.join(list_path, 'train_list.txt'), 'w')
    f_test = open(os.path.join(list_path, 'test_list.txt'), 'w')

    for i in range(len(audios)):
        sounds = os.listdir(os.path.join(audio_path, audios[i]))
        for sound in sounds:
            sound_path = os.path.join(audio_path, audios[i], sound)
            t = librosa.get_duration(filename=sound_path)
            # [Adjustable] Filter out audio shorter than 2.1 seconds
            if t >= 2.1:
                if sound_sum % 100 == 0:
                    f_test.write('%s\t%d\n' % (sound_path, i))
                else:
                    f_train.write('%s\t%d\n' % (sound_path, i))
                sound_sum += 1
        print("Audio processed: %d/%d" % (i + 1, len(audios)))

    f_test.close()
    f_train.close()

if __name__ == '__main__':
    get_data_list('dataset/audio', 'dataset')

With the data list, generate TFRecord files. The audio length is set to 2.04 seconds (adjustable).

# Get float feature
def _float_feature(value):
    if not isinstance(value, list):
        value = [value]
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

# Get integer feature
def _int64_feature(value):
    if not isinstance(value, list):
        value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

# Create TFRecord example
def data_example(data, label):
    feature = {
        'data': _float_feature(data),
        'label': _int64_feature(label),
    }
    return tf.train.Example(features=tf.train.Features(feature=feature))

# Generate TFRecord
def create_data_tfrecord(data_list_path, save_path):
    with open(data_list_path, 'r') as f:
        data = f.readlines()
    with tf.io.TFRecordWriter(save_path) as writer:
        for d in tqdm(data):
            try:
                path, label = d.replace('\n', '').split('\t')
                wav, sr = librosa.load(path, sr=16000)
                intervals = librosa.effects.split(wav, top_db=20)
                wav_output = []
                # [Adjustable] Audio length: 16000 * 2.04
                wav_len = int(16000 * 2.04)
                for sliced in intervals:
                    wav_output.extend(wav[sliced[0]:sliced[1]])
                for i in range(5):
                    # Crop long audio, pad short audio with zeros
                    if len(wav_output) > wav_len:
                        l = len(wav_output) - wav_len
                        r = random.randint(0, l)
                        wav_output = wav_output[r:wav_len + r]
                    else:
                        wav_output.extend(np.zeros(wav_len - len(wav_output), dtype=np.float32))
                    wav_output = np.array(wav_output)
                    # Convert to Mel Spectrogram
                    ps = librosa.feature.melspectrogram(y=wav_output, sr=sr, hop_length=256).reshape(-1).tolist()
                    # [Adjustable] Check Mel Spectrogram shape: 128x128
                    if len(ps) != 128 * 128: continue
                    tf_example = data_example(ps, int(label))
                    writer.write(tf_example.SerializeToString())
                    if len(wav_output) <= wav_len:
                        break
            except Exception as e:
                print(e)

if __name__ == '__main__':
    create_data_tfrecord('dataset/train_list.txt', 'dataset/train.tfrecord')
    create_data_tfrecord('dataset/test_list.txt', 'dataset/test.tfrecord')

For the Urbansound8K dataset (10 urban sound classes), use this code to generate the data list:

# Create Urbansound8K data list
def get_urbansound8k_list(path, urbansound8k_cvs_path):
    data_list = []
    data = pd.read_csv(urbansound8k_cvs_path)
    # Filter audio with duration >= 3s
    valid_data = data[['slice_file_name', 'fold', 'classID', 'class']][data['end'] - data['start'] >= 3]
    valid_data['path'] = 'fold' + valid_data['fold'].astype('str') + '/' + valid_data['slice_file_name'].astype('str')
    for row in valid_data.itertuples():
        data_list.append([row.path, row.classID])

    f_train = open(os.path.join(path, 'train_list.txt'), 'w')
    f_test = open(os.path.join(path, 'test_list.txt'), 'w')

    for i, data in enumerate(data_list):
        sound_path = os.path.join('dataset/UrbanSound8K/audio/', data[0])
        if i % 100 == 0:
            f_test.write('%s\t%d\n' % (sound_path, data[1]))
        else:
            f_train.write('%s\t%d\n' % (sound_path, data[1]))

    f_test.close()
    f_train.close()

if __name__ == '__main__':
    get_urbansound8k_list('dataset', 'dataset/UrbanSound8K/metadata/UrbanSound8K.csv')

Create reader.py to read TFRecord data. Adjust data_feature_description to match your Mel Spectrogram shape:

import tensorflow as tf

def _parse_data_function(example):
    # [Adjustable] Mel Spectrogram shape: 128x128
    data_feature_description = {
        'data': tf.io.FixedLenFeature([16384], tf.float32),
        'label': tf.io.FixedLenFeature([], tf.int64),
    }
    return tf.io.parse_single_example(example, data_feature_description)

def train_reader_tfrecord(data_path, num_epochs, batch_size):
    raw_dataset = tf.data.TFRecordDataset(data_path)
    train_dataset = raw_dataset.map(_parse_data_function)
    train_dataset = train_dataset.shuffle(buffer_size=1000) \
        .repeat(count=num_epochs) \
        .batch(batch_size=batch_size) \
        .prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return train_dataset

def test_reader_tfrecord(data_path, batch_size):
    raw_dataset = tf.data.TFRecordDataset(data_path)
    test_dataset = raw_dataset.map(_parse_data_function)
    test_dataset = test_dataset.batch(batch_size=batch_size)
    return test_dataset

Training

Create train.py with a CNN model. Use ResNet50V2 with Mel Spectrogram input (similar to grayscale images):

import tensorflow as tf
import reader
import numpy as np

class_dim = 10
EPOCHS = 100
BATCH_SIZE = 32

model = tf.keras.models.Sequential([
    tf.keras.applications.ResNet50V2(include_top=False, weights=None, input_shape=(128, None, 1)),
    tf.keras.layers.ActivityRegularization(l2=0.5),
    tf.keras.layers.Dropout(rate=0.5),
    tf.keras.layers.GlobalMaxPooling2D(),
    tf.keras.layers.Dense(units=class_dim, activation=tf.nn.softmax)
])

model.summary()

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)

train_dataset = reader.train_reader_tfrecord('dataset/train.tfrecord', EPOCHS, BATCH_SIZE)
test_dataset = reader.test_reader_tfrecord('dataset/test.tfrecord', BATCH_SIZE)

Train with gradient tape and evaluate periodically:

for batch_id, data in enumerate(train_dataset):
    # Reshape Mel Spectrogram to (batch, 128, 128, 1)
    sounds = data['data'].numpy().reshape((-1, 128, 128, 1))
    labels = data['label']
    with tf.GradientTape() as tape:
        predictions = model(sounds)
        train_loss = tf.keras.losses.sparse_categorical_crossentropy(labels, predictions)
        train_loss = tf.reduce_mean(train_loss)
        train_accuracy = tf.keras.metrics.sparse_categorical_accuracy(labels, predictions)
        train_accuracy = np.sum(train_accuracy.numpy()) / len(train_accuracy.numpy())

    gradients = tape.gradient(train_loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    if batch_id % 20 == 0:
        print("Batch %d, Loss: %f, Accuracy: %f" % (batch_id, train_loss.numpy(), train_accuracy))

    if batch_id % 200 == 0 and batch_id != 0:
        test_losses = []
        test_accuracies = []
        for d in test_dataset:
            test_sounds = d['data'].numpy().reshape((-1, 128, 128, 1))
            test_labels = d['label']
            test_result = model(test_sounds)
            test_loss = tf.keras.losses.sparse_categorical_crossentropy(test_labels, test_result)
            test_loss = tf.reduce_mean(test_loss)
            test_losses.append(test_loss)
            test_accuracy = tf.keras.metrics.sparse_categorical_accuracy(test_labels, test_result)
            test_accuracy = np.sum(test_accuracy.numpy()) / len(test_accuracy.numpy())
            test_accuracies.append(test_accuracy)
        print('=================================================')
        print("Test Loss: %f, Accuracy: %f" % (
            sum(test_losses)/len(test_losses), sum(test_accuracies)/len(test_accuracies)))
        print('=================================================')
        model.save('models/resnet50.h5')

Prediction

After training, use predict.py to load the model and predict audio:

import librosa
import numpy as np
import tensorflow as tf

model = tf.keras.models.load_model('models/resnet50.h5')

def load_data(data_path):
    wav, sr = librosa.load(data_path, sr=16000)
    intervals = librosa.effects.split(wav, top_db=20)
    wav_output = []
    for sliced in intervals:
        wav_output.extend(wav[sliced[0]:sliced[1]])
    assert len(wav_output) >= 8000, "Valid audio <0.5s"
    wav_output = np.array(wav_output)
    ps = librosa.feature.melspectrogram(y=wav_output, sr=sr, hop_length=256).astype(np.float32)
    ps = ps[np.newaxis, ..., np.newaxis]
    return ps

def infer(audio_path):
    data = load_data(audio_path)
    result = model.predict(data)
    lab = tf.argmax(result, 1)
    return lab

if __name__ == '__main__':
    path = 'test_audio.wav'
    label = infer(path)
    print('Predicted label: %d' % label)

Additional Tools

  • record_audio.py: Record audio at 44100 Hz, 1 channel, 16-bit.
  • crop_audio.py: Crop long audio into segments for dataset creation.
  • infer_record.py: Real-time audio recording and prediction (3-second segments).

GitHub Repository: https://github.com/yeyupiaoling/AudioClassification_Tensorflow

Xiaoye