Data Augmentation for Audio

Data Augmentation

3 min readJun 1, 2019

Although tuning model architecture and hyperparameter are successful factor of building a wonderful model, data scientist should also focus on data. No matter how amazing model you build, garbage in, garbage out (GIGO).

Intuitively, lack of data is one of the common issue in actual data science problem. Data augmentation helps to generate synthetic data from existing data set such that generalisation capability of model can be improved.

In the previous story, we explained how we play with spectrogram. In this story, we will talk about a basic augmentation methods for audio. This story and implementation are inspired by Kaggle’s Audio Data Augmentation Notebook.

Data Augmentation for Audio

To generate syntactic data for audio, we can apply noise injection, shifting time, changing pitch and speed. numpy provides an easy way to handle noise injection and shifting time while librosa (library for Recognition and Organization of Speech and Audio) help to manipulate pitch and speed with just 1 line of code.

Noise Injection

It simply add some random value into data by using numpy.

import numpy as npdef manipulate(data, noise_factor):
    noise = np.random.randn(len(data))
    augmented_data = data + noise_factor * noise
    # Cast back to same data type
    augmented_data = augmented_data.astype(type(data[0]))
    return augmented_data

Comparison between original and noise voice

Shifting Time

The idea of shifting time is very simple. It just shift audio to left/right with a random second. If shifting audio to left (fast forward) with x seconds, first x seconds will mark as 0 (i.e. silence). If shifting audio to right (back forward) with x seconds, last x seconds will mark as 0 (i.e. silence).

import numpy as npdef manipulate(data, sampling_rate, shift_max, shift_direction):
    shift = np.random.randint(sampling_rate * shift_max)
    if shift_direction == 'right':
        shift = -shift
    elif self.shift_direction == 'both':
        direction = np.random.randint(0, 2)
        if direction == 1:
            shift = -shift    augmented_data = np.roll(data, shift)
    # Set to silence for heading/ tailing
    if shift > 0:
        augmented_data[:shift] = 0
    else:
        augmented_data[shift:] = 0
    return augmented_data

Comparison between original and shifted voice

Changing Pitch

This augmentation is a wrapper of librosa function. It change pitch randomly

import librosadef manipulate(data, sampling_rate, pitch_factor):
    return librosa.effects.pitch_shift(data, sampling_rate, pitch_factor)

Comparison between original and changed pitch voice

Changing Speed

Same as changing pitch, this augmentation is performed by librosa function. It stretches times series by a fixed rate.

import librosadef manipulate(data, speed_factor):
    return librosa.effects.time_stretch(data, speed_factor)

Comparison between original and changed speed voice

Take Away

Above 4 methods are implemented in nlpaug package (≥ 0.0.3). You can generate augmented data within a few line of code.
Data augmentation cannot replace real training data. It just help to generate synthetic data to make the model better.
Do not blindly generate synthetic data. You have to understand your data pattern and selecting a appropriate way to increase training data volume.

About Me

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.

Data Augmentation for Audio

Data Augmentation

Data Augmentation for Audio

Noise Injection

Shifting Time

Changing Pitch

Changing Speed

Take Away

About Me

Extension Reading

Written by Edward Ma