Preprocessing the Audio Dataset

Audio preprocessing is an essential step in preparing audio data for machine learning models.

Improves audio quality by reducing noise and distortions
Extracts meaningful features from raw audio signals
Converts data into a format suitable for model input
Enhances overall model performance and accuracy

Importance of Audio Preprocessing

Preprocessing helps improve model performance and ensures consistency across datasets.

Reduces background noise and unwanted signals
Standardizes formats, sample rates and resolutions
Extracts important features like MFCCs and spectrograms
Normalizes signal amplitude for consistency
Handles variable-length audio using padding or trimming
Improves training efficiency and model accuracy

Implementation

Step 1: Install Required Libraries

pip install gdown librosa

Step 2: Import Required Libraries

librosa: Load and process audio signals
scipy.signal: Apply filters (noise removal)
numpy: Handle numerical operations on audio arrays
os: Work with file paths
matplotlib: Visualize audio features
librosa.display: Plot spectrograms

Python

import librosa
from scipy.signal import butter, filtfilt
import numpy as np
import os
import matplotlib.pyplot as plt
import librosa.display

Step 3: Load Dataset

Audio datasets are often large and stored externally
Extracting them ensures we can access individual audio files

Python

file_id = '1lNUGw8VMXvY2Yu6aITYlOCNaj8y-KbNB'

!gdown --id $file_id -O dataset.zip
!unzip -q dataset.zip -d /content/

Step 4: Resampling

Audio files may have different sample rates (e.g., 44.1kHz, 22kHz)
Models usually require a fixed sample rate (e.g., 16kHz)

Python

sample_audio_path = '/content/barbie_vs_puppy/barbie/barbie_4.wav'

def resample_audio(audio_path, target_sr=16000):
    y, sr = librosa.load(audio_path, sr=target_sr)
    return y, sr

resampled_audio, sr = resample_audio(sample_audio_path)
print(f"Sample rate after Resampling: {sr}")

Output:

Step 5: Filtering

Removes high-frequency noise using a low-pass filter

Python

def butter_lowpass_filter(data, cutoff_freq, sample_rate, order=4):
    nyquist = 0.5 * sample_rate
    normal_cutoff = cutoff_freq / nyquist
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    filtered_data = filtfilt(b, a, data)
    print(f"Filtered audio shape: {filtered_data.shape}")
    return filtered_data

filtered_audio = butter_lowpass_filter(resampled_audio, cutoff_freq=4000, sample_rate=sr)

Step 6: Convert to Model Input

Audio clips are adjusted to a fixed length
Ensures consistent input shape like (16000,)

Python

def convert_to_model_input(y, target_length):
    if len(y) < target_length:
        y = np.pad(y, (0, target_length - len(y)))
    else:
        y = y[:target_length]
    return y

model_input = convert_to_model_input(filtered_audio, target_length=16000)
print(f"Model input shape: {model_input.shape}")

Output:

Step 7: Audio Data Streaming (Batch Processing)

Processes audio files in batches instead of all at once
Saves memory
Works with large datasets
Enables real-time and scalable systems

Python

def stream_audio_dataset(dataset_path, batch_size=32, target_length=16000, target_sr=None):
    audio_files = [os.path.join(root, file) for root, dirs, files in os.walk(dataset_path) for file in files]
    np.random.shuffle(audio_files)

    for i in range(0, len(audio_files), batch_size):
        batch_paths = audio_files[i:i + batch_size]
        batch_data = []

        for file_path in batch_paths:
            y, sr = librosa.load(file_path, sr=target_sr)

            if target_sr is not None and sr != target_sr:
                y = librosa.resample(y, sr, target_sr)
                sr = target_sr

            filtered_audio = butter_lowpass_filter(y, cutoff_freq=4000, sample_rate=sr)
            model_input = convert_to_model_input(filtered_audio, target_length=target_length)
            batch_data.append(model_input)

        yield np.array(batch_data)
        
dataset_path = '/content/barbie_vs_puppy/barbie'

for batch_data in stream_audio_dataset(dataset_path, batch_size=2, target_sr=16000):
    print(f"Processing batch with {len(batch_data)} files")
    print(f"Shape of the first file: {batch_data[0].shape}")

Output:

Step 8: Log-Mel Spectrogram

Converts audio into a visual representation (frequency vs time)
Raw audio is hard for models to understand so Spectrograms capture Frequency patterns,Temporal changes

Python

def compute_logmel_spectrogram(y, sr, n_mels=128, hop_length=512):
    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels, hop_length=hop_length)
    logmel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)
    return logmel_spectrogram

audio_file_path = '/content/barbie_vs_puppy/barbie/barbie_4.wav'
target_sr = 16000

y, sr = librosa.load(audio_file_path, sr=target_sr)
logmel_spectrogram = compute_logmel_spectrogram(y, sr=sr)

plt.figure(figsize=(8, 4))
librosa.display.specshow(logmel_spectrogram, sr=sr, hop_length=512, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Log-Mel Spectrogram')
plt.show()

Output:

Download full code from here

Applications

Speech Recognition: Improves accuracy in systems like voice assistants and transcription tools
Audio Classification: Used to classify sounds such as music genres, environmental sounds or speaker identity
Music Analysis: Helps in tasks like beat detection, genre classification and recommendation systems
Healthcare: Assists in analyzing speech patterns for detecting disorders or medical conditions
Security and Surveillance: Enables sound-based event detection like alarms, gunshots or anomalies
Voice Biometrics: Supports speaker verification and authentication systems

Preprocessing the Audio Dataset

Importance of Audio Preprocessing

Implementation

Step 1: Install Required Libraries

Step 2: Import Required Libraries

Step 3: Load Dataset

Step 4: Resampling

Step 5: Filtering

Step 6: Convert to Model Input

Step 7: Audio Data Streaming (Batch Processing)

Step 8: Log-Mel Spectrogram

Applications

Explore