Messy Mashup: AST Audio Classification Model

This model is part of the "Messy Mashup" music genre classification project, designed to handle complex, noisy audio environments using Audio Spectrogram Transformers (AST).

For the full implementation, dataset details, and training code, visit the main GitHub repository: github.com/RohanIITM/dl-genai-project-26-t1

Model Description

The model utilizes the MIT/ast-finetuned-audioset backbone, which is a pure attention-based mechanism for audio classification. It was fine-tuned to classify 10 music genres under "messy mashup" conditions (samples with complex instrument recombinations, tempo variations, and environmental noise).

Architecture: Audio Spectrogram Transformer (AST)
Mechanism: Log-Mel spectrograms are divided into overlapping patches, linearly projected, and processed by a Transformer encoder.
Backbone: Pre-trained on AudioSet, providing strong audio-specific priors.

Training Data

The model was trained on a custom dataset consisting of:

Training Stems: 4,000 instrument-separated stems across 10 genres.
Noise Injection: 2,000 environmental audio clips from the ESC-50 dataset.
Augmentations: Time-stretching (0.9x–1.1x), tempo synchronization, and additive noise injection.

Training Procedure

Hyperparameters

Epochs: 10
Learning Rate: 5e-5
Batch Size: 12
Optimizer: AdamW

Preprocessing

Stems were mixed, normalized (range [-1, 1]), and noise-injected. The waveform was converted into log-Mel patches with specific overlapping strides, optimized for transformer self-attention.

Evaluation Results

The model achieved the following performance on the competition leaderboard:

Public Score: 0.8522
Private Score: 0.8650

Limitations

The model's performance may degrade on audio samples with sampling rates significantly different from the 16 kHz down-sampled training rate or on genres not represented in the 10-class training set.

Downloads last month: 6

Safetensors

Model size

86.2M params

Tensor type

F32

Space using 23f1000371/ast-messy-mashup 1

Evaluation results

Private Leaderboard Accuracy on Messy Mashup (Custom)
self-reported

86.500