Instructions to use 23f1000371/ast-messy-mashup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 23f1000371/ast-messy-mashup with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="23f1000371/ast-messy-mashup")# Load model directly from transformers import AutoFeatureExtractor, AutoModelForAudioClassification extractor = AutoFeatureExtractor.from_pretrained("23f1000371/ast-messy-mashup") model = AutoModelForAudioClassification.from_pretrained("23f1000371/ast-messy-mashup") - Notebooks
- Google Colab
- Kaggle
Messy Mashup: AST Audio Classification Model
This model is part of the "Messy Mashup" music genre classification project, designed to handle complex, noisy audio environments using Audio Spectrogram Transformers (AST).
For the full implementation, dataset details, and training code, visit the main GitHub repository: github.com/RohanIITM/dl-genai-project-26-t1
Model Description
The model utilizes the MIT/ast-finetuned-audioset backbone, which is a pure attention-based mechanism for audio classification. It was fine-tuned to classify 10 music genres under "messy mashup" conditions (samples with complex instrument recombinations, tempo variations, and environmental noise).
- Architecture: Audio Spectrogram Transformer (AST)
- Mechanism: Log-Mel spectrograms are divided into overlapping patches, linearly projected, and processed by a Transformer encoder.
- Backbone: Pre-trained on AudioSet, providing strong audio-specific priors.
Training Data
The model was trained on a custom dataset consisting of:
- Training Stems: 4,000 instrument-separated stems across 10 genres.
- Noise Injection: 2,000 environmental audio clips from the ESC-50 dataset.
- Augmentations: Time-stretching (0.9xโ1.1x), tempo synchronization, and additive noise injection.
Training Procedure
Hyperparameters
- Epochs: 10
- Learning Rate: 5e-5
- Batch Size: 12
- Optimizer: AdamW
Preprocessing
Stems were mixed, normalized (range [-1, 1]), and noise-injected. The waveform was converted into log-Mel patches with specific overlapping strides, optimized for transformer self-attention.
Evaluation Results
The model achieved the following performance on the competition leaderboard:
- Public Score: 0.8522
- Private Score: 0.8650
Limitations
The model's performance may degrade on audio samples with sampling rates significantly different from the 16 kHz down-sampled training rate or on genres not represented in the 10-class training set.
- Downloads last month
- 6
Space using 23f1000371/ast-messy-mashup 1
Evaluation results
- Private Leaderboard Accuracy on Messy Mashup (Custom)self-reported86.500