Instructions to use litert-community/PANNs-CNN14-AudioSet-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use litert-community/PANNs-CNN14-AudioSet-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
PANNs CNN14 β LiteRT (on-device AudioSet tagging, GPU CNN + host log-mel)
PANNs CNN14 (Cnn14_mAP=0.431) general sound-event tagging, converted to LiteRT with the CNN body running fully on the CompiledModel GPU (ML Drift) on Android. Given ~10 s of audio it predicts probabilities over the 527 AudioSet classes β speech, music, instruments, animals, vehicles, alarms, household sounds, and so on. AudioSet tagging is multi-label: several tags can be high at once.
waveform[320000] (32 kHz) β[host: log-mel]β logmel[1,1,1001,64] β[GPU: CNN14]β probs[1,527] (sigmoid)
On-device (Pixel 8a, Tensor G3 β verified)
| nodes on GPU | 45 / 45 LITERT_CL (full residency, single graph, 1 partition) |
| inference | ~124 ms GPU + ~99 ms host log-mel β 0.22 s per 10 s clip |
| size | 162 MB (fp16) |
| accuracy | fp16 tflite-vs-PyTorch corr 1.000000; self-test top tag "Speech" |
How it converts (litert-torch) β and why the log-mel is host-side
PANNs builds its spectrogram with torchlibrosa, whose STFT is a DFT-as-Conv1d β so there is no FFT op and the whole raw-audioβtags graph is almost GPU-clean; the only blocker is the STFT centering reflect-pad (one GATHER_ND, removable via pad_mode='constant', corr 1.0). But the converted spectral front-end is unusable: litert-torch lowers the giant 1024-tap DFT-conv incorrectly (fp32 tflite corr β 0.19), and the power spectrum |STFT|Β² (~1e6) overflows fp16 on Mali β NaN.
So the spectral front-end is computed on the CPU (the Whisper/Kokoro pattern), matched to torchlibrosa exactly, and only the CNN body rides the GPU:
- log-mel (host) β reflect-pad center, periodic Hann, 1024-pt FFT, power, mel matmul (
librosa.filters.mel, slaney),10Β·log10(max(mel,1e-10)). Validated host-vs-torch corr 1.000000 (max|d| 0.0017). The mel basis is shipped here asmel_basis.bin[64, 513]. - CNN14 body (GPU) β
bn0+ 6 conv blocks + mean/max time-pool + 2 FC + sigmoid. Pure CNN, converts at corr 1.000000 in fp32 and fp16, op-check banned NONE / >4D 0, one delegatable graph.
Files
| File | What |
|---|---|
cnn14_audioset_fp16.tflite |
the CNN body, fp16, input logmel [1,1,1001,64] β probs [1,527] |
mel_basis.bin |
mel filterbank [64, 513] float32 for the host log-mel |
audioset_labels.txt |
the 527 AudioSet class display names (row index = class id) |
build_panns.py |
conversion + host-mel validation script |
Preprocessing
Mono 32 kHz, padded/truncated to 10 s (320000 samples), values in [-1, 1]. Compute the log-mel as above β [1,1,1001,64]. The output 527 sigmoid probabilities are per-class (multi-label); take the top-K as tags.
Minimal usage
Android (Kotlin, CompiledModel GPU)
// staged into filesDir by an install script (162 MB β too big for assets)
val model = CompiledModel.create(File(ctx.filesDir, "cnn14_audioset_fp16.tflite").absolutePath,
CompiledModel.Options(Accelerator.GPU), null)
val inputs = model.createInputBuffers(); val outputs = model.createOutputBuffers()
inputs[0].writeFloat(logmel) // [1,1,1001,64] host log-mel (see Python below)
model.run(inputs, outputs)
val probs = outputs[0].readFloat() // [527] sigmoid, multi-label -> top-K tags
Python (desktop verification)
import numpy as np, soundfile as sf
from ai_edge_litert.interpreter import Interpreter
SR, NFFT, HOP, NMEL, CLIP = 32000, 1024, 320, 64, 320000
wav, _ = sf.read("clip_32k.wav", dtype="float32") # mono 32 kHz
x = np.zeros(CLIP, np.float32); n = min(len(wav), CLIP); x[:n] = wav[:n]
# torchlibrosa-exact log-mel: center reflect-pad, periodic Hann, |rFFT|^2, mel, 10*log10
pad = np.pad(x, NFFT // 2, mode="reflect")
win = 0.5 - 0.5 * np.cos(2 * np.pi * np.arange(NFFT) / NFFT)
frames = 1 + CLIP // HOP # 1001
power = np.stack([np.abs(np.fft.rfft(pad[t*HOP:t*HOP+NFFT] * win))**2 for t in range(frames)])
fb = np.fromfile("mel_basis.bin", np.float32).reshape(NMEL, 513)
logmel = (10.0 * np.log10(np.maximum(power @ fb.T, 1e-10))).astype(np.float32)
it = Interpreter(model_path="cnn14_audioset_fp16.tflite"); it.allocate_tensors()
it.set_tensor(it.get_input_details()[0]["index"], logmel[None, None]); it.invoke()
probs = it.get_tensor(it.get_output_details()[0]["index"])[0] # [527]
labels = open("audioset_labels.txt").read().splitlines()
for i in probs.argsort()[::-1][:5]:
print(f"{probs[i]:.3f} {labels[i]}")
License
Code Apache-2.0; weights Cnn14_mAP=0.431.pth CC-BY-4.0 (Zenodo). AudioSet ontology Β© Google, CC-BY-4.0. Upstream: qiuqiangkong/audioset_tagging_cnn.
Citation
@article{kong2020panns,
title={PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition},
author={Kong, Qiuqiang and Cao, Yin and Iqbal, Turab and Wang, Yuxuan and Wang, Wenwu and Plumbley, Mark D},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2020}
}
- Downloads last month
- 6
