PANNs CNN14 — LiteRT (on-device AudioSet tagging, GPU CNN + host log-mel)

PANNs CNN14 (Cnn14_mAP=0.431) general sound-event tagging, converted to LiteRT with the CNN body running fully on the CompiledModel GPU (ML Drift) on Android. Given ~10 s of audio it predicts probabilities over the 527 AudioSet classes — speech, music, instruments, animals, vehicles, alarms, household sounds, and so on. AudioSet tagging is multi-label: several tags can be high at once.

waveform[320000] (32 kHz) →[host: log-mel]→ logmel[1,1,1001,64] →[GPU: CNN14]→ probs[1,527] (sigmoid)

On-device (Pixel 8a, Tensor G3 — verified)


nodes on GPU	45 / 45 LITERT_CL (full residency, single graph, 1 partition)
inference	~124 ms GPU + ~99 ms host log-mel ≈ 0.22 s per 10 s clip
size	162 MB (fp16)
accuracy	fp16 tflite-vs-PyTorch corr 1.000000; self-test top tag "Speech"

How it converts (litert-torch) — and why the log-mel is host-side

PANNs builds its spectrogram with torchlibrosa, whose STFT is a DFT-as-Conv1d — so there is no FFT op and the whole raw-audio→tags graph is almost GPU-clean; the only blocker is the STFT centering reflect-pad (one GATHER_ND, removable via pad_mode='constant', corr 1.0). But the converted spectral front-end is unusable: litert-torch lowers the giant 1024-tap DFT-conv incorrectly (fp32 tflite corr ≈ 0.19), and the power spectrum |STFT|² (~1e6) overflows fp16 on Mali → NaN.

So the spectral front-end is computed on the CPU (the Whisper/Kokoro pattern), matched to torchlibrosa exactly, and only the CNN body rides the GPU:

log-mel (host) — reflect-pad center, periodic Hann, 1024-pt FFT, power, mel matmul (librosa.filters.mel, slaney), 10·log10(max(mel,1e-10)). Validated host-vs-torch corr 1.000000 (max|d| 0.0017). The mel basis is shipped here as mel_basis.bin [64, 513].
CNN14 body (GPU) — bn0 + 6 conv blocks + mean/max time-pool + 2 FC + sigmoid. Pure CNN, converts at corr 1.000000 in fp32 and fp16, op-check banned NONE / >4D 0, one delegatable graph.

Files

File	What
`cnn14_audioset_fp16.tflite`	the CNN body, fp16, input logmel [1,1,1001,64] → probs [1,527]
`mel_basis.bin`	mel filterbank [64, 513] float32 for the host log-mel
`audioset_labels.txt`	the 527 AudioSet class display names (row index = class id)
`build_panns.py`	conversion + host-mel validation script

Preprocessing

Mono 32 kHz, padded/truncated to 10 s (320000 samples), values in [-1, 1]. Compute the log-mel as above → [1,1,1001,64]. The output 527 sigmoid probabilities are per-class (multi-label); take the top-K as tags.

Minimal usage

Android (Kotlin, CompiledModel GPU)

// staged into filesDir by an install script (162 MB — too big for assets)
val model = CompiledModel.create(File(ctx.filesDir, "cnn14_audioset_fp16.tflite").absolutePath,
    CompiledModel.Options(Accelerator.GPU), null)
val inputs = model.createInputBuffers(); val outputs = model.createOutputBuffers()
inputs[0].writeFloat(logmel)          // [1,1,1001,64] host log-mel (see Python below)
model.run(inputs, outputs)
val probs = outputs[0].readFloat()    // [527] sigmoid, multi-label -> top-K tags

Python (desktop verification)

import numpy as np, soundfile as sf
from ai_edge_litert.interpreter import Interpreter

SR, NFFT, HOP, NMEL, CLIP = 32000, 1024, 320, 64, 320000
wav, _ = sf.read("clip_32k.wav", dtype="float32")               # mono 32 kHz
x = np.zeros(CLIP, np.float32); n = min(len(wav), CLIP); x[:n] = wav[:n]

# torchlibrosa-exact log-mel: center reflect-pad, periodic Hann, |rFFT|^2, mel, 10*log10
pad = np.pad(x, NFFT // 2, mode="reflect")
win = 0.5 - 0.5 * np.cos(2 * np.pi * np.arange(NFFT) / NFFT)
frames = 1 + CLIP // HOP                                        # 1001
power = np.stack([np.abs(np.fft.rfft(pad[t*HOP:t*HOP+NFFT] * win))**2 for t in range(frames)])
fb = np.fromfile("mel_basis.bin", np.float32).reshape(NMEL, 513)
logmel = (10.0 * np.log10(np.maximum(power @ fb.T, 1e-10))).astype(np.float32)

it = Interpreter(model_path="cnn14_audioset_fp16.tflite"); it.allocate_tensors()
it.set_tensor(it.get_input_details()[0]["index"], logmel[None, None]); it.invoke()
probs = it.get_tensor(it.get_output_details()[0]["index"])[0]   # [527]

labels = open("audioset_labels.txt").read().splitlines()
for i in probs.argsort()[::-1][:5]:
    print(f"{probs[i]:.3f}  {labels[i]}")

License

Code Apache-2.0; weights Cnn14_mAP=0.431.pth CC-BY-4.0 (Zenodo). AudioSet ontology © Google, CC-BY-4.0. Upstream: qiuqiangkong/audioset_tagging_cnn.

Citation

@article{kong2020panns,
  title={PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition},
  author={Kong, Qiuqiang and Cao, Yin and Iqbal, Turab and Wang, Yuxuan and Wang, Wenwu and Plumbley, Mark D},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2020}
}

Downloads last month: 6