larger-clap-general-coreml
Two artifacts derived from laion/larger_clap_general, kept in the same embedding space so they can be used as a pair:
clap_audio_encoder.mlpackageβ native Core ML build of the audio encoder + projection head. Runs accelerated on Apple GPU viaMLComputeUnits.cpuAndGPU.text_model.onnxβ ONNX build of the text encoder + projection head. Standard ORT-compatible, cross-platform.
Both take their respective inputs and return an L2-normalized 512-d embedding in the joint CLAP space (cosine similarity == dot product).
larger_clap_general is trained on general audio, music and speech β use the pair for zero-shot audio classification or open-vocabulary retrieval.
Why this repo exists
- Audio side:
ort's CoreML execution provider can't accelerate HTSAT β reflect-pad, 5-D reshapes, relative-position-bias gather, and dynamic shapes shred the graph into CPU partitions, so the EP "registers" but every node runs on CPU. Loading the.mlpackagedirectly via Core ML (skipping ORT entirely) runs the full graph on the Apple GPU. - Text side: this
text_model.onnxis re-exported directly from LAION's PyTorch with nooptimumgraph fusion. Xenova's matchinglarger_clap_generalONNX export of the text encoder is in a slightly different numerical subspace than LAION's PyTorch (graph fusions + quantization add up), so pairing Xenova-text with our LAION-derived audio model collapses textβaudio cosine to ~0.2. Re-exporting text from the same PyTorch source recovers ~0.7+ on good matches.
Inputs / Outputs
Audio (clap_audio_encoder.mlpackage)
| name | shape | dtype | notes | |
|---|---|---|---|---|
| Input | audio |
[1, 480000] |
float32 | 10 s mono @ 48 kHz, peak-normalized to [-1, 1] |
| Output | embedding |
[1, 512] |
float32 | L2-normalized; cosine == dot product |
The mel-spectrogram extraction (STFT, Slaney mel filterbank, log) is baked into the model graph β you pass raw audio, not features.
Text (text_model.onnx)
| name | shape | dtype | notes | |
|---|---|---|---|---|
| Input | input_ids |
[B, T] |
int64 | RoBERTa tokenizer output |
| Input | attention_mask |
[B, T] |
int64 | 1 for real tokens, 0 for padding |
| Output | text_embeds |
[B, 512] |
float32 | L2-normalized; cosine == dot product |
Both batch and sequence length are dynamic. Use the tokenizer from Xenova/larger_clap_general (or any larger_clap_general mirror with the standard RoBERTa tokenizer config) β vocab + special tokens are identical across exports.
Variable-length audio
The graph has a fixed 10 s input shape. For arbitrary-length audio, recommended recipe:
| Duration | Strategy |
|---|---|
| β€ 10 s | Zero-pad to 480_000 samples, single forward pass. |
| > 10 s | Sliding 10 s windows with 50 % overlap, embed each window, mean-pool the embeddings, re-L2-normalize. |
For very long files cap window count to bound runtime β uniformly spacing N windows across [0, T-10s] gives full-file coverage without per-window blow-up.
Usage
Swift (Core ML)
import CoreML
let config = MLModelConfiguration()
config.computeUnits = .cpuAndGPU
let model = try MLModel(contentsOf: compiledURL, configuration: config)
let audio = try MLMultiArray(shape: [1, 480_000], dataType: .float32)
// copy your normalized waveform into audio.dataPointer ...
let provider = try MLDictionaryFeatureProvider(dictionary: ["audio": audio])
let out = try model.prediction(from: provider)
let embedding = out.featureValue(for: "embedding")!.multiArrayValue!
Rust (objc2-core-ml)
The objc2/objc2-core-ml crates give direct Rust bindings to Core ML. Sketch:
use objc2_core_ml::{MLModel, MLModelConfiguration, MLMultiArray, MLMultiArrayDataType,
MLDictionaryFeatureProvider, MLFeatureValue, MLComputeUnits};
// Core ML wants a compiled .mlmodelc β compile the .mlpackage once,
// then load with cpuAndGPU compute units.
let compiled = unsafe { MLModel::compileModelAtURL_error(&mlpackage_url) }?;
let config = unsafe { MLModelConfiguration::new() };
unsafe { config.setComputeUnits(MLComputeUnits::CPUAndGPU) };
let model = unsafe { MLModel::modelWithContentsOfURL_configuration_error(&compiled, &config) }?;
// Build [1, 480000] float32 input, copy waveform via dataPointer,
// wrap in MLFeatureValue + MLDictionaryFeatureProvider, run prediction.
Python (audio via coremltools, text via onnxruntime)
import coremltools as ct
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
# --- audio: Core ML ---
audio_model = ct.models.MLModel("clap_audio_encoder.mlpackage")
waveform = np.zeros((1, 480_000), dtype=np.float32)
audio_emb = audio_model.predict({"audio": waveform})["embedding"] # (1, 512)
# --- text: ONNX ---
tok = AutoTokenizer.from_pretrained("Xenova/larger_clap_general")
text_sess = ort.InferenceSession("text_model.onnx", providers=["CPUExecutionProvider"])
encoded = tok("a dog barking", return_tensors="np", padding=True)
text_emb = text_sess.run(["text_embeds"], {
"input_ids": encoded["input_ids"].astype(np.int64),
"attention_mask": encoded["attention_mask"].astype(np.int64),
})[0] # (1, 512)
# Joint-embedding similarity:
similarity = float(np.dot(audio_emb.flatten(), text_emb.flatten()))
How it was built
Audio (clap_audio_encoder.mlpackage)
coremltools 8 + torch.export from laion/larger_clap_general's PyTorch weights, then convert_to="mlprogram" + int8 linear weight quantization. The conversion is non-trivial β ct.convert rejects the model out of the box. Patches applied:
F.interpolate(mode='bicubic')β'bilinear'β CoreML's MIL backend lacks bicubic upsampling. Used by HTSAT's positional-embedding resize. Accuracy delta is negligible.torch.jit.is_tracing()βTrueβ forces HF's CLAP code onto the static-shape path during conversion.ClapAudioLayer.set_shift_and_window_sizeβ no-op β the dynamic window adjustment hits a "data-dependent guard" error intorch.export. For our fixed[1, 1, 1001, 64]input the__init__values are already correct, so neutralizing is safe.- Custom STFT β
torch.stft's op signature drifts across torch versions and the coremltools handler unpacks the wrong arity; implemented as strided conv1d with pre-baked cos/sin Hann bases instead. - Custom
fmodMIL lowering β HTSAT's relative-position arithmetic uses float modulo; coremltools has no built-in handler. Registered asx - trunc(x/y) * y. slice_scatteroverride β HTSAT's attention-mask builder generates empty-sliceslice_scattercalls at deeper Swin stages (e.g.slice(0, -window_size)evaluates toslice(0, 0)). The built-in handler's shape check rejects these; registered override that no-ops empty slices and reduces non-empty ones toslice_by_index + concat.
A full conversion script that applies all six patches is included in this repo: convert-clap-to-coreml.py. Run with pip install coremltools>=8,<9 torch>=2.6,<2.10 transformers>=4.40 numpy>=1.24,<2 then python convert-clap-to-coreml.py --output clap_audio_encoder.mlpackage. Validation (cosine vs PyTorch reference) runs automatically.
Text (text_model.onnx)
Plain torch.onnx.export from the same PyTorch source β no optimum, no graph fusion, no quantization. RoBERTa exports cleanly so no per-op patches are needed. Recent torch.onnx.export writes weights to a sidecar .onnx.data file by default; the conversion script consolidates them back into a single ~500 MB .onnx so distribution is one file. Opset 17.
Companion script: convert-clap-text-to-onnx.py. Same dependencies as the audio script plus pip install onnx onnxruntime.
Validation
Audio
Cosine similarity vs the PyTorch reference, on random [1, 480000] peak-normalized inputs:
| Trial | Cosine |
|---|---|
| 1 | 0.999393 |
| 2 | 0.998725 |
| 3 | 0.998992 |
Drift is dominated by int8 weight quantization. For full fp32 weights, re-run the audio conversion with --quantize none (~3Γ larger file, ~1.0 cosine).
Text
Cosine similarity vs the PyTorch reference, on five sample queries:
| Query | Cosine |
|---|---|
"a dog barking" |
1.000000 |
"808 kick drum" |
1.000000 |
"lo-fi piano loop with vinyl crackle" |
1.000000 |
"ambient pad with reverb" |
1.000000 |
"voice saying hello" |
1.000000 |
No quantization on the text side β bit-exact (within fp32 noise) against PyTorch.
Performance
Apple M-series, MLComputeUnits.cpuAndGPU:
| Latency per 10 s window | |
|---|---|
| Cold start (first forward pass) | ~5 s (Core ML graph compile + GPU upload) |
| Steady state | ~30 ms |
Compared to running the original .onnx via ort on Apple Silicon CPU, that's a roughly 10Γ speedup for the steady state. ANE was not attempted (MLComputeUnits.all) β CPUAndGPU was the sweet spot during testing; the strictest backend often rejects whole-graph compilation for transformer audio models.
Limitations
- No
logit_scale. The original CLAP model's learnable temperature isn't included here β projection heads only. For zero-shot classification you can either ignore it (cosine alone usually ranks correctly) or pull it from the originallaion/larger_clap_generalcheckpoint. - Fixed audio input shape. Audio shorter than 10 s must be zero-padded; longer requires the sliding-window recipe above.
- int8 audio quantization. ~99.9 % cosine is sufficient for retrieval / search use cases; if you're using these embeddings as inputs to downstream training, re-run audio conversion with
--quantize none.
Credits
- LAION for
larger_clap_general. - gridshiftstudio/clap-music-coreml for the first public proof that this conversion is viable + the two key patches.
Citation
If you use this model in your work, please cite the original CLAP paper (arXiv:2211.06687):
@misc{https://doi.org/10.48550/arxiv.2211.06687,
doi = {10.48550/ARXIV.2211.06687},
url = {https://arxiv.org/abs/2211.06687},
author = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
License
This artifact inherits the source model's license: Apache 2.0.
- Downloads last month
- 28
Model tree for atan2f/larger_clap_general_coreml
Base model
laion/larger_clap_general