larger-clap-general-coreml

Two artifacts derived from laion/larger_clap_general, kept in the same embedding space so they can be used as a pair:

  • clap_audio_encoder.mlpackage β€” native Core ML build of the audio encoder + projection head. Runs accelerated on Apple GPU via MLComputeUnits.cpuAndGPU.
  • text_model.onnx β€” ONNX build of the text encoder + projection head. Standard ORT-compatible, cross-platform.

Both take their respective inputs and return an L2-normalized 512-d embedding in the joint CLAP space (cosine similarity == dot product).

larger_clap_general is trained on general audio, music and speech β€” use the pair for zero-shot audio classification or open-vocabulary retrieval.

Why this repo exists

  • Audio side: ort's CoreML execution provider can't accelerate HTSAT β€” reflect-pad, 5-D reshapes, relative-position-bias gather, and dynamic shapes shred the graph into CPU partitions, so the EP "registers" but every node runs on CPU. Loading the .mlpackage directly via Core ML (skipping ORT entirely) runs the full graph on the Apple GPU.
  • Text side: this text_model.onnx is re-exported directly from LAION's PyTorch with no optimum graph fusion. Xenova's matching larger_clap_general ONNX export of the text encoder is in a slightly different numerical subspace than LAION's PyTorch (graph fusions + quantization add up), so pairing Xenova-text with our LAION-derived audio model collapses textβ†’audio cosine to ~0.2. Re-exporting text from the same PyTorch source recovers ~0.7+ on good matches.

Inputs / Outputs

Audio (clap_audio_encoder.mlpackage)

name shape dtype notes
Input audio [1, 480000] float32 10 s mono @ 48 kHz, peak-normalized to [-1, 1]
Output embedding [1, 512] float32 L2-normalized; cosine == dot product

The mel-spectrogram extraction (STFT, Slaney mel filterbank, log) is baked into the model graph β€” you pass raw audio, not features.

Text (text_model.onnx)

name shape dtype notes
Input input_ids [B, T] int64 RoBERTa tokenizer output
Input attention_mask [B, T] int64 1 for real tokens, 0 for padding
Output text_embeds [B, 512] float32 L2-normalized; cosine == dot product

Both batch and sequence length are dynamic. Use the tokenizer from Xenova/larger_clap_general (or any larger_clap_general mirror with the standard RoBERTa tokenizer config) β€” vocab + special tokens are identical across exports.

Variable-length audio

The graph has a fixed 10 s input shape. For arbitrary-length audio, recommended recipe:

Duration Strategy
≀ 10 s Zero-pad to 480_000 samples, single forward pass.
> 10 s Sliding 10 s windows with 50 % overlap, embed each window, mean-pool the embeddings, re-L2-normalize.

For very long files cap window count to bound runtime β€” uniformly spacing N windows across [0, T-10s] gives full-file coverage without per-window blow-up.

Usage

Swift (Core ML)

import CoreML

let config = MLModelConfiguration()
config.computeUnits = .cpuAndGPU
let model = try MLModel(contentsOf: compiledURL, configuration: config)

let audio = try MLMultiArray(shape: [1, 480_000], dataType: .float32)
// copy your normalized waveform into audio.dataPointer ...

let provider = try MLDictionaryFeatureProvider(dictionary: ["audio": audio])
let out = try model.prediction(from: provider)
let embedding = out.featureValue(for: "embedding")!.multiArrayValue!

Rust (objc2-core-ml)

The objc2/objc2-core-ml crates give direct Rust bindings to Core ML. Sketch:

use objc2_core_ml::{MLModel, MLModelConfiguration, MLMultiArray, MLMultiArrayDataType,
                    MLDictionaryFeatureProvider, MLFeatureValue, MLComputeUnits};

// Core ML wants a compiled .mlmodelc β€” compile the .mlpackage once,
// then load with cpuAndGPU compute units.
let compiled = unsafe { MLModel::compileModelAtURL_error(&mlpackage_url) }?;
let config = unsafe { MLModelConfiguration::new() };
unsafe { config.setComputeUnits(MLComputeUnits::CPUAndGPU) };
let model = unsafe { MLModel::modelWithContentsOfURL_configuration_error(&compiled, &config) }?;

// Build [1, 480000] float32 input, copy waveform via dataPointer,
// wrap in MLFeatureValue + MLDictionaryFeatureProvider, run prediction.

Python (audio via coremltools, text via onnxruntime)

import coremltools as ct
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

# --- audio: Core ML ---
audio_model = ct.models.MLModel("clap_audio_encoder.mlpackage")
waveform = np.zeros((1, 480_000), dtype=np.float32)
audio_emb = audio_model.predict({"audio": waveform})["embedding"]  # (1, 512)

# --- text: ONNX ---
tok = AutoTokenizer.from_pretrained("Xenova/larger_clap_general")
text_sess = ort.InferenceSession("text_model.onnx", providers=["CPUExecutionProvider"])
encoded = tok("a dog barking", return_tensors="np", padding=True)
text_emb = text_sess.run(["text_embeds"], {
    "input_ids": encoded["input_ids"].astype(np.int64),
    "attention_mask": encoded["attention_mask"].astype(np.int64),
})[0]  # (1, 512)

# Joint-embedding similarity:
similarity = float(np.dot(audio_emb.flatten(), text_emb.flatten()))

How it was built

Audio (clap_audio_encoder.mlpackage)

coremltools 8 + torch.export from laion/larger_clap_general's PyTorch weights, then convert_to="mlprogram" + int8 linear weight quantization. The conversion is non-trivial β€” ct.convert rejects the model out of the box. Patches applied:

  1. F.interpolate(mode='bicubic') β†’ 'bilinear' β€” CoreML's MIL backend lacks bicubic upsampling. Used by HTSAT's positional-embedding resize. Accuracy delta is negligible.
  2. torch.jit.is_tracing() β†’ True β€” forces HF's CLAP code onto the static-shape path during conversion.
  3. ClapAudioLayer.set_shift_and_window_size β†’ no-op β€” the dynamic window adjustment hits a "data-dependent guard" error in torch.export. For our fixed [1, 1, 1001, 64] input the __init__ values are already correct, so neutralizing is safe.
  4. Custom STFT β€” torch.stft's op signature drifts across torch versions and the coremltools handler unpacks the wrong arity; implemented as strided conv1d with pre-baked cos/sin Hann bases instead.
  5. Custom fmod MIL lowering β€” HTSAT's relative-position arithmetic uses float modulo; coremltools has no built-in handler. Registered as x - trunc(x/y) * y.
  6. slice_scatter override β€” HTSAT's attention-mask builder generates empty-slice slice_scatter calls at deeper Swin stages (e.g. slice(0, -window_size) evaluates to slice(0, 0)). The built-in handler's shape check rejects these; registered override that no-ops empty slices and reduces non-empty ones to slice_by_index + concat.

A full conversion script that applies all six patches is included in this repo: convert-clap-to-coreml.py. Run with pip install coremltools>=8,<9 torch>=2.6,<2.10 transformers>=4.40 numpy>=1.24,<2 then python convert-clap-to-coreml.py --output clap_audio_encoder.mlpackage. Validation (cosine vs PyTorch reference) runs automatically.

Text (text_model.onnx)

Plain torch.onnx.export from the same PyTorch source β€” no optimum, no graph fusion, no quantization. RoBERTa exports cleanly so no per-op patches are needed. Recent torch.onnx.export writes weights to a sidecar .onnx.data file by default; the conversion script consolidates them back into a single ~500 MB .onnx so distribution is one file. Opset 17.

Companion script: convert-clap-text-to-onnx.py. Same dependencies as the audio script plus pip install onnx onnxruntime.

Validation

Audio

Cosine similarity vs the PyTorch reference, on random [1, 480000] peak-normalized inputs:

Trial Cosine
1 0.999393
2 0.998725
3 0.998992

Drift is dominated by int8 weight quantization. For full fp32 weights, re-run the audio conversion with --quantize none (~3Γ— larger file, ~1.0 cosine).

Text

Cosine similarity vs the PyTorch reference, on five sample queries:

Query Cosine
"a dog barking" 1.000000
"808 kick drum" 1.000000
"lo-fi piano loop with vinyl crackle" 1.000000
"ambient pad with reverb" 1.000000
"voice saying hello" 1.000000

No quantization on the text side β†’ bit-exact (within fp32 noise) against PyTorch.

Performance

Apple M-series, MLComputeUnits.cpuAndGPU:

Latency per 10 s window
Cold start (first forward pass) ~5 s (Core ML graph compile + GPU upload)
Steady state ~30 ms

Compared to running the original .onnx via ort on Apple Silicon CPU, that's a roughly 10Γ— speedup for the steady state. ANE was not attempted (MLComputeUnits.all) β€” CPUAndGPU was the sweet spot during testing; the strictest backend often rejects whole-graph compilation for transformer audio models.

Limitations

  • No logit_scale. The original CLAP model's learnable temperature isn't included here β€” projection heads only. For zero-shot classification you can either ignore it (cosine alone usually ranks correctly) or pull it from the original laion/larger_clap_general checkpoint.
  • Fixed audio input shape. Audio shorter than 10 s must be zero-padded; longer requires the sliding-window recipe above.
  • int8 audio quantization. ~99.9 % cosine is sufficient for retrieval / search use cases; if you're using these embeddings as inputs to downstream training, re-run audio conversion with --quantize none.

Credits

Citation

If you use this model in your work, please cite the original CLAP paper (arXiv:2211.06687):

@misc{https://doi.org/10.48550/arxiv.2211.06687,
  doi = {10.48550/ARXIV.2211.06687},
  url = {https://arxiv.org/abs/2211.06687},
  author = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

License

This artifact inherits the source model's license: Apache 2.0.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for atan2f/larger_clap_general_coreml

Quantized
(2)
this model

Paper for atan2f/larger_clap_general_coreml