截圖 2026-04-27 17.30.20

AcuLa

AcuLa (Audio–Clinical Understanding via Language Alignment) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, encouraging audio embeddings to capture richer clinical semantics while preserving fine-grained acoustic information.

This repository provides the checkpoint for AcuLa. The accompanying implementation is available at:

GitHub: https://github.com/janine714/AcuLA

This work is described in the paper “Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”


Intended Use

AcuLa is intended for research on clinically informed medical audio representation learning.

Use case Description
Feature extraction Extract embeddings from cardio-respiratory audio recordings
Linear probing Train lightweight classifiers or regressors on frozen embeddings
Transfer learning Adapt the aligned encoder to downstream medical audio datasets
Respiratory analysis Study cough, breath, exhalation, and lung sound representations
Cardiac audio analysis Study heart sound representations
Audio-text retrieval Retrieve semantically related clinical reports or audio samples
Representation analysis Analyze how clinical semantics are reflected in audio embedding spaces

AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference.


Installation

Clone the GitHub repository:

git clone https://github.com/janine714/AcuLA
cd AcuLA

Install dependencies:

pip install -r requirements.txt

If using OPERA-family encoders, please make sure the required OPERA dependencies and pretrained checkpoints are available in your environment.


Training

To train AcuLa, first clone the repository:

git clone https://github.com/janine714/AcuLA
cd AcuLA

Then run training with:

python main.py \
  --csv_path /path/to/combined_dataset.csv \
  --audio_ckpt /path/to/encoder-operaGT.ckpt \
  --output_dir ./checkpoints \
  --audio_backbone operaGT \
  --llm_type google/medgemma-4b-pt \
  --epochs 50 \
  --batch_size 24 \
  --grad_accum_steps 2 \
  --warmup_steps 400 \
  --lr 1e-5 \
  --lambda_align 1.0 \
  --lambda_mam 1.0 \
  --use_wandb

Expected CSV format:

Column Description
audio_path Path to the audio recording
Gen_Report Clinical text report paired with the audio recording

Example:

audio_path Gen_Report
/path/to/audio.wav The recording is consistent with normal pulmonary findings...

Checkpoint Loading

The checkpoint can be loaded together with the AcuLa codebase.

import torch
from audio_encoder import initialize_pretrained_model

checkpoint_path = "path/to/acula_checkpoint.pt"

audio_model = initialize_pretrained_model(pretrain="operaGT")
ckpt = torch.load(checkpoint_path, map_location="cpu")

if "audio_model_state_dict" in ckpt:
    state_dict = ckpt["audio_model_state_dict"]
elif "state_dict" in ckpt:
    state_dict = ckpt["state_dict"]
else:
    state_dict = ckpt

audio_model.load_state_dict(state_dict, strict=False)
audio_model.eval()

Extract audio features:

import torch

with torch.no_grad():
    features = audio_model.forward_feature(audio_input)

The variable audio_input should follow the preprocessing format expected by the selected audio encoder.


Input Format

AcuLa expects medical audio recordings that are preprocessed into the format required by the selected audio encoder.

A typical preprocessing setup is:

Step Setting
Sampling rate 16 kHz
Segment length Fixed-length segment, commonly around 8 seconds
Audio representation Log-mel spectrogram
Number of mel bins 64
Padding/truncation Applied as needed

During training, optional audio augmentations may include volume adjustment, normalization, low-pass filtering, and high-pass filtering.


Training Data

AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets.

Dataset Modality
ICBHI Lung sounds
HFLung Lung sounds
UK COVID-19 Induced cough and exhalation
CoughVID Cough sounds
CirCor Heart sounds
SPRSound Lung sounds
ZCHSound Heart sounds

The paper reports more than 100,000 paired audio-report samples for alignment.


Downstream Evaluation

The paper evaluates AcuLa on 18 cardio-respiratory tasks.

Task group Example tasks Metric
Respiratory condition inference COVID-19 detection, COPD classification, smoker classification, obstructive-vs-healthy classification, COPD severity classification AUROC
Lung function estimation FVC, FEV1, FEV1/FVC, respiratory rate MAE
Cardiac condition inference Murmur detection, symptomatic-vs-healthy classification AUROC

The main evaluation protocol uses frozen embeddings and lightweight supervised prediction heads, allowing performance differences to reflect representation quality.


Reported Findings

The paper shows that AcuLa improves medical audio representations across diverse cardio-respiratory tasks.

Finding Summary
Stronger classification representations Improved AUROC across respiratory and cardiac condition inference tasks
Improved cough-based analysis Large gains on challenging COVID-19 cough detection settings
Better physiological estimation Improved performance on multiple lung-function estimation tasks
Model-agnostic improvements Consistent gains across several pretrained audio backbones
Zero-shot potential Competitive retrieval-style audio-text similarity results on respiratory tasks

Please refer to the paper for full task-by-task results and experimental details.


Checkpoint Contents

Depending on the uploaded checkpoint variant, the checkpoint may contain one or more of the following components:

Component Description
Audio encoder weights Aligned medical audio encoder parameters
Audio projection head Projection layer for shared-space audio embeddings
Language projection head Projection layer for shared-space text embeddings
Training metadata Optional optimizer, scheduler, or epoch information

Users can inspect the checkpoint keys with:

import torch

ckpt = torch.load("path/to/acula_checkpoint.pt", map_location="cpu")
print(ckpt.keys())

Limitations

Limitation Description
Research-stage checkpoint Intended for research evaluation and downstream development
Dataset dependence Performance may vary across datasets, devices, and recording conditions
Synthetic text supervision Alignment reports are generated from metadata and may simplify clinical details
Clip-level representation The method learns global clip embeddings and does not explicitly localize events
Downstream adaptation Task-specific classifiers or regressors may still be needed for final applications

Citation

Please cite the paper if you use this checkpoint:

@misc{wang2026languagemodelssemanticteachers,
  title={Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding}, 
  author={Tsai-Ning Wang and Lin-Lin Chen and Neil Zeghidour and Aaqib Saeed},
  year={2026},
  eprint={2512.04847},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2512.04847}, 
}

Acknowledgment

This checkpoint is released to support reproducibility and further research on medical audio understanding, audio-language alignment, and clinically informed representation learning.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for tsnngw/AcuLa