AcuLa
AcuLa (Audio–Clinical Understanding via Language Alignment) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, encouraging audio embeddings to capture richer clinical semantics while preserving fine-grained acoustic information.
This repository provides the checkpoint for AcuLa. The accompanying implementation is available at:
GitHub: https://github.com/janine714/AcuLA
This work is described in the paper “Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”
Intended Use
AcuLa is intended for research on clinically informed medical audio representation learning.
| Use case | Description |
|---|---|
| Feature extraction | Extract embeddings from cardio-respiratory audio recordings |
| Linear probing | Train lightweight classifiers or regressors on frozen embeddings |
| Transfer learning | Adapt the aligned encoder to downstream medical audio datasets |
| Respiratory analysis | Study cough, breath, exhalation, and lung sound representations |
| Cardiac audio analysis | Study heart sound representations |
| Audio-text retrieval | Retrieve semantically related clinical reports or audio samples |
| Representation analysis | Analyze how clinical semantics are reflected in audio embedding spaces |
AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference.
Installation
Clone the GitHub repository:
git clone https://github.com/janine714/AcuLA
cd AcuLA
Install dependencies:
pip install -r requirements.txt
If using OPERA-family encoders, please make sure the required OPERA dependencies and pretrained checkpoints are available in your environment.
Training
To train AcuLa, first clone the repository:
git clone https://github.com/janine714/AcuLA
cd AcuLA
Then run training with:
python main.py \
--csv_path /path/to/combined_dataset.csv \
--audio_ckpt /path/to/encoder-operaGT.ckpt \
--output_dir ./checkpoints \
--audio_backbone operaGT \
--llm_type google/medgemma-4b-pt \
--epochs 50 \
--batch_size 24 \
--grad_accum_steps 2 \
--warmup_steps 400 \
--lr 1e-5 \
--lambda_align 1.0 \
--lambda_mam 1.0 \
--use_wandb
Expected CSV format:
| Column | Description |
|---|---|
audio_path |
Path to the audio recording |
Gen_Report |
Clinical text report paired with the audio recording |
Example:
| audio_path | Gen_Report |
|---|---|
/path/to/audio.wav |
The recording is consistent with normal pulmonary findings... |
Checkpoint Loading
The checkpoint can be loaded together with the AcuLa codebase.
import torch
from audio_encoder import initialize_pretrained_model
checkpoint_path = "path/to/acula_checkpoint.pt"
audio_model = initialize_pretrained_model(pretrain="operaGT")
ckpt = torch.load(checkpoint_path, map_location="cpu")
if "audio_model_state_dict" in ckpt:
state_dict = ckpt["audio_model_state_dict"]
elif "state_dict" in ckpt:
state_dict = ckpt["state_dict"]
else:
state_dict = ckpt
audio_model.load_state_dict(state_dict, strict=False)
audio_model.eval()
Extract audio features:
import torch
with torch.no_grad():
features = audio_model.forward_feature(audio_input)
The variable audio_input should follow the preprocessing format expected by the selected audio encoder.
Input Format
AcuLa expects medical audio recordings that are preprocessed into the format required by the selected audio encoder.
A typical preprocessing setup is:
| Step | Setting |
|---|---|
| Sampling rate | 16 kHz |
| Segment length | Fixed-length segment, commonly around 8 seconds |
| Audio representation | Log-mel spectrogram |
| Number of mel bins | 64 |
| Padding/truncation | Applied as needed |
During training, optional audio augmentations may include volume adjustment, normalization, low-pass filtering, and high-pass filtering.
Training Data
AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets.
| Dataset | Modality |
|---|---|
| ICBHI | Lung sounds |
| HFLung | Lung sounds |
| UK COVID-19 | Induced cough and exhalation |
| CoughVID | Cough sounds |
| CirCor | Heart sounds |
| SPRSound | Lung sounds |
| ZCHSound | Heart sounds |
The paper reports more than 100,000 paired audio-report samples for alignment.
Downstream Evaluation
The paper evaluates AcuLa on 18 cardio-respiratory tasks.
| Task group | Example tasks | Metric |
|---|---|---|
| Respiratory condition inference | COVID-19 detection, COPD classification, smoker classification, obstructive-vs-healthy classification, COPD severity classification | AUROC |
| Lung function estimation | FVC, FEV1, FEV1/FVC, respiratory rate | MAE |
| Cardiac condition inference | Murmur detection, symptomatic-vs-healthy classification | AUROC |
The main evaluation protocol uses frozen embeddings and lightweight supervised prediction heads, allowing performance differences to reflect representation quality.
Reported Findings
The paper shows that AcuLa improves medical audio representations across diverse cardio-respiratory tasks.
| Finding | Summary |
|---|---|
| Stronger classification representations | Improved AUROC across respiratory and cardiac condition inference tasks |
| Improved cough-based analysis | Large gains on challenging COVID-19 cough detection settings |
| Better physiological estimation | Improved performance on multiple lung-function estimation tasks |
| Model-agnostic improvements | Consistent gains across several pretrained audio backbones |
| Zero-shot potential | Competitive retrieval-style audio-text similarity results on respiratory tasks |
Please refer to the paper for full task-by-task results and experimental details.
Checkpoint Contents
Depending on the uploaded checkpoint variant, the checkpoint may contain one or more of the following components:
| Component | Description |
|---|---|
| Audio encoder weights | Aligned medical audio encoder parameters |
| Audio projection head | Projection layer for shared-space audio embeddings |
| Language projection head | Projection layer for shared-space text embeddings |
| Training metadata | Optional optimizer, scheduler, or epoch information |
Users can inspect the checkpoint keys with:
import torch
ckpt = torch.load("path/to/acula_checkpoint.pt", map_location="cpu")
print(ckpt.keys())
Limitations
| Limitation | Description |
|---|---|
| Research-stage checkpoint | Intended for research evaluation and downstream development |
| Dataset dependence | Performance may vary across datasets, devices, and recording conditions |
| Synthetic text supervision | Alignment reports are generated from metadata and may simplify clinical details |
| Clip-level representation | The method learns global clip embeddings and does not explicitly localize events |
| Downstream adaptation | Task-specific classifiers or regressors may still be needed for final applications |
Citation
Please cite the paper if you use this checkpoint:
@misc{wang2026languagemodelssemanticteachers,
title={Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding},
author={Tsai-Ning Wang and Lin-Lin Chen and Neil Zeghidour and Aaqib Saeed},
year={2026},
eprint={2512.04847},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2512.04847},
}
Acknowledgment
This checkpoint is released to support reproducibility and further research on medical audio understanding, audio-language alignment, and clinically informed representation learning.
