AcuLa

AcuLa (Audio–Clinical Understanding via Language Alignment) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, encouraging audio embeddings to capture richer clinical semantics while preserving fine-grained acoustic information.

This repository provides the checkpoint for AcuLa. The accompanying implementation is available at:

GitHub: https://github.com/janine714/AcuLA

This work is described in the paper “Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”

Intended Use

AcuLa is intended for research on clinically informed medical audio representation learning.

Use case	Description
Feature extraction	Extract embeddings from cardio-respiratory audio recordings
Linear probing	Train lightweight classifiers or regressors on frozen embeddings
Transfer learning	Adapt the aligned encoder to downstream medical audio datasets
Respiratory analysis	Study cough, breath, exhalation, and lung sound representations
Cardiac audio analysis	Study heart sound representations
Audio-text retrieval	Retrieve semantically related clinical reports or audio samples
Representation analysis	Analyze how clinical semantics are reflected in audio embedding spaces

AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference.

Installation

Clone the GitHub repository:

git clone https://github.com/janine714/AcuLA
cd AcuLA

Install dependencies:

pip install -r requirements.txt

If using OPERA-family encoders, please make sure the required OPERA dependencies and pretrained checkpoints are available in your environment.

Training

To train AcuLa, first clone the repository:

git clone https://github.com/janine714/AcuLA
cd AcuLA

Then run training with:

python main.py \
  --csv_path /path/to/combined_dataset.csv \
  --audio_ckpt /path/to/encoder-operaGT.ckpt \
  --output_dir ./checkpoints \
  --audio_backbone operaGT \
  --llm_type google/medgemma-4b-pt \
  --epochs 50 \
  --batch_size 24 \
  --grad_accum_steps 2 \
  --warmup_steps 400 \
  --lr 1e-5 \
  --lambda_align 1.0 \
  --lambda_mam 1.0 \
  --use_wandb

Expected CSV format:

Column	Description
`audio_path`	Path to the audio recording
`Gen_Report`	Clinical text report paired with the audio recording

Example:

audio_path	Gen_Report
`/path/to/audio.wav`	`The recording is consistent with normal pulmonary findings...`

Checkpoint Loading

The checkpoint can be loaded together with the AcuLa codebase.

import torch
from audio_encoder import initialize_pretrained_model

checkpoint_path = "path/to/acula_checkpoint.pt"

audio_model = initialize_pretrained_model(pretrain="operaGT")
ckpt = torch.load(checkpoint_path, map_location="cpu")

if "audio_model_state_dict" in ckpt:
    state_dict = ckpt["audio_model_state_dict"]
elif "state_dict" in ckpt:
    state_dict = ckpt["state_dict"]
else:
    state_dict = ckpt

audio_model.load_state_dict(state_dict, strict=False)
audio_model.eval()

Extract audio features:

import torch

with torch.no_grad():
    features = audio_model.forward_feature(audio_input)

The variable audio_input should follow the preprocessing format expected by the selected audio encoder.

Input Format

AcuLa expects medical audio recordings that are preprocessed into the format required by the selected audio encoder.

A typical preprocessing setup is:

Step	Setting
Sampling rate	16 kHz
Segment length	Fixed-length segment, commonly around 8 seconds
Audio representation	Log-mel spectrogram
Number of mel bins	64
Padding/truncation	Applied as needed

During training, optional audio augmentations may include volume adjustment, normalization, low-pass filtering, and high-pass filtering.

Training Data

AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets.

Dataset	Modality
ICBHI	Lung sounds
HFLung	Lung sounds
UK COVID-19	Induced cough and exhalation
CoughVID	Cough sounds
CirCor	Heart sounds
SPRSound	Lung sounds
ZCHSound	Heart sounds

The paper reports more than 100,000 paired audio-report samples for alignment.

Downstream Evaluation

The paper evaluates AcuLa on 18 cardio-respiratory tasks.

Task group	Example tasks	Metric
Respiratory condition inference	COVID-19 detection, COPD classification, smoker classification, obstructive-vs-healthy classification, COPD severity classification	AUROC
Lung function estimation	FVC, FEV1, FEV1/FVC, respiratory rate	MAE
Cardiac condition inference	Murmur detection, symptomatic-vs-healthy classification	AUROC

The main evaluation protocol uses frozen embeddings and lightweight supervised prediction heads, allowing performance differences to reflect representation quality.

Reported Findings

The paper shows that AcuLa improves medical audio representations across diverse cardio-respiratory tasks.

Finding	Summary
Stronger classification representations	Improved AUROC across respiratory and cardiac condition inference tasks
Improved cough-based analysis	Large gains on challenging COVID-19 cough detection settings
Better physiological estimation	Improved performance on multiple lung-function estimation tasks
Model-agnostic improvements	Consistent gains across several pretrained audio backbones
Zero-shot potential	Competitive retrieval-style audio-text similarity results on respiratory tasks

Please refer to the paper for full task-by-task results and experimental details.

Checkpoint Contents

Depending on the uploaded checkpoint variant, the checkpoint may contain one or more of the following components:

Component	Description
Audio encoder weights	Aligned medical audio encoder parameters
Audio projection head	Projection layer for shared-space audio embeddings
Language projection head	Projection layer for shared-space text embeddings
Training metadata	Optional optimizer, scheduler, or epoch information

Users can inspect the checkpoint keys with:

import torch

ckpt = torch.load("path/to/acula_checkpoint.pt", map_location="cpu")
print(ckpt.keys())

Limitations

Limitation	Description
Research-stage checkpoint	Intended for research evaluation and downstream development
Dataset dependence	Performance may vary across datasets, devices, and recording conditions
Synthetic text supervision	Alignment reports are generated from metadata and may simplify clinical details
Clip-level representation	The method learns global clip embeddings and does not explicitly localize events
Downstream adaptation	Task-specific classifiers or regressors may still be needed for final applications

Citation

Please cite the paper if you use this checkpoint:

@misc{wang2026languagemodelssemanticteachers,
  title={Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding}, 
  author={Tsai-Ning Wang and Lin-Lin Chen and Neil Zeghidour and Aaqib Saeed},
  year={2026},
  eprint={2512.04847},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2512.04847}, 
}

Acknowledgment

This checkpoint is released to support reproducibility and further research on medical audio understanding, audio-language alignment, and clinically informed representation learning.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for tsnngw/AcuLa

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

Paper • 2512.04847 • Published 15 days ago