SoundSense AI - EEND-OLA Lite Diarizer
This is a custom diarization model trained from scratch for SoundSense AI.
Base model / source:
- Architecture inspired by EEND-OLA (End-to-End Neural Diarization with Online Attractors),
https://arxiv.org/abs/2006.02616
- Custom implementation: 4-block Transformer encoder + dual LSTM heads (OD + PSE)
Training data:
- Simulated diarization mixtures generated from LibriSpeech train-clean-360
(921 unique speakers, 3000 training samples)
Use:
- Part of SoundSense AI hackathon submission (Stage 3: Diarization).
- Determines who is speaking at each 32ms audio frame using Power Set Encoding
(8 classes covering up to 3 simultaneous speakers).
Limitations:
- Built for prototype/demo use.
- Trained on simulated mixtures, not real conversational/overlapping speech
(e.g. CALLHOME, AMI). Real-world diarization error rate (DER) not yet benchmarked.
- PSE accuracy: 46.9% (vs 12.5% random chance); OD accuracy: 77.5% (vs 50% random chance).
- Performance should be verified on the target environment before deployment.