DeepFormants β ONNX
ONNX exports of the MLSpeech/DeepFormants formant
estimation and tracking models, with fp32, fp16, and int8 dynamic-quantization variants.
The repository ports all three usable upstream artifacts:
- The PyTorch MLP estimator (
LPC_NN_scaledLoss.pt). - The PyTorch LSTM tracker (
LPC_RNN.pt). - The original Torch7 paper model (
estimation_model.dat), ported to PyTorch viatorchfileand verified bit-faithful against afloat64numpy reconstruction of the Torch7 forward pass.
Status: private redistribution only. The DeepFormants source is MIT-licensed; the upstream weights are hosted on external Google Drive and the authors have not explicitly stated terms for redistribution of derived artifacts. Treat these files as local-use only unless you have confirmed otherwise.
Models
| Directory | Source | Architecture | Input shape | Output shape |
|---|---|---|---|---|
lpc_estimator/ |
pytorchFormants/Estimator/LPC_NN_scaledLoss.pt |
MLP 350 β 1024 β 512 β 256 β 4 (sigmoid hidden, linear out) | (B, 350) float32 |
(B, 4) float32 |
lpc_tracker/ |
pytorchFormants/Tracker/LPC_RNN.pt |
LSTM(350, 512) β LSTM(512, 256) β Linear(256, 4) | (B, T, 350) float32 |
(B, T, 4) float32 |
lpc_estimator_torch7/ |
estimation_model.dat (Torch7 nn.Sequential, ported via torchfile) |
Same MLP arch as lpc_estimator/; different weights (original paper model) |
(B, 350) float32 |
(B, 4) float32 |
lpc_tracker_torch7/ |
tracking_model.dat (Torch7 nn.Sequencer + nn.FastLSTM, ported via torchfile) |
Same LSTM arch as lpc_tracker/; different weights (original paper model) |
(B, T, 350) float32 |
(B, T, 4) float32 |
Output convention. Raw model output β formant_Hz / 1000. Multiply by 1000 to obtain
Hz, exactly as load_pytorch_lpc_estimator.py and load_estimation_model.lua do upstream.
Input features. All three models consume the same 350-dimensional per-frame feature vector
produced by extract_features.py in the upstream repo: a periodogram-derived cepstrum (50
coefficients) concatenated with LPC coefficients for orders 8..17 (10 Γ 30 = 300). The frame
hop is 10 ms at 16 kHz, frame length 30 ms. The estimator predicts formants for a single frame;
the tracker takes a sequence of frames and emits a per-frame trajectory.
Variants
Each model directory contains three ONNX files:
| File | Variant | Method |
|---|---|---|
model.onnx |
fp32 | torch.onnx.export (opset 17, legacy TorchScript exporter) |
model_fp16.onnx |
fp16 | onnxconverter_common.float16.convert_float_to_float16 (also casts I/O) |
model_int8.onnx |
int8 dynamic | onnxruntime.quantization.quantize_dynamic (QuantType.QInt8) |
int8dynamic quantizesMatMulweights only β for these MLP/LSTM models that covers the entire parameter mass. There are noConvops, so no static-quantization with calibration data is required.
Size and parity (N = 20 random inputs, vs PyTorch reference)
| Model | Variant | Size | max abs diff | max rel diff | mean abs diff |
|---|---|---|---|---|---|
lpc_estimator |
fp32 | 4.07 MB | 9.5 Γ 10β»β· | 8.0 Γ 10β»β· | 1.9 Γ 10β»β· |
lpc_estimator |
fp16 | 2.03 MB | 1.4 Γ 10β»Β³ | 5.4 Γ 10β»β΄ | 3.2 Γ 10β»β΄ |
lpc_estimator |
int8 | 1.03 MB | 1.4 Γ 10β»Β² | 1.5 Γ 10β»Β² | 2.9 Γ 10β»Β³ |
lpc_tracker |
fp32 | 10.24 MB | 1.2 Γ 10β»βΆ | 4.9 Γ 10β»βΆ | 1.4 Γ 10β»β· |
lpc_tracker |
fp16 | 5.12 MB | 2.1 Γ 10β»Β³ | 9.0 Γ 10β»ΒΉ | 3.5 Γ 10β»β΄ |
lpc_tracker |
int8 | 2.58 MB | 5.0 Γ 10β»Β² | 1.8 Γ 10β»ΒΉ | 5.8 Γ 10β»Β³ |
lpc_estimator_torch7 |
fp32 | 4.07 MB | 1.4 Γ 10β»βΆ | 2.6 Γ 10β»β΅ | 1.6 Γ 10β»β· |
lpc_estimator_torch7 |
fp16 | 2.03 MB | 2.0 Γ 10β»Β³ | 5.0 Γ 10β»Β² | 2.5 Γ 10β»β΄ |
lpc_estimator_torch7 |
int8 | 1.03 MB | 4.2 Γ 10β»Β² | 4.8 Γ 10β° | 5.3 Γ 10β»Β³ |
lpc_tracker_torch7 |
fp32 | 10.24 MB | 1.1 Γ 10β»βΆ | 9.5 Γ 10β»β΄ | 5.4 Γ 10β»βΈ |
lpc_tracker_torch7 |
fp16 | 5.12 MB | 1.8 Γ 10β»Β³ | 1.2 Γ 10β° | 1.1 Γ 10β»β΄ |
lpc_tracker_torch7 |
int8 | 2.58 MB | 1.2 Γ 10β»ΒΉ | 7.3 Γ 10ΒΉ | 5.1 Γ 10β»Β³ |
Diffs are on the raw model output (β Hz / 1000). To read as Hz-scale absolute error on the predicted formants, multiply by 1000.
The tracker's fp16 max rel diff of 0.9 is inflated by outputs that pass near zero on
synthetic random input; the mean absolute drift is β 0.3 Hz, well within phonetic tolerance.
Choosing a variant:
fp32β reference accuracy, ~10 ms inference for the tracker on CPU.fp16β half the size, sub-Hz drift on the estimators, β² 2 Hz mean drift on the tracker. Strongly recommended for production.int8β quarter the size, ~5β50 Hz worst-case drift. Acceptable for indicative analyses; verify on your own data before using for fine-grained phonetic measurement.
Usage
import numpy as np
import onnxruntime as ort
# Single-frame estimator
sess = ort.InferenceSession("lpc_estimator/model.onnx",
providers=["CPUExecutionProvider"])
x = features.astype(np.float32) # (B, 350) from extract_features.py
y = sess.run(None, {"input": x})[0] # (B, 4)
F1, F2, F3, F4 = (y * 1000.0).T # Hz
# Sequence tracker
sess_t = ort.InferenceSession("lpc_tracker/model.onnx",
providers=["CPUExecutionProvider"])
x_seq = features_seq.astype(np.float32) # (B, T, 350)
y_seq = sess_t.run(None, {"input": x_seq})[0] # (B, T, 4)
formant_tracks_hz = y_seq * 1000.0
# Torch7-origin paper estimator (drop-in replacement for lpc_estimator)
sess_t7 = ort.InferenceSession("lpc_estimator_torch7/model.onnx",
providers=["CPUExecutionProvider"])
y_t7 = sess_t7.run(None, {"input": x})[0]
For the fp16 variants, cast the input as well: x.astype(np.float16). The int8 variants
accept float32 input.
To prepare the input features from a WAV file, use the upstream extract_features.py. The
script targets Python 2.7 / SciPy 1.x; on Python 3.11 / SciPy β₯ 1.13 two one-line patches are
required (scipy.signal.hamming β scipy.signal.windows.hamming, np.fromstring β
np.frombuffer).
Torch7-origin estimator (lpc_estimator_torch7/)
The original estimation_model.dat accompanying the paper is a Torch7 binary, not a PyTorch
checkpoint. The port pipeline:
Read the Torch7 module tree using the pure-Python
torchfilelibrary β no Lua/Torch7 runtime is required. The dump (seelpc_estimator_torch7/PORT_NOTES.md) confirms:nn.Sequential nn.Linear(350, 1024) β nn.Sigmoid nn.Linear(1024, 512) β nn.Sigmoid nn.Linear(512, 256) β nn.Sigmoid nn.Linear(256, 4)Identical layer shapes to
LPC_NN_scaledLoss.pt; only the trained weights differ. Tensors are stored asDoubleTensor(float64).Reconstruct in PyTorch:
weight (out, in)andbias (out,)copy verbatim β Torch7 and PyTorch agree onnn.Linearstorage layout, so no transpose. Castfloat64 β float32.Verify by feeding real LPC features from
data/Example.wavthrough both (a) the ported PyTorchfloat32model and (b) afloat64numpy reconstruction of the Torch7 forward pass using the originally-stored weights. Maximum drift: 0.003 Hz on the Hz scale. The port is numerically equivalent to running the upstream.datunder Lua.
The PyTorch state dict produced by the port is included as
lpc_estimator_torch7/reconstructed.pt (4.07 MB).
Torch7-origin tracker (lpc_tracker_torch7/)
The original Torch7 binary tracking_model.dat (3.86 GB on disk, β10 MB of unique
parameters β the bulk is BPTT-unrolled state buffers) ports cleanly into PyTorch
nn.LSTM:
nn.Sequential
βββ nn.Sequencer( nn.FastLSTM ) hidden = 512 i2g (2048, 350) o2g (2048, 512)
βββ nn.Sequencer( nn.FastLSTM ) hidden = 256 i2g (1024, 512) o2g (1024, 256)
βββ nn.Sequencer( nn.Recursor( nn.Linear(256, 4) ) )
Architecturally identical to lpc_tracker/ (the PyTorch retrain) β only the trained
weights differ.
Gate ordering. Torch7 nn.FastLSTM packs the four LSTM gates as [i, g, f, o]
(determined by the order of activations in the inner nn.ParallelTable:
Sigmoid / Tanh / Sigmoid / Sigmoid β the Tanh marks the cell-candidate g). PyTorch's
nn.LSTM uses [i, f, g, o]. The port applies a block-wise permutation [0, 2, 1, 3]
to the 4 H-sized blocks of weight_ih, weight_hh, and bias_ih.
Bias convention. FastLSTM has bias on the input-to-gates linear only (the recurrent
linear is LinearNoBias). PyTorch's nn.LSTM exposes bias_ih + bias_hh and sums
them at forward time. The port places the (permuted) Torch7 i2g.bias into bias_ih_l0
and zeros bias_hh_l0.
Bit-fidelity. A pure-numpy float64 FastLSTM forward (using the original
non-permuted weights) was compared to PyTorch nn.LSTM (using the permuted float32
weights) on random (1, 20, 350) input. Maximum drift: 7 Γ 10β»βΈ on raw output =
0.0001 Hz on the Hz scale. The port is provably equivalent to running the original
.dat under Torch7.
Cross-check vs lpc_tracker/ (PyTorch retrain) on real LPC features from
data/Example.wav (231 frames):
| F1 | F2 | F3 | F4 | |
|---|---|---|---|---|
| PyTorch retrain mean (Hz) | 559 | 1819 | 2646 | 3910 |
| Torch7 port mean (Hz) | 467 | 1823 | 2588 | 3866 |
| Mean diff (Torch7 β PT) | β92 | +4 | β58 | β44 |
Both produce 100 % correctly ordered F1 < F2 < F3 < F4 on these frames; ranges are 230β4400 Hz (Torch7 port) and 280β4600 Hz (PyTorch retrain). The ~50β90 Hz mean offset is the natural divergence between two independent training runs of identical architecture on the same task. Prefer the Torch7-origin port when matching published-paper behaviour matters.
The raw float64 weight extraction is included as
lpc_tracker_torch7/torch7_raw_weights.npz for audit / independent re-port. See
lpc_tracker_torch7/PORT_NOTES.md for the full arch dump and conversion notes.
int8 caveat for LSTMs. Dynamic int8 quantization of LSTMs can drift outputs by ~100 Hz worst-case (β5 Hz mean) on this model β the validate threshold for
int8 absis bumped to1.5 Γ 10β»ΒΉaccordingly. Prefer thefp16variant for production use.
About the published ExamplePred.csv golden values
The upstream repo ships data/Example.wav and data/ExamplePred.csv with reference outputs
F1 β 508, F2 β 1605, F3 β 2671, F4 β 3639 Hz. These do not reproduce with the modern
feature pipeline. The discrepancy is in feature extraction, not in the model port:
extract_features.pydepends onscikits.talkbox(no Python 3 release),scipy.fftpack(deprecated), and specific numpy semantics aroundfromstring.- The two ported PyTorch checkpoints (
LPC_NN_scaledLoss.pt, the Torch7 port) produce similar but not identical outputs to each other on the same features, and both differ modestly from the golden β exactly the pattern expected when the upstream pipeline used a decade-old DSP stack.
If exact upstream parity is required for a downstream comparison, regenerate
ExamplePred.csv once on your local environment and use that as your reference instead.
Intended use
- Single-frame and sequence-level formant estimation for research in speech science, phonetics, sociolinguistics, clinical speech analysis, and ASR feature exploration.
- Use cases where the input audio is 16 kHz mono speech and feature extraction can be done
upstream with the
extract_features.pypipeline.
Limitations and biases
- Trained on the VTR-TIMIT corpus (read American English vowels, adult speakers, clean studio recordings, 16 kHz). Out-of-domain performance β children's speech, dysphonic or pathological voices, non-English vowel systems, low-SNR field recordings, or band-limited telephone audio β is unverified and likely degraded.
- The estimator predicts exactly four formants (F1..F4). Models do not return a confidence or voicing decision; unvoiced frames produce nonsensical predictions that the caller is expected to mask externally (e.g. by a voicing detector).
- The tracker was trained with a fixed input sequence length of 20 frames in the upstream training loop; longer sequences work because the ONNX export uses a dynamic time axis, but evaluation outside that regime has not been independently validated here.
int8dynamic quantization can drift the tracker's predictions by several tens of Hz in the worst case. For fine-grained phonetic measurement (e.g. small F1/F2 contrasts), prefer thefp16orfp32variants.
Files
.
βββ README.md (this file β model card)
βββ metadata.json (machine-readable arch/shape/parity summary)
βββ lpc_estimator/
β βββ model.onnx fp32
β βββ model_fp16.onnx fp16
β βββ model_int8.onnx int8 dynamic
βββ lpc_tracker/
β βββ model.onnx fp32
β βββ model_fp16.onnx fp16
β βββ model_int8.onnx int8 dynamic
βββ lpc_estimator_torch7/
β βββ model.onnx fp32
β βββ model_fp16.onnx fp16
β βββ model_int8.onnx int8 dynamic
β βββ reconstructed.pt PyTorch state_dict ported from estimation_model.dat
β βββ PORT_NOTES.md layer-by-layer arch dump and conversion notes
βββ lpc_tracker_torch7/
β βββ model.onnx fp32
β βββ model_fp16.onnx fp16
β βββ model_int8.onnx int8 dynamic
β βββ reconstructed.pt PyTorch state_dict ported from tracking_model.dat
β βββ torch7_raw_weights.npz raw float64 weight extraction (audit trail)
β βββ PORT_NOTES.md arch dump + gate-order analysis + bit-fidelity result
βββ scripts/
βββ convert_lpc.py PyTorch MLP β ONNX
βββ convert_tracker.py PyTorch LSTM β ONNX
βββ port_torch7_estimator.py Torch7 estimator .dat β PyTorch state_dict
βββ convert_lpc_torch7.py ported MLP state_dict β ONNX
βββ port_torch7_tracker.py Torch7 tracker .dat β PyTorch state_dict (+ numpy bit-fidelity)
βββ convert_lpc_tracker_torch7.py ported LSTM state_dict β ONNX
βββ cross_check_trackers.py compare Torch7 tracker port vs PyTorch retrain on real features
βββ validate_parity.py parity check (writes metadata.json, exits non-zero on regression)
βββ upload_hub.py push to this HF repo
Reproduce locally
conda activate onnxconv # torch 2.11, onnx 1.21, onnxruntime 1.24
pip install torchfile # only needed for the Torch7 port step
python scripts/convert_lpc.py
python scripts/convert_tracker.py
python scripts/port_torch7_estimator.py
python scripts/convert_lpc_torch7.py
python scripts/port_torch7_tracker.py # loads 3.86 GB .dat β needs ~6 GB RAM
python scripts/convert_lpc_tracker_torch7.py
python scripts/validate_parity.py
python scripts/cross_check_trackers.py # informational, requires extract_features.py output
The scripts assume the upstream repo is checked out as a sibling directory:
git clone https://github.com/MLSpeech/DeepFormants ../DeepFormants
Skipped
CNN_estimate.pt(spectrogram estimator) β referenced byspectrogramEstimate_inference.pybut not shipped in the public repo. Cannot convert without weights.
Citation
If you use these models, please cite the upstream paper:
@inproceedings{dissen2016formant,
title = {Formant Estimation and Tracking using Deep Learning},
author = {Dissen, Yossi and Keshet, Joseph},
booktitle = {Interspeech},
year = {2016}
}
@article{dissen2019formant,
title = {Formant estimation and tracking: A deep learning approach},
author = {Dissen, Yossi and Goldberger, Jacob and Keshet, Joseph},
journal = {The Journal of the Acoustical Society of America},
volume = {145},
number = {2},
pages = {642--653},
year = {2019}
}
License
MIT, inherited from the upstream MLSpeech/DeepFormants repository.
The ONNX exports and the Torch7-origin port are derivative artifacts of the upstream weights. The upstream README distributes the Torch7 weights via a Google Drive link without explicit redistribution terms β this repo is therefore kept private. Do not re-host the converted artifacts publicly without first confirming the authors' intent.