DeepFormants — ONNX

ONNX exports of the MLSpeech/DeepFormants formant estimation and tracking models, with fp32, fp16, and int8 dynamic-quantization variants.

The repository ports all three usable upstream artifacts:

The PyTorch MLP estimator (LPC_NN_scaledLoss.pt).
The PyTorch LSTM tracker (LPC_RNN.pt).
The original Torch7 paper model (estimation_model.dat), ported to PyTorch via torchfile and verified bit-faithful against a float64 numpy reconstruction of the Torch7 forward pass.

Status: private redistribution only. The DeepFormants source is MIT-licensed; the upstream weights are hosted on external Google Drive and the authors have not explicitly stated terms for redistribution of derived artifacts. Treat these files as local-use only unless you have confirmed otherwise.

Models

Directory	Source	Architecture	Input shape	Output shape
`lpc_estimator/`	`pytorchFormants/Estimator/LPC_NN_scaledLoss.pt`	MLP 350 → 1024 → 512 → 256 → 4 (sigmoid hidden, linear out)	`(B, 350)` float32	`(B, 4)` float32
`lpc_tracker/`	`pytorchFormants/Tracker/LPC_RNN.pt`	LSTM(350, 512) → LSTM(512, 256) → Linear(256, 4)	`(B, T, 350)` float32	`(B, T, 4)` float32
`lpc_estimator_torch7/`	`estimation_model.dat` (Torch7 `nn.Sequential`, ported via `torchfile`)	Same MLP arch as `lpc_estimator/`; different weights (original paper model)	`(B, 350)` float32	`(B, 4)` float32
`lpc_tracker_torch7/`	`tracking_model.dat` (Torch7 `nn.Sequencer` + `nn.FastLSTM`, ported via `torchfile`)	Same LSTM arch as `lpc_tracker/`; different weights (original paper model)	`(B, T, 350)` float32	`(B, T, 4)` float32

Output convention. Raw model output ≈ formant_Hz / 1000. Multiply by 1000 to obtain Hz, exactly as load_pytorch_lpc_estimator.py and load_estimation_model.lua do upstream.

Input features. All three models consume the same 350-dimensional per-frame feature vector produced by extract_features.py in the upstream repo: a periodogram-derived cepstrum (50 coefficients) concatenated with LPC coefficients for orders 8..17 (10 × 30 = 300). The frame hop is 10 ms at 16 kHz, frame length 30 ms. The estimator predicts formants for a single frame; the tracker takes a sequence of frames and emits a per-frame trajectory.

Variants

Each model directory contains three ONNX files:

File	Variant	Method
`model.onnx`	fp32	`torch.onnx.export` (opset 17, legacy TorchScript exporter)
`model_fp16.onnx`	fp16	`onnxconverter_common.float16.convert_float_to_float16` (also casts I/O)
`model_int8.onnx`	int8 dynamic	`onnxruntime.quantization.quantize_dynamic` (`QuantType.QInt8`)

int8 dynamic quantizes MatMul weights only — for these MLP/LSTM models that covers the entire parameter mass. There are no Conv ops, so no static-quantization with calibration data is required.

Size and parity (N = 20 random inputs, vs PyTorch reference)

Model	Variant	Size	max abs diff	max rel diff	mean abs diff
`lpc_estimator`	fp32	4.07 MB	9.5 × 10⁻⁷	8.0 × 10⁻⁷	1.9 × 10⁻⁷
`lpc_estimator`	fp16	2.03 MB	1.4 × 10⁻³	5.4 × 10⁻⁴	3.2 × 10⁻⁴
`lpc_estimator`	int8	1.03 MB	1.4 × 10⁻²	1.5 × 10⁻²	2.9 × 10⁻³
`lpc_tracker`	fp32	10.24 MB	1.2 × 10⁻⁶	4.9 × 10⁻⁶	1.4 × 10⁻⁷
`lpc_tracker`	fp16	5.12 MB	2.1 × 10⁻³	9.0 × 10⁻¹	3.5 × 10⁻⁴
`lpc_tracker`	int8	2.58 MB	5.0 × 10⁻²	1.8 × 10⁻¹	5.8 × 10⁻³
`lpc_estimator_torch7`	fp32	4.07 MB	1.4 × 10⁻⁶	2.6 × 10⁻⁵	1.6 × 10⁻⁷
`lpc_estimator_torch7`	fp16	2.03 MB	2.0 × 10⁻³	5.0 × 10⁻²	2.5 × 10⁻⁴
`lpc_estimator_torch7`	int8	1.03 MB	4.2 × 10⁻²	4.8 × 10⁰	5.3 × 10⁻³
`lpc_tracker_torch7`	fp32	10.24 MB	1.1 × 10⁻⁶	9.5 × 10⁻⁴	5.4 × 10⁻⁸
`lpc_tracker_torch7`	fp16	5.12 MB	1.8 × 10⁻³	1.2 × 10⁰	1.1 × 10⁻⁴
`lpc_tracker_torch7`	int8	2.58 MB	1.2 × 10⁻¹	7.3 × 10¹	5.1 × 10⁻³

Diffs are on the raw model output (≈ Hz / 1000). To read as Hz-scale absolute error on the predicted formants, multiply by 1000.

The tracker's fp16 max rel diff of 0.9 is inflated by outputs that pass near zero on synthetic random input; the mean absolute drift is ≈ 0.3 Hz, well within phonetic tolerance.

Choosing a variant:

fp32 — reference accuracy, ~10 ms inference for the tracker on CPU.
fp16 — half the size, sub-Hz drift on the estimators, ≲ 2 Hz mean drift on the tracker. Strongly recommended for production.
int8 — quarter the size, ~5–50 Hz worst-case drift. Acceptable for indicative analyses; verify on your own data before using for fine-grained phonetic measurement.

Usage

import numpy as np
import onnxruntime as ort

# Single-frame estimator
sess = ort.InferenceSession("lpc_estimator/model.onnx",
                            providers=["CPUExecutionProvider"])
x = features.astype(np.float32)              # (B, 350) from extract_features.py
y = sess.run(None, {"input": x})[0]          # (B, 4)
F1, F2, F3, F4 = (y * 1000.0).T              # Hz

# Sequence tracker
sess_t = ort.InferenceSession("lpc_tracker/model.onnx",
                              providers=["CPUExecutionProvider"])
x_seq = features_seq.astype(np.float32)      # (B, T, 350)
y_seq = sess_t.run(None, {"input": x_seq})[0]  # (B, T, 4)
formant_tracks_hz = y_seq * 1000.0

# Torch7-origin paper estimator (drop-in replacement for lpc_estimator)
sess_t7 = ort.InferenceSession("lpc_estimator_torch7/model.onnx",
                               providers=["CPUExecutionProvider"])
y_t7 = sess_t7.run(None, {"input": x})[0]

For the fp16 variants, cast the input as well: x.astype(np.float16). The int8 variants accept float32 input.

To prepare the input features from a WAV file, use the upstream extract_features.py. The script targets Python 2.7 / SciPy 1.x; on Python 3.11 / SciPy ≥ 1.13 two one-line patches are required (scipy.signal.hamming → scipy.signal.windows.hamming, np.fromstring → np.frombuffer).

Torch7-origin estimator (`lpc_estimator_torch7/`)

The original estimation_model.dat accompanying the paper is a Torch7 binary, not a PyTorch checkpoint. The port pipeline:

Read the Torch7 module tree using the pure-Python torchfile library — no Lua/Torch7 runtime is required. The dump (see lpc_estimator_torch7/PORT_NOTES.md) confirms:
```
nn.Sequential
  nn.Linear(350, 1024) → nn.Sigmoid
  nn.Linear(1024, 512) → nn.Sigmoid
  nn.Linear(512, 256)  → nn.Sigmoid
  nn.Linear(256, 4)
```
Identical layer shapes to LPC_NN_scaledLoss.pt; only the trained weights differ. Tensors are stored as DoubleTensor (float64).
Reconstruct in PyTorch: weight (out, in) and bias (out,) copy verbatim — Torch7 and PyTorch agree on nn.Linear storage layout, so no transpose. Cast float64 → float32.
Verify by feeding real LPC features from data/Example.wav through both (a) the ported PyTorch float32 model and (b) a float64 numpy reconstruction of the Torch7 forward pass using the originally-stored weights. Maximum drift: 0.003 Hz on the Hz scale. The port is numerically equivalent to running the upstream .dat under Lua.

The PyTorch state dict produced by the port is included as lpc_estimator_torch7/reconstructed.pt (4.07 MB).

Torch7-origin tracker (`lpc_tracker_torch7/`)

The original Torch7 binary tracking_model.dat (3.86 GB on disk, ≈10 MB of unique parameters — the bulk is BPTT-unrolled state buffers) ports cleanly into PyTorch nn.LSTM:

nn.Sequential
├── nn.Sequencer( nn.FastLSTM )    hidden = 512   i2g (2048, 350)  o2g (2048, 512)
├── nn.Sequencer( nn.FastLSTM )    hidden = 256   i2g (1024, 512)  o2g (1024, 256)
└── nn.Sequencer( nn.Recursor( nn.Linear(256, 4) ) )

Architecturally identical to lpc_tracker/ (the PyTorch retrain) — only the trained weights differ.

Gate ordering. Torch7 nn.FastLSTM packs the four LSTM gates as [i, g, f, o] (determined by the order of activations in the inner nn.ParallelTable: Sigmoid / Tanh / Sigmoid / Sigmoid — the Tanh marks the cell-candidate g). PyTorch's nn.LSTM uses [i, f, g, o]. The port applies a block-wise permutation [0, 2, 1, 3] to the 4 H-sized blocks of weight_ih, weight_hh, and bias_ih.

Bias convention. FastLSTM has bias on the input-to-gates linear only (the recurrent linear is LinearNoBias). PyTorch's nn.LSTM exposes bias_ih + bias_hh and sums them at forward time. The port places the (permuted) Torch7 i2g.bias into bias_ih_l0 and zeros bias_hh_l0.

Bit-fidelity. A pure-numpy float64 FastLSTM forward (using the original non-permuted weights) was compared to PyTorch nn.LSTM (using the permuted float32 weights) on random (1, 20, 350) input. Maximum drift: 7 × 10⁻⁸ on raw output = 0.0001 Hz on the Hz scale. The port is provably equivalent to running the original .dat under Torch7.

Cross-check vs lpc_tracker/ (PyTorch retrain) on real LPC features from data/Example.wav (231 frames):

	F1	F2	F3	F4
PyTorch retrain mean (Hz)	559	1819	2646	3910
Torch7 port mean (Hz)	467	1823	2588	3866
Mean diff (Torch7 − PT)	−92	+4	−58	−44

Both produce 100 % correctly ordered F1 < F2 < F3 < F4 on these frames; ranges are 230–4400 Hz (Torch7 port) and 280–4600 Hz (PyTorch retrain). The ~50–90 Hz mean offset is the natural divergence between two independent training runs of identical architecture on the same task. Prefer the Torch7-origin port when matching published-paper behaviour matters.

The raw float64 weight extraction is included as lpc_tracker_torch7/torch7_raw_weights.npz for audit / independent re-port. See lpc_tracker_torch7/PORT_NOTES.md for the full arch dump and conversion notes.

int8 caveat for LSTMs. Dynamic int8 quantization of LSTMs can drift outputs by ~100 Hz worst-case (≈5 Hz mean) on this model — the validate threshold for int8 abs is bumped to 1.5 × 10⁻¹ accordingly. Prefer the fp16 variant for production use.

About the published `ExamplePred.csv` golden values

The upstream repo ships data/Example.wav and data/ExamplePred.csv with reference outputs F1 ≈ 508, F2 ≈ 1605, F3 ≈ 2671, F4 ≈ 3639 Hz. These do not reproduce with the modern feature pipeline. The discrepancy is in feature extraction, not in the model port:

extract_features.py depends on scikits.talkbox (no Python 3 release), scipy.fftpack (deprecated), and specific numpy semantics around fromstring.
The two ported PyTorch checkpoints (LPC_NN_scaledLoss.pt, the Torch7 port) produce similar but not identical outputs to each other on the same features, and both differ modestly from the golden — exactly the pattern expected when the upstream pipeline used a decade-old DSP stack.

If exact upstream parity is required for a downstream comparison, regenerate ExamplePred.csv once on your local environment and use that as your reference instead.

Intended use

Single-frame and sequence-level formant estimation for research in speech science, phonetics, sociolinguistics, clinical speech analysis, and ASR feature exploration.
Use cases where the input audio is 16 kHz mono speech and feature extraction can be done upstream with the extract_features.py pipeline.

Limitations and biases

Trained on the VTR-TIMIT corpus (read American English vowels, adult speakers, clean studio recordings, 16 kHz). Out-of-domain performance — children's speech, dysphonic or pathological voices, non-English vowel systems, low-SNR field recordings, or band-limited telephone audio — is unverified and likely degraded.
The estimator predicts exactly four formants (F1..F4). Models do not return a confidence or voicing decision; unvoiced frames produce nonsensical predictions that the caller is expected to mask externally (e.g. by a voicing detector).
The tracker was trained with a fixed input sequence length of 20 frames in the upstream training loop; longer sequences work because the ONNX export uses a dynamic time axis, but evaluation outside that regime has not been independently validated here.
int8 dynamic quantization can drift the tracker's predictions by several tens of Hz in the worst case. For fine-grained phonetic measurement (e.g. small F1/F2 contrasts), prefer the fp16 or fp32 variants.

Files

.
├── README.md                       (this file — model card)
├── metadata.json                   (machine-readable arch/shape/parity summary)
├── lpc_estimator/
│   ├── model.onnx                  fp32
│   ├── model_fp16.onnx             fp16
│   └── model_int8.onnx             int8 dynamic
├── lpc_tracker/
│   ├── model.onnx                  fp32
│   ├── model_fp16.onnx             fp16
│   └── model_int8.onnx             int8 dynamic
├── lpc_estimator_torch7/
│   ├── model.onnx                  fp32
│   ├── model_fp16.onnx             fp16
│   ├── model_int8.onnx             int8 dynamic
│   ├── reconstructed.pt            PyTorch state_dict ported from estimation_model.dat
│   └── PORT_NOTES.md               layer-by-layer arch dump and conversion notes
├── lpc_tracker_torch7/
│   ├── model.onnx                  fp32
│   ├── model_fp16.onnx             fp16
│   ├── model_int8.onnx             int8 dynamic
│   ├── reconstructed.pt            PyTorch state_dict ported from tracking_model.dat
│   ├── torch7_raw_weights.npz      raw float64 weight extraction (audit trail)
│   └── PORT_NOTES.md               arch dump + gate-order analysis + bit-fidelity result
└── scripts/
    ├── convert_lpc.py                  PyTorch MLP → ONNX
    ├── convert_tracker.py              PyTorch LSTM → ONNX
    ├── port_torch7_estimator.py        Torch7 estimator .dat → PyTorch state_dict
    ├── convert_lpc_torch7.py           ported MLP state_dict → ONNX
    ├── port_torch7_tracker.py          Torch7 tracker .dat → PyTorch state_dict (+ numpy bit-fidelity)
    ├── convert_lpc_tracker_torch7.py   ported LSTM state_dict → ONNX
    ├── cross_check_trackers.py         compare Torch7 tracker port vs PyTorch retrain on real features
    ├── validate_parity.py              parity check (writes metadata.json, exits non-zero on regression)
    └── upload_hub.py                   push to this HF repo

Reproduce locally

conda activate onnxconv          # torch 2.11, onnx 1.21, onnxruntime 1.24
pip install torchfile             # only needed for the Torch7 port step

python scripts/convert_lpc.py
python scripts/convert_tracker.py
python scripts/port_torch7_estimator.py
python scripts/convert_lpc_torch7.py
python scripts/port_torch7_tracker.py            # loads 3.86 GB .dat — needs ~6 GB RAM
python scripts/convert_lpc_tracker_torch7.py
python scripts/validate_parity.py
python scripts/cross_check_trackers.py           # informational, requires extract_features.py output

The scripts assume the upstream repo is checked out as a sibling directory:

git clone https://github.com/MLSpeech/DeepFormants ../DeepFormants

Skipped

CNN_estimate.pt (spectrogram estimator) — referenced by spectrogramEstimate_inference.py but not shipped in the public repo. Cannot convert without weights.

Citation

If you use these models, please cite the upstream paper:

@inproceedings{dissen2016formant,
  title     = {Formant Estimation and Tracking using Deep Learning},
  author    = {Dissen, Yossi and Keshet, Joseph},
  booktitle = {Interspeech},
  year      = {2016}
}

@article{dissen2019formant,
  title   = {Formant estimation and tracking: A deep learning approach},
  author  = {Dissen, Yossi and Goldberger, Jacob and Keshet, Joseph},
  journal = {The Journal of the Acoustical Society of America},
  volume  = {145},
  number  = {2},
  pages   = {642--653},
  year    = {2019}
}

License

MIT, inherited from the upstream MLSpeech/DeepFormants repository.

The ONNX exports and the Torch7-origin port are derivative artifacts of the upstream weights. The upstream README distributes the Torch7 weights via a Google Drive link without explicit redistribution terms — this repo is therefore kept private. Do not re-host the converted artifacts publicly without first confirming the authors' intent.

Downloads last month: -; Downloads are not tracked for this model. How to track