GPT-NL Energy Prediction Models

Trained energy prediction models for the GPT-NL data curation pipeline. These models estimate the energy consumption of each pipeline stage (data splitting → string normalization → heuristic filtering → toxic language detection → deduplication) running on the Snellius supercomputer.

Usage

pip install git+https://github.com/kruuusher13/gptnl-energy-estimation-ekf.git huggingface_hub

# Download and use a model
from huggingface_hub import hf_hub_download
from gptnl_energy.models import get_model

model_path = hf_hub_download('GPT-NL/gptnl-energy-models', 'linear_energy.joblib')
model = get_model('linear')
model.load_fits(model_path)

# Predict energy for 400k documents
result = model.predict(n=400000, corpus='american_stories')
print(f'{result["total_j"]/1e6:.2f} MJ')

Available Models

File	Type	Target	Description
`ols_fits_with_dedup.json`	sklearn	energy_j (Joules)	Calibrated per-stage OLS coefficients (production calibration incl. deduplication) — the default fits used by `gptnl-energy forecast` and `monitor`
`linear_energy.joblib`	sklearn	energy_j (Joules)	Per-stage OLS linear model (physics baseline): E = c0 + c1*n per pipeline stage
`linear_time.joblib`	sklearn	—	—
`ridge_energy.joblib`	sklearn	—	—
`ridge_time.joblib`	sklearn	—	—
`gbm_energy.joblib`	sklearn	energy_j (Joules)	Histogram Gradient Boosting — best ML model for cross-corpus energy transfer
`gbm_time.joblib`	sklearn	—	—
`mlp_energy.joblib`	sklearn	—	—
`mlp_time.joblib`	sklearn	—	—
`ftt_energy.pt`	PyTorch	energy_j (Joules)	FT-Transformer — neural tabular model with feature tokenization + stage embedding
`ftt_time.pt`	PyTorch	—	—
`kalman_transformer_energy.pt`	PyTorch	total_energy_j (whole pipeline, cold start)	FT-Transformer trained on Kalman filter trajectories for cold whole-run prediction

Methodology

Three prediction layers:

Physics: Per-stage linear model E = c0 + c1·n (OLS, calibrated from sample runs)
Learned g: Coefficient predictor for unseen corpora (GBM, MLP, FT-Transformer)
Kalman filter (EKF): Online estimator blending model prediction with live telemetry

See the GPT-NL Energy repo for the full pipeline.

Source and thesis

Code and data: kruuusher13/gptnl-energy-estimation-ekf
Thesis: GPT-NL: Complexity-Aware Energy Estimation of the Data Curation Pipeline via Extended Kalman Filtering, R. Malik, MSc Applied Data Science, Utrecht University, 2026 (in the repository under paper/thesis.pdf)
Pipeline this predicts: GPT-NL data-curation-pipeline

All coefficients and evaluation numbers are regenerable from the measurement data with the scripts in paper/code/.

Downloads last month: -; Downloads are not tracked for this model. How to track