GPT-NL Energy Prediction Models
Trained energy prediction models for the GPT-NL data curation pipeline. These models estimate the energy consumption of each pipeline stage (data splitting β string normalization β heuristic filtering β toxic language detection β deduplication) running on the Snellius supercomputer.
Usage
pip install git+https://github.com/kruuusher13/gptnl-energy-estimation-ekf.git huggingface_hub
# Download and use a model
from huggingface_hub import hf_hub_download
from gptnl_energy.models import get_model
model_path = hf_hub_download('GPT-NL/gptnl-energy-models', 'linear_energy.joblib')
model = get_model('linear')
model.load_fits(model_path)
# Predict energy for 400k documents
result = model.predict(n=400000, corpus='american_stories')
print(f'{result["total_j"]/1e6:.2f} MJ')
Available Models
| File | Type | Target | Description |
|---|---|---|---|
ols_fits_with_dedup.json |
sklearn | energy_j (Joules) | Calibrated per-stage OLS coefficients (production calibration incl. deduplication) β the default fits used by gptnl-energy forecast and monitor |
linear_energy.joblib |
sklearn | energy_j (Joules) | Per-stage OLS linear model (physics baseline): E = c0 + c1*n per pipeline stage |
linear_time.joblib |
sklearn | β | β |
ridge_energy.joblib |
sklearn | β | β |
ridge_time.joblib |
sklearn | β | β |
gbm_energy.joblib |
sklearn | energy_j (Joules) | Histogram Gradient Boosting β best ML model for cross-corpus energy transfer |
gbm_time.joblib |
sklearn | β | β |
mlp_energy.joblib |
sklearn | β | β |
mlp_time.joblib |
sklearn | β | β |
ftt_energy.pt |
PyTorch | energy_j (Joules) | FT-Transformer β neural tabular model with feature tokenization + stage embedding |
ftt_time.pt |
PyTorch | β | β |
kalman_transformer_energy.pt |
PyTorch | total_energy_j (whole pipeline, cold start) | FT-Transformer trained on Kalman filter trajectories for cold whole-run prediction |
Methodology
Three prediction layers:
- Physics: Per-stage linear model E = c0 + c1Β·n (OLS, calibrated from sample runs)
- Learned g: Coefficient predictor for unseen corpora (GBM, MLP, FT-Transformer)
- Kalman filter (EKF): Online estimator blending model prediction with live telemetry
See the GPT-NL Energy repo for the full pipeline.
Source and thesis
- Code and data: kruuusher13/gptnl-energy-estimation-ekf
- Thesis: GPT-NL: Complexity-Aware Energy Estimation of the Data Curation Pipeline via Extended Kalman Filtering, R. Malik, MSc Applied Data Science, Utrecht University, 2026 (in the repository under
paper/thesis.pdf) - Pipeline this predicts: GPT-NL data-curation-pipeline
All coefficients and evaluation numbers are regenerable from the measurement data with the scripts in paper/code/.