Pulpie Orange

Pulpie Orange Small

Pareto-optimal main-content extraction from HTML.
210M-parameter encoder · 0.862 ROUGE-5 F1 on WebMainBench · the recommended, default Pulpie model.

GitHub · Blog · PyPI


Pulpie Orange Small extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It is an encoder that labels every HTML block as content or boilerplate in a single forward pass, so it approaches state-of-the-art extraction quality while running far faster and cheaper than autoregressive extractors.

At 210M parameters it matches Dripper (0.6B) in quality on WebMainBench (0.862 vs 0.864) while running 20x faster on an L4 GPU. It has the best size-to-quality ratio in the Pulpie family and is the recommended model for production.

Usage

The easiest way to use this model is through the pulpie package, which handles HTML simplification, chunking, classification, and reconstruction:

pip install pulpie
from pulpie import Extractor

extractor = Extractor()               # defaults to pulpie-orange-small
result = extractor.extract(html)

print(result.markdown)                # clean Markdown
print(result.html)                    # clean HTML
print(result.n_main, result.n_other)  # blocks kept vs dropped

Extractor auto-detects CUDA, Apple MPS, then CPU. See the GitHub README for batch and multi-GPU usage.

How it works

Pulpie runs a four-stage pipeline:

  1. Simplify — remove scripts, styles, and formatting noise; tag each block with a unique ID.
  2. Chunk — pack blocks into sequences of up to 8,192 tokens separated by <|sep|> markers (~80% of pages fit in one chunk).
  3. Classify — a single encoder forward pass labels every block (at its <|sep|> position) as content or boilerplate.
  4. Reconstruct — return the kept blocks as HTML, or convert to Markdown.

This model is a token-classification head over EuroBERT-210m, distilled from the 2.1B Pulpie Orange Large teacher (KL-divergence 0.7 + hard-label cross-entropy 0.3, temperature 2.0).

Benchmarks

WebMainBench, English subset (6,647 pages), ROUGE-5 F1:

Model Params ROUGE-5 F1 Throughput (L4)
Pulpie Orange Large 2.1B 0.873 1.3 pages/sec
Dripper 0.6B 0.864 0.68 pages/sec
Pulpie Orange Base 610M 0.863 3.9 pages/sec
Pulpie Orange Small (this model) 210M 0.862 13.7 pages/sec
magic-html - 0.700 -
Trafilatura - 0.619 -

Cleaning 1 billion pages on an L4 costs ~$7,900 with Pulpie Orange Small versus ~$159,000 with Dripper. Full analysis in the blog post.

Model family

Model Params ROUGE-5 F1 Use case
pulpie-orange-small 210M 0.862 Recommended — best value, fastest
pulpie-orange-base 610M 0.863 Balanced
pulpie-orange-large 2.1B 0.873 Highest quality (teacher)

Acknowledgements

Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. Built on EuroBERT (Boizard et al., 2025).

Citation

@note{pulpie2026,
  title  = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
  author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
  year   = {2026},
  venue  = {Feyn Field Notes}
}

Built by Feyn. Model weights licensed under CC BY-NC 4.0 (non-commercial); contact team@usefeyn.com for commercial licensing. The pulpie library is Apache 2.0.

Downloads last month
95
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for feyninc/pulpie-orange-small

Finetuned
(69)
this model

Space using feyninc/pulpie-orange-small 1

Collection including feyninc/pulpie-orange-small