Instructions to use feyninc/pulpie-orange-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use feyninc/pulpie-orange-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="feyninc/pulpie-orange-large", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("feyninc/pulpie-orange-large", trust_remote_code=True) model = AutoModelForTokenClassification.from_pretrained("feyninc/pulpie-orange-large", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Pulpie Orange Large
Pareto-optimal main-content extraction from HTML.
2.1B-parameter encoder · 0.873 ROUGE-5 F1 on WebMainBench · the highest-quality Pulpie model (teacher).
Pulpie Orange Large extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It is an encoder that labels every HTML block as content or boilerplate in a single forward pass, so it approaches state-of-the-art extraction quality while running far faster and cheaper than autoregressive extractors.
At 2.1B parameters it is the strongest single model in the family, scoring 0.873 ROUGE-5 F1 — ahead of Dripper (0.864) at a third the memory-bandwidth cost. It is the teacher from which Orange Base and Orange Small are distilled. Use it when you want maximum quality; for most production workloads Orange Small matches it within ~1 F1 point at a fraction of the cost.
Usage
The easiest way to use this model is through the pulpie package:
pip install pulpie
from pulpie import Extractor
extractor = Extractor(model="orange-large")
result = extractor.extract(html)
print(result.markdown) # clean Markdown
print(result.html) # clean HTML
print(result.n_main, result.n_other) # blocks kept vs dropped
Extractor auto-detects CUDA, Apple MPS, then CPU. See the GitHub README for batch and multi-GPU usage.
How it works
Pulpie runs a four-stage pipeline:
- Simplify — remove scripts, styles, and formatting noise; tag each block with a unique ID.
- Chunk — pack blocks into sequences of up to 8,192 tokens separated by
<|sep|>markers (~80% of pages fit in one chunk). - Classify — a single encoder forward pass labels every block (at its
<|sep|>position) as content or boilerplate. - Reconstruct — return the kept blocks as HTML, or convert to Markdown.
This model is a token-classification head over EuroBERT-2.1B, fine-tuned on 14,959 Common Crawl pages with block-level labels (DeepSeek V3.2 labeled, Dripper cross-validated), using class-weighted cross-entropy.
Benchmarks
WebMainBench, English subset (6,647 pages), ROUGE-5 F1:
| Model | Params | ROUGE-5 F1 | Throughput (L4) |
|---|---|---|---|
| Pulpie Orange Large (this model) | 2.1B | 0.873 | 1.3 pages/sec |
| Dripper | 0.6B | 0.864 | 0.68 pages/sec |
| Pulpie Orange Base | 610M | 0.863 | 3.9 pages/sec |
| Pulpie Orange Small | 210M | 0.862 | 13.7 pages/sec |
| magic-html | - | 0.700 | - |
| Trafilatura | - | 0.619 | - |
Full analysis in the blog post.
Model family
| Model | Params | ROUGE-5 F1 | Use case |
|---|---|---|---|
| pulpie-orange-small | 210M | 0.862 | Recommended — best value, fastest |
| pulpie-orange-base | 610M | 0.863 | Balanced |
| pulpie-orange-large | 2.1B | 0.873 | Highest quality (teacher) |
Acknowledgements
Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. Built on EuroBERT (Boizard et al., 2025).
Citation
@note{pulpie2026,
title = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
year = {2026},
venue = {Feyn Field Notes}
}
Built by Feyn. Model weights and the pulpie library are licensed under Apache 2.0.
- Downloads last month
- 93
Model tree for feyninc/pulpie-orange-large
Base model
EuroBERT/EuroBERT-2.1B