Instructions to use SIRIS-Lab/erc-classifiers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SIRIS-Lab/erc-classifiers with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="SIRIS-Lab/erc-classifiers")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("SIRIS-Lab/erc-classifiers") model = AutoModelForSequenceClassification.from_pretrained("SIRIS-Lab/erc-classifiers") - Notebooks
- Google Colab
- Kaggle
File size: 9,348 Bytes
9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f dfc353e e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f 9da069d e046a5f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 | ---
library_name: transformers
license: apache-2.0
datasets:
- SIRIS-Lab/erc-classification-dataset
base_model:
- allenai/specter2_base
pipeline_tag: text-classification
---
# ERC Panels Classifier
This model is a fine-tuned version of **allenai/specter2_base** for multilabel scientific domain classification aligned with **ERC panel taxonomy**.
It achieves the following results on the held-out test set:
- **Best validation loss:** 0.0361
- **Micro F1:** 0.9386
- **Micro ROC-AUC:** 0.9718
- **Subset accuracy:** 0.7943
---
## Model description
This model is a fine-tuned variant of **SPECTER2** (`allenai/specter2_base`) adapted for **multilabel classification of scientific documents** into ERC research panels.
The model takes as input the **title and abstract** of a scientific publication and predicts **one or more research panels**.
Since scientific outputs may legitimately span multiple domains, the model is trained using **sigmoid activation** with **binary cross-entropy loss**, allowing independent assignment of multiple labels.
### Key characteristics
- **Base model:** allenai/specter2_base
- **Task:** multilabel document classification
- **Labels:** 28 ERC scientific panels
- **Activation:** sigmoid (independent scores per label)
- **Loss:** BCEWithLogitsLoss
- **Output:** list of predicted panels with associated probabilities
- **Decision threshold:** 0.5 (tunable)
This model enables automatic research-domain tagging aligned with the ERC panel structure.
---
## Intended uses & limitations
### Intended uses
This model is designed for:
- Automatic assignment of ERC research panels
- Metadata enrichment for:
- research project databases
- institutional repositories
- funding and grant analysis pipelines
- Large-scale analytics such as:
- portfolio mapping
- thematic analysis of research outputs
- monitoring disciplinary coverage of funded projects
- Predicting subject areas for documents lacking structured domain metadata
The model supports:
- title only
- abstract only
- **title + abstract (recommended)**
### Limitations
- ERC panels are **high-level categories** and do not represent fine-grained subdisciplines
- Labels are derived from curated datasets, semi-automatically annotated data
- Class imbalance may affect recall for underrepresented panels
- The model does not encode explicit hierarchical relationships between panels
Not suited for:
- fine-grained subfield classification
- journal recommendation
- evaluation of research quality or impact
- clinical, legal, or regulatory decision-making
Predictions should be treated as **supportive metadata**, not authoritative classifications.
---
## How to use
```
from transformers import pipeline
# Replace with your actual model repo name on HuggingFace
MODEL_NAME = "nicolauduran45/erc_classifier_demo"
classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME)
text = ["Climate change impacts on Arctic ecosystems."]
classifier(text)
```
---
## Training and evaluation data
### Training data
- Scientific documents with ERC-style panel annotations
- Inputs:
- title
- abstract
- Task type: **multilabel classification**
### Dataset characteristics
| Property | Value |
|--------|------|
| Documents | ~40k |
| Labels | 28 panels |
| Input fields | Title, Abstract |
| Task type | Multilabel |
| License | Dataset-dependent |
---
## Training procedure
### Preprocessing
- Input text constructed as:
`title + ". " + abstract`
- Tokenization using the SPECTER2 tokenizer
- Maximum sequence length: **512 tokens**
### Model
- Base model: `allenai/specter2_base`
- Classification head: linear → sigmoid
- Loss function: BCEWithLogitsLoss
- Predictions: independent probability per label
### Training hyperparameters
| Hyperparameter | Value |
|--------------|------|
| Learning rate | 2e-5 |
| Train batch size | 16 |
| Eval batch size | 16 |
| Epochs | 6 |
| Weight decay | 0.01 |
| Optimizer | AdamW |
| Metric for best model | Micro F1 |
---
## Training results
| Epoch | Training Loss | Validation Loss | Micro F1 | ROC-AUC | Accuracy |
|------|---------------|-----------------|----------|---------|----------|
| 1 | 0.2089 | 0.0968 | 0.7576 | 0.8347 | 0.4043 |
| 2 | 0.0961 | 0.0713 | 0.8231 | 0.8888 | 0.5171 |
| 3 | 0.0719 | 0.0578 | 0.8614 | 0.9209 | 0.5829 |
| 4 | 0.0579 | 0.0458 | 0.9072 | 0.9546 | 0.7029 |
| 5 | 0.0479 | 0.0390 | 0.9264 | 0.9620 | 0.7614 |
| 6 | 0.0407 | 0.0361 | **0.9386** | **0.9718** | **0.7943** |
---
## Evaluation results (multilabel test set)
| Panel | Precision | Recall | F1-score | Support |
|------|-----------|--------|----------|---------|
| Biotechnology and Biosystems Engineering | 0.88 | 0.70 | 0.78 | 30 |
| Cell Biology, Development, Stem Cells and Regeneration | 0.98 | 0.94 | 0.96 | 54 |
| Computer Science and Informatics | 0.96 | 0.98 | 0.97 | 95 |
| Condensed Matter Physics | 0.97 | 0.99 | 0.98 | 68 |
| Earth System Science | 0.94 | 0.98 | 0.96 | 64 |
| Environmental Biology, Ecology and Evolution | 0.91 | 0.96 | 0.94 | 54 |
| Fundamental Constituents of Matter | 0.97 | 0.94 | 0.95 | 32 |
| Human Mobility, Environment, and Space | 0.81 | 0.81 | 0.81 | 21 |
| Immunity, Infection and Immunotherapy | 1.00 | 0.97 | 0.99 | 40 |
| Individuals, Markets and Organisations | 0.94 | 0.98 | 0.96 | 48 |
| Institutions, Governance and Legal Systems | 0.89 | 0.92 | 0.91 | 26 |
| Integrative Biology: from Genes and Genomes to Systems | 0.91 | 0.98 | 0.94 | 49 |
| Materials Engineering | 0.81 | 0.93 | 0.87 | 75 |
| Mathematics | 1.00 | 1.00 | 1.00 | 36 |
| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.94 | 0.98 | 0.96 | 111 |
| Neuroscience and Disorders of the Nervous System | 1.00 | 1.00 | 1.00 | 30 |
| Physical and Analytical Chemical Sciences | 0.89 | 0.93 | 0.91 | 94 |
| Physiology in Health, Disease and Ageing | 0.94 | 1.00 | 0.97 | 34 |
| Prevention, Diagnosis and Treatment of Human Diseases | 0.97 | 0.96 | 0.96 | 68 |
| Products and Processes Engineering | 0.90 | 0.97 | 0.93 | 109 |
| Studies of Cultures and Arts | 1.00 | 0.78 | 0.88 | 9 |
| Synthetic Chemistry and Materials | 0.82 | 0.77 | 0.79 | 47 |
| Systems and Communication Engineering | 0.94 | 0.97 | 0.95 | 87 |
| Texts and Concepts | 0.87 | 0.93 | 0.90 | 14 |
| The Human Mind and Its Complexity | 1.00 | 0.93 | 0.97 | 30 |
| The Social World and Its Interactions | 0.97 | 0.94 | 0.96 | 34 |
| The Study of the Human Past | 0.89 | 0.94 | 0.91 | 17 |
| Universe Sciences | 1.00 | 1.00 | 1.00 | 25 |
**Overall performance**
| | Precision | Recall | F1-score | Support |
|------|-----------|--------|----------|---------|
| **Micro avg** | **0.93** | **0.95** | **0.94** | **1401** |
| **Macro avg** | **0.93** | **0.94** | **0.93** | **1401** |
| **Weighted avg** | **0.93** | **0.95** | **0.94** | **1401** |
| **Samples avg** | **0.93** | **0.94** | **0.93** | **1401** |
---
## ERC-funded projects evaluation (multiclass recall)
This evaluation uses **ERC-funded projects**, where each project belongs to **exactly one panel**.
Only **recall** is reported.
| Panel | Recall |
|------|--------|
| Biotechnology and Biosystems Engineering | 0.26 |
| Cell Biology, Development, Stem Cells and Regeneration | 0.81 |
| Computer Science and Informatics | 1.00 |
| Condensed Matter Physics | 0.77 |
| Earth System Science | 0.92 |
| Environmental Biology, Ecology and Evolution | 0.85 |
| Fundamental Constituents of Matter | 0.84 |
| Human Mobility, Environment, and Space | 0.61 |
| Immunity, Infection and Immunotherapy | 0.83 |
| Individuals, Markets and Organisations | 0.96 |
| Institutions, Governance and Legal Systems | 0.58 |
| Integrative Biology: from Genes and Genomes to Systems | 0.73 |
| Materials Engineering | 0.75 |
| Mathematics | 0.96 |
| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.95 |
| Neuroscience and Disorders of the Nervous System | 0.92 |
| Physical and Analytical Chemical Sciences | 0.83 |
| Physiology in Health, Disease and Ageing | 0.60 |
| Prevention, Diagnosis and Treatment of Human Diseases | 0.94 |
| Products and Processes Engineering | 0.58 |
| Studies of Cultures and Arts | 0.27 |
| Synthetic Chemistry and Materials | 0.67 |
| Systems and Communication Engineering | 0.75 |
| Texts and Concepts | 0.62 |
| The Human Mind and Its Complexity | 0.85 |
| The Social World and Its Interactions | 0.73 |
| The Study of the Human Past | 0.83 |
| Universe Sciences | 1.00 |
**Overall performance**
**Overall recall**
- **Micro recall:** 0.77
- **Macro recall:** 0.76
## Citation
```
@inproceedings{bovenzi2022mapping,
title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark},
author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep},
booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)},
pages={495--499},
year={2022},
publisher={Springer International Publishing}
}
```
---
## Framework versions
- **Transformers:** 4.57.x
- **PyTorch:** 2.8.0
- **Datasets:** 3.x
- **Tokenizers:** 0.22.x |