| --- |
| library_name: transformers |
| license: apache-2.0 |
| datasets: |
| - SIRIS-Lab/erc-classification-dataset |
| base_model: |
| - allenai/specter2_base |
| pipeline_tag: text-classification |
| --- |
| |
| # ERC Panels Classifier |
|
|
| This model is a fine-tuned version of **allenai/specter2_base** for multilabel scientific domain classification aligned with **ERC panel taxonomy**. |
| It achieves the following results on the held-out test set: |
| |
| - **Best validation loss:** 0.0361 |
| - **Micro F1:** 0.9386 |
| - **Micro ROC-AUC:** 0.9718 |
| - **Subset accuracy:** 0.7943 |
| |
| --- |
| |
| ## Model description |
| |
| This model is a fine-tuned variant of **SPECTER2** (`allenai/specter2_base`) adapted for **multilabel classification of scientific documents** into ERC research panels. |
|
|
| The model takes as input the **title and abstract** of a scientific publication and predicts **one or more research panels**. |
| Since scientific outputs may legitimately span multiple domains, the model is trained using **sigmoid activation** with **binary cross-entropy loss**, allowing independent assignment of multiple labels. |
|
|
| ### Key characteristics |
|
|
| - **Base model:** allenai/specter2_base |
| - **Task:** multilabel document classification |
| - **Labels:** 28 ERC scientific panels |
| - **Activation:** sigmoid (independent scores per label) |
| - **Loss:** BCEWithLogitsLoss |
| - **Output:** list of predicted panels with associated probabilities |
| - **Decision threshold:** 0.5 (tunable) |
| |
| This model enables automatic research-domain tagging aligned with the ERC panel structure. |
| |
| --- |
| |
| ## Intended uses & limitations |
| |
| ### Intended uses |
| |
| This model is designed for: |
| |
| - Automatic assignment of ERC research panels |
| - Metadata enrichment for: |
| - research project databases |
| - institutional repositories |
| - funding and grant analysis pipelines |
| - Large-scale analytics such as: |
| - portfolio mapping |
| - thematic analysis of research outputs |
| - monitoring disciplinary coverage of funded projects |
| - Predicting subject areas for documents lacking structured domain metadata |
| |
| The model supports: |
| |
| - title only |
| - abstract only |
| - **title + abstract (recommended)** |
| |
| ### Limitations |
| |
| - ERC panels are **high-level categories** and do not represent fine-grained subdisciplines |
| - Labels are derived from curated datasets, semi-automatically annotated data |
| - Class imbalance may affect recall for underrepresented panels |
| - The model does not encode explicit hierarchical relationships between panels |
| |
| Not suited for: |
| |
| - fine-grained subfield classification |
| - journal recommendation |
| - evaluation of research quality or impact |
| - clinical, legal, or regulatory decision-making |
| |
| Predictions should be treated as **supportive metadata**, not authoritative classifications. |
| |
| --- |
| |
| ## How to use |
| |
| ``` |
| from transformers import pipeline |
| |
| # Replace with your actual model repo name on HuggingFace |
| MODEL_NAME = "nicolauduran45/erc_classifier_demo" |
|
|
| classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME) |
|
|
| text = ["Climate change impacts on Arctic ecosystems."] |
|
|
| classifier(text) |
| ``` |
| --- |
|
|
| ## Training and evaluation data |
|
|
| ### Training data |
|
|
| - Scientific documents with ERC-style panel annotations |
| - Inputs: |
| - title |
| - abstract |
| - Task type: **multilabel classification** |
|
|
| ### Dataset characteristics |
|
|
| | Property | Value | |
| |--------|------| |
| | Documents | ~40k | |
| | Labels | 28 panels | |
| | Input fields | Title, Abstract | |
| | Task type | Multilabel | |
| | License | Dataset-dependent | |
|
|
| --- |
|
|
| ## Training procedure |
|
|
| ### Preprocessing |
|
|
| - Input text constructed as: |
|
|
| `title + ". " + abstract` |
|
|
| - Tokenization using the SPECTER2 tokenizer |
| - Maximum sequence length: **512 tokens** |
|
|
| ### Model |
|
|
| - Base model: `allenai/specter2_base` |
| - Classification head: linear → sigmoid |
| - Loss function: BCEWithLogitsLoss |
| - Predictions: independent probability per label |
|
|
| ### Training hyperparameters |
|
|
| | Hyperparameter | Value | |
| |--------------|------| |
| | Learning rate | 2e-5 | |
| | Train batch size | 16 | |
| | Eval batch size | 16 | |
| | Epochs | 6 | |
| | Weight decay | 0.01 | |
| | Optimizer | AdamW | |
| | Metric for best model | Micro F1 | |
|
|
| --- |
|
|
| ## Training results |
|
|
| | Epoch | Training Loss | Validation Loss | Micro F1 | ROC-AUC | Accuracy | |
| |------|---------------|-----------------|----------|---------|----------| |
| | 1 | 0.2089 | 0.0968 | 0.7576 | 0.8347 | 0.4043 | |
| | 2 | 0.0961 | 0.0713 | 0.8231 | 0.8888 | 0.5171 | |
| | 3 | 0.0719 | 0.0578 | 0.8614 | 0.9209 | 0.5829 | |
| | 4 | 0.0579 | 0.0458 | 0.9072 | 0.9546 | 0.7029 | |
| | 5 | 0.0479 | 0.0390 | 0.9264 | 0.9620 | 0.7614 | |
| | 6 | 0.0407 | 0.0361 | **0.9386** | **0.9718** | **0.7943** | |
|
|
| --- |
|
|
| ## Evaluation results (multilabel test set) |
|
|
| | Panel | Precision | Recall | F1-score | Support | |
| |------|-----------|--------|----------|---------| |
| | Biotechnology and Biosystems Engineering | 0.88 | 0.70 | 0.78 | 30 | |
| | Cell Biology, Development, Stem Cells and Regeneration | 0.98 | 0.94 | 0.96 | 54 | |
| | Computer Science and Informatics | 0.96 | 0.98 | 0.97 | 95 | |
| | Condensed Matter Physics | 0.97 | 0.99 | 0.98 | 68 | |
| | Earth System Science | 0.94 | 0.98 | 0.96 | 64 | |
| | Environmental Biology, Ecology and Evolution | 0.91 | 0.96 | 0.94 | 54 | |
| | Fundamental Constituents of Matter | 0.97 | 0.94 | 0.95 | 32 | |
| | Human Mobility, Environment, and Space | 0.81 | 0.81 | 0.81 | 21 | |
| | Immunity, Infection and Immunotherapy | 1.00 | 0.97 | 0.99 | 40 | |
| | Individuals, Markets and Organisations | 0.94 | 0.98 | 0.96 | 48 | |
| | Institutions, Governance and Legal Systems | 0.89 | 0.92 | 0.91 | 26 | |
| | Integrative Biology: from Genes and Genomes to Systems | 0.91 | 0.98 | 0.94 | 49 | |
| | Materials Engineering | 0.81 | 0.93 | 0.87 | 75 | |
| | Mathematics | 1.00 | 1.00 | 1.00 | 36 | |
| | Molecules of Life: Biological Mechanisms, Structures and Functions | 0.94 | 0.98 | 0.96 | 111 | |
| | Neuroscience and Disorders of the Nervous System | 1.00 | 1.00 | 1.00 | 30 | |
| | Physical and Analytical Chemical Sciences | 0.89 | 0.93 | 0.91 | 94 | |
| | Physiology in Health, Disease and Ageing | 0.94 | 1.00 | 0.97 | 34 | |
| | Prevention, Diagnosis and Treatment of Human Diseases | 0.97 | 0.96 | 0.96 | 68 | |
| | Products and Processes Engineering | 0.90 | 0.97 | 0.93 | 109 | |
| | Studies of Cultures and Arts | 1.00 | 0.78 | 0.88 | 9 | |
| | Synthetic Chemistry and Materials | 0.82 | 0.77 | 0.79 | 47 | |
| | Systems and Communication Engineering | 0.94 | 0.97 | 0.95 | 87 | |
| | Texts and Concepts | 0.87 | 0.93 | 0.90 | 14 | |
| | The Human Mind and Its Complexity | 1.00 | 0.93 | 0.97 | 30 | |
| | The Social World and Its Interactions | 0.97 | 0.94 | 0.96 | 34 | |
| | The Study of the Human Past | 0.89 | 0.94 | 0.91 | 17 | |
| | Universe Sciences | 1.00 | 1.00 | 1.00 | 25 | |
|
|
|
|
| **Overall performance** |
|
|
| | | Precision | Recall | F1-score | Support | |
| |------|-----------|--------|----------|---------| |
| | **Micro avg** | **0.93** | **0.95** | **0.94** | **1401** | |
| | **Macro avg** | **0.93** | **0.94** | **0.93** | **1401** | |
| | **Weighted avg** | **0.93** | **0.95** | **0.94** | **1401** | |
| | **Samples avg** | **0.93** | **0.94** | **0.93** | **1401** | |
| |
| --- |
|
|
| ## ERC-funded projects evaluation (multiclass recall) |
|
|
| This evaluation uses **ERC-funded projects**, where each project belongs to **exactly one panel**. |
| Only **recall** is reported. |
|
|
| | Panel | Recall | |
| |------|--------| |
| | Biotechnology and Biosystems Engineering | 0.26 | |
| | Cell Biology, Development, Stem Cells and Regeneration | 0.81 | |
| | Computer Science and Informatics | 1.00 | |
| | Condensed Matter Physics | 0.77 | |
| | Earth System Science | 0.92 | |
| | Environmental Biology, Ecology and Evolution | 0.85 | |
| | Fundamental Constituents of Matter | 0.84 | |
| | Human Mobility, Environment, and Space | 0.61 | |
| | Immunity, Infection and Immunotherapy | 0.83 | |
| | Individuals, Markets and Organisations | 0.96 | |
| | Institutions, Governance and Legal Systems | 0.58 | |
| | Integrative Biology: from Genes and Genomes to Systems | 0.73 | |
| | Materials Engineering | 0.75 | |
| | Mathematics | 0.96 | |
| | Molecules of Life: Biological Mechanisms, Structures and Functions | 0.95 | |
| | Neuroscience and Disorders of the Nervous System | 0.92 | |
| | Physical and Analytical Chemical Sciences | 0.83 | |
| | Physiology in Health, Disease and Ageing | 0.60 | |
| | Prevention, Diagnosis and Treatment of Human Diseases | 0.94 | |
| | Products and Processes Engineering | 0.58 | |
| | Studies of Cultures and Arts | 0.27 | |
| | Synthetic Chemistry and Materials | 0.67 | |
| | Systems and Communication Engineering | 0.75 | |
| | Texts and Concepts | 0.62 | |
| | The Human Mind and Its Complexity | 0.85 | |
| | The Social World and Its Interactions | 0.73 | |
| | The Study of the Human Past | 0.83 | |
| | Universe Sciences | 1.00 | |
|
|
| **Overall performance** |
| **Overall recall** |
|
|
| - **Micro recall:** 0.77 |
| - **Macro recall:** 0.76 |
|
|
| ## Citation |
|
|
| ``` |
| @inproceedings{bovenzi2022mapping, |
| title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark}, |
| author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep}, |
| booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)}, |
| pages={495--499}, |
| year={2022}, |
| publisher={Springer International Publishing} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Framework versions |
|
|
| - **Transformers:** 4.57.x |
| - **PyTorch:** 2.8.0 |
| - **Datasets:** 3.x |
| - **Tokenizers:** 0.22.x |