File size: 9,348 Bytes

9da069d
 
e046a5f
 
 
 
 
 
9da069d
 
e046a5f
9da069d
e046a5f
 
9da069d
e046a5f
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
9da069d
e046a5f
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dfc353e
e046a5f
 
 
 
 
 
 
 
9da069d
e046a5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f

---
library_name: transformers
license: apache-2.0
datasets:
- SIRIS-Lab/erc-classification-dataset
base_model:
- allenai/specter2_base
pipeline_tag: text-classification
---

# ERC Panels Classifier

This model is a fine-tuned version of **allenai/specter2_base** for multilabel scientific domain classification aligned with **ERC panel taxonomy**.  
It achieves the following results on the held-out test set:

- **Best validation loss:** 0.0361  
- **Micro F1:** 0.9386  
- **Micro ROC-AUC:** 0.9718  
- **Subset accuracy:** 0.7943  

---

## Model description

This model is a fine-tuned variant of **SPECTER2** (`allenai/specter2_base`) adapted for **multilabel classification of scientific documents** into ERC research panels.

The model takes as input the **title and abstract** of a scientific publication and predicts **one or more research panels**.  
Since scientific outputs may legitimately span multiple domains, the model is trained using **sigmoid activation** with **binary cross-entropy loss**, allowing independent assignment of multiple labels.

### Key characteristics

- **Base model:** allenai/specter2_base  
- **Task:** multilabel document classification  
- **Labels:** 28 ERC scientific panels  
- **Activation:** sigmoid (independent scores per label)  
- **Loss:** BCEWithLogitsLoss  
- **Output:** list of predicted panels with associated probabilities  
- **Decision threshold:** 0.5 (tunable)

This model enables automatic research-domain tagging aligned with the ERC panel structure.

---

## Intended uses & limitations

### Intended uses

This model is designed for:

- Automatic assignment of ERC research panels  
- Metadata enrichment for:
  - research project databases  
  - institutional repositories  
  - funding and grant analysis pipelines  
- Large-scale analytics such as:
  - portfolio mapping  
  - thematic analysis of research outputs  
  - monitoring disciplinary coverage of funded projects  
- Predicting subject areas for documents lacking structured domain metadata  

The model supports:

- title only  
- abstract only  
- **title + abstract (recommended)**  

### Limitations

- ERC panels are **high-level categories** and do not represent fine-grained subdisciplines  
- Labels are derived from curated datasets, semi-automatically annotated data
- Class imbalance may affect recall for underrepresented panels  
- The model does not encode explicit hierarchical relationships between panels  

Not suited for:

- fine-grained subfield classification  
- journal recommendation  
- evaluation of research quality or impact  
- clinical, legal, or regulatory decision-making  

Predictions should be treated as **supportive metadata**, not authoritative classifications.

---

## How to use

```
from transformers import pipeline

# Replace with your actual model repo name on HuggingFace
MODEL_NAME = "nicolauduran45/erc_classifier_demo"

classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME)

text = ["Climate change impacts on Arctic ecosystems."]

classifier(text)
```
---

## Training and evaluation data

### Training data

- Scientific documents with ERC-style panel annotations  
- Inputs:
  - title  
  - abstract  
- Task type: **multilabel classification**

### Dataset characteristics

| Property | Value |
|--------|------|
| Documents | ~40k |
| Labels | 28 panels |
| Input fields | Title, Abstract |
| Task type | Multilabel |
| License | Dataset-dependent |

---

## Training procedure

### Preprocessing

- Input text constructed as:

  `title + ". " + abstract`

- Tokenization using the SPECTER2 tokenizer  
- Maximum sequence length: **512 tokens**

### Model

- Base model: `allenai/specter2_base`  
- Classification head: linear → sigmoid  
- Loss function: BCEWithLogitsLoss  
- Predictions: independent probability per label  

### Training hyperparameters

| Hyperparameter | Value |
|--------------|------|
| Learning rate | 2e-5 |
| Train batch size | 16 |
| Eval batch size | 16 |
| Epochs | 6 |
| Weight decay | 0.01 |
| Optimizer | AdamW |
| Metric for best model | Micro F1 |

---

## Training results

| Epoch | Training Loss | Validation Loss | Micro F1 | ROC-AUC | Accuracy |
|------|---------------|-----------------|----------|---------|----------|
| 1 | 0.2089 | 0.0968 | 0.7576 | 0.8347 | 0.4043 |
| 2 | 0.0961 | 0.0713 | 0.8231 | 0.8888 | 0.5171 |
| 3 | 0.0719 | 0.0578 | 0.8614 | 0.9209 | 0.5829 |
| 4 | 0.0579 | 0.0458 | 0.9072 | 0.9546 | 0.7029 |
| 5 | 0.0479 | 0.0390 | 0.9264 | 0.9620 | 0.7614 |
| 6 | 0.0407 | 0.0361 | **0.9386** | **0.9718** | **0.7943** |

---

## Evaluation results (multilabel test set)

| Panel | Precision | Recall | F1-score | Support |
|------|-----------|--------|----------|---------|
| Biotechnology and Biosystems Engineering | 0.88 | 0.70 | 0.78 | 30 |
| Cell Biology, Development, Stem Cells and Regeneration | 0.98 | 0.94 | 0.96 | 54 |
| Computer Science and Informatics | 0.96 | 0.98 | 0.97 | 95 |
| Condensed Matter Physics | 0.97 | 0.99 | 0.98 | 68 |
| Earth System Science | 0.94 | 0.98 | 0.96 | 64 |
| Environmental Biology, Ecology and Evolution | 0.91 | 0.96 | 0.94 | 54 |
| Fundamental Constituents of Matter | 0.97 | 0.94 | 0.95 | 32 |
| Human Mobility, Environment, and Space | 0.81 | 0.81 | 0.81 | 21 |
| Immunity, Infection and Immunotherapy | 1.00 | 0.97 | 0.99 | 40 |
| Individuals, Markets and Organisations | 0.94 | 0.98 | 0.96 | 48 |
| Institutions, Governance and Legal Systems | 0.89 | 0.92 | 0.91 | 26 |
| Integrative Biology: from Genes and Genomes to Systems | 0.91 | 0.98 | 0.94 | 49 |
| Materials Engineering | 0.81 | 0.93 | 0.87 | 75 |
| Mathematics | 1.00 | 1.00 | 1.00 | 36 |
| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.94 | 0.98 | 0.96 | 111 |
| Neuroscience and Disorders of the Nervous System | 1.00 | 1.00 | 1.00 | 30 |
| Physical and Analytical Chemical Sciences | 0.89 | 0.93 | 0.91 | 94 |
| Physiology in Health, Disease and Ageing | 0.94 | 1.00 | 0.97 | 34 |
| Prevention, Diagnosis and Treatment of Human Diseases | 0.97 | 0.96 | 0.96 | 68 |
| Products and Processes Engineering | 0.90 | 0.97 | 0.93 | 109 |
| Studies of Cultures and Arts | 1.00 | 0.78 | 0.88 | 9 |
| Synthetic Chemistry and Materials | 0.82 | 0.77 | 0.79 | 47 |
| Systems and Communication Engineering | 0.94 | 0.97 | 0.95 | 87 |
| Texts and Concepts | 0.87 | 0.93 | 0.90 | 14 |
| The Human Mind and Its Complexity | 1.00 | 0.93 | 0.97 | 30 |
| The Social World and Its Interactions | 0.97 | 0.94 | 0.96 | 34 |
| The Study of the Human Past | 0.89 | 0.94 | 0.91 | 17 |
| Universe Sciences | 1.00 | 1.00 | 1.00 | 25 |


**Overall performance**

|  | Precision | Recall | F1-score | Support |
|------|-----------|--------|----------|---------|
| **Micro avg** | **0.93** | **0.95** | **0.94** | **1401** |
| **Macro avg** | **0.93** | **0.94** | **0.93** | **1401** |
| **Weighted avg** | **0.93** | **0.95** | **0.94** | **1401** |
| **Samples avg** | **0.93** | **0.94** | **0.93** | **1401** |
  
---

## ERC-funded projects evaluation (multiclass recall)

This evaluation uses **ERC-funded projects**, where each project belongs to **exactly one panel**.  
Only **recall** is reported.

| Panel | Recall |
|------|--------|
| Biotechnology and Biosystems Engineering | 0.26 |
| Cell Biology, Development, Stem Cells and Regeneration | 0.81 |
| Computer Science and Informatics | 1.00 |
| Condensed Matter Physics | 0.77 |
| Earth System Science | 0.92 |
| Environmental Biology, Ecology and Evolution | 0.85 |
| Fundamental Constituents of Matter | 0.84 |
| Human Mobility, Environment, and Space | 0.61 |
| Immunity, Infection and Immunotherapy | 0.83 |
| Individuals, Markets and Organisations | 0.96 |
| Institutions, Governance and Legal Systems | 0.58 |
| Integrative Biology: from Genes and Genomes to Systems | 0.73 |
| Materials Engineering | 0.75 |
| Mathematics | 0.96 |
| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.95 |
| Neuroscience and Disorders of the Nervous System | 0.92 |
| Physical and Analytical Chemical Sciences | 0.83 |
| Physiology in Health, Disease and Ageing | 0.60 |
| Prevention, Diagnosis and Treatment of Human Diseases | 0.94 |
| Products and Processes Engineering | 0.58 |
| Studies of Cultures and Arts | 0.27 |
| Synthetic Chemistry and Materials | 0.67 |
| Systems and Communication Engineering | 0.75 |
| Texts and Concepts | 0.62 |
| The Human Mind and Its Complexity | 0.85 |
| The Social World and Its Interactions | 0.73 |
| The Study of the Human Past | 0.83 |
| Universe Sciences | 1.00 |

**Overall performance**
**Overall recall**

- **Micro recall:** 0.77  
- **Macro recall:** 0.76

## Citation

```
@inproceedings{bovenzi2022mapping,
  title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark},
  author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep},
  booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)},
  pages={495--499},
  year={2022},
  publisher={Springer International Publishing}
}
```

---

## Framework versions

- **Transformers:** 4.57.x  
- **PyTorch:** 2.8.0  
- **Datasets:** 3.x  
- **Tokenizers:** 0.22.x