File size: 9,348 Bytes
9da069d
 
e046a5f
 
 
 
 
 
9da069d
 
e046a5f
9da069d
e046a5f
 
9da069d
e046a5f
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
9da069d
e046a5f
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dfc353e
e046a5f
 
 
 
 
 
 
 
9da069d
e046a5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9da069d
e046a5f
9da069d
e046a5f
9da069d
e046a5f
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
library_name: transformers
license: apache-2.0
datasets:
- SIRIS-Lab/erc-classification-dataset
base_model:
- allenai/specter2_base
pipeline_tag: text-classification
---

# ERC Panels Classifier

This model is a fine-tuned version of **allenai/specter2_base** for multilabel scientific domain classification aligned with **ERC panel taxonomy**.  
It achieves the following results on the held-out test set:

- **Best validation loss:** 0.0361  
- **Micro F1:** 0.9386  
- **Micro ROC-AUC:** 0.9718  
- **Subset accuracy:** 0.7943  

---

## Model description

This model is a fine-tuned variant of **SPECTER2** (`allenai/specter2_base`) adapted for **multilabel classification of scientific documents** into ERC research panels.

The model takes as input the **title and abstract** of a scientific publication and predicts **one or more research panels**.  
Since scientific outputs may legitimately span multiple domains, the model is trained using **sigmoid activation** with **binary cross-entropy loss**, allowing independent assignment of multiple labels.

### Key characteristics

- **Base model:** allenai/specter2_base  
- **Task:** multilabel document classification  
- **Labels:** 28 ERC scientific panels  
- **Activation:** sigmoid (independent scores per label)  
- **Loss:** BCEWithLogitsLoss  
- **Output:** list of predicted panels with associated probabilities  
- **Decision threshold:** 0.5 (tunable)

This model enables automatic research-domain tagging aligned with the ERC panel structure.

---

## Intended uses & limitations

### Intended uses

This model is designed for:

- Automatic assignment of ERC research panels  
- Metadata enrichment for:
  - research project databases  
  - institutional repositories  
  - funding and grant analysis pipelines  
- Large-scale analytics such as:
  - portfolio mapping  
  - thematic analysis of research outputs  
  - monitoring disciplinary coverage of funded projects  
- Predicting subject areas for documents lacking structured domain metadata  

The model supports:

- title only  
- abstract only  
- **title + abstract (recommended)**  

### Limitations

- ERC panels are **high-level categories** and do not represent fine-grained subdisciplines  
- Labels are derived from curated datasets, semi-automatically annotated data
- Class imbalance may affect recall for underrepresented panels  
- The model does not encode explicit hierarchical relationships between panels  

Not suited for:

- fine-grained subfield classification  
- journal recommendation  
- evaluation of research quality or impact  
- clinical, legal, or regulatory decision-making  

Predictions should be treated as **supportive metadata**, not authoritative classifications.

---

## How to use

```
from transformers import pipeline

# Replace with your actual model repo name on HuggingFace
MODEL_NAME = "nicolauduran45/erc_classifier_demo"

classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME)

text = ["Climate change impacts on Arctic ecosystems."]

classifier(text)
```
---

## Training and evaluation data

### Training data

- Scientific documents with ERC-style panel annotations  
- Inputs:
  - title  
  - abstract  
- Task type: **multilabel classification**

### Dataset characteristics

| Property | Value |
|--------|------|
| Documents | ~40k |
| Labels | 28 panels |
| Input fields | Title, Abstract |
| Task type | Multilabel |
| License | Dataset-dependent |

---

## Training procedure

### Preprocessing

- Input text constructed as:

  `title + ". " + abstract`

- Tokenization using the SPECTER2 tokenizer  
- Maximum sequence length: **512 tokens**

### Model

- Base model: `allenai/specter2_base`  
- Classification head: linear → sigmoid  
- Loss function: BCEWithLogitsLoss  
- Predictions: independent probability per label  

### Training hyperparameters

| Hyperparameter | Value |
|--------------|------|
| Learning rate | 2e-5 |
| Train batch size | 16 |
| Eval batch size | 16 |
| Epochs | 6 |
| Weight decay | 0.01 |
| Optimizer | AdamW |
| Metric for best model | Micro F1 |

---

## Training results

| Epoch | Training Loss | Validation Loss | Micro F1 | ROC-AUC | Accuracy |
|------|---------------|-----------------|----------|---------|----------|
| 1 | 0.2089 | 0.0968 | 0.7576 | 0.8347 | 0.4043 |
| 2 | 0.0961 | 0.0713 | 0.8231 | 0.8888 | 0.5171 |
| 3 | 0.0719 | 0.0578 | 0.8614 | 0.9209 | 0.5829 |
| 4 | 0.0579 | 0.0458 | 0.9072 | 0.9546 | 0.7029 |
| 5 | 0.0479 | 0.0390 | 0.9264 | 0.9620 | 0.7614 |
| 6 | 0.0407 | 0.0361 | **0.9386** | **0.9718** | **0.7943** |

---

## Evaluation results (multilabel test set)

| Panel | Precision | Recall | F1-score | Support |
|------|-----------|--------|----------|---------|
| Biotechnology and Biosystems Engineering | 0.88 | 0.70 | 0.78 | 30 |
| Cell Biology, Development, Stem Cells and Regeneration | 0.98 | 0.94 | 0.96 | 54 |
| Computer Science and Informatics | 0.96 | 0.98 | 0.97 | 95 |
| Condensed Matter Physics | 0.97 | 0.99 | 0.98 | 68 |
| Earth System Science | 0.94 | 0.98 | 0.96 | 64 |
| Environmental Biology, Ecology and Evolution | 0.91 | 0.96 | 0.94 | 54 |
| Fundamental Constituents of Matter | 0.97 | 0.94 | 0.95 | 32 |
| Human Mobility, Environment, and Space | 0.81 | 0.81 | 0.81 | 21 |
| Immunity, Infection and Immunotherapy | 1.00 | 0.97 | 0.99 | 40 |
| Individuals, Markets and Organisations | 0.94 | 0.98 | 0.96 | 48 |
| Institutions, Governance and Legal Systems | 0.89 | 0.92 | 0.91 | 26 |
| Integrative Biology: from Genes and Genomes to Systems | 0.91 | 0.98 | 0.94 | 49 |
| Materials Engineering | 0.81 | 0.93 | 0.87 | 75 |
| Mathematics | 1.00 | 1.00 | 1.00 | 36 |
| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.94 | 0.98 | 0.96 | 111 |
| Neuroscience and Disorders of the Nervous System | 1.00 | 1.00 | 1.00 | 30 |
| Physical and Analytical Chemical Sciences | 0.89 | 0.93 | 0.91 | 94 |
| Physiology in Health, Disease and Ageing | 0.94 | 1.00 | 0.97 | 34 |
| Prevention, Diagnosis and Treatment of Human Diseases | 0.97 | 0.96 | 0.96 | 68 |
| Products and Processes Engineering | 0.90 | 0.97 | 0.93 | 109 |
| Studies of Cultures and Arts | 1.00 | 0.78 | 0.88 | 9 |
| Synthetic Chemistry and Materials | 0.82 | 0.77 | 0.79 | 47 |
| Systems and Communication Engineering | 0.94 | 0.97 | 0.95 | 87 |
| Texts and Concepts | 0.87 | 0.93 | 0.90 | 14 |
| The Human Mind and Its Complexity | 1.00 | 0.93 | 0.97 | 30 |
| The Social World and Its Interactions | 0.97 | 0.94 | 0.96 | 34 |
| The Study of the Human Past | 0.89 | 0.94 | 0.91 | 17 |
| Universe Sciences | 1.00 | 1.00 | 1.00 | 25 |


**Overall performance**

|  | Precision | Recall | F1-score | Support |
|------|-----------|--------|----------|---------|
| **Micro avg** | **0.93** | **0.95** | **0.94** | **1401** |
| **Macro avg** | **0.93** | **0.94** | **0.93** | **1401** |
| **Weighted avg** | **0.93** | **0.95** | **0.94** | **1401** |
| **Samples avg** | **0.93** | **0.94** | **0.93** | **1401** |
  
---

## ERC-funded projects evaluation (multiclass recall)

This evaluation uses **ERC-funded projects**, where each project belongs to **exactly one panel**.  
Only **recall** is reported.

| Panel | Recall |
|------|--------|
| Biotechnology and Biosystems Engineering | 0.26 |
| Cell Biology, Development, Stem Cells and Regeneration | 0.81 |
| Computer Science and Informatics | 1.00 |
| Condensed Matter Physics | 0.77 |
| Earth System Science | 0.92 |
| Environmental Biology, Ecology and Evolution | 0.85 |
| Fundamental Constituents of Matter | 0.84 |
| Human Mobility, Environment, and Space | 0.61 |
| Immunity, Infection and Immunotherapy | 0.83 |
| Individuals, Markets and Organisations | 0.96 |
| Institutions, Governance and Legal Systems | 0.58 |
| Integrative Biology: from Genes and Genomes to Systems | 0.73 |
| Materials Engineering | 0.75 |
| Mathematics | 0.96 |
| Molecules of Life: Biological Mechanisms, Structures and Functions | 0.95 |
| Neuroscience and Disorders of the Nervous System | 0.92 |
| Physical and Analytical Chemical Sciences | 0.83 |
| Physiology in Health, Disease and Ageing | 0.60 |
| Prevention, Diagnosis and Treatment of Human Diseases | 0.94 |
| Products and Processes Engineering | 0.58 |
| Studies of Cultures and Arts | 0.27 |
| Synthetic Chemistry and Materials | 0.67 |
| Systems and Communication Engineering | 0.75 |
| Texts and Concepts | 0.62 |
| The Human Mind and Its Complexity | 0.85 |
| The Social World and Its Interactions | 0.73 |
| The Study of the Human Past | 0.83 |
| Universe Sciences | 1.00 |

**Overall performance**
**Overall recall**

- **Micro recall:** 0.77  
- **Macro recall:** 0.76

## Citation

```
@inproceedings{bovenzi2022mapping,
  title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark},
  author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep},
  booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)},
  pages={495--499},
  year={2022},
  publisher={Springer International Publishing}
}
```

---

## Framework versions

- **Transformers:** 4.57.x  
- **PyTorch:** 2.8.0  
- **Datasets:** 3.x  
- **Tokenizers:** 0.22.x