File size: 6,348 Bytes
c1ea99a
 
 
e612d44
c1ea99a
 
 
 
 
 
 
 
 
e612d44
c1ea99a
e612d44
c1ea99a
e612d44
c1ea99a
 
 
e612d44
c1ea99a
 
 
e612d44
c1ea99a
 
 
 
 
e612d44
 
c1ea99a
 
 
e612d44
c1ea99a
 
 
 
93197c5
 
 
 
887e87e
 
 
 
 
 
 
93197c5
 
 
887e87e
 
 
 
 
93197c5
 
 
 
 
 
 
887e87e
93197c5
887e87e
 
 
 
 
 
 
93197c5
 
887e87e
 
 
 
 
93197c5
 
 
 
 
 
 
887e87e
e612d44
 
 
887e87e
 
93197c5
 
e612d44
93197c5
 
 
 
 
c1ea99a
 
93197c5
 
 
 
c1ea99a
 
 
 
 
 
 
 
 
 
 
0fef163
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: cc-by-nc-nd-4.0
datasets:
- SaeedLab/BindScreen
tags:
- proteins
- molecules
- bioinformatics
- drug-discovery
- feature-extraction
- transformers
---

# BindScreen Frozen

This model corresponds to the BindScreen frozen configuration, in which both encoders are frozen and only the projection layers are trained on filtered ChEMBL.

\[[Github Repo](https://github.com/pcdslab/BindScreen)\] | \[[Dataset on HuggingFace](https://huggingface.co/datasets/SaeedLab/BindScreen)\] | \[[Model Collection](https://huggingface.co/collections/SaeedLab/bindscreen)\] | \[[Cite](#citation)\]

## Abstract

Virtual screening aims to identify candidate molecules that bind to a target protein, playing a central role in computational drug discovery. Sequence-based deep learning methods offer a more broadly applicable alternative to structure-based approaches, since they do not require 3D structural information. However, they typically require a separate forward pass per protein-molecule pair, limiting their scalability to large molecular libraries. Contrastive learning methods inspired by CLIP address this by encoding proteins and molecules independently, allowing similarity analysis via simple comparisons rather than a forward pass per pair. However, standard CLIP training was designed for symmetric tasks and does not account for the asymmetric and one-to-many nature of protein-molecule binding. In this paper, we introduce *BindScreen*, a sequence-based virtual screening method built on a dual-encoder contrastive architecture. BindScreen introduces a protein-centric batch construction strategy and an asymmetric multi-positive InfoNCE loss to cope with the protein-centric nature of virtual screening. We conducted a systematic evaluation of 8 protein language models and 3 molecular language model variants against BindScreen. The proposed protein-centric batch construction consistently outperforms standard CLIP training across all evaluated encoders while substantially improving computational efficiency, reducing training cost by up to 32 times. In addition, our experiments demonstrate that BindScreen requires 7 times fewer inference computations than pairwise virtual screening approaches. On the LIT-PCBA dataset, BindScreen outperforms all sequence-based baselines, achieving a relative improvement of up to 39% in EF at 0.5 over the best competing method, while remaining competitive with traditional docking approaches without requiring 3D structural information.

## Model Details

BindScreen uses a dual-encoder architecture that independently encodes proteins and molecules using specialized language models. The protein branch processes amino acid sequences with ESM2 T36, and the molecule branch processes SMILES strings with MolDeBERTa MLC. Both representations are passed through projection heads to produce normalized embeddings.

![Model](pipeline.png)

Two configurations are available in this collection:

- [BindScreen-Frozen](https://huggingface.co/SaeedLab/BindScreen-Frozen): only the projection layers are trained, both encoders are frozen.
- [BindScreen-Finetuning](https://huggingface.co/SaeedLab/BindScreen-Finetuning): the projection layers and ESM2 T36 are trained, MolDeBERTa MLC is frozen.

## Usage

BindScreen computes cosine similarities between protein and molecule embeddings, which can be used to rank candidate molecules for a given target protein.

### Similarity

```python
from transformers import AutoTokenizer, AutoModel
import torch

# proteins
tokenizer_prot = AutoTokenizer.from_pretrained(
  'facebook/esm2_t36_3B_UR50D'
)
encoder_prot = AutoModel.from_pretrained(
  'facebook/esm2_t36_3B_UR50D'
).eval()


proteins = ["MKTFFVLLL", "ABCDE"]
proteins = [" ".join(i) for i in proteins]
inputs_prot = tokenizer_prot(
  proteins, 
  return_tensors="pt", 
  padding=True
)

with torch.no_grad():
  outputs = encoder_prot(**inputs_prot)
  hidden = outputs.last_hidden_state[:, :]
  mask = inputs_prot['attention_mask'].unsqueeze(-1).float()
  prot_rep = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-8)


# molecules
tokenizer_mol = AutoTokenizer.from_pretrained(
  'SaeedLab/MolDeBERTa-base-123M-mlc'
)
encoder_mol = AutoModel.from_pretrained(
  'SaeedLab/MolDeBERTa-base-123M-mlc'
).eval()


molecules = ["NCCc1nc(-c2ccccc2)cs1", "CC(=O)OCC(C)C"]
inputs_mol = tokenizer_mol(
  molecules, 
  return_tensors="pt", 
  padding=True
)

with torch.no_grad():
  outputs = encoder_mol(**inputs_mol)
  hidden = outputs.last_hidden_state[:, :]
  mask = inputs_mol['attention_mask'].unsqueeze(-1).float()
  mol_rep = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-8)


# bindscreen
bindscreen = AutoModel.from_pretrained(
  'SaeedLab/BindScreen-Frozen', 
  trust_remote_code=True
).eval()

with torch.no_grad():
  outputs = bindscreen(prot=prot_rep, mol=mol_rep)

print('Protein embeddings projected:', outputs.prot_rep)
print('Molecule embeddings projected:', outputs.mol_rep)
print('Cossine similarity:', outputs.similarity)

```

The returned outputs are:
- prot_rep: Projected embeddings for protein input in 512 dimension.
- mol_rep: Projected embeddings for molecule input in 512 dimension.
- similarity: Cossine similarity between proteins and molecules.

## Citation

The paper is under review. As soon as it is accepted, we will update this section.

## License

This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.

## Contact

For any additional questions or comments, contact Fahad Saeed (fsaeed@fiu.edu).