Feature Extraction
sentence-transformers
Safetensors
modernbert
code-search
code-embedding
retrieval
dense
text-embeddings-inference
File size: 17,157 Bytes
437f5d1
 
 
 
c30b665
 
 
 
437f5d1
 
c30b665
437f5d1
c6d9d28
7cfae01
 
 
 
437f5d1
 
11b8466
437f5d1
43b8704
 
437f5d1
43b8704
 
11b8466
 
 
 
 
43b8704
 
 
 
 
 
 
 
 
 
 
 
11b8466
 
43b8704
11b8466
43b8704
11b8466
 
 
 
 
 
43b8704
437f5d1
 
 
 
 
 
c30b665
437f5d1
c30b665
 
 
 
 
437f5d1
11b8466
 
437f5d1
11b8466
 
c30b665
437f5d1
 
c30b665
437f5d1
11b8466
 
 
 
 
 
 
 
 
 
 
 
 
c30b665
 
 
43b8704
11b8466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43b8704
 
 
11b8466
a0cbae3
 
 
43b8704
 
a0cbae3
43b8704
a0cbae3
43b8704
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0cbae3
 
 
43b8704
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0cbae3
 
 
43b8704
 
 
 
 
 
 
 
c30b665
a0cbae3
 
 
43b8704
 
 
 
 
 
 
 
 
 
 
 
 
a0cbae3
 
 
 
 
43b8704
 
 
 
 
 
a0cbae3
 
 
 
 
 
 
43b8704
 
a0cbae3
 
 
43b8704
 
 
 
a0cbae3
 
 
43b8704
 
 
 
a0cbae3
c30b665
 
43b8704
 
 
 
 
 
 
 
 
 
 
 
 
 
11b8466
 
 
 
 
43b8704
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11b8466
 
 
43b8704
 
11b8466
43b8704
 
11b8466
 
 
43b8704
 
 
 
 
 
11b8466
 
 
 
43b8704
11b8466
 
 
 
 
 
43b8704
42e608e
c30b665
 
43b8704
 
 
 
 
 
 
 
 
 
 
 
 
 
11b8466
 
 
 
 
 
 
 
 
 
 
43b8704
 
 
437f5d1
c30b665
437f5d1
43b8704
 
11b8466
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
tags:
- sentence-transformers
- feature-extraction
- code-search
- code-embedding
- retrieval
- modernbert
- dense
base_model: Shuu12121/NightOwl
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: apache-2.0
datasets:
- Shuu12121/coir_hard_negative_datasets_v3_kd
- Shuu12121/owl_code_search_hard_negative_datasets_V2_kd
- Shuu12121/codeedit_hard_negative_datasets_kd
---

# NightOwl-CodeEmbedding 🦉

`NightOwl-CodeEmbedding` is a compact, 768-dimensional dense embedding model built for
code retrieval, code-edit retrieval, and technical question answering.

It is fine-tuned from [`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl),
a ModernBERT-based code encoder. The model uses CLS pooling with cosine similarity and
does **not** require `query:` / `passage:` style prefixes.

## Highlights

* Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
* Ranks **18th out of 241 models overall** on the
  [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard) leaderboard, and is
  the **top-scoring dense single-vector model under 300M parameters** among scored
  entries — ahead of many models an order of magnitude larger
  (see [Leaderboard Standing](#leaderboard-standing))
* Standard single-vector retrieval: one 768-d vector per document and plain
  dot-product / cosine search, so it drops straight into any existing vector-database
  stack
* Covers **eight programming languages**: the six CodeSearchNet languages plus Rust
  and TypeScript
* Handles a broad range of retrieval scenarios: NL-to-code search, code-to-code
  retrieval, **code-edit retrieval**, and technical QA
* Trained with hard negatives mined by `Qwen/Qwen3-Embedding-0.6B`
  (15 hard negatives per anchor)
* Decontaminated against the CodeSearchNet test splits and the
  CodeEditSearchRetrieval benchmark (see [Data Decontamination](#data-decontamination))
* Drop-in compatible with `sentence-transformers`; Apache-2.0 licensed

## Supported Languages

* Go, Java, JavaScript, PHP, Python, Ruby (CodeSearchNet languages)
* **Rust, TypeScript** (additional)

Performance on languages outside this set is untested and may vary.

## Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding")

queries = ["Python function that sorts a list in descending order"]
documents = [
    "def sort_desc(values): return sorted(values, reverse=True)",
    "def average(values): return sum(values) / len(values)",
]

query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

# Cosine similarity (embeddings are normalized internally by similarity())
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)
```

## Model Details

| Property                | Value                |
| ----------------------- | -------------------- |
| Base model              | `Shuu12121/NightOwl` |
| Architecture            | ModernBERT           |
| Parameters              | 150,779,136          |
| Embedding dimension     | 768                  |
| Pooling                 | CLS pooling          |
| Maximum sequence length | 1,024 tokens         |
| Similarity              | Cosine similarity    |
| Query/document prefixes | Not required         |
| Weight dtype            | FP32                 |
| Weight memory           | 575 MiB              |
| License                 | Apache-2.0           |

## MTEB Results

The model was evaluated with MTEB on code-related retrieval and technical QA tasks.

Evaluation setup:

* Model revision: `c7c8a57b9539297e192d5cf39b9aecf1fb376edd`
* MTEB version: `2.15.1`
* Metric: `NDCG@10`
* Hardware: NVIDIA GeForce RTX 5090
* Batch size: 64

Multi-subset task scores are reported as macro averages.

| Task                             |   Split |     NDCG@10 |
| -------------------------------- | ------: | ----------: |
| AppsRetrieval                    |    test |     0.39177 |
| COIRCodeSearchNetRetrieval       |    test |     0.84264 |
| CodeEditSearchRetrieval          | train¹ |     0.74808 |
| CodeFeedbackMT                   |    test |     0.76690 |
| CodeFeedbackST                   |    test |     0.85207 |
| CodeSearchNetCCRetrieval         |    test |     0.91805 |
| CodeSearchNetRetrieval           |    test |     0.89239 |
| CodeTransOceanContest            |    test |     0.75953 |
| CodeTransOceanDL                 |    test |     0.36057 |
| CosQA                            |    test |     0.42810 |
| StackOverflowQA                  |    test |     0.86608 |
| SyntheticText2SQL                |    test |     0.68266 |
| **Macro average, all 12 tasks**  |         | **0.70907** |
| **CoIR macro average, 10 tasks** |         | **0.68684** |

¹ `CodeEditSearchRetrieval` does not provide a standard `test` split in MTEB, so the
official `train` split is used for evaluation. These examples were **not** used for
fine-tuning. See [Data Decontamination](#data-decontamination).

### Leaderboard Standing

On the public [MTEB(Code, v1)](https://huggingface.co/spaces/mteb/leaderboard)
leaderboard, the model's task-mean score (70.91, i.e. the 12-task macro average above
×100) places it as follows:

* **#18 of 241 models overall**, ahead of many models an order of magnitude larger
* **#6 of 155 among sub-1B-parameter dense single-vector models** — and the
  **smallest model in that top six**. The five models ranked above it
  (`F2LLM-v2-0.6B/330M`, `pplx-embed-v1-0.6b`, `C2LLM-0.5B`, `Qwen3-Embedding-0.6B`)
  are all roughly 0.33–0.6B parameters, i.e. 2–4× larger.
* **#1 among ranked dense single-vector models under 300M parameters**
  (the leaderboard's small-model view)

> **Reading the numbers fairly.** MTEB(Code, v1) reports a *zero-shot %* for each
> model — the fraction of leaderboard tasks the model was *not* trained on.
> `NightOwl-CodeEmbedding` is **8%** zero-shot: it was trained on most of these task
> families, so its score reflects strong **in-domain** retrieval rather than zero-shot
> transfer. Models marked **100%** (e.g. `embeddinggemma-300m`, the
> `granite-embedding` r2 family, `Qwen3-Embedding`) are evaluated fully out-of-domain,
> so raw score comparisons across rows with different zero-shot % are not
> apples-to-apples. The fairest direct comparisons are to other code-specialized
> models at similar zero-shot levels (e.g. the `F2LLM` / `C2LLM` families at 8–58%).

### Comparison with similar-sized models

The table below compares `NightOwl-CodeEmbedding` with other compact **dense
single-vector (bi-encoder)** models on MTEB(Code, v1), with a size ladder of larger
single-vector models for reference. Score is the leaderboard task mean (higher is
better); the *Zero-shot* column is the share of tasks the model did not train on.

| Model                                                | Params  | Emb. dim | Max tokens | Zero-shot | MTEB(Code, v1) ↑ |
| ---------------------------------------------------- | ------: | -------- | ---------: | --------: | ---------------: |
| **`NightOwl-CodeEmbedding`** (this model)            |  150.8M | 768      |      1,024 |        8% |        **70.91** |
| `codefuse-ai/F2LLM-v2-160M`                          |    159M | 640      |     40,960 |       58% |            70.38 |
| `google/embeddinggemma-300m`                         |    308M | 768      |      2,048 |      100% |            68.76 |
| `codefuse-ai/F2LLM-v2-80M`                           |     80M | 320      |     40,960 |       58% |            67.97 |
| `ibm-granite/granite-embedding-311m-multilingual-r2` |    312M | 768      |      8,192 |      100% |            63.84 |
| _Larger single-vector reference (size ladder)_       |         |          |            |           |                  |
| `codefuse-ai/F2LLM-v2-0.6B` (#1 sub-1B)              |    596M | 1,024    |     40,960 |       58% |            77.41 |
| `Qwen/Qwen3-Embedding-0.6B`                          |    596M | 1,024    |     32,768 |      100% |            75.42 |
| `codefuse-ai/F2LLM-v2-14B` (#1 overall)              |  13.99B | 5,120    |     40,960 |       58% |            80.75 |

Takeaways:

* Among compact single-vector dense models, `NightOwl-CodeEmbedding` is the strongest
  entry in the leaderboard's small-model view while also being one of the smallest,
  edging out `F2LLM-v2-160M` and clearly ahead of `embeddinggemma-300m`.
* The sub-1B leaders (`F2LLM-v2-0.6B`, `Qwen3-Embedding-0.6B`) score ~4–6.5 points
  higher but are ~4× the parameter count with larger embedding dimensions, which
  directly increases index size and inference cost.
* The 14B model at the top of the overall board is ~10 points higher but ~93× larger —
  a different deployment cost regime entirely.

## Base Model: the NightOwl Backbone

`NightOwl-CodeEmbedding` is fine-tuned from
[`Shuu12121/NightOwl`](https://huggingface.co/Shuu12121/NightOwl), a ModernBERT-style
code encoder that was **pre-trained from scratch** — tokenizer included — rather than
adapted from a general-purpose checkpoint. The whole stack, from tokenization to the
pre-training objective, is designed around code.

**Code-aware tokenizer.** NightOwl uses a custom 50,368-token BPE tokenizer in which
whitespace is tokenized **independently** of adjacent words, so indentation gets its
own tokens instead of being merged into "leading-whitespace + word" pieces. In code,
the same identifier recurs at many indentation depths; folding whitespace into those
pieces would waste large parts of the vocabulary on near-duplicate "indent + token"
variants. Keeping whitespace separate lets the fixed vocabulary budget cover more
genuinely distinct subwords while still representing indentation faithfully — which
matters for whitespace-significant languages such as Python.

**Two-phase pre-training with line-level masking.** NightOwl is trained with
masked-language modeling (`mlm_probability = 0.3`) in two phases:

* *Phase 1 — mixed pre-training:* standard random-token MLM over code, natural
  language, and technical documentation (producing `NightOwl-Pre`).
* *Phase 2 — code-only continuation:* **line-level MLM**, where entire source-code
  lines are masked instead of random tokens. This aligns the pre-training objective
  with code search and retrieval, where the unit of meaning is closer to a line or
  statement than an isolated token. The recommended `NightOwl` checkpoint is this
  Phase-2 result.

Backbone architecture (base):

| Property                       | Value                                                 |
| ------------------------------ | ----------------------------------------------------- |
| Architecture                   | ModernBERT (alternating local/global attention, RoPE) |
| Parameters                     | ≈150M                                                 |
| `hidden_size` / layers / heads | 768 / 19 / 12                                         |
| Vocabulary                     | 50,368 (custom code BPE)                              |
| Max sequence length            | 1,024 (Phase 1) → 2,048 (Phase 2)                     |

Pre-training data mixes `bigcode/starcoder2data-extras` (Kaggle notebooks,
StackOverflow threads, GitHub issues, technical documentation, …) with whole-file
source from `Shuu12121/github-file-programs-dataset` across the eight supported
languages. Long examples are split into chunks so all tokens are used rather than
truncated.

As a raw backbone — before any embedding fine-tuning — NightOwl reaches **0.8436
average MRR** on MTEB `CodeSearchNetRetrieval` under a fixed SentenceTransformer
fine-tuning protocol, ahead of CodeBERT-base (0.7944), GraphCodeBERT-base (0.8325),
UniXCoder-base (0.8003), and ModernBERT-base (0.8182) evaluated the same way.
`NightOwl-CodeEmbedding` builds the retrieval model described in this card on top of
that backbone.

## Training

The model was trained with `CachedMultipleNegativesRankingLoss` using bidirectional
query-to-document and document-to-query objectives.

| Property                   | Value                                |
| -------------------------- | ------------------------------------ |
| Training samples           | 2,534,400                            |
| Positives per anchor       | 1                                    |
| Negatives per anchor       | 15                                   |
| Loss                       | `CachedMultipleNegativesRankingLoss` |
| Objective                  | Bidirectional retrieval training     |
| Hard-negative mining model | `Qwen/Qwen3-Embedding-0.6B`          |
| Epochs                     | 1                                    |
| Learning rate              | 6e-5                                 |
| Batch size                 | 1024                                 |

### Training Data

The training data is a mixture of:

1. **Public code-retrieval datasets** covering the following CoIR task families:
   AppsRetrieval, COIRCodeSearchNetRetrieval, CodeFeedbackMT, CodeFeedbackST,
   CodeSearchNetCCRetrieval, CodeSearchNetRetrieval, CodeTransOceanContest,
   CodeTransOceanDL, CosQA, StackOverflowQA, and SyntheticText2SQL.
2. **Custom code–comment pair data**: code snippets paired with natural-language
   description comments across the eight supported languages.
3. **Code-edit data** derived from `commitpackft`, pairing edit intents with code
   changes.

All datasets were constructed as hard-negative retrieval datasets: for each anchor,
one positive and fifteen hard negatives. Hard negatives were mined with
[`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B), which
retrieves semantically similar but non-matching candidates — considerably harder than
random negatives. The mining model is used only during dataset construction and is not
required at inference time.

This setup is intended to sharpen discrimination between code snippets, programming
questions, edit examples, and technically similar retrieval candidates.

### Data Decontamination

To reduce benchmark contamination, the following overlaps were removed from the
training data **before** training:

* Overlaps between the custom code–comment pair data and the **CodeSearchNet test
  split**
* Overlaps between the `commitpackft`-derived code-edit data and the
  **CodeEditSearchRetrieval** benchmark evaluation data

For `CodeEditSearchRetrieval`, note that MTEB labels the evaluation split `train`.
This is just the official split name for the task; the evaluated examples were not
included in this model's fine-tuning data. The reported score should therefore be read
as **in-domain generalization on held-out benchmark examples** — not training-set
performance, though, given the in-domain training distribution, not strictly zero-shot
performance either.

## Intended Use

* Natural language to code search
* Code-to-code retrieval and similar-function search
* Code-edit retrieval (matching edit intents to code changes)
* Retrieval over programming Q&A and technical questions
* Local semantic code search systems
* RAG systems over codebases and developer documentation

Example use cases include indexing functions, snippets, programming solutions,
StackOverflow-style answers, code-review examples, and edit-related code examples.

## Limitations

* Specialized for code-related retrieval; may underperform general-purpose text
  embedding models on unrelated natural-language tasks.
* Inputs longer than 1,024 tokens are truncated — a shorter context window than some
  competitors (e.g. the 8K+ token `F2LLM` and `granite` models), so very long files
  must be chunked.
* MTEB(Code, v1) is largely in-domain for this model (8% zero-shot). On code domains,
  query styles, or languages far from the training distribution, expect lower
  performance than the leaderboard numbers suggest.
* Performance may vary by programming language, query style, and the granularity of
  indexed code chunks; languages outside the eight supported ones are untested.
* The model produces dense single-vector embeddings. For applications that need very
  fine-grained token-level matching, other retrieval approaches — such as
  late-interaction (multi-vector) models or cross-encoder rerankers — may be worth
  considering, with different trade-offs in index size and retrieval infrastructure.

## Recommended Indexing Settings

Encode both queries and documents with normalized embeddings:

```python
embeddings = model.encode(texts, normalize_embeddings=True)
```

With normalized embeddings, dot product is equivalent to cosine similarity.

For codebase search, indexing function-level or class-level chunks usually works best.
Very long files may exceed the 1,024-token context limit and should be split into
smaller semantic chunks.

## Citation

If you use this model, please cite it together with the base model and Sentence
Transformers.

```bibtex
@misc{nightowl_codeembedding,
  title = {NightOwl-CodeEmbedding},
  author = {Shuu12121},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Shuu12121/NightOwl-CodeEmbedding}
}
```