Feature Extraction
sentence-transformers
Safetensors
modernbert
code-search
code-embedding
retrieval
dense
text-embeddings-inference
Shuu12121's picture
Update README.md
43b8704 verified
|
Raw
History Blame Contribute Delete
17.2 kB
metadata
tags:
  - sentence-transformers
  - feature-extraction
  - code-search
  - code-embedding
  - retrieval
  - modernbert
  - dense
base_model: Shuu12121/NightOwl
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: apache-2.0
datasets:
  - Shuu12121/coir_hard_negative_datasets_v3_kd
  - Shuu12121/owl_code_search_hard_negative_datasets_V2_kd
  - Shuu12121/codeedit_hard_negative_datasets_kd

NightOwl-CodeEmbedding 🦉

NightOwl-CodeEmbedding is a compact, 768-dimensional dense embedding model built for code retrieval, code-edit retrieval, and technical question answering.

It is fine-tuned from Shuu12121/NightOwl, a ModernBERT-based code encoder. The model uses CLS pooling with cosine similarity and does not require query: / passage: style prefixes.

Highlights

  • Compact (150.8M parameters) yet competitive on CoIR-style code retrieval benchmarks
  • Ranks 18th out of 241 models overall on the MTEB(Code, v1) leaderboard, and is the top-scoring dense single-vector model under 300M parameters among scored entries — ahead of many models an order of magnitude larger (see Leaderboard Standing)
  • Standard single-vector retrieval: one 768-d vector per document and plain dot-product / cosine search, so it drops straight into any existing vector-database stack
  • Covers eight programming languages: the six CodeSearchNet languages plus Rust and TypeScript
  • Handles a broad range of retrieval scenarios: NL-to-code search, code-to-code retrieval, code-edit retrieval, and technical QA
  • Trained with hard negatives mined by Qwen/Qwen3-Embedding-0.6B (15 hard negatives per anchor)
  • Decontaminated against the CodeSearchNet test splits and the CodeEditSearchRetrieval benchmark (see Data Decontamination)
  • Drop-in compatible with sentence-transformers; Apache-2.0 licensed

Supported Languages

  • Go, Java, JavaScript, PHP, Python, Ruby (CodeSearchNet languages)
  • Rust, TypeScript (additional)

Performance on languages outside this set is untested and may vary.

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Shuu12121/NightOwl-CodeEmbedding")

queries = ["Python function that sorts a list in descending order"]
documents = [
    "def sort_desc(values): return sorted(values, reverse=True)",
    "def average(values): return sum(values) / len(values)",
]

query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

# Cosine similarity (embeddings are normalized internally by similarity())
scores = model.similarity(query_embeddings, document_embeddings)
print(scores)

Model Details

Property Value
Base model Shuu12121/NightOwl
Architecture ModernBERT
Parameters 150,779,136
Embedding dimension 768
Pooling CLS pooling
Maximum sequence length 1,024 tokens
Similarity Cosine similarity
Query/document prefixes Not required
Weight dtype FP32
Weight memory 575 MiB
License Apache-2.0

MTEB Results

The model was evaluated with MTEB on code-related retrieval and technical QA tasks.

Evaluation setup:

  • Model revision: c7c8a57b9539297e192d5cf39b9aecf1fb376edd
  • MTEB version: 2.15.1
  • Metric: NDCG@10
  • Hardware: NVIDIA GeForce RTX 5090
  • Batch size: 64

Multi-subset task scores are reported as macro averages.

Task Split NDCG@10
AppsRetrieval test 0.39177
COIRCodeSearchNetRetrieval test 0.84264
CodeEditSearchRetrieval train¹ 0.74808
CodeFeedbackMT test 0.76690
CodeFeedbackST test 0.85207
CodeSearchNetCCRetrieval test 0.91805
CodeSearchNetRetrieval test 0.89239
CodeTransOceanContest test 0.75953
CodeTransOceanDL test 0.36057
CosQA test 0.42810
StackOverflowQA test 0.86608
SyntheticText2SQL test 0.68266
Macro average, all 12 tasks 0.70907
CoIR macro average, 10 tasks 0.68684

¹ CodeEditSearchRetrieval does not provide a standard test split in MTEB, so the official train split is used for evaluation. These examples were not used for fine-tuning. See Data Decontamination.

Leaderboard Standing

On the public MTEB(Code, v1) leaderboard, the model's task-mean score (70.91, i.e. the 12-task macro average above ×100) places it as follows:

  • #18 of 241 models overall, ahead of many models an order of magnitude larger
  • #6 of 155 among sub-1B-parameter dense single-vector models — and the smallest model in that top six. The five models ranked above it (F2LLM-v2-0.6B/330M, pplx-embed-v1-0.6b, C2LLM-0.5B, Qwen3-Embedding-0.6B) are all roughly 0.33–0.6B parameters, i.e. 2–4× larger.
  • #1 among ranked dense single-vector models under 300M parameters (the leaderboard's small-model view)

Reading the numbers fairly. MTEB(Code, v1) reports a zero-shot % for each model — the fraction of leaderboard tasks the model was not trained on. NightOwl-CodeEmbedding is 8% zero-shot: it was trained on most of these task families, so its score reflects strong in-domain retrieval rather than zero-shot transfer. Models marked 100% (e.g. embeddinggemma-300m, the granite-embedding r2 family, Qwen3-Embedding) are evaluated fully out-of-domain, so raw score comparisons across rows with different zero-shot % are not apples-to-apples. The fairest direct comparisons are to other code-specialized models at similar zero-shot levels (e.g. the F2LLM / C2LLM families at 8–58%).

Comparison with similar-sized models

The table below compares NightOwl-CodeEmbedding with other compact dense single-vector (bi-encoder) models on MTEB(Code, v1), with a size ladder of larger single-vector models for reference. Score is the leaderboard task mean (higher is better); the Zero-shot column is the share of tasks the model did not train on.

Model Params Emb. dim Max tokens Zero-shot MTEB(Code, v1) ↑
NightOwl-CodeEmbedding (this model) 150.8M 768 1,024 8% 70.91
codefuse-ai/F2LLM-v2-160M 159M 640 40,960 58% 70.38
google/embeddinggemma-300m 308M 768 2,048 100% 68.76
codefuse-ai/F2LLM-v2-80M 80M 320 40,960 58% 67.97
ibm-granite/granite-embedding-311m-multilingual-r2 312M 768 8,192 100% 63.84
Larger single-vector reference (size ladder)
codefuse-ai/F2LLM-v2-0.6B (#1 sub-1B) 596M 1,024 40,960 58% 77.41
Qwen/Qwen3-Embedding-0.6B 596M 1,024 32,768 100% 75.42
codefuse-ai/F2LLM-v2-14B (#1 overall) 13.99B 5,120 40,960 58% 80.75

Takeaways:

  • Among compact single-vector dense models, NightOwl-CodeEmbedding is the strongest entry in the leaderboard's small-model view while also being one of the smallest, edging out F2LLM-v2-160M and clearly ahead of embeddinggemma-300m.
  • The sub-1B leaders (F2LLM-v2-0.6B, Qwen3-Embedding-0.6B) score ~4–6.5 points higher but are ~4× the parameter count with larger embedding dimensions, which directly increases index size and inference cost.
  • The 14B model at the top of the overall board is ~10 points higher but ~93× larger — a different deployment cost regime entirely.

Base Model: the NightOwl Backbone

NightOwl-CodeEmbedding is fine-tuned from Shuu12121/NightOwl, a ModernBERT-style code encoder that was pre-trained from scratch — tokenizer included — rather than adapted from a general-purpose checkpoint. The whole stack, from tokenization to the pre-training objective, is designed around code.

Code-aware tokenizer. NightOwl uses a custom 50,368-token BPE tokenizer in which whitespace is tokenized independently of adjacent words, so indentation gets its own tokens instead of being merged into "leading-whitespace + word" pieces. In code, the same identifier recurs at many indentation depths; folding whitespace into those pieces would waste large parts of the vocabulary on near-duplicate "indent + token" variants. Keeping whitespace separate lets the fixed vocabulary budget cover more genuinely distinct subwords while still representing indentation faithfully — which matters for whitespace-significant languages such as Python.

Two-phase pre-training with line-level masking. NightOwl is trained with masked-language modeling (mlm_probability = 0.3) in two phases:

  • Phase 1 — mixed pre-training: standard random-token MLM over code, natural language, and technical documentation (producing NightOwl-Pre).
  • Phase 2 — code-only continuation: line-level MLM, where entire source-code lines are masked instead of random tokens. This aligns the pre-training objective with code search and retrieval, where the unit of meaning is closer to a line or statement than an isolated token. The recommended NightOwl checkpoint is this Phase-2 result.

Backbone architecture (base):

Property Value
Architecture ModernBERT (alternating local/global attention, RoPE)
Parameters ≈150M
hidden_size / layers / heads 768 / 19 / 12
Vocabulary 50,368 (custom code BPE)
Max sequence length 1,024 (Phase 1) → 2,048 (Phase 2)

Pre-training data mixes bigcode/starcoder2data-extras (Kaggle notebooks, StackOverflow threads, GitHub issues, technical documentation, …) with whole-file source from Shuu12121/github-file-programs-dataset across the eight supported languages. Long examples are split into chunks so all tokens are used rather than truncated.

As a raw backbone — before any embedding fine-tuning — NightOwl reaches 0.8436 average MRR on MTEB CodeSearchNetRetrieval under a fixed SentenceTransformer fine-tuning protocol, ahead of CodeBERT-base (0.7944), GraphCodeBERT-base (0.8325), UniXCoder-base (0.8003), and ModernBERT-base (0.8182) evaluated the same way. NightOwl-CodeEmbedding builds the retrieval model described in this card on top of that backbone.

Training

The model was trained with CachedMultipleNegativesRankingLoss using bidirectional query-to-document and document-to-query objectives.

Property Value
Training samples 2,534,400
Positives per anchor 1
Negatives per anchor 15
Loss CachedMultipleNegativesRankingLoss
Objective Bidirectional retrieval training
Hard-negative mining model Qwen/Qwen3-Embedding-0.6B
Epochs 1
Learning rate 6e-5
Batch size 1024

Training Data

The training data is a mixture of:

  1. Public code-retrieval datasets covering the following CoIR task families: AppsRetrieval, COIRCodeSearchNetRetrieval, CodeFeedbackMT, CodeFeedbackST, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval, CodeTransOceanContest, CodeTransOceanDL, CosQA, StackOverflowQA, and SyntheticText2SQL.
  2. Custom code–comment pair data: code snippets paired with natural-language description comments across the eight supported languages.
  3. Code-edit data derived from commitpackft, pairing edit intents with code changes.

All datasets were constructed as hard-negative retrieval datasets: for each anchor, one positive and fifteen hard negatives. Hard negatives were mined with Qwen/Qwen3-Embedding-0.6B, which retrieves semantically similar but non-matching candidates — considerably harder than random negatives. The mining model is used only during dataset construction and is not required at inference time.

This setup is intended to sharpen discrimination between code snippets, programming questions, edit examples, and technically similar retrieval candidates.

Data Decontamination

To reduce benchmark contamination, the following overlaps were removed from the training data before training:

  • Overlaps between the custom code–comment pair data and the CodeSearchNet test split
  • Overlaps between the commitpackft-derived code-edit data and the CodeEditSearchRetrieval benchmark evaluation data

For CodeEditSearchRetrieval, note that MTEB labels the evaluation split train. This is just the official split name for the task; the evaluated examples were not included in this model's fine-tuning data. The reported score should therefore be read as in-domain generalization on held-out benchmark examples — not training-set performance, though, given the in-domain training distribution, not strictly zero-shot performance either.

Intended Use

  • Natural language to code search
  • Code-to-code retrieval and similar-function search
  • Code-edit retrieval (matching edit intents to code changes)
  • Retrieval over programming Q&A and technical questions
  • Local semantic code search systems
  • RAG systems over codebases and developer documentation

Example use cases include indexing functions, snippets, programming solutions, StackOverflow-style answers, code-review examples, and edit-related code examples.

Limitations

  • Specialized for code-related retrieval; may underperform general-purpose text embedding models on unrelated natural-language tasks.
  • Inputs longer than 1,024 tokens are truncated — a shorter context window than some competitors (e.g. the 8K+ token F2LLM and granite models), so very long files must be chunked.
  • MTEB(Code, v1) is largely in-domain for this model (8% zero-shot). On code domains, query styles, or languages far from the training distribution, expect lower performance than the leaderboard numbers suggest.
  • Performance may vary by programming language, query style, and the granularity of indexed code chunks; languages outside the eight supported ones are untested.
  • The model produces dense single-vector embeddings. For applications that need very fine-grained token-level matching, other retrieval approaches — such as late-interaction (multi-vector) models or cross-encoder rerankers — may be worth considering, with different trade-offs in index size and retrieval infrastructure.

Recommended Indexing Settings

Encode both queries and documents with normalized embeddings:

embeddings = model.encode(texts, normalize_embeddings=True)

With normalized embeddings, dot product is equivalent to cosine similarity.

For codebase search, indexing function-level or class-level chunks usually works best. Very long files may exceed the 1,024-token context limit and should be split into smaller semantic chunks.

Citation

If you use this model, please cite it together with the base model and Sentence Transformers.

@misc{nightowl_codeembedding,
  title = {NightOwl-CodeEmbedding},
  author = {Shuu12121},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Shuu12121/NightOwl-CodeEmbedding}
}