Hebrew Semantic Retrieval β€” 3rd Place Solution

Competition: Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the Israel National NLP Program

Result: πŸ₯‰ 3rd place β€” nDCG@20 = 0.652538 (private test set) Β· 0.432286 (public test set)

Author: kdbrodt


Overview

This repository contains the complete inference code and fine-tuned models for the 3rd-place solution to the Hebrew Semantic Retrieval Challenge. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by NDCG@20.

The solution is a clean, end-to-end two-stage retrieve-then-rerank pipeline built entirely on the FlagEmbedding (BAAI/bge-m3) family. Both the dense embedder and the cross-encoder reranker were fine-tuned directly on the competition's annotated Hebrew data.


The Challenge

Property Detail
Organizer MAFAT DDR&D + Israel National NLP Program
Corpus size 127,731 Hebrew paragraphs
Data sources Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols
Evaluation metric NDCG@20
Phase I Public leaderboard (Codabench)
Phase II Private test set with additional human annotation of previously unseen retrievals
Relevance scale 0–4 (human annotated)

Solution Architecture

A straightforward two-stage pipeline: dense retrieval followed by cross-encoder reranking.

Query
  β”‚
  β–Ό
[BGE-M3 Dense Retriever]  (fine-tuned, CLS pooling, FP16)
  β”‚  cosine similarity over 127k passages
  β–Ό
Top-100 Candidates
  β”‚
  β–Ό
[BGE-Reranker-v2-M3]  (fine-tuned binary classifier, FP16)
  β”‚  query-passage pairs scored, max_length=512
  β–Ό
Final Top-20 Results

Stage 1 β€” Dense Retrieval

The fine-tuned bge-m3 encoder produces CLS-token embeddings (L2-normalized, FP16) for all corpus passages at preprocessing time. At query time, a single query embedding is computed and scored against all corpus embeddings via dot-product similarity (equivalent to cosine similarity on normalized vectors). The top-100 passages are selected for reranking.

Property Value
Model test_encoder_only_base_bge_m3_new1 (fine-tuned BAAI/bge-m3)
Pooling CLS token
Normalization L2
Precision FP16
Max length 512 tokens
Batch size (corpus) 64
Retrieval pool Top-100 candidates

Stage 2 β€” Cross-Encoder Reranking

The top-100 candidates are re-scored by the fine-tuned bge-reranker-v2-m3, a sequence classification model that takes concatenated [query, passage] pairs as input and outputs a relevance logit. Passages are sorted by length before scoring to minimize padding overhead. The top-20 by reranker score are returned.

Property Value
Model test_encoder_only_base_bge_reranker_v2_m3_new1 (fine-tuned BAAI/bge-reranker-v2-m3)
Max length 512 tokens
Batch size 16
Output Top-20 by reranker logit

Fine-Tuning

Both models were fine-tuned on the competition's annotated Hebrew training set using the FlagEmbedding framework.

Training data construction:

  • Every query–document pair with a positive relevance score (> 0) was treated as a positive example.
  • Every pair with a score of 0 was treated as a negative example.

Embedder (bge-m3): Trained with KL-divergence loss to produce embeddings that better separate relevant from irrelevant documents.

Reranker (bge-reranker-v2-m3): Trained as a binary classifier on the same positive/negative pairs, learning to predict relevance probability directly.

Hyperparameter Value
Epochs 2
Batch size per device 2
Learning rate 5e-6
Hardware 2 Γ— Nvidia Tesla V100-SXM2-32GB
Training time ~1 hour

Included Models (fine-tuned)

Path in repo Base model Fine-tuning
models/test_encoder_only_base_bge_m3_new1/ BAAI/bge-m3 KL-divergence loss on competition data ✨
models/test_encoder_only_base_bge_reranker_v2_m3_new1/ BAAI/bge-reranker-v2-m3 Binary classification on competition data ✨

Repository Structure

model.py      ← Full inference pipeline (preprocess + predict)
prepare.py    ← Data preparation script
train.sh      ← Training script
models/
  test_encoder_only_base_bge_m3_new1/                  ← Fine-tuned BGE-M3 embedder ✨
  test_encoder_only_base_bge_reranker_v2_m3_new1/      ← Fine-tuned BGE reranker ✨

Usage

The pipeline exposes two functions matching the competition API:

from model import preprocess, predict

# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)

# Query at inference time
results = predict({"query": "ΧžΧ” Χ”Χ–Χ›Χ•Χ™Χ•Χͺ של Χ©Χ•Χ›Χ¨Χ™ Χ“Χ™Χ¨Χ”?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 1.23}, ...]  (top-20)

Requirements:

torch
transformers
numpy

Hardware: A CUDA-capable GPU is required. Inference takes less than 1.5 hours on an g5.xlarge instance.


Reproducing the Models

1. Prepare data:

# Download competition data and unzip into `hsrc/` folder
python prepare.py

2. Train:

sh ./train.sh

Training takes ~1 hour on 2 Γ— V100-SXM2-32GB GPUs.


Technical Notes

  • Both models are loaded in FP16 via torch_dtype=torch.float16 and device_map for automatic GPU placement.
  • Corpus passages are sorted by length before embedding to reduce padding overhead during batch encoding.
  • The reranker also sorts candidates by passage length before scoring batches.
  • Fallback: if reranking fails, the pipeline falls back to returning the top-20 by dense retrieval score.

Results

Phase NDCG@20 Rank
Public (Phase I) 0.432286 πŸ₯‰ 3rd
Private (Phase II) 0.652538 πŸ₯‰ 3rd

The large gap between public and private scores reflects the private phase's additional human annotation of previously un-annotated retrieved documents, significantly boosting NDCG for systems that retrieved relevant but unannotated paragraphs.


Citation

If you use this solution or the models in this repository, please acknowledge the Hebrew Semantic Retrieval Challenge by MAFAT DDR&D and the Israel National NLP Program, and credit kdbrodt as the solution author.


Acknowledgements

  • MAFAT DDR&D and the Israel National NLP Program for organizing the challenge and providing the annotated Hebrew corpus.
  • The authors of BAAI/bge-m3 and BAAI/bge-reranker-v2-m3.
  • The FlagEmbedding team for the training framework.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including HebArabNlpProject/Semantic-Retrieval-3rd-place