Instructions to use baa-ai/Merino-Pro-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use baa-ai/Merino-Pro-4bit with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("baa-ai/Merino-Pro-4bit") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Notebooks
- Google Colab
- Kaggle
baa.ai · Merino-Pro-4bit
The premium unified retrieval model — bi-encoder embedding and cross-encoder reranking in one package, over a single shared word-embedding table. A 1024-dimensional multilingual model, by BAA AI (Black Sheep AI). 872M params, 4-bit quantized (0.5 GB).
Get the optimal model for your data
Merino-Pro-4bit is baa.ai's flagship default. But the best embedder + reranker is corpus-specific — the ideal choice depends on your documents and your notion of relevance. baa.ai offers exclusive tooling that identifies the optimal embedding and reranking models for your specific data, so you ship the smallest models that maximize document recovery on your corpus. For a tailored recommendation, reach out to baa.ai.
What it is
A two-role retrieval model over a shared input word-embedding matrix (~256M params, stored once). The bi-encoder embedder and cross-encoder reranker are both built on the xlm-roberta-large backbone and co-trained to share a single word-embedding table at no quality loss, while each role keeps its native layers and head. 4-bit weight quantization is lossless on this stack (the embedder is the limiter; rerank tolerates lower bits).
- Embed role: bi-encoder, 1024-d, L2-normalized. Prepend
"query: "to queries. - Rerank role: cross-encoder, single relevance logit per (query, document) pair.
- Router: call
.embed(...)or.rerank(...).
Usage
from modeling_baa import BaaEmbeddingReranker # included in this repo
m = BaaEmbeddingReranker("baa-ai/Merino-Pro-4bit")
qv = m.embed(["my query"], is_query=True)[0] # 1024-d normalized
dv = m.embed(["doc a", "doc b"])
ranked = m.rerank("my query", ["doc a", "doc b"], top_k=10) # [(doc, score), ...]
License & attribution
- BAA Contributions (shared-embedding architecture, router/loader code, packaging, weights, docs) are proprietary to BAA AI (Black Sheep AI) — see
LICENSE. - Incorporates the
xlm-roberta-largebackbone under the MIT License — seeLICENSE-xlm-roberta-large.txt.
© 2026 BAA AI (Black Sheep AI) — baa.ai. Provided "as is" without warranty.
Certification & corpus fit (2026-07)
Position Balance (PB): 0.31 — PB measures how findable a chunk is through its second fact when two facts share one embedding (second-fact / first-fact top-1 retrieval on an adversarial 1,300-chunk audit; fleet range 0.16–0.69). Best-in-fleet single-fact quality (1.00). Strongest choice for topically-unified corpora (abstracts, product data); for heterogeneous multi-fact documents prefer Merino-Large or enforce strictly atomic chunking.
4-bit quantization: certified lossless under distractor stress. Paired contested-region robustness (gold document injected into pools of up to 100 near-topical distractors, n=300 queries, bootstrap CIs) is statistically indistinguishable from fp16 — an axis standard hit@k benchmarks do not measure.
Chunking prescription: embed one atomic claim per chunk and lead with its key entity; retrieve small, return the parent section for context. Basis: single-vector embeddings preserve ~one independent fact per chunk regardless of encoder family (measured across 12 encoders).
- Downloads last month
- 54