ORCA6 v0.1-rc1
ORCA6 is an orchestration-advisor model project focused on AI tool selection, workflow architecture, RAG design, model-routing tradeoffs, and developer automation strategy.
Current best local adapter:
qwen3_14b_orca_refusal_smoke/
This checkpoint is a guarded local release candidate, not a broad public-quality model. It should be used with the source-packet runtime guard documented in the evaluation report, runbook, and local adapter CLI.
Intended Use
- Recommend orchestration patterns for AI developer workflows.
- Compare tools such as n8n, LangGraph, LiteLLM, Qdrant, Langfuse, Promptfoo, MCP servers, local inference stacks, and related infrastructure.
- Provide architecture tradeoffs, implementation plans, and conservative next steps.
Not Intended For
- Executing code or tools directly.
- Legal, medical, financial, or safety-critical decisions.
- General code-completion benchmarks unrelated to orchestration.
Training Data
- SFT train rows: 41
- SFT validation rows: 3
- DPO preference rows: 30
- Preference source: auto-graded bootstrap preferences
- Corpus source: GitHub documentation chunks from the ORCA6 pilot retrieval set.
- Final SFT source mix:
{"graded_preference": 27, "grounded_sft_builder": 8, "refusal_sft_builder": 6} - Final answer word count: min=102, max=239, avg=187.05
- Grounded SFT rows include retrieved-evidence citation examples and refusal hard negatives for empty evidence, unsupported claims, high-risk automation, medical-record access, and certification/compliance claims.
Training Dataset Audit
| Dataset | Rows | Sources | Avg answer words | Issues |
|---|---|---|---|---|
data/training_sft.jsonl |
27 | {"graded_preference": 27} |
202.22 | 0 |
data/validation_sft.jsonl |
3 | graded_preference |
213.0 | 0 |
data/training_sft_plus_grounded.jsonl |
35 | {"graded_preference": 27, "grounded_sft_builder": 8} |
198.37 | 0 |
data/refusal_sft_hard_negatives.jsonl |
6 | {"refusal_sft_builder": 6} |
121.0 | 0 |
data/training_sft_plus_grounded_refusals.jsonl |
41 | {"graded_preference": 27, "grounded_sft_builder": 8, "refusal_sft_builder": 6} |
187.05 | 0 |
Models and Infrastructure Used
| Role | Model / System | Notes |
|---|---|---|
| Initial smoke base | Qwen/Qwen2.5-0.5B-Instruct |
Small-model SFT smoke path validation. |
| Attempted 14B base | Qwen/Qwen2.5-14B-Instruct |
Initial fetch was too slow/stalled; not the release base. |
| Release base | Qwen/Qwen3-14B |
Official adapter base and Hugging Face base_model. |
| Local downloaded base path | .cache/orca6-qwen3-14b-download |
Local path used by current helper defaults; historical run used /tmp/orca6-qwen3-14b-download. |
| Preference/judge/bootstrap generation | qwen3-coder:30b via Ollama |
Used for local answer generation / auto-graded bootstrap preference data. |
| Embedding model | nomic-embed-text:latest via Ollama |
Used to embed pilot corpus chunks into Qdrant. |
| Fine-tuning stack | Unsloth + TRL SFTTrainer + PEFT LoRA / QLoRA | 4-bit training profile on local RTX 3090. |
Training Rounds
| Round | Base model | Adapter output | Rows | Validation rows | Profile | Final loss | Eval / pass result |
|---|---|---|---|---|---|---|---|
| Small-model smoke | Qwen/Qwen2.5-0.5B-Instruct |
qwen_finetuned_v0_smoke2/ |
27 | 3 | smoke SFT | eval_loss=1.762, train_loss=2.260, 14 steps |
Pipeline smoke only |
| Qwen3 14B smoke | Qwen/Qwen3-14B |
qwen3_14b_orca_smoke/ |
27 | 3 | 14B LoRA smoke | eval_loss=1.264, train_loss=1.537, 14 steps |
8 held-out prompts generated; quality not release-ready |
| Grounded fit-check | Qwen/Qwen3-14B |
qwen3_14b_orca_grounded_smoke/ |
35 | 3 | seq_length=512, lora_r=16, lora_alpha=32, lora_dropout=0, grad accum 2 |
eval_loss=1.211, train_loss=1.592, 36 steps |
Grounded eval 11/12 = 91.7% |
| Refusal fit-check / release adapter | Qwen/Qwen3-14B |
qwen3_14b_orca_refusal_smoke/ |
41 | 3 | seq_length=512, lora_r=16, lora_alpha=32, lora_dropout=0, grad accum 2 |
eval_loss=1.206, train_loss=1.510, 42 steps |
Guarded eval 12/12 = 100.0%; expanded release eval 54/54 = 100.0% |
Earlier 2048-token/r64 and 1024-token/r32 grounded-profile attempts hit CUDA OOM with the longer grounded examples. The release fit-check profile settled on 512 tokens and LoRA rank 16 on the RTX 3090.
Retrieval and Vector Database Stack
The model was trained and evaluated around a source-packet workflow rather than free-form citation generation.
| Component | Setting |
|---|---|
| Vector database used for active retrieval | Qdrant |
| Qdrant collection | orca6_pilot |
| Embedding model | nomic-embed-text:latest through Ollama |
| Embedding dimension | 768 |
| Vector distance | Cosine |
| Lexical retrieval | In-process BM25 over data/pilot_orchestration_chunks.jsonl |
| Rank fusion | Reciprocal-rank fusion plus exact-match/domain-cue boosts |
| Vector DBs represented in corpus/tool coverage | Qdrant, Chroma, Weaviate, pgvector |
| Other retrieval/RAG tools represented | LlamaIndex, Ragas, LangGraph, Langfuse, LangSmith, Promptfoo, LiteLLM, Ollama, llama.cpp, vLLM |
Evaluation
Latest recorded retrieval metrics:
{
"queries": 20,
"calibrated_pass_at_1": 0.7,
"calibrated_pass_at_3": 1.0,
"calibrated_pass_at_5": 1.0,
"hit_at_1": 1.0,
"hit_at_3": 1.0,
"hit_at_5": 1.0,
"all_expected_at_3": 0.95,
"all_expected_at_5": 1.0,
"all_expected_at_10": 1.0
}
Latest guarded grounded-answer eval:
{
"outputs": "evals/qwen3_14b_orca_refusal_guarded_eval_outputs.jsonl",
"total": 12,
"passed": 12,
"pass_rate": 1.0,
"by_type": {
"source_packet": {
"passed": 8,
"total": 8
},
"hard_negative": {
"passed": 4,
"total": 4
}
}
}
Expanded v0.1-rc1 grounded release eval:
{
"total": 54,
"passed": 54,
"pass_rate": 1.0,
"by_type": {
"source_packet": {
"passed": 50,
"total": 50
},
"hard_negative": {
"passed": 4,
"total": 4
}
},
"outputs": "evals/qwen3_14b_orca_refusal_release_grounded_outputs.jsonl"
}
Evaluation Matrix
| Evaluation | Passed | Total | Pass rate | Notes |
|---|---|---|---|---|
| Retrieval calibrated pass@1 | 14 | 20 | 70.0% | Smoke retrieval exact/semantic check |
| Retrieval calibrated pass@3 | 20 | 20 | 100.0% | Re-check matched recorded metrics |
| Retrieval calibrated pass@5 | 20 | 20 | 100.0% | Re-check matched recorded metrics |
| Retrieval all-expected@3 | 19 | 20 | 95.0% | Multi-expected query coverage |
| Retrieval all-expected@5 | 20 | 20 | 100.0% | Multi-expected query coverage |
| Grounded adapter, unguarded | 11 | 12 | 91.7% | Pre-refusal grounded adapter; failed one empty-evidence hard negative |
| Refusal adapter, unguarded | 11 | 12 | 91.7% | Still failed one empty-evidence high-risk citation case |
| Refusal adapter + runtime guard | 12 | 12 | 100.0% | 8/8 source-packet, 4/4 hard-negative |
| Expanded release grounded eval | 54 | 54 | 100.0% | 50/50 source-packet, 4/4 hard-negative |
Unguarded refusal eval passed 11/12. The remaining unguarded failure was an empty-evidence, high-risk payment automation prompt where the model invented a source citation. The current gate therefore requires the runtime source-packet guard.
Artifact Statistics
| Artifact | Path | Size / Count |
|---|---|---|
| Published adapter package | adapter/ on Hugging Face |
LoRA adapter, tokenizer, chat template, and config; merged shards/runs/checkpoints excluded from upload |
| Adapter weights | qwen3_14b_orca_refusal_smoke/adapter_model.safetensors |
256,976,504 bytes |
| Local adapter tree | qwen3_14b_orca_refusal_smoke |
27.77 GB including local merged model artifacts under the ignored working tree |
| Local merged model | qwen3_14b_orca_refusal_smoke/merged |
27.52 GB; 6 safetensors shards |
| Local Q8_0 GGUF | release/gguf/orca6-qwen3-14b-refusal-q8_0.gguf |
14.62 GB |
| Release manifest | release/release_manifest.json |
126 tracked release artifacts |
| GitHub release candidate assets | v0.1-rc1 |
109/109 expected assets attached |
| Hugging Face model repo | veroarc/ORCA6 |
Adapter, tokenizer, model card, eval reports, release notes |
| Hugging Face feedback Space | veroarc/orca6-feedback |
Manual feedback intake UI |
Release artifact checksums are recorded in:
release/release_manifest.json
Limitations
- The current dataset is small and should be treated as a v0 bootstrap.
- Auto-graded preferences are useful for pipeline validation but should be replaced or supplemented with human preference labels.
- Recommendations are only as current as the indexed source corpus.
- The adapter is not intended for unguarded citation-heavy answering. Use a runtime prompt guard that forbids invented source IDs, URLs, integrations, certifications, guarantees, and high-risk actions without retrieved evidence.
- The model must not execute tools or approve irreversible actions.
Release Notes
- Generated: 2026-06-26
- Version: v0.1-rc1
- Base model target: Qwen/Qwen3-14B
- Adapter target: qwen3_14b_orca_refusal_smoke/