Tool-call Verifier Classifier

This document tracks the current tool-call verifier training state for Forge. It is a recovery playbook, not a promotion record. The current published tool-call artifact is telemetry-only and must stay in shadow mode until a replacement passes notebook gates, ONNX parity, release shadow replay, and advisory replay.

The classifier is a DeBERTa sequence-classification sidecar over serialized tool-call contexts. Current published artifacts use serialize_state_v1; new replacement runs should use toolcall-verifier-input/v2 with serialize_state_v2. It runs after deterministic validation: syntax, JSON schema, unknown tools, required steps, prerequisites, unsafe batches, and terminal-tool rules remain Rust-owned and authoritative.

Current Status

Field Value
Base model microsoft/deberta-v3-small
Notebook notebook/toolcall_verifier_training_production_colab_v5.ipynb
Label mode production
Current published input schema toolcall-verifier-input/v1
Current published serializer serialize_state_v1
Replacement notebook input schema toolcall-verifier-input/v2
Replacement notebook serializer serialize_state_v2
Default runtime mode shadow
Active non-valid thresholds 1.01
Current published tool-call pin 548fb906e65a7061504e5100702834d55e31a02f
Previous strong tool-call pin b35b9734b6a3195e335ceb0a11b49d6782fec3b4
Current final-response pin bb11f0aaece9cae6f9b553e7522cb6d75d9cafbc

The default published tool-call pin was updated to the June 11 high-coverage run (548fb906e65a7061504e5100702834d55e31a02f). This resolves the training distribution failure/regression of b8e292b4de5725250bd1698eb5c795ffcb1a4cde (which had F1 0.681 and valid recall 0.41). The new candidate pin achieves a test macro F1 of 0.9014, valid recall of 0.9824, and wrong_tool_semantic precision of 0.9890. However, it still fails the strict 0.005 false objection promotion gate (obtaining 0.0068), so it should remain in shadow mode.

Labels

Production mode uses six labels:

Label Meaning Deployment guidance
valid Candidate call appears appropriate for the request and workflow state. Allow.
wrong_tool_semantic Candidate uses the wrong tool for the request or workflow state. Shadow-only until replay proves precision.
wrong_arguments_semantic Candidate uses a plausible tool but semantically wrong arguments. Shadow-only until numeric and recovery slices pass.
tool_not_needed Candidate calls a tool when no tool call is needed. Shadow-only until replay proves safety.
needs_clarification Request is underspecified and should be clarified before tool use. Ignore as a gate unless support is at least 50 rows.
deterministic_invalid Collapsed bucket for deterministic failures. Deterministic-only. Never enforce from ML.

Raw deterministic labels collapse into deterministic_invalid: invalid_args_schema, missing_required_args, unknown_tool, premature_terminal, missing_prerequisite, unsafe_parallel_batch, and malformed_tool_call.

Current Notebook Settings

These are the current recovery defaults that should be preserved unless a new run gives a concrete reason to change them.

Dataset Mix

Setting Current value Reason
FORGE_AGENT_HF_DATASET_WEIGHT 1 Private rows tune Forge slices; they should not dominate.
FORGE_AGENT_HF_TRAIN_FRACTION_TARGET 0.25 Keep private rows in the 0.15 to 0.30 range.
FORGE_AGENT_HF_PUBLIC_ONLY_TRAIN_CAP 0 Preserve broad public coverage.
FORGE_AGENT_HF_DOWNSAMPLE_PUBLIC_FOR_TARGET False Do not shrink the public backbone to satisfy private fraction.
PREFER_FORGE_AGENT_HF_DATASET True Keep reviewed private rows when present.
INCLUDE_PRIVATE_AGENT_LOGS False Local agent logs remain opt-in.
USE_SERIALIZER_V2 True Train/export the metadata-aware schema used by new Forge rows.

Use group-preserving sampling by example_group_id. If a hard negative is included, keep the paired valid/corrected row in the same group so splitting and sampling do not separate the contrastive pair.

Private Generated Dataset

The private generated dataset used for agent_training_hf in the latest run is addenda/forge-eval-3k-v2/agent_training.notebook.jsonl from the Hugging Face repo cowWhySo/forge-toolcall-verifier-openrouter-2650-v1 (revision 01eedcb861324df5fe5b6584ed4f12995b103d0f), containing 724 agent-derived rows:

Label Rows
valid 413
tool_not_needed 241
wrong_arguments_semantic 38
wrong_tool_semantic 32

This legacy dataset is useful as Forge-style valid-call coverage, but it is not strong wrong-tool training evidence. In this run, 246/247 private wrong-tool rows used a literal synthetic_unrelated_tool distractor, so the negative boundary is mostly a name-level shortcut. The latest pasted evaluation showed agent_training_hf accuracy around 0.975, while the large wrong-tool confusions still came from public datasets. Do not infer from that private score that the classifier has learned real wrong-tool semantics.

For the next private addendum, use forge-dataset reviewed rows rather than the legacy distractor dataset. The generator now creates targeted alternatives only from verified-valid captures and reviewer/verifier-accepts them before training:

  • prefer real competing tools from the same observed task group when available;
  • include paired valid rows in the same example_group_id;
  • keep schema-valid arguments for the distractor so the label remains semantic wrong-tool, not deterministic invalid or wrong-argument noise;
  • include bounded repeated-tool (tool_not_needed) and underspecified-request (needs_clarification) alternatives;
  • mine high-confidence reviewed quarantines, such as uv lock requested but make build executed, into paired wrong-argument or wrong-tool examples only after verification accepts them as training rows.

Recommended private capture-review mix for the next OpenRouter addendum:

--review-max-alternatives-per-group 4 \
--review-max-alternative-ratio 0.50

After generation, require forge-dataset validate and split_manifest.json to show nonzero counts for valid, wrong_tool_semantic, wrong_arguments_semantic, tool_not_needed, and needs_clarification before using the addendum in a production notebook run.

Uploaded Eval Files

Use this hard-negative glob:

FORGE_HARD_NEGATIVE_GLOB = "/content/*hard_negatives.jsonl"

The previous glob, /content/*.hard_negatives.jsonl, did not match files named rust_smoke.tool_call_hard_negatives.jsonl or rust_smoke.final_response_hard_negatives.jsonl. A corrected T4 audit showed the hard-negative loader working: forge_hard_negative rows were present, with 7 corrected positives and 6 corrected error-recovery positives.

Telemetry files such as proxy_classifier_budget_8192.jsonl and rust_smoke.jsonl are diagnostics only. Mine them for top-k failures, but do not feed raw top-k telemetry into training or use it as promotion evidence.

Train Rebalance

High-coverage and T4 profiles intentionally use different rebalance behavior. The T4 profile is for cheap diagnosis; it is not promotion evidence.

Setting High-coverage default T4/debug default
VALID_TRAIN_FRACTION_TARGET 0.40 0.40
VALID_TRAIN_MAX_DUPLICATION_FACTOR 2 2
ENABLE_SEMANTIC_NEGATIVE_TRAIN_REBALANCE False False
WRONG_TOOL_TRAIN_TO_VALID_RATIO_TARGET 0.90 unused while disabled 0.55 unused while disabled
WRONG_ARGUMENTS_TRAIN_TO_VALID_RATIO_TARGET 0.75 unused while disabled 0.70 unused while disabled
MAX_SEMANTIC_NEGATIVE_DUPLICATION_FACTOR 4 2 unused while disabled
MAX_NEEDS_CLARIFICATION_TO_VALID_TRAIN_RATIO 0.15 0.15
ENABLE_VALID_PROTECTION_EXTRA_TRAIN_REBALANCE True True
VALID_PROTECTION_EXTRA_COPY_FACTOR 2 2
VALID_PROTECTION_EXTRA_COPY_ROWS_CAP 5000 5000

Non-valid caps remain:

Label Max ratio to valid rows
deterministic_invalid 0.35
wrong_tool_semantic 0.75
wrong_arguments_semantic 0.90
tool_not_needed 0.30
needs_clarification 0.15

Valid-Protection Slices

Track these slices on validation and test. Apply valid recall and false-objection gates when a slice has at least 25 valid rows.

  • terminal-like tools: respond, summarize, report, submit_*, present, recommend, and diagnose,
  • corrected error-recovery positives,
  • fixed-width numeric string arguments, especially zero-padded values such as 0010,
  • no-op valid calls with empty argument objects.

Promotion Gates

The immediate notebook gates are:

Gate Threshold
valid recall >= 0.94
valid false objection at confidence 0.90 <= 0.005
wrong_tool_semantic precision >= 0.90
needs_clarification ignored unless support is at least 50 rows
valid-protection slices with at least 25 valid rows same valid recall and false-objection gates

Passing the notebook gates is necessary but not sufficient. Promotion also requires FP32 ONNX parity, shadow release replay, false-objection mining, and a later clean advisory replay.

Since the v5e notebook patch, promotion_gate_report.json is the single source of truth for notebook-side promotability. It carries promotion_status (blocked or promotable_pending_replay, never plain promotable), blocked_reasons[], and artifact_promotable, which are mirrored into artifact_manifest.json, thresholds.json, candidate_thresholds.json, test_metrics.json, and training_run_summary.json; the notebook raises if any exported file claims promotability while the gate report is blocked. The per-eval eval_checkpoint_constrained_promotable flag is checkpoint-selection telemetry only and uses the strict core gates, replacing the old ambiguous eval_constrained_promotable key. The report also embeds diagnostic-only threshold_sweep, confidence_margin_diagnostics, and per_source_diagnostics keys that never block promotion, and the run exports high_confidence_mistakes.jsonl for manual audit of confident wrong predictions.

Lessons Learned

Do Not Threshold Around A Bad Boundary

The current published pin learned a bad boundary: valid calls were pushed into wrong_tool_semantic. Lowering or raising thresholds cannot fix that. Treat that artifact as telemetry-only.

Public Coverage Is The Backbone

The bad high-VRAM setup over-corrected toward private data: private fraction 0.60, private weight 4x, and public-only caps around 6000 rows. That shrunk broad valid/wrong-tool/wrong-argument coverage and collapsed valid-call generalization. Current defaults restore public coverage and keep private rows as a tuning slice.

Hard Negatives Must Stay Paired

Hard negatives without their valid/corrected counterparts teach the classifier to object broadly. Keep pairs together with example_group_id, and evaluate their slices separately.

Numeric Formatting Is Semantic

For the error_recovery smoke tool, {"count":"0010"} is valid and {"count":"10"} is wrong for that schema. This must be trained and evaluated as a semantic argument distinction, not treated as a harmless formatting issue.

T4 Runs Are Diagnostics

T4 runs exposed data-path and balance issues but are not promotion candidates:

Run Useful finding Failure
T4 valid-heavy run valid recall reached 0.947 valid false objection 0.0132, wrong_tool_semantic precision 0.676, wrong_tool_semantic recall 0.088
T4 semantic-heavy run wrong_tool_semantic recall recovered to 0.773 valid recall collapsed to 0.628, wrong_tool_semantic precision only 0.422
T4 softened semantic run valid recall recovered to 0.794 and wrong_tool_semantic precision improved to 0.528 still failed valid recall, valid false objection, wrong_tool_semantic precision, and no-op valid slice gates
T4 auto/t4_proven recovery run macro F1 recovered to 0.7603 and valid recall to 0.9109 after the t4_fast collapse still failed valid recall, valid false objection 0.0127, wrong_tool_semantic precision 0.7273, fixed-width/no-op slice gates, and showed CANDIDATE_CALL truncation around 12.5%
T4 openrouter-train-3k run test valid recall reached 0.9408, wrong_arguments_semantic precision reached 0.9523, and agent_training_hf accuracy reached about 0.975 validation/test still failed promotion: test valid false objection 0.0128, test wrong_tool_semantic precision 0.8462, wrong-tool recall only about 0.30, and protected valid slices still failed

The current T4-only rebalance backs off semantic-negative upsampling entirely and keeps extra protected-valid duplication enabled. This is a diagnostic attempt to separate the effects of global valid balance and protected valid support from semantic-negative pressure. Use T4 to iterate on data flow, not to decide promotion. If T4 continues to fail after data-quality fixes, prefer a high-coverage GPU run with a longer context window over more t4_fast ratio chasing.

The openrouter-train-3k result changes the immediate diagnosis. It no longer looks like the model primarily pushes valid calls into wrong_tool_semantic. Instead, it is too permissive on public wrong-tool rows: 755/1139 test wrong_tool_semantic rows were predicted valid, while private agent_training_hf rows were already mostly correct. Fix generated and public wrong-tool evidence before changing gates or thresholds.

The latest auto/t4_proven sidecars also exposed a reporting issue: split balancing produced 25 corrected error-recovery valid rows in both validation and test, but the evaluation slice mask reported zero rows. Slice diagnostics must use the precomputed valid_protection_* columns when present, not only metadata reparsing after JSON dataset reload.

High-Coverage Recovery Is Closer

The best recovery signal so far came from a high-coverage run after public downsampling was disabled:

Metric Value
Test macro F1 0.9848
valid recall 0.9815
wrong_tool_semantic precision 0.9865
valid false objection at 0.90 0.0077

That candidate still failed the 0.005 false-objection gate and was not promoted. The latest run on 2026-06-11 (detailed in Latest Run Results below) further improved key metrics: valid recall reached 0.9824, wrong_tool_semantic precision reached 0.9890, and valid false objection at 0.90 was reduced to 0.0068. However, it still fails the strict 0.005 false objection promotion gate on the test set.

Quantized ONNX Is A Separate Candidate

A prior quantized parity result had FP32/quantized top-label agreement around 0.342. Quantized output cannot be trusted just because PyTorch or FP32 ONNX looks good. Calibrate thresholds against the artifact that will actually run.

Required parity gates:

Check Gate
PyTorch vs FP32 ONNX top-label agreement >= 0.995
Quantized ONNX vs FP32 ONNX top-label agreement >= 0.98

If quantized parity fails, write the parity report, stop packaging/upload, and use FP32 ONNX for replay. Publish quantized only as shadow telemetry until parity is fixed.

Final-Response Verifier Is Separate

The final-response verifier is a separate artifact family and is not mature enough for active behavior. A recent runtime replay labeled 302/302 final responses as failed_to_acknowledge_data_gap at low confidence. Keep it shadow-only and document/evaluate it separately.

Latest Run Results (2026-06-11)

The latest high-coverage run on June 11, 2026, was executed with enable_forge_augmentation=True and enable_final_response_verifier=True.

Dataset Statistics

During preprocessing, 33,056 deterministic invalid rows were removed. In addition, 62 rows were quarantined due to source-quality flags (forge_argument_semantic, forge_contrastive_wts, forge_hard_negative, and forge_synthetic), leaving 290,019 rows after quarantine.

After applying group-preserving label caps (max 50,000 per label), the dataset size was reduced to 226,599 rows, preserving all preferred private HF rows.

Capped training rows by source and label:

  • Salesforce/xlam-function-calling-60k: 130,870 rows (valid: 47,237, wrong_arguments: 45,568, wrong_tool: 37,221, needs_clarification: 538, tool_not_needed: 12,844)
  • glaiveai/glaive-function-calling-v2: 48,763 rows (valid: 19,398, wrong_arguments: 18,350, wrong_tool: 5,314, needs_clarification: 237, tool_not_needed: 5,414)
  • Team-ACE/ToolACE: 27,713 rows (valid: 9,486, wrong_arguments: 8,926, wrong_tool: 7,184, needs_clarification: 120, tool_not_needed: 2,017)
  • agent_training_hf: 724 rows (valid: 413, wrong_arguments: 38, wrong_tool: 32, tool_not_needed: 241)
  • forge_error_recovery_protected: 2,559 rows (valid: 525, wrong_arguments: 1,509, wrong_tool: 525)
  • forge_fixed_width_numeric: 1,874 rows (valid: 570, wrong_arguments: 1,304)
  • forge_trace: 1,069 rows (valid: 1,051, wrong_arguments: 18)
  • forge_error_recovery_numeric: 419 rows (valid: 60, wrong_arguments: 359)
  • forge_augmented: 100 rows (needs_clarification: 100)

Final split sizes:

  • Train: 190,692 rows (after valid rebalancing duplication factor of 2)
  • Validation: 11,293 rows
  • Test: 22,370 rows

Training Profile

  • Device: NVIDIA RTX PRO 6000 Blackwell Server Edition (95 GB VRAM)
  • Batch Size: 64 (gradient accumulation: 1)
  • Max Sequence Length: 1,280
  • Optimizer: adamw_torch_fused
  • Gradient Checkpointing: Disabled
  • Epochs: 5

Training Progress & Best Checkpoint

The best model checkpoint was saved at step 13384 (end of Epoch 4) based on the gate_deficit_score metric.

Epoch Training Loss Validation Loss Validation Accuracy Validation Macro F1 Valid Recall Valid False Objection at 0.90 Wrong Tool Precision Wrong Arguments Recall Gate Deficit Score Checkpoint Constrained Promotable
1 0.1653 0.1621 0.9486 0.7662 0.9005 0.0290 0.9089 0.9786 0.8367 False
2 0.1033 0.0832 0.9730 0.8536 0.9684 0.0069 0.9810 0.9781 101.0584 False
3 0.0733 0.0845 0.9752 0.8716 0.9809 0.0079 0.9988 0.9820 101.0623 False
4 0.0526 0.0657 0.9792 0.9273 0.9817 0.0048 0.9873 0.9783 101.0911 True
5 0.0485 0.0624 0.9806 0.9422 0.9796 0.0051 0.9865 0.9799 101.0881 False

Test Evaluation Results

Evaluated on the held-out test split of 22,370 rows:

Metric Value
Test Accuracy 0.9780
Macro F1 (5 Active Labels) 0.9014
Macro F1 (All Labels) 0.7512
valid Recall 0.9824
valid Precision 0.9583
valid False Objection at 0.90 0.0068 (22 false objections / 7,836 valid rows)
wrong_tool_semantic Precision 0.9890
wrong_tool_semantic Recall 0.9718
wrong_arguments_semantic Precision 0.9878
wrong_arguments_semantic Recall 0.9793
valid to wrong_arguments_semantic Error Rate 0.0103
wrong_tool to wrong_arguments_semantic Rate 0.0008
Gate Deficit Score 101.0743

Test Classification Report

                          precision    recall  f1-score   support

                   valid       0.96      0.98      0.97      7836
     wrong_tool_semantic       0.99      0.97      0.98      4921
wrong_arguments_semantic       0.99      0.98      0.98      7543
         tool_not_needed       1.00      1.00      1.00      1969
     needs_clarification       0.85      0.44      0.58       101
   deterministic_invalid       0.00      0.00      0.00         0

                accuracy                           0.98     22370
               macro avg       0.80      0.73      0.75     22370
            weighted avg       0.98      0.98      0.98     22370

Test Confusion Matrix

True \ Predicted valid wrong_tool_semantic wrong_arguments_semantic tool_not_needed needs_clarification deterministic_invalid
valid 7698 48 81 3 6 0
wrong_tool_semantic 133 4783 4 1 0 0
wrong_arguments_semantic 149 2 7388 2 2 0
tool_not_needed 1 2 0 1966 0 0
needs_clarification 50 1 6 0 44 0
deterministic_invalid 0 0 0 0 0 0

Per-Source and Per-Label Accuracies

Per-Source Accuracy:

  • Salesforce/xlam-function-calling-60k: 14,071 rows, Accuracy: 97.73% (Avg Conf: 0.9882)
  • glaiveai/glaive-function-calling-v2: 4,775 rows, Accuracy: 99.52% (Avg Conf: 0.9971)
  • Team-ACE/ToolACE: 2,822 rows, Accuracy: 95.11% (Avg Conf: 0.9642)
  • forge_error_recovery_protected: 257 rows, Accuracy: 100.00% (Avg Conf: 0.9993)
  • forge_fixed_width_numeric: 197 rows, Accuracy: 99.49% (Avg Conf: 0.9977)
  • forge_trace: 152 rows, Accuracy: 99.34% (Avg Conf: 0.9987)
  • agent_training_hf: 48 rows, Accuracy: 85.42% (Avg Conf: 0.9578)
  • forge_error_recovery_numeric: 35 rows, Accuracy: 97.14% (Avg Conf: 0.9783)
  • forge_augmented: 13 rows, Accuracy: 100.00% (Avg Conf: 0.9612)

Per-Label Accuracy:

  • valid: 7,836 rows, Accuracy: 98.24% (Avg Conf: 0.9805)
  • wrong_arguments_semantic: 7,543 rows, Accuracy: 97.95% (Avg Conf: 0.9910)
  • wrong_tool_semantic: 4,921 rows, Accuracy: 97.20% (Avg Conf: 0.9902)
  • tool_not_needed: 1,969 rows, Accuracy: 99.85% (Avg Conf: 0.9996)
  • needs_clarification: 101 rows, Accuracy: 43.56% (Avg Conf: 0.8513)

Guarded-Objection Sweep Details

Valid-call false block rate at different logit thresholds on the test set:

  • @ 0.80: 73 / 7836 = 0.0093
  • @ 0.90: 52 / 7836 = 0.0066
  • @ 0.95: 39 / 7836 = 0.0050
  • @ 0.98: 30 / 7836 = 0.0038
  • @ 0.99: 22 / 7836 = 0.0028

Threshold Policy

The exported default mode is shadow, with default action allow. Thresholds are policy metadata, not proof that enforcement is safe.

Recommended local policy:

{
  "schema_version": "toolcall-verifier-thresholds/v1",
  "mode": "shadow",
  "default_action": "allow",
  "labels": {
    "valid": {
      "action": "allow",
      "advisory_min_confidence": 0.0,
      "enforce_min_confidence": 1.01
    },
    "wrong_tool_semantic": {
      "action": "shadow_only_until_eval_proven",
      "advisory_min_confidence": 1.01,
      "enforce_min_confidence": 1.01
    },
    "wrong_arguments_semantic": {
      "action": "shadow_only_until_eval_proven",
      "advisory_min_confidence": 1.01,
      "enforce_min_confidence": 1.01
    },
    "tool_not_needed": {
      "action": "shadow_only_until_eval_proven",
      "advisory_min_confidence": 1.01,
      "enforce_min_confidence": 1.01
    },
    "needs_clarification": {
      "action": "shadow_only_until_eval_proven",
      "advisory_min_confidence": 1.01,
      "enforce_min_confidence": 1.01
    },
    "deterministic_invalid": {
      "action": "deterministic_only",
      "advisory_min_confidence": 1.01,
      "enforce_min_confidence": 1.01
    }
  }
}

Candidate calibrated thresholds may be recorded for diagnostics, but non-valid active thresholds should remain above 1.0 until shadow replay and advisory replay both pass.

Input Format

The current published classifier expects the canonical serialized format produced by serialize_state_v1. New replacement artifacts should use serialize_state_v2, which keeps the v1 body and appends SCORING_METADATA.

SCHEMA_VERSION:
toolcall-verifier-input/v1

USER_REQUEST:
Generate a sales report from the Q4 2024 dataset.

WORKFLOW_STATE:
required_steps=['fetch_sales_data', 'analyze_sales']
completed_steps=[]
pending_steps=['fetch_sales_data', 'analyze_sales']
terminal_tools=['report']
recent_errors=[]

AVAILABLE_TOOLS:
report: Produce the final report from findings.
PARAMETERS: {"properties": {"summary": {"type": "string"}}, "required": ["summary"], "type": "object"}

fetch_sales_data: Fetch sales data for a given quarter and year.
PARAMETERS: {"properties": {"quarter": {"type": "integer"}, "year": {"type": "integer"}}, "required": ["quarter", "year"], "type": "object"}

analyze_sales: Analyze the loaded sales data and produce findings.
PARAMETERS: {"properties": {}, "type": "object"}

CANDIDATE_CALL:
{"arguments": {"summary": "Done."}, "name": "report"}

Runtime integrations should byte-compare serializer output against serializer_fixture.json before trusting model scores.

Runtime Files

Required artifact files:

model.onnx
labels.json
thresholds.json
candidate_thresholds.json
artifact_manifest.json
input_schema.json
serializer_fixture.json
tokenizer_config.json
special_tokens_map.json
added_tokens.json
spm.model
config.json
training_run_summary.json
test_metrics.json
promotion_gate_report.json
valid_protection_slice_metrics.json
onnx_parity_report.json

model_quantized.onnx may be published only when quantized parity passes. If it does not pass, treat it as telemetry-only and prefer FP32 ONNX for replay.

Rust Deployment Guidance

Recommended integration order:

1. Parse provider response.
2. Validate format, known tool names, and JSON-schema arguments.
3. Enforce required steps, prerequisites, terminal rules, and unsafe batches.
4. If the call is still valid-looking, run the classifier.
5. Shadow mode: log classifier verdict only.
6. Advisory mode: use classifier verdict to choose better nudges.
7. Enforce mode: block only high-confidence semantic labels after eval proof.

Loading failures should fail closed for strict deployment modes. Scoring failures should fail open in shadow and advisory modes, with telemetry.

Promotion Ladder

  1. Train replacement.
  2. Require good PyTorch validation/test metrics.
  3. Require good FP32 ONNX parity.
  4. Require good quantized parity, or skip quantized active use.
  5. Run release eval in shadow.
  6. Mine false objections and top-k disagreement rows.
  7. Run advisory replay.
  8. Consider enforcement only after advisory replay is clean.

Minimum replay matrix:

no_classifier
classifier_fp32_onnx_shadow
classifier_quantized_onnx_shadow
classifier_fp32_onnx_advisory
classifier_quantized_onnx_advisory

Promotion must show:

  • valid recall at least 0.94,
  • valid false objection at confidence 0.90 at most 0.005,
  • wrong_tool_semantic precision at least 0.90,
  • valid-protection slice gates for any slice with at least 25 valid rows,
  • no regression in terminal-tool workflows,
  • no regression in summarize/report workflows,
  • no regression in fixed-width numeric strings or corrected error-recovery calls,
  • acceptable p95/p99 latency and proxy RSS,
  • stable behavior across real Forge tool schemas, not only public datasets.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cowWhySo/toolcall-verifier-classifier-production

Quantized
(16)
this model