SkyJM（RubricRM） is a reward model for visual generation, covering both text-to-image generation and image editing. Given a prompt and two candidate images, it predicts which one better satisfies the instruction. RubricRM performs the following in a single forward pass:

Dynamically produces an evaluation rubric conditioned on the prompt — including evaluation dimensions, per-dimension weights, and graded scoring descriptors;
Scores both candidate images at the dimension level under that rubric;
Aggregates the dimension scores via the rubric weights to derive the final preference.

We release two model sizes built on the Qwen3.5 backbone:

SkyJM-Gen-4B / SkyJM-Gen-9B — for text-to-image generation
SkyJM-Edit-4B / SkyJM-Edit-9B — for image editing

Performance

Text-to-image generation

Model	MMRB2	GenAI-Bench	GenAI-Bench-Verified
Proprietary MLLMs
Claude Sonnet 4.6	70.8	65.8	75.3
GPT-5.4	67.5	64.2	74.2
Gemini 2.5 Pro	70.5	67.8	77.4
Gemini 3.1 Pro	74.4	73.9	84.8
Open-source MLLMs
Qwen3-VL-8B	61.2	63.3	72.5
Qwen3-VL-235B-A22B	66.6	61.5	69.7
Qwen3.5-9B	66.3	63.3	70.7
Qwen3.5-397B-A17B	72.7	66.2	77.0
Reward Models
HPSv2	55.0	68.8	78.1
PickScore	57.6	70.0	79.2
HPSv3	60.2	70.9	81.0
UnifiedReward-9B	57.9	69.2	72.8
UnifiedReward-Think-9B	65.5	72.8	81.7
UnifiedReward-Flex-8B	69.2	73.4	84.2
SkyJM-Gen-4B (Ours)	70.5	73.2	83.1
SkyJM-Gen-9B (Ours)	72.0	74.1	84.5

Image editing

Model	MMRB2	EditReward-ERB Avg	EditScore-ERB Avg
Proprietary MLLMs
Claude Sonnet 4.6	71.7	44.1	79.3
GPT-5.4	68.5	42.5	74.6
Gemini 2.5 Pro	71.3	42.2	75.2
Gemini 3.1 Pro	74.9	45.0	81.6
Open-source MLLMs
Qwen3-VL-8B	63.4	40.9	76.9
Qwen3-VL-235B-A22B	64.8	34.6	78.8
Qwen3.5-9B	64.4	37.4	72.0
Qwen3.5-397B-A17B	73.7	43.9	81.2
Reward Models
EditReward-7B	67.2	38.4	78.3
EditScore-7B	55.6	28.8	61.9
SkyJM-Edit-4B (Ours)	73.2	45.5	85.5
SkyJM-Edit-9B (Ours)	75.4	46.4	85.6

Training Strategy

Stage 1: Rubric-trajectory SFT. We use Gemini 3.1 Pro to synthesize rubric-based evaluation trajectories conditioned on human preference labels, then filter them with structural and label-consistency checks for SFT.

Stage 2: Dimension-level GRPO. During RL, we fix the rubric and optimize only the scoring process using rewards based on per-dimension score gaps, with saturated-group filtering to suppress noisy low-variance updates.

Quick Start

For detailed usage instructions, installation guide, and inference examples (supporting both vLLM and Transformers backends), please refer to the official inference framework:

SKYLENAGE-JUDGER — Unified inference framework for SkyJM judge models.

Link

GitHub: SKYLENAGE-AI/SKYLENAGE-JUDGER
Hugging Face Models:
Hugging Face Dataset: skylenage-ai/RubricRM-Data
ModelScope Models:

Downloads last month: 47

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support