Instructions to use Qwen/Qwen3-Reranker-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3-Reranker-8B with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-8B") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-8B") - sentence-transformers
How to use Qwen/Qwen3-Reranker-8B with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("Qwen/Qwen3-Reranker-8B") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Notebooks
- Google Colab
- Kaggle
Encountered an error while starting the model using VLLM.
#2
by wuht1 - opened
VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0 vllm serve Qwen3-Reranker-8B --gpu_memory_utilization 0.9 --port 10004 --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 32768 --task score
INFO 06-07 13:37:40 [engine.py:316] Added request rerank-92c9e500eae14ed684fdc5f2e542999c-30.
INFO: 172.17.26.10:33191 - "POST /v1/rerank HTTP/1.1" 500 Internal Server Error
ERROR 06-07 13:37:41 [engine.py:164] AttributeError("'Qwen3ForCausalLM' object has no attribute 'pooler'")
ERROR 06-07 13:37:41 [engine.py:164] Traceback (most recent call last):
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 162, in start
ERROR 06-07 13:37:41 [engine.py:164] self.run_engine_loop()
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 225, in run_engine_loop
ERROR 06-07 13:37:41 [engine.py:164] request_outputs = self.engine_step()
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 251, in engine_step
ERROR 06-07 13:37:41 [engine.py:164] raise e
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 234, in engine_step
ERROR 06-07 13:37:41 [engine.py:164] return self.engine.step()
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 1393, in step
ERROR 06-07 13:37:41 [engine.py:164] outputs = self.model_executor.execute_model(
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/executor/executor_base.py", line 140, in execute_model
ERROR 06-07 13:37:41 [engine.py:164] output = self.collective_rpc("execute_model",
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 06-07 13:37:41 [engine.py:164] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/utils.py", line 2605, in run_method
ERROR 06-07 13:37:41 [engine.py:164] return func(*args, **kwargs)
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 06-07 13:37:41 [engine.py:164] output = self.model_runner.execute_model(
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-07 13:37:41 [engine.py:164] return func(*args, **kwargs)
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/worker/pooling_model_runner.py", line 159, in execute_model
ERROR 06-07 13:37:41 [engine.py:164] self.model.pooler(hidden_states=hidden_or_intermediate_states,
ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
ERROR 06-07 13:37:41 [engine.py:164] raise AttributeError(
ERROR 06-07 13:37:41 [engine.py:164] AttributeError: 'Qwen3ForCausalLM' object has no attribute 'pooler'
VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0 vllm serve Qwen3-Reranker-8B --gpu_memory_utilization 0.9 --port 10004 --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 32768 --task score
INFO 06-07 13:37:40 [engine.py:316] Added request rerank-92c9e500eae14ed684fdc5f2e542999c-30. INFO: 172.17.26.10:33191 - "POST /v1/rerank HTTP/1.1" 500 Internal Server Error ERROR 06-07 13:37:41 [engine.py:164] AttributeError("'Qwen3ForCausalLM' object has no attribute 'pooler'") ERROR 06-07 13:37:41 [engine.py:164] Traceback (most recent call last): ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 162, in start ERROR 06-07 13:37:41 [engine.py:164] self.run_engine_loop() ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 225, in run_engine_loop ERROR 06-07 13:37:41 [engine.py:164] request_outputs = self.engine_step() ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 251, in engine_step ERROR 06-07 13:37:41 [engine.py:164] raise e ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 234, in engine_step ERROR 06-07 13:37:41 [engine.py:164] return self.engine.step() ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 1393, in step ERROR 06-07 13:37:41 [engine.py:164] outputs = self.model_executor.execute_model( ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/executor/executor_base.py", line 140, in execute_model ERROR 06-07 13:37:41 [engine.py:164] output = self.collective_rpc("execute_model", ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc ERROR 06-07 13:37:41 [engine.py:164] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/utils.py", line 2605, in run_method ERROR 06-07 13:37:41 [engine.py:164] return func(*args, **kwargs) ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/worker/worker_base.py", line 420, in execute_model ERROR 06-07 13:37:41 [engine.py:164] output = self.model_runner.execute_model( ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 06-07 13:37:41 [engine.py:164] return func(*args, **kwargs) ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/vllm/worker/pooling_model_runner.py", line 159, in execute_model ERROR 06-07 13:37:41 [engine.py:164] self.model.pooler(hidden_states=hidden_or_intermediate_states, ERROR 06-07 13:37:41 [engine.py:164] File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__ ERROR 06-07 13:37:41 [engine.py:164] raise AttributeError( ERROR 06-07 13:37:41 [engine.py:164] AttributeError: 'Qwen3ForCausalLM' object has no attribute 'pooler'
which version of vLLM do you use ?
+1
vllm serve /mnt/data/llch/Models/Qwen3-Reranker-4B \
--served-model-name Qwen3-Reranker-4B \
--task score \
--host 0.0.0.0 \
--api-key torch-elskenrgvoiserngviopsejrmoief \
--max-model-len 1024 \
--trust-remote-code \
--port 16854
INFO 06-24 14:57:57 [__init__.py:244] Automatically detected platform cuda.
INFO 06-24 14:58:02 [api_server.py:1287] vLLM API server version 0.9.1
INFO 06-24 14:58:03 [cli_args.py:309] non-default args: {'host': '0.0.0.0', 'port': 16854, 'api_key': 'torch-elskenrgvoiserngviopsejrmoief', 'model': '/mnt/data/llch/Models/Qwen3-Reranker-4B', 'task': 'score', 'trust_remote_code': True, 'max_model_len': 1024, 'served_model_name': ['Qwen3-Reranker-4B']}
WARNING 06-24 14:58:12 [arg_utils.py:1642] --task score is not supported by the V1 Engine. Falling back to V0.
INFO 06-24 14:58:12 [api_server.py:265] Started engine process with PID 3896833
WARNING 06-24 14:58:14 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 06-24 14:58:17 [__init__.py:244] Automatically detected platform cuda.
INFO 06-24 14:58:21 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model='/mnt/data/llch/Models/Qwen3-Reranker-4B', speculative_config=None, tokenizer='/mnt/data/llch/Models/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-Reranker-4B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True,
INFO 06-24 14:58:21 [cuda.py:327] Using Flash Attention backend.
INFO 06-24 14:58:22 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-24 14:58:22 [model_runner.py:1171] Starting to load model /mnt/data/llch/Models/Qwen3-Reranker-4B...
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.95s/it]
INFO 06-24 14:58:26 [default_loader.py:272] Loading weights took 3.97 seconds
INFO 06-24 14:58:27 [launcher.py:37] Route: /rerank, Methods: POST
INFO 06-24 14:58:27 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 06-24 14:58:27 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 06-24 14:58:27 [launcher.py:37] Route: /invocations, Methods: POST
INFO 06-24 14:58:27 [launcher.py:37] Route: /metrics, Methods: GET
INFO: Started server process [3896313]
INFO: Waiting for application startup.
INFO: Application startup complete.
WARNING 06-24 14:59:17 [api_server.py:816] To indicate that the rerank API is not part of the standard OpenAI API, we have located it at `/rerank`. Please update your client accordingly. (Note: Conforms to JinaAI rerank API)
INFO 06-24 14:59:17 [logger.py:43] Received request rerank-c674f86510474da18e5e3d6c7cf3f50c-0: prompt: '如何优化大型语言模型的推理速度?', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-24 14:59:17 [logger.py:43] Received request rerank-c674f86510474da18e5e3d6c7cf3f50c-1: prompt: '大型语言模型的推理速度可以通过模型量化来提升,例如使用INT8或INT4精度。', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-24 14:59:17 [logger.py:43] Received request rerank-c674f86510474da18e5e3d6c7cf3f50c-2: prompt: '使用闪存注意力(Flash Attention)是加速Transformer模型推理的有效方法。', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-24 14:59:17 [logger.py:43] Received request rerank-c674f86510474da18e5e3d6c7cf3f50c-3: prompt: '对于大型模型,张量并行(Tensor Parallelism)是一种常见的分布式推理策略。', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-24 14:59:17 [logger.py:43] Received request rerank-c674f86510474da18e5e3d6c7cf3f50c-4: prompt: '进行知识蒸馏,将大模型的知识迁移到小模型上,可以获得更快的推理速度。', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-24 14:59:17 [logger.py:43] Received request rerank-c674f86510474da18e5e3d6c7cf3f50c-5: prompt: '如何给宠物猫洗澡?首先要安抚它的情绪,准备好温水和宠物专用香波。', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-24 14:59:17 [logger.py:43] Received request rerank-c674f86510474da18e5e3d6c7cf3f50c-6: prompt: '优化推理速度的关键在于减少计算量和内存访问,vLLM通过PagedAttention实现了这一点。', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-24 14:59:17 [logger.py:43] Received request rerank-c674f86510474da18e5e3d6c7cf3f50c-7: prompt: '剪枝(Pruning)技术可以移除模型中不重要的权重,从而减小模型大小并加速计算。', params: PoolingParams(dimensions=None, additional_metadata=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-24 14:59:17 [engine.py:317] Added request rerank-c674f86510474da18e5e3d6c7cf3f50c-7.
INFO: 127.0.0.1:54454 - "POST /v1/rerank HTTP/1.1" 500 Internal Server Error
ERROR 06-24 14:59:17 [engine.py:165] AttributeError("'Qwen3ForCausalLM' object has no attribute 'pooler'")
ERROR 06-24 14:59:17 [engine.py:165] Traceback (most recent call last):
ERROR 06-24 14:59:17 [engine.py:165] File "/mnt/data/llch/my_lm_log/.venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 163, in start
ERROR 06-24 14:59:17 [engine.py:165] self.run_engine_loop()
ERROR 06-24 14:59:17 [engine.py:165] File "/mnt/data/llch/my_lm_log/.venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 226, in run_engine_loop
ERROR 06-24 14:59:17 [engine.py:165] request_outputs = self.engine_step()
ERROR 06-24 14:59:17 [engine.py:165] ^^^^^^^^^^^^^^^^^^
...
ERROR 06-24 14:59:17 [engine.py:165] File "/mnt/data/llch/my_lm_log/.venv/lib/python3.12/site-packages/vllm/worker/pooling_model_runner.py", line 159, in execute_model
ERROR 06-24 14:59:17 [engine.py:165] self.model.pooler(hidden_states=hidden_or_intermediate_states,
ERROR 06-24 14:59:17 [engine.py:165] ^^^^^^^^^^^^^^^^^
ERROR 06-24 14:59:17 [engine.py:165] File "/mnt/data/llch/my_lm_log/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
ERROR 06-24 14:59:17 [engine.py:165] raise AttributeError(
ERROR 06-24 14:59:17 [engine.py:165] AttributeError: 'Qwen3ForCausalLM' object has no attribute 'pooler'
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [3896313]