Instructions to use zai-org/GLM-4.6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.6 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.6")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.6")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.6")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.6 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.6"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.6",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.6

SGLang

How to use zai-org/GLM-4.6 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.6" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.6",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.6" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.6",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.6 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.6
```

Congratulations to the Zhipu Team

#24

by md-1415 - opened Nov 12, 2025

Discussion

md-1415

Nov 12, 2025

•

edited Nov 13, 2025

I wanted to take a moment to congratulate and thank the Zhipu AI team for the fantastic work on GLM-4.6. I don't usually comment publicly, but the work that has gone into training this model and the decision to make it available free to the community deserves special thanks and gratitude.

In my case, I am running an AI pipeline that processes legal agreements on a local bank of RTX 5090 GPUs that I operate from inside my law firm. My pipeline relies on complex semantic LLM text processing, which requires subtle analysis of legal language in English combined with very precise instruction following. My prompts are usually 15K+ words in length, and I am doing 20+ passes over each legal agreement, with most subsequent legal prompt being dependent on earlier prompts. If anything fails in my sequence of AI passes, the final output is not usable for my clients. This is basically Legaltech 2.0, and the boundary between "OK" results and top 5% of attorney work product is very fine.

The only model that has worked for me so far is GPT-OSS 120b with "high" reasoning. I have tried every large open source LLM model that has come out so far, and every new model that is rated as better than GPT-OSS 120b fails in my pipeline. Higher scores in logic or instruction following benchmarks for recent flagship OS models do not stand up in my testing for my specialized use case in English. GLM-4.6 is the first model that shows glimpses of being better than GPT-OSS 120b for my use case (and I found it exceptional in some cases). Unfortunately GLM-4.6 is a bit too large for my local deployment and I can't use it in production currently. I have to run my pipeline locally, on my own local servers, because my clients are highly regulated entities and do not want their legal agreements processed through the cloud (notwithstanding sandboxing and other assurances provided by cloud companies).

I am very hopeful for the upcoming GLM-4.6 Air model! Thank you again for an incredible job with 4.6 and for making it available to us as OS!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment