Instructions to use inclusionAI/Ling-1T with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inclusionAI/Ling-1T with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="inclusionAI/Ling-1T", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("inclusionAI/Ling-1T", trust_remote_code=True, dtype="auto")

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use inclusionAI/Ling-1T with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inclusionAI/Ling-1T"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inclusionAI/Ling-1T",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/inclusionAI/Ling-1T

SGLang

How to use inclusionAI/Ling-1T with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inclusionAI/Ling-1T" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inclusionAI/Ling-1T",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inclusionAI/Ling-1T" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inclusionAI/Ling-1T",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use inclusionAI/Ling-1T with Docker Model Runner:
```
docker model run hf.co/inclusionAI/Ling-1T
```

Can I run this locally?

by nvriese - opened Oct 10, 2025

Discussion

nvriese

Oct 10, 2025

•

edited Oct 10, 2025

Joking obviously, actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?

Fernanda24

Oct 11, 2025

when GGUFs come we can run locally just like Kimi K2

owenqwenllmwine

Oct 12, 2025

Joking obviously, actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?

I was about to say... lol

ubergarm

Oct 14, 2025

Not sure why you're joking, once GGUFs land for llama.cpp and possibly ik_llama.cpp (i may take a crack at quantizing this one) you can likely run it on similar hardware as Kimi-K2 but slower given more active parameters.

@nvriese

actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?

I would not recommend running it at full bf16 as it is designed to operate at fp8. fp8 = 8bpw * 50B parameters = 50GB active weights per token. Assuming you're running this at say a 4bpw quant with a ~512GB RAM + 48GB VRAM rig or something you could do hybrid inference.

I didn't do the mat yet but if you quantize attn/shexp/first N dense layers (assuming it is similar arch as deepseek / kimi-k2) at 6-8bpw and smash the routed experts to like 2-4bpw i'm guessing active weights size goes down to like 20GB maybe.

So seems like it might be a bit out of reach of a single 24GB VRAM GPU rig with much usable context length anyway. Guessing even with good DDR5-6400MT/s or faster RAM maybe if your rig can hit 512GB theoretical bandwidth in a single NUMA node you might get like 100 tok/sec PP and 10 tok/sec TG on a good day lol.. just spitballing...

Fernanda24

Oct 16, 2025

please take a crack at it Ubergarm. Your ggufs are really solid!! Would love to test these if you get them going. The biggest I can fit is Q3 but that should be good enough to get a solid idea about this model. Even Q2 Kimi K2 is surprisingly good and on the polyglot it benchmark within 1pnt of q4 i believe

ubergarm

Oct 17, 2025

@Fernanda24

No promises but making some progress on Ling-1T-GGUF with new PR on ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp/pull/837#issuecomment-3413794264 . I hope hugging face lets me upload the files given the recent changes on public storage limits!

ubergarm

Oct 17, 2025

•

edited Oct 17, 2025

@Fernanda24

Okay you have your choice of ik_llama.cpp (mostly merged into main, though check for the CUDA -ger patch if you are using any CUDA) and mainline llama.cpp quant using an open PR still:

https://huggingface.co/ubergarm2/Ling-1T-GGUF

sorry about the namespace, will try to change it to ubergarm properly ASAP but at least there are files for you to download and try now. I had luck out to 40k context and ~6 rounds chatting about some papers and generating c++ diffs and never blew up at me...

Fernanda24

Oct 18, 2025

@ubergarm sweet!! thx downloading ks now!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment