Instructions to use inclusionAI/Ling-1T with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use inclusionAI/Ling-1T with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="inclusionAI/Ling-1T", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("inclusionAI/Ling-1T", trust_remote_code=True, dtype="auto") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use inclusionAI/Ling-1T with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "inclusionAI/Ling-1T" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inclusionAI/Ling-1T", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/inclusionAI/Ling-1T
- SGLang
How to use inclusionAI/Ling-1T with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "inclusionAI/Ling-1T" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inclusionAI/Ling-1T", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "inclusionAI/Ling-1T" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inclusionAI/Ling-1T", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use inclusionAI/Ling-1T with Docker Model Runner:
docker model run hf.co/inclusionAI/Ling-1T
Can I run this locally?
Joking obviously, actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?
when GGUFs come we can run locally just like Kimi K2
Joking obviously, actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?
I was about to say... lol
Not sure why you're joking, once GGUFs land for llama.cpp and possibly ik_llama.cpp (i may take a crack at quantizing this one) you can likely run it on similar hardware as Kimi-K2 but slower given more active parameters.
actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?
I would not recommend running it at full bf16 as it is designed to operate at fp8. fp8 = 8bpw * 50B parameters = 50GB active weights per token. Assuming you're running this at say a 4bpw quant with a ~512GB RAM + 48GB VRAM rig or something you could do hybrid inference.
I didn't do the mat yet but if you quantize attn/shexp/first N dense layers (assuming it is similar arch as deepseek / kimi-k2) at 6-8bpw and smash the routed experts to like 2-4bpw i'm guessing active weights size goes down to like 20GB maybe.
So seems like it might be a bit out of reach of a single 24GB VRAM GPU rig with much usable context length anyway. Guessing even with good DDR5-6400MT/s or faster RAM maybe if your rig can hit 512GB theoretical bandwidth in a single NUMA node you might get like 100 tok/sec PP and 10 tok/sec TG on a good day lol.. just spitballing...
please take a crack at it Ubergarm. Your ggufs are really solid!! Would love to test these if you get them going. The biggest I can fit is Q3 but that should be good enough to get a solid idea about this model. Even Q2 Kimi K2 is surprisingly good and on the polyglot it benchmark within 1pnt of q4 i believe
No promises but making some progress on Ling-1T-GGUF with new PR on ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp/pull/837#issuecomment-3413794264 . I hope hugging face lets me upload the files given the recent changes on public storage limits!
Okay you have your choice of ik_llama.cpp (mostly merged into main, though check for the CUDA -ger patch if you are using any CUDA) and mainline llama.cpp quant using an open PR still:
https://huggingface.co/ubergarm2/Ling-1T-GGUF
sorry about the namespace, will try to change it to ubergarm properly ASAP but at least there are files for you to download and try now. I had luck out to 40k context and ~6 rounds chatting about some papers and generating c++ diffs and never blew up at me...