Instructions to use deepseek-ai/DeepSeek-R1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/DeepSeek-R1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use deepseek-ai/DeepSeek-R1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/DeepSeek-R1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/DeepSeek-R1
- SGLang
How to use deepseek-ai/DeepSeek-R1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-R1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-R1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/DeepSeek-R1 with Docker Model Runner:
docker model run hf.co/deepseek-ai/DeepSeek-R1
Deploying production ready service with Unsloth GGUF quants on your AWS account. (4 x L40S)
Hi People
In the past few weeks we have been doing tons of PoCs with enterprises trying to deploy DeepSeek R1. The most popular combination was the Unsloth GGUF
quants on 4xL40S.
We just dropped the guide to deploy it on serverless GPUs on your own cloud: https://tensorfuse.io/docs/guides/integrations/llama_cpp
Single request tok/sec - 24 tok/sec
Context size - 5k
We also ran multiple experiments to figure out the right combination of context size fit and tps. You can modify the the "--n-gpu-layers" and "--ctx-size" paramters to calculate tokens per second for each scenario, here are the results -
- GPU Layers 30 , context 10k, speed 6.3 t/s
- GPU Layers 40, context 10k, speed 8.5 t/s
- GPU Layers 50, context 10k , speed 12 t/s
- At GPU layers > 50 , 10k context window will not fit.
Is it FP8 based, or Q4 based?
If I had Deepseek R1 running at 6.3 t/s with context 10k, all running locally I'd be happy and probably wouldn't even touch online models or very rarely.
Unfortunately that's not possible on consumer PC's, but on the other hand for servers it sounds too slow... 🤷♂️
Is it FP8 based, or Q4 based?
@ghostplant It is a 1.58 bit dynamic quant.
If I had Deepseek R1 running at 6.3 t/s with context 10k, all running locally I'd be happy and probably wouldn't even touch online models or very rarely.
Unfortunately that's not possible on consumer PC's, but on the other hand for servers it sounds too slow... 🤷♂️
@MrDevolver You can increase the speed by increasing the number of GPUs. Max I have achieved is arouns 70 tok/sec on L40S
I also tried running on CPU only machines and I was getting 5 tokens per second. If you have a decent mac you can run it at 6.3 tokens per second.
Does Quat1B still outperform o1-mini? If not, why not using 32B distilled what only costs 1 GPU?
Is Q1 still outperform o1-mini? If no, why not using 32B version?
Imho, low quant of bigger model is still better than highest quant of smaller model.
Is Q1 still outperform o1-mini? If no, why not using 32B version?
Imho, low quant of bigger model is still better than highest quant of smaller model.
Is there any formal comparison to show their performance with Distilled-Qwen-32B?
Subjectively the 2.5bit quantization outperforms handily the llama 70B distil in reasoning quality. The 70B distil is much faster though...