Instructions to use togethercomputer/LLaMA-2-7B-32K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use togethercomputer/LLaMA-2-7B-32K with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="togethercomputer/LLaMA-2-7B-32K")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K") model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use togethercomputer/LLaMA-2-7B-32K with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "togethercomputer/LLaMA-2-7B-32K" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/LLaMA-2-7B-32K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/togethercomputer/LLaMA-2-7B-32K
- SGLang
How to use togethercomputer/LLaMA-2-7B-32K with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "togethercomputer/LLaMA-2-7B-32K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/LLaMA-2-7B-32K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "togethercomputer/LLaMA-2-7B-32K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/LLaMA-2-7B-32K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use togethercomputer/LLaMA-2-7B-32K with Docker Model Runner:
docker model run hf.co/togethercomputer/LLaMA-2-7B-32K
GGML Version
Outstanding work! just convert it to ggml, check it out if your are interested! https://huggingface.co/s3nh/LLaMA-2-7B-32K-GGML
@deepakkaura26 I think so! by default you get 2 vCPUs on colab with 13G RAM which should be enough to run the ggml versions
Which quantization did you try? I tried the 4bit version on colab and could run it without problems.
import ctransformers
from ctransformers import AutoModelForCausalLM
model_file = "LLaMA-2-7B-32K.ggmlv3.q4_0.bin"
model = AutoModelForCausalLM.from_pretrained("s3nh/LLaMA-2-7B-32K-GGML", model_type="llama", model_file=model_file)
prompt = "Whales have been living in the oceans for millions of years "
model(prompt, max_new_tokens=128, temperature=0.9, top_p= 0.7)
EDIT: load model directly from hub.
@mauriceweber I have use this same example which is present in this model website
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)
input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
@mauriceweber I tried to run your codes which you showed they give me this following error
HTTPError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
260 try:
--> 261 response.raise_for_status()
262 except HTTPError as e:
11 frames
HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main
The above exception was the direct cause of the following exception:
RepositoryNotFoundError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
291 " make sure you are authenticated."
292 )
--> 293 raise RepositoryNotFoundError(message, response) from e
294
295 elif response.status_code == 400:
RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64caab34-5bd826d76686f26a76b02644;7f562443-2822-41e5-bcd0-37c62aef99f9)
Repository Not Found for url: https://huggingface.co/api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.
@mauriceweber I have use this same example which is present in this model website
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
Here you are not using the quantized (ggml) models, which is why you are running out of memory (you need around 14GB RAM for the 7B model with float16).
@mauriceweber I tried to run your codes which you showed they give me this following error
This is error is because the model is not downloaded yet (I was assuming you had it downloaded to colab) -- I adjusted the code snippet above so that the model file gets pulled directly from the repo. You can check the other model versions here.
Let us know how it goes!:)
