Instructions to use togethercomputer/LLaMA-2-7B-32K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/LLaMA-2-7B-32K with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/LLaMA-2-7B-32K")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use togethercomputer/LLaMA-2-7B-32K with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/LLaMA-2-7B-32K"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/LLaMA-2-7B-32K",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/LLaMA-2-7B-32K

SGLang

How to use togethercomputer/LLaMA-2-7B-32K with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/LLaMA-2-7B-32K" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/LLaMA-2-7B-32K",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/LLaMA-2-7B-32K" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/LLaMA-2-7B-32K",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/LLaMA-2-7B-32K with Docker Model Runner:
```
docker model run hf.co/togethercomputer/LLaMA-2-7B-32K
```

GGML Version

by s3nh - opened Jul 29, 2023

Discussion

s3nh

Jul 29, 2023

Outstanding work! just convert it to ggml, check it out if your are interested! https://huggingface.co/s3nh/LLaMA-2-7B-32K-GGML

deepakkaura26

Jul 29, 2023

@s3nh Will your converted model can run on colab's CPU easily?

mauriceweber

Aug 2, 2023

@deepakkaura26 I think so! by default you get 2 vCPUs on colab with 13G RAM which should be enough to run the ggml versions

deepakkaura26

Aug 2, 2023

@mauriceweber actually I tried it but whether I choose CPU or GPU my colab got crashed 5 times.

mauriceweber

Aug 2, 2023

•

edited Aug 3, 2023

Which quantization did you try? I tried the 4bit version on colab and could run it without problems.

import ctransformers
from ctransformers import AutoModelForCausalLM

model_file = "LLaMA-2-7B-32K.ggmlv3.q4_0.bin"
model = AutoModelForCausalLM.from_pretrained("s3nh/LLaMA-2-7B-32K-GGML",  model_type="llama", model_file=model_file)

prompt = "Whales have been living in the oceans for millions of years "
model(prompt, max_new_tokens=128, temperature=0.9, top_p= 0.7)

EDIT: load model directly from hub.

deepakkaura26

Aug 2, 2023

@mauriceweber I have use this same example which is present in this model website

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)

input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

deepakkaura26

Aug 2, 2023

@mauriceweber I tried to run your codes which you showed they give me this following error

HTTPError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
260 try:
--> 261 response.raise_for_status()
262 except HTTPError as e:

11 frames
HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main

The above exception was the direct cause of the following exception:

RepositoryNotFoundError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
291 " make sure you are authenticated."
292 )
--> 293 raise RepositoryNotFoundError(message, response) from e
294
295 elif response.status_code == 400:

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64caab34-5bd826d76686f26a76b02644;7f562443-2822-41e5-bcd0-37c62aef99f9)

Repository Not Found for url: https://huggingface.co/api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

mauriceweber

Aug 3, 2023

@mauriceweber I have use this same example which is present in this model website

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)

input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

Here you are not using the quantized (ggml) models, which is why you are running out of memory (you need around 14GB RAM for the 7B model with float16).

@mauriceweber I tried to run your codes which you showed they give me this following error

This is error is because the model is not downloaded yet (I was assuming you had it downloaded to colab) -- I adjusted the code snippet above so that the model file gets pulled directly from the repo. You can check the other model versions here.

Let us know how it goes!:)

Sc0urge

Aug 25, 2023

Is this model already trained? running the example code just gives me this:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment