Instructions to use togethercomputer/StripedHyena-Nous-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/StripedHyena-Nous-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/StripedHyena-Nous-7B", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("togethercomputer/StripedHyena-Nous-7B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use togethercomputer/StripedHyena-Nous-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/StripedHyena-Nous-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/StripedHyena-Nous-7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/StripedHyena-Nous-7B

SGLang

How to use togethercomputer/StripedHyena-Nous-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/StripedHyena-Nous-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/StripedHyena-Nous-7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/StripedHyena-Nous-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/StripedHyena-Nous-7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/StripedHyena-Nous-7B with Docker Model Runner:
```
docker model run hf.co/togethercomputer/StripedHyena-Nous-7B
```

Optional dependencies on custom kernels don't work(Flash Depthwise, flash fft)

by Maykeye - opened Dec 9, 2023

Discussion

Maykeye

Dec 9, 2023

•

edited Dec 9, 2023

Flash depth wise:

no import for FlashDepthwiseConv1d (both in github repo and hf repo it mentioned only once - when instantiated)
I'm not sure what package is intended but flashfftconv mentioned in github repo has FlashDepthWiseConv1d (upper case W)
If I use from flashfftconv import FlashDepthWiseConv1d as FlashDepthwiseConv1d and enable flash_depthwise in config I get warnings about unitialized parameters:

In [1]: model = AutoModelForCausalLM.from_pretrained(".", device="cuda", dtype=torch.float16, load_in_4bit=True, trust_remote_code=True)
bin /home/fella/src/sd/sd/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.21s/it]
Some weights of StripedHyenaModelForCausalLM were not initialized from the model checkpoint at . and are newly initialized: ['backbone.blocks.2.filter.fir_fn.bias', 'backbone.blocks.24.filter.fir_fn.bias', 'backbone.blocks.16.filter.fir_fn.weights', 'backbone.blocks.8.filter.fir_fn.bias', 'backbone.blocks.10.filter.fir_fn.weights', 'backbone.blocks.14.filter.fir_fn.bias', 'backbone.blocks.10.filter.fir_fn.bias', 'backbone.blocks.16.filter.fir_fn.bias', 'backbone.blocks.4.filter.fir_fn.weights', 'backbone.blocks.20.filter.fir_fn.bias', 'backbone.blocks.18.filter.fir_fn.weights', 'backbone.blocks.0.filter.fir_fn.bias', 'backbone.blocks.18.filter.fir_fn.bias', 'backbone.blocks.28.filter.fir_fn.bias', 'backbone.blocks.26.filter.fir_fn.bias', 'backbone.blocks.14.filter.fir_fn.weights', 'backbone.blocks.8.filter.fir_fn.weights', 'backbone.blocks.12.filter.fir_fn.bias', 'backbone.blocks.26.filter.fir_fn.weights', 'backbone.blocks.0.filter.fir_fn.weights', 'backbone.blocks.22.filter.fir_fn.bias', 'backbone.blocks.24.filter.fir_fn.weights', 'backbone.blocks.4.filter.fir_fn.bias', 'backbone.blocks.6.filter.fir_fn.bias', 'backbone.blocks.12.filter.fir_fn.weights', 'backbone.blocks.20.filter.fir_fn.weights', 'backbone.blocks.30.filter.fir_fn.weights', 'backbone.blocks.6.filter.fir_fn.weights', 'backbone.blocks.30.filter.fir_fn.bias', 'backbone.blocks.28.filter.fir_fn.weights', 'backbone.blocks.22.filter.fir_fn.weights', 'backbone.blocks.2.filter.fir_fn.weights']

flash fft
It's marked as compatible in yml file, but if I try to use it, model.py raises error

        if config.get("use_flashfft", "False"):
            raise NotImplementedError("Please use standalone SH code for other custom kernels")

(it is different on github though, but once again name mismatch: it uses flash_fft.conv, while recently build flash-fft-conv uses flashfftconv and github also uses config.seqlen which is None in hf)

Zymrael

Dec 11, 2023

Additional optimizations will trickle in (including better support of custom kernels, quantization). Stay tuned :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment