Fox Models (text)
Collection
11 items • Updated • 1
How to use teolm30/Fox-1.5 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="teolm30/Fox-1.5", filename="model.gguf", )
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)How to use teolm30/Fox-1.5 with llama.cpp:
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf teolm30/Fox-1.5 # Run inference directly in the terminal: llama cli -hf teolm30/Fox-1.5
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf teolm30/Fox-1.5 # Run inference directly in the terminal: llama cli -hf teolm30/Fox-1.5
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf teolm30/Fox-1.5 # Run inference directly in the terminal: ./llama-cli -hf teolm30/Fox-1.5
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf teolm30/Fox-1.5 # Run inference directly in the terminal: ./build/bin/llama-cli -hf teolm30/Fox-1.5
docker model run hf.co/teolm30/Fox-1.5
How to use teolm30/Fox-1.5 with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "teolm30/Fox-1.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "teolm30/Fox-1.5",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker model run hf.co/teolm30/Fox-1.5
How to use teolm30/Fox-1.5 with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "teolm30/Fox-1.5" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "teolm30/Fox-1.5",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "teolm30/Fox-1.5" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "teolm30/Fox-1.5",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'How to use teolm30/Fox-1.5 with Ollama:
ollama run hf.co/teolm30/Fox-1.5
How to use teolm30/Fox-1.5 with Unsloth Studio:
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for teolm30/Fox-1.5 to start chatting
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for teolm30/Fox-1.5 to start chatting
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for teolm30/Fox-1.5 to start chatting
How to use teolm30/Fox-1.5 with Pi:
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf teolm30/Fox-1.5
# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
"providers": {
"llama-cpp": {
"baseUrl": "http://localhost:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "teolm30/Fox-1.5"
}
]
}
}
}# Start Pi in your project directory: pi
How to use teolm30/Fox-1.5 with Hermes Agent:
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf teolm30/Fox-1.5
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default teolm30/Fox-1.5
hermes
How to use teolm30/Fox-1.5 with Docker Model Runner:
docker model run hf.co/teolm30/Fox-1.5
How to use teolm30/Fox-1.5 with Lemonade:
# Download Lemonade from https://lemonade-server.ai/ lemonade pull teolm30/Fox-1.5
lemonade run user.Fox-1.5-{{QUANT_TAG}}lemonade list
| Metric | Value |
|---|---|
| Throughput | ~35 tokens/sec (RTX 3050, 6GB VRAM) |
| Avg Latency | ~4-5s per response |
| Success Rate | 100% (5/5 tasks) |
| Tokens/Response | ~150 avg |
| MMLU (ref) | ~72% |
| GSM8K (ref) | ~58% |
| HumanEval (ref) | ~55% |
| Task | Prompt | Check | Result |
|---|---|---|---|
| Math | "A farmer has 17 sheep. All but 9 run away. How many sheep left?" | 9 |
✅ |
| Coding | "Write a Python function to check if a number is prime." | def |
✅ |
| Knowledge | "What is the capital of Greece?" | athens |
✅ |
| Logic | "If all cats are animals and some animals are pets, then some cats are pets. True or false?" | true |
✅ |
| Translation | "Translate to Greek: Hello, how are you?" | γεια |
✅ |
| Property | Value |
|---|---|
| Base Model | Qwen2.5-7B-Instruct |
| Quantization | GPTQ 4-bit |
| Parameters | 7B |
| Context Length | 32K tokens |
| Size | 5.3GB |
| VRAM Required | ~6GB |
| License | Apache 2.0 |
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "teolm30/Fox-1.5"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [{"role": "user", "content": "Explain quantum entanglement in simple terms"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For 4-bit GPTQ loading: pip install auto-gptq optimum
Built by T_craftClaw 🔥 | Owner: teolm30
ollama run hf.co/teolm30/Fox-1.5