Instructions to use Qapdex/SLM750-Edge-1.58-bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Qapdex/SLM750-Edge-1.58-bit with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Qapdex/SLM750-Edge-1.58-bit", filename="quantized_q4km.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Qapdex/SLM750-Edge-1.58-bit with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT # Run inference directly in the terminal: llama cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT # Run inference directly in the terminal: llama cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT # Run inference directly in the terminal: ./llama-cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT # Run inference directly in the terminal: ./build/bin/llama-cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Use Docker
docker model run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
- LM Studio
- Jan
- vLLM
How to use Qapdex/SLM750-Edge-1.58-bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qapdex/SLM750-Edge-1.58-bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qapdex/SLM750-Edge-1.58-bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
- Ollama
How to use Qapdex/SLM750-Edge-1.58-bit with Ollama:
ollama run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
- Unsloth Studio
How to use Qapdex/SLM750-Edge-1.58-bit with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Qapdex/SLM750-Edge-1.58-bit to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Qapdex/SLM750-Edge-1.58-bit to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Qapdex/SLM750-Edge-1.58-bit to start chatting
- Atomic Chat new
- Docker Model Runner
How to use Qapdex/SLM750-Edge-1.58-bit with Docker Model Runner:
docker model run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
- Lemonade
How to use Qapdex/SLM750-Edge-1.58-bit with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Run and chat with the model
lemonade run user.SLM750-Edge-1.58-bit-Q4_K_M_QUANT
List all available models
lemonade list
SLM750-Edge 1.58-bit โ Q4_K_M Quantized GGUF
A compact, efficiently quantized BitNet b1.58 ternary model optimized for edge deployment. This repository provides a plug-and-play GGUF file ready for use with llama.cpp and its ecosystem (llama-cli, llama-cpp-python, text-generation-webui, and more).
โ ๏ธ IMPORTANT โ PLEASE READ: This model does NOT run with the standard
llama.cppfrom the release. It requires a patchedbitnet.cpp(see below). Without the patch, you will encountermissing tensor 'blk.0.attn_sub_norm.weight'or similar errors.
SLM750-Edge: Distillation + GGUF Build Colab
slm750_edge_gpu_package.tgz.gz Google Drive
Model Highlights
| Attribute | Value |
|---|---|
| Architecture | BitNet b1.58 (ternary {-1, 0, +1} weights) |
| Parameters | ~1.4B |
| Embedding Dim | 1,536 |
| Attention Heads | 12 (4 KV heads, GQA) |
| Feed-Forward | 4,096 (ReLUยฒ activation) |
| Context Length | 8,192 tokens |
| Vocabulary | 256,000 tokens (Gemma-2 tokenizer) |
| Quantization | Q4_K_M (5.27 BPW) |
| File Size | 873 MB |
| License | BitNet 1.58 |
File Description
| File | Description |
|---|---|
quantized_q4km.gguf |
โ Q4_K_M quantized GGUF (873 MB) |
โ
Recommended |
โ Q4_K_M quantized GGUF (~111 tok/s on CPU) quant_fixed_from_source_gguf (921 MB) |
Quick Start
Prerequisites
- llama.cpp built from source with BitNet support (commit
52b3df002or later), OR - llama-cpp-python v0.3.x+
Usage with llama-cli
# Basic text generation
./llama-cli -m quantized_q4km.gguf \
-p "Explain quantum computing in simple terms" \
-n 256 \
-t 4 \
--temp 0.7 \
--top-p 0.9
# Chat mode
./llama-cli -m quantized_q4km.gguf \
-p "You are a helpful assistant." \
--chat-template gemma \
-n 512 \
-t 4
Usage with llama-cpp-python (Python)
from llama_cpp import Llama
llm = Llama(
model_path="quantized_q4km.gguf",
n_ctx=8192,
n_threads=4,
verbose=False,
)
output = llm(
"What is the meaning of life?",
max_tokens=256,
temperature=0.7,
top_p=0.9,
echo=False,
)
print(output["choices"][0]["text"])
Usage with text-generation-webui
- Place quantized_q4km.gguf in the models/ directory.
- Launch text-generation-webui with --model quantized_q4km.gguf.
- Select the model in the UI under the "Model" tab.
Usage with LangChain
from langchain_community.llms import LlamaCpp
llm = LlamaCpp(
model_path="quantized_q4km.gguf",
n_ctx=8192,
n_threads=4,
temperature=0.7,
top_p=0.9,
verbose=False,
)
response = llm.invoke("Write a short poem about AI.")
print(response)
Performance
Benchmarks measured on an AMD EPYC 7763 (4 vCPUs, 15 GB RAM, CPU only):
Metric Performance
Prompt Processing (pp=512, t=2) 64.6 t/s
Text Generation (tg=128, t=2) 27.3 t/s
Prompt Processing (pp=512, t=4) 81.8 t/s
Text Generation (tg=128, t=4) 7.7 t/s
Model Load Time ~2 seconds
Peak RAM Usage ~1.5 GB
Architecture Details
BitNet b1.58
This model implements the BitNet b1.58 architecture introduced by Microsoft Research, where all weight matrices are ternary-valued ({-1, 0, +1}). This drastically reduces memory footprint and computational cost while retaining model quality.
Key Architectural Features
Component Specification
Weight Precision Ternary {-1, 0, +1} (training), Q4_K_M (storage)
FFN Activation ReLUยฒ (relu(x)ยฒ)
Attention Grouped-Query Attention (GQA), 12 heads, 4 KV heads
Positional Encoding RoPE (Rotary Position Embeddings)
Normalization RMSNorm (epsilon = 1e-6)
Logit Softcapping Attention: 50.0, Final: 30.0 (tanh-based)
Context Length 8,192 tokens
Quantization Format
The model is quantized using Q4_K_M (4-bit K-quant, medium size):
File type: LLAMA_FTYPE_MOSTLY_Q4_K_M (15)
BPW: 5.27 bits per weight (including overhead)
Compression ratio: ~6:1 vs. full precision
Method: llama.cpp llama-quantize with --allow-requantize
Compatibility
Supported Runtimes
Runtime Status Notes
llama.cpp (mainline) โ
Full Requires LLM_ARCH_BITNET support (commit 52b3df002+)
llama-cpp-python โ
Full v0.3.x+ with BitNet support
text-generation-webui โ
Full Via llama.cpp backend
LangChain โ
Full Via LlamaCpp wrapper
Ollama โ ๏ธ Manual Requires custom Modelfile; not officially supported
llama-cpp.server โ
Full OpenAI-compatible API server
Known Limitations
GPU offloading is not supported for BitNet architectures in the current llama.cpp release โ all inference runs on CPU. Flash Attention is not compatible with the BitNet attention implementation. Batch inference (parallel decoding) is limited by the CPU-only constraint.
Building llama.cpp with BitNet Support
https://copilot.microsoft.com/shares/pages/kaesS7TdPccnaLsu4iQ9T
๐ Thanks to
- llama.cpp
- BitNet b1.58
- Gemma-2
- SmolLM2
- Heyneo
- Google Ai Assistant
- Microsoft Copilot
- Downloads last month
- 226
4-bit
Model tree for Qapdex/SLM750-Edge-1.58-bit
Unable to build the model tree, the base model loops to the model itself. Learn more.