Instructions to use Qapdex/SLM750-Edge-1.58-bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qapdex/SLM750-Edge-1.58-bit with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Qapdex/SLM750-Edge-1.58-bit",
	filename="quantized_q4km.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Qapdex/SLM750-Edge-1.58-bit with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
# Run inference directly in the terminal:
llama cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
# Run inference directly in the terminal:
llama cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
# Run inference directly in the terminal:
./llama-cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT

Use Docker

docker model run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT

LM Studio
Jan

vLLM

How to use Qapdex/SLM750-Edge-1.58-bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qapdex/SLM750-Edge-1.58-bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qapdex/SLM750-Edge-1.58-bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT

Ollama

How to use Qapdex/SLM750-Edge-1.58-bit with Ollama:

ollama run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT

Unsloth Studio

How to use Qapdex/SLM750-Edge-1.58-bit with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Qapdex/SLM750-Edge-1.58-bit to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Qapdex/SLM750-Edge-1.58-bit to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Qapdex/SLM750-Edge-1.58-bit to start chatting

Atomic Chat new
Docker Model Runner
How to use Qapdex/SLM750-Edge-1.58-bit with Docker Model Runner:
```
docker model run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
```

Lemonade

How to use Qapdex/SLM750-Edge-1.58-bit with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT

Run and chat with the model

lemonade run user.SLM750-Edge-1.58-bit-Q4_K_M_QUANT

List all available models

lemonade list

SLM750-Edge 1.58-bit — Q4_K_M Quantized GGUF

A compact, efficiently quantized BitNet b1.58 ternary model optimized for edge deployment. This repository provides a plug-and-play GGUF file ready for use with llama.cpp and its ecosystem (llama-cli, llama-cpp-python, text-generation-webui, and more).

⚠️ IMPORTANT — PLEASE READ: This model does NOT run with the standard llama.cpp from the release. It requires a patched bitnet.cpp (see below). Without the patch, you will encounter missing tensor 'blk.0.attn_sub_norm.weight' or similar errors.

SLM750-Edge: Distillation + GGUF Build Colab

slm750_edge_gpu_package.tgz.gz Google Drive

Model Highlights

Attribute	Value
Architecture	BitNet b1.58 (ternary {-1, 0, +1} weights)
Parameters	~1.4B
Embedding Dim	1,536
Attention Heads	12 (4 KV heads, GQA)
Feed-Forward	4,096 (ReLU² activation)
Context Length	8,192 tokens
Vocabulary	256,000 tokens (Gemma-2 tokenizer)
Quantization	Q4_K_M (5.27 BPW)
File Size	873 MB
License	BitNet 1.58

File Description

File	Description
`quantized_q4km.gguf`	— Q4_K_M quantized GGUF (873 MB)
✅ `Recommended`	— Q4_K_M quantized GGUF (~111 tok/s on CPU) quant_fixed_from_source_gguf (921 MB)

Quick Start

Prerequisites

llama.cpp built from source with BitNet support (commit 52b3df002 or later), OR
llama-cpp-python v0.3.x+

Usage with llama-cli

# Basic text generation
./llama-cli -m quantized_q4km.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 256 \
  -t 4 \
  --temp 0.7 \
  --top-p 0.9

# Chat mode
./llama-cli -m quantized_q4km.gguf \
  -p "You are a helpful assistant." \
  --chat-template gemma \
  -n 512 \
  -t 4

Usage with llama-cpp-python (Python)

from llama_cpp import Llama

llm = Llama(
    model_path="quantized_q4km.gguf",
    n_ctx=8192,
    n_threads=4,
    verbose=False,
)

output = llm(
    "What is the meaning of life?",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
    echo=False,
)

print(output["choices"][0]["text"])

Usage with text-generation-webui

Place quantized_q4km.gguf in the models/ directory.
Launch text-generation-webui with --model quantized_q4km.gguf.
Select the model in the UI under the "Model" tab.

Usage with LangChain

from langchain_community.llms import LlamaCpp

llm = LlamaCpp(
    model_path="quantized_q4km.gguf",
    n_ctx=8192,
    n_threads=4,
    temperature=0.7,
    top_p=0.9,
    verbose=False,
)

response = llm.invoke("Write a short poem about AI.")
print(response)

Performance

Benchmarks measured on an AMD EPYC 7763 (4 vCPUs, 15 GB RAM, CPU only):

Metric	Performance
Prompt Processing (pp=512, t=2)	64.6 t/s
Text Generation (tg=128, t=2)	27.3 t/s
Prompt Processing (pp=512, t=4)	81.8 t/s
Text Generation (tg=128, t=4)	7.7 t/s
Model Load Time	~2 seconds
Peak RAM Usage	~1.5 GB
Architecture Details
BitNet b1.58
This model implements the BitNet b1.58 architecture introduced by Microsoft Research, where all weight matrices are ternary-valued ({-1, 0, +1}). This drastically reduces memory footprint and computational cost while retaining model quality.

Key Architectural Features

Component	Specification
Weight Precision	Ternary {-1, 0, +1} (training), Q4_K_M (storage)
FFN Activation	ReLU² (relu(x)²)
Attention	Grouped-Query Attention (GQA), 12 heads, 4 KV heads
Positional Encoding	RoPE (Rotary Position Embeddings)
Normalization	RMSNorm (epsilon = 1e-6)
Logit Softcapping	Attention: 50.0, Final: 30.0 (tanh-based)
Context Length	8,192 tokens
Quantization Format
The model is quantized using Q4_K_M (4-bit K-quant, medium size):
File type: LLAMA_FTYPE_MOSTLY_Q4_K_M (15)
BPW: 5.27 bits per weight (including overhead)
Compression ratio: ~6:1 vs. full precision
Method: llama.cpp llama-quantize with --allow-requantize

Compatibility

Supported Runtimes
Runtime	Status	Notes
llama.cpp (mainline)	✅ Full	Requires LLM_ARCH_BITNET support (commit 52b3df002+)
llama-cpp-python	✅ Full	v0.3.x+ with BitNet support
text-generation-webui	✅ Full	Via llama.cpp backend
LangChain	✅ Full	Via LlamaCpp wrapper
Ollama	⚠️ Manual	Requires custom Modelfile; not officially supported
llama-cpp.server	✅ Full	OpenAI-compatible API server

Known Limitations

GPU offloading is not supported for BitNet architectures in the current llama.cpp release — all inference runs on CPU. Flash Attention is not compatible with the BitNet attention implementation. Batch inference (parallel decoding) is limited by the CPU-only constraint.

Building llama.cpp with BitNet Support

https://copilot.microsoft.com/shares/pages/kaesS7TdPccnaLsu4iQ9T

🙏 Thanks to

llama.cpp
BitNet b1.58
Gemma-2
SmolLM2
Heyneo
Google Ai Assistant
Microsoft Copilot

Downloads last month: 226

GGUF

Model size

1B params

Architecture

bitnet

Hardware compatibility

4-bit

View +1 variant

Model tree for Qapdex/SLM750-Edge-1.58-bit

Unable to build the model tree, the base model loops to the model itself. Learn more.

Dataset used to train Qapdex/SLM750-Edge-1.58-bit

Paper for Qapdex/SLM750-Edge-1.58-bit

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27, 2024 • 630