SLM750-Edge 1.58-bit โ€” Q4_K_M Quantized GGUF

A compact, efficiently quantized BitNet b1.58 ternary model optimized for edge deployment. This repository provides a plug-and-play GGUF file ready for use with llama.cpp and its ecosystem (llama-cli, llama-cpp-python, text-generation-webui, and more).

โš ๏ธ IMPORTANT โ€” PLEASE READ: This model does NOT run with the standard llama.cpp from the release. It requires a patched bitnet.cpp (see below). Without the patch, you will encounter missing tensor 'blk.0.attn_sub_norm.weight' or similar errors.


SLM750-Edge: Distillation + GGUF Build Colab

slm750_edge_gpu_package.tgz.gz Google Drive


Model Highlights

Attribute Value
Architecture BitNet b1.58 (ternary {-1, 0, +1} weights)
Parameters ~1.4B
Embedding Dim 1,536
Attention Heads 12 (4 KV heads, GQA)
Feed-Forward 4,096 (ReLUยฒ activation)
Context Length 8,192 tokens
Vocabulary 256,000 tokens (Gemma-2 tokenizer)
Quantization Q4_K_M (5.27 BPW)
File Size 873 MB
License BitNet 1.58

File Description

File Description
quantized_q4km.gguf โ€” Q4_K_M quantized GGUF (873 MB)
โœ… Recommended โ€” Q4_K_M quantized GGUF (~111 tok/s on CPU) quant_fixed_from_source_gguf (921 MB)

Quick Start

Prerequisites

  • llama.cpp built from source with BitNet support (commit 52b3df002 or later), OR
  • llama-cpp-python v0.3.x+

Usage with llama-cli

# Basic text generation
./llama-cli -m quantized_q4km.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 256 \
  -t 4 \
  --temp 0.7 \
  --top-p 0.9

# Chat mode
./llama-cli -m quantized_q4km.gguf \
  -p "You are a helpful assistant." \
  --chat-template gemma \
  -n 512 \
  -t 4

Usage with llama-cpp-python (Python)

from llama_cpp import Llama

llm = Llama(
    model_path="quantized_q4km.gguf",
    n_ctx=8192,
    n_threads=4,
    verbose=False,
)

output = llm(
    "What is the meaning of life?",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
    echo=False,
)

print(output["choices"][0]["text"])

Usage with text-generation-webui

  1. Place quantized_q4km.gguf in the models/ directory.
  2. Launch text-generation-webui with --model quantized_q4km.gguf.
  3. Select the model in the UI under the "Model" tab.

Usage with LangChain

from langchain_community.llms import LlamaCpp

llm = LlamaCpp(
    model_path="quantized_q4km.gguf",
    n_ctx=8192,
    n_threads=4,
    temperature=0.7,
    top_p=0.9,
    verbose=False,
)

response = llm.invoke("Write a short poem about AI.")
print(response)

Performance

Benchmarks measured on an AMD EPYC 7763 (4 vCPUs, 15 GB RAM, CPU only):

Metric	Performance
Prompt Processing (pp=512, t=2)	64.6 t/s
Text Generation (tg=128, t=2)	27.3 t/s
Prompt Processing (pp=512, t=4)	81.8 t/s
Text Generation (tg=128, t=4)	7.7 t/s
Model Load Time	~2 seconds
Peak RAM Usage	~1.5 GB
Architecture Details
BitNet b1.58
This model implements the BitNet b1.58 architecture introduced by Microsoft Research, where all weight matrices are ternary-valued ({-1, 0, +1}). This drastically reduces memory footprint and computational cost while retaining model quality.

Key Architectural Features

Component	Specification
Weight Precision	Ternary {-1, 0, +1} (training), Q4_K_M (storage)
FFN Activation	ReLUยฒ (relu(x)ยฒ)
Attention	Grouped-Query Attention (GQA), 12 heads, 4 KV heads
Positional Encoding	RoPE (Rotary Position Embeddings)
Normalization	RMSNorm (epsilon = 1e-6)
Logit Softcapping	Attention: 50.0, Final: 30.0 (tanh-based)
Context Length	8,192 tokens
Quantization Format
The model is quantized using Q4_K_M (4-bit K-quant, medium size):
File type: LLAMA_FTYPE_MOSTLY_Q4_K_M (15)
BPW: 5.27 bits per weight (including overhead)
Compression ratio: ~6:1 vs. full precision
Method: llama.cpp llama-quantize with --allow-requantize

Compatibility

Supported Runtimes
Runtime	Status	Notes
llama.cpp (mainline)	โœ… Full	Requires LLM_ARCH_BITNET support (commit 52b3df002+)
llama-cpp-python	โœ… Full	v0.3.x+ with BitNet support
text-generation-webui	โœ… Full	Via llama.cpp backend
LangChain	โœ… Full	Via LlamaCpp wrapper
Ollama	โš ๏ธ Manual	Requires custom Modelfile; not officially supported
llama-cpp.server	โœ… Full	OpenAI-compatible API server

Known Limitations

GPU offloading is not supported for BitNet architectures in the current llama.cpp release โ€” all inference runs on CPU. Flash Attention is not compatible with the BitNet attention implementation. Batch inference (parallel decoding) is limited by the CPU-only constraint.

Building llama.cpp with BitNet Support

https://copilot.microsoft.com/shares/pages/kaesS7TdPccnaLsu4iQ9T

๐Ÿ™ Thanks to

Downloads last month
226
GGUF
Model size
1B params
Architecture
bitnet
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Qapdex/SLM750-Edge-1.58-bit

Unable to build the model tree, the base model loops to the model itself. Learn more.

Dataset used to train Qapdex/SLM750-Edge-1.58-bit

Paper for Qapdex/SLM750-Edge-1.58-bit