KoHRM-Text-1.4B-GGUF

GGUF exports for LLM-OS-Models/KoHRM-Text-1.4B.

This is a custom hrm_text architecture. Standard upstream llama.cpp, Ollama, LM Studio, and other GGUF frontends will not load these files until hrm_text support lands upstream. Use the included runtime patch:

runtime/llama.cpp-hrm_text.patch

The patch is based on the HRM-Text GGUF work from sinimiini/HRM-Text-1B-GGUF, adapted for KoHRM-Text-1.4B. The KoHRM conversion infers the physical H/L stack depth from safetensors, because the public config reports num_hidden_layers=32 while the exported tensors are arranged as H: 16 and L: 16.

Files

file type size sha256
KoHRM-Text-1.4B-BF16.gguf BF16 2.6G d5c66f994327c1e2f05b33b0a2ff798a1d05f8b905b7f93943e101bca06c8b0a
KoHRM-Text-1.4B-Q8_0.gguf Q8_0 1.4G 8dae86207987804c7e8fc34fcba0d78ae2e54cd8563e907e9e5aea8442f7300c
KoHRM-Text-1.4B-Q6_K.gguf Q6_K 1.1G dd54d24344e842c3cd0f261e4b740c42c0ec78ed0b3414cdb8b2ac5022b7fb8a
KoHRM-Text-1.4B-Q5_K_M.gguf Q5_K_M 961M 90f47f54bd7cf545583a2be43a9d0c971cf6112ff16261e2e926cfabe2e9e35a
KoHRM-Text-1.4B-Q4_K_M.gguf Q4_K_M 841M e521243be6733796f221ec7de3ca3d1ff9014301f812138a173072d6def2f090
KoHRM-Text-1.4B-Q3_K_M.gguf Q3_K_M 700M 29fe588fdc434980cdc484c6324af8ca0c92122b995b26a09b8fed5baceae4be
KoHRM-Text-1.4B-Q2_K.gguf Q2_K 569M 6010878b117e639f9a1fb5332aa6c5a76bdf50ee08e6af4d7661a12d77cf7157

Build Patched llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git checkout 6a257d44633d4a752183ed778b88d2924d0a6b9d
git apply /path/to/runtime/llama.cpp-hrm_text.patch

cmake -S . -B build-hrm \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=OFF \
  -DGGML_NATIVE=OFF

cmake --build build-hrm --target llama-cli llama-quantize llama-completion llama-results -j 8

CPU Run

Download a quantized GGUF file:

huggingface-cli download LLM-OS-Models/KoHRM-Text-1.4B-GGUF \
  KoHRM-Text-1.4B-Q8_0.gguf \
  --local-dir .

Run on CPU:

./build-hrm/bin/llama-cli \
  -m ./KoHRM-Text-1.4B-Q8_0.gguf \
  -ngl 0 \
  -t 4 \
  -c 1024 \
  -n 260 \
  --seed 41 \
  --temp 0.45 \
  --top-p 0.9 \
  --repeat-penalty 1.08 \
  --single-turn \
  --simple-io \
  --no-warmup \
  --display-prompt \
  -p $'ํ•ด์™ธ์ฃผ์‹ ํˆฌ์ž์—์„œ ์›/๋‹ฌ๋Ÿฌ ํ™˜์œจ ๋ณ€๋™์ด ์›ํ™” ์ˆ˜์ต๋ฅ ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ๊ณผ ๋Œ€์‘ ๋ฐฉ์•ˆ์„ ๊ฐ„๋‹จํžˆ ์„ค๋ช…ํ•ด ์ฃผ์„ธ์š”.'

H/L Cycle Override (์ˆ˜์ • ์‹คํ–‰: H/L ์‚ฌ์ดํด ์ง์ ‘ ์ง€์ •)

KoHRM-Text-GGUF stores recurrence settings as GGUF metadata:

hrm_text.h_cycles = 2
hrm_text.l_cycles = 3

ํ˜„์žฌ ํŒจ์น˜๋œ llama.cpp์—์„œ๋Š” ์‹คํ–‰ ์‹œ ๋ชจ๋ธ ๋กœ๋”ฉ ๋‹จ๊ณ„์—์„œ ์•„๋ž˜ ํ‚ค๋ฅผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์˜ค๋ฒ„๋ผ์ด๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

./build-hrm/bin/llama-cli \
  -m ./KoHRM-Text-1.4B-Q8_0.gguf \
  -ngl 0 \
  -t 4 \
  -c 1024 \
  -n 260 \
  --seed 41 \
  --temp 0.45 \
  --top-p 0.9 \
  --repeat-penalty 1.08 \
  --single-turn \
  --simple-io \
  --no-warmup \
  --display-prompt \
  --override-kv hrm_text.h_cycles=int:1 \
  --override-kv hrm_text.l_cycles=int:2 \
  -p $'ํ•ด์™ธ์ฃผ์‹ ํˆฌ์ž์—์„œ ์›/๋‹ฌ๋Ÿฌ ํ™˜์œจ ๋ณ€๋™์ด ์›ํ™” ์ˆ˜์ต๋ฅ ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ๊ณผ ๋Œ€์‘ ๋ฐฉ์•ˆ์„ ๊ฐ„๋‹จํžˆ ์„ค๋ช…ํ•ด ์ฃผ์„ธ์š”.'

Use case:

  • h_cycles/l_cycles๋ฅผ ๋‚ฎ์ถ”๋ฉด ๋™์ผ ์กฐ๊ฑด์—์„œ ์‘๋‹ต์†๋„๋Š” ๋นจ๋ผ์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์ง€๋งŒ ํ’ˆ์งˆ ์ €ํ•˜๊ฐ€ ์ž์ฃผ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  • 2/3์€ ํ˜„์žฌ ๊ธฐ๋ณธ๊ฐ’(๋ฌธ์„œ ์ƒ ์•ˆ์ • ๋™์ž‘)์ž…๋‹ˆ๋‹ค.
  • 1/2๋Š” ์†๋„ ์šฐ์„  ํ…Œ์ŠคํŠธ๋กœ ์ถ”์ฒœํ•˜๋ฉฐ, ์‹ค์ œ ์ •๋ฐ€ ์ถ”๋ก ์—์„œ๋Š” 2/3์ด ๋” ์•ˆ์ •์ ์ž…๋‹ˆ๋‹ค.

--override-kv uses key format KEY=TYPE:VALUE, same as upstream llama.cpp:

  • hrm_text.h_cycles=int:1
  • hrm_text.l_cycles=int:2

If you need a persistent configuration (e.g., fixed 1/1 for a workload), export a new GGUF after changing H_cycles / L_cycles in the source config before convert_hf_to_gguf.py conversion. That preserves one set of cycles inside the artifact and avoids runtime override overhead.

CPU Generation Tests

Tested locally on CPU with the patched llama.cpp build and the prompt shown above.

This prompt was chosen after checking the KoHRM training-data path. KoHRM uses the HRM V1Dataset instruction-response layout:

<|im_start|><condition_token>instruction<|im_end|>response<|box_end|>

The instruction/prefix span is not trained with loss; the response span is trained with response-only loss. Local decoded samples include short Korean finance QA rows, so the representative GGUF smoke prompt below uses the same plain instruction style instead of a legal reasoning prompt.

file prompt speed generation speed value check
KoHRM-Text-1.4B-Q8_0.gguf 25.3 t/s 5.0 t/s Runtime OK; useful qualitative finance QA sample

Q8_0 output excerpt:

1. **ํ™˜์œจ ๋ณ€๋™์˜ ์˜ํ–ฅ:** ํ•ด์™ธ ์ฃผ์‹ ํˆฌ์ž์˜ ์ˆ˜์ต๋ฅ ์€ ์ฃผ๊ฐ€ ์ƒ์Šน์— ๋”ฐ๋ฅธ ์ˆ˜์ต๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ™˜์œจ ๋ณ€๋™์— ๋”ฐ๋ฅธ ์ˆ˜์ต ๋˜๋Š” ์†์‹ค์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
...
2. **๋Œ€์‘ ๋ฐฉ์•ˆ:**
   - **ํ™˜์œจ ๋ณ€๋™ ์œ„ํ—˜ ๊ด€๋ฆฌ:** ํ™˜์œจ ๋ณ€๋™ ์œ„ํ—˜์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ํ™˜ํ—ค์ง€ ์ƒํ’ˆ์„ ํ™œ์šฉํ•˜๊ฑฐ๋‚˜, ๋ถ„ํ•  ๋งค์ˆ˜/๋งค๋„ ์ „๋žต์„ ํ†ตํ•ด ํ™˜์œจ ๋ณ€๋™์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์„ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
   - **์žฅ๊ธฐ ํˆฌ์ž:** ์žฅ๊ธฐ ํˆฌ์ž๋ฅผ ํ†ตํ•ด ํ™˜์œจ ๋ณ€๋™์˜ ๋‹จ๊ธฐ์ ์ธ ์˜ํ–ฅ์„ ์™„ํ™”ํ•˜๊ณ , ์žฅ๊ธฐ์ ์ธ ์ฃผ๊ฐ€ ์ƒ์Šน์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
   - **๋ถ„์‚ฐ ํˆฌ์ž:** ๋‹ค์–‘ํ•œ ๊ตญ๊ฐ€์˜ ์ฃผ์‹์— ๋ถ„์‚ฐ ํˆฌ์žํ•˜์—ฌ ํŠน์ • ๊ตญ๊ฐ€์˜ ํ™˜์œจ ๋ณ€๋™ ์œ„ํ—˜์„ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

The full smoke-test log is in reports/generation_tests/finance_short_q8_0.txt. This is a qualitative CPU runtime sample, not a benchmark or financial advice.

Prompt Format

The source training/inference wrapper is:

<|im_start|><|object_ref_start|>PROMPT<|im_end|>

prepare_sft_data.py writes the generic HRM V1Dataset layout with direct=<|object_ref_start|> by default. In this patched GGUF runtime, llama-completion could load the model but returned an immediate end token for the tested prompts, while llama-cli --single-turn produced visible CPU token generation. The public checkpoint is a rolling pretraining-stage model, not a final chat/SFT model, so instruction following can still be unstable.

GGUF Metadata

Key converted metadata:

general.architecture = hrm_text
hrm_text.context_length = 4096
hrm_text.embedding_length = 1536
hrm_text.block_count = 128
hrm_text.layers_per_stack = 16
hrm_text.h_cycles = 2
hrm_text.l_cycles = 3
tokenizer.ggml.model = gpt2
tokenizer.ggml.pre = qwen2
tokenizer.ggml.bos_token_id = 2
tokenizer.ggml.eos_token_id = 35
tokenizer.ggml.padding_token_id = 0

Notes

  • Source model: LLM-OS-Models/KoHRM-Text-1.4B
  • Source revision converted: c413eee318b28e4f970f1be83698b161e60b3adb
  • llama.cpp base commit used for the patch: 6a257d44633d4a752183ed778b88d2924d0a6b9d
  • BF16 conversion wrote 259 tensors.
  • llama-completion can load the model non-interactively, but in local probes it immediately returned an end token for the tested prompts. llama-cli --single-turn produced visible CPU token generation and is the command shown above.
Downloads last month
725
GGUF
Model size
1B params
Architecture
hrm_text
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for LLM-OS-Models/KoHRM-Text-1.4B-GGUF

Quantized
(2)
this model