How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Acnoryx/Airy:
# Run inference directly in the terminal:
llama-cli -hf Acnoryx/Airy:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Acnoryx/Airy:
# Run inference directly in the terminal:
llama-cli -hf Acnoryx/Airy:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Acnoryx/Airy:
# Run inference directly in the terminal:
./llama-cli -hf Acnoryx/Airy:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Acnoryx/Airy:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Acnoryx/Airy:
Use Docker
docker model run hf.co/Acnoryx/Airy:
Quick Links

Acnoryx AI Research Bundle

Overview

  • Base model: Qwen/Qwen3.5-0.8B
  • Model size: 0.8b
  • Research quantizations: Q3_K_M, IQ3_M, Q2_K, IQ2_M, IQ2_XS, IQ2_XXS, IQ1_M, IQ1_S
  • Purpose: evaluate quality vs. size trade-offs below the production threshold

Notes

  • IQ1/IQ2 formats require an importance matrix (imatrix).
  • These files are more experimental than the release bundle.
  • Production-facing use should prefer the release bundle.
  • If prompting in Vietnamese, write with full accents for best consistency.

Evaluation Snapshot

Research GGUFs were continued from the existing results and merged with the latest rerun on the same curated 58-question bilingual benchmark.

Quant Think No-Think Avg Status
Q3_K_M 74.1% 72.4% 73.2% Best current research quant
IQ3_M 60.3% 60.3% 60.3% Heavy quality loss
IQ2_M 20.7% 19.0% 19.8% Below usable threshold
IQ2_XS 5.2% 3.4% 4.3% Triggered early-stop for lower bits

Research Guidance

  • Public research recommendation: Q3_K_M only
  • IQ3_M is still uploadable for experiments, but quality is clearly degraded
  • The rerun auto-stopped below IQ2_XS because average pass rate fell under 50%, so lower-bit quants should be considered archival artifacts rather than viable deployments
  • For any user-facing scenario, prefer the release bundle instead of this research branch

For cross-family ranking and release-vs-research comparison, see results/COMPARISON.md in the workspace.

Downloads last month
76
GGUF
Model size
0.8B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Acnoryx/Airy

Quantized
(109)
this model