Instructions to use MachadoDeCastro/krull-micro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MachadoDeCastro/krull-micro with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MachadoDeCastro/krull-micro")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("MachadoDeCastro/krull-micro")
model = AutoModelForMaskedLM.from_pretrained("MachadoDeCastro/krull-micro")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use MachadoDeCastro/krull-micro with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MachadoDeCastro/krull-micro"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MachadoDeCastro/krull-micro",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/MachadoDeCastro/krull-micro

SGLang

How to use MachadoDeCastro/krull-micro with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MachadoDeCastro/krull-micro" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MachadoDeCastro/krull-micro",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MachadoDeCastro/krull-micro" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MachadoDeCastro/krull-micro",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use MachadoDeCastro/krull-micro with Docker Model Runner:
```
docker model run hf.co/MachadoDeCastro/krull-micro
```

Krull-Micro

Krull is an acronym for Knowledge Running Under Lightweight Language.

GitHub

https://github.com/machadodecastro/krull-micro.git

Language: en

License: mit

Tags:

krull
tiny-transformer
distillation
edge-ai
low-memory
onnx
quantization pipeline_tag: text-generation library_name: pytorch

Krull-Micro

Krull-Micro is a distilled, ultra-lightweight GPT-style language model designed for edge devices with limited RAM and compute.

It uses an advanced comprehensive distillation strategy that transfers:

Embedding representations
Transformer hidden states
Attention matrices
Output distributions (soft targets)

This allows Krull-Micro to remain extremely small while preserving strong language modeling performance.

Key Features

Tiny Transformer architecture (edge-optimized)
Full distillation (feature-based + attention + response-based)
ONNX export for cross-platform deployment
INT8 quantization support
Designed for memory-bound inference

Architecture

Model type: Causal Language Model (GPT-style)
Transformer layers: (set in config)
Hidden size: (set in config)
Attention heads: (set in config)
Vocabulary size: (matches tokenizer)

Usage (PyTorch)

import torch

model = torch.load("krull_micro.pt", map_location="cpu")
model.eval()

# Example input (token IDs)
x = torch.tensor([[1, 5, 23, 42]])

with torch.no_grad():
    logits = model(x)

ONNX Inference (Edge Deployment)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("krull_micro.onnx")

input_ids = np.array([[1, 5, 23, 42]], dtype=np.int64)

outputs = session.run(None, {"input_ids": input_ids})

Training

Training is performed using:

python scripts/train_lm.py \
  --config configs/krull_micro.json \
  --tokenizer artifacts/tokenizer.json \
  --corpus data/tiny_corpus.txt \
  --out artifacts/krull_micro.pt \
  --epochs 3 \
  --batch-size 8 \
  --lr 3e-4

Optimization Pipeline

Train distilled model
Export to ONNX
Apply INT8 quantization
Deploy with ONNX Runtime

Intended Use

Embedded systems
Mobile devices
Offline text generation
Low-latency inference environments

Limitations

Small model capacity → limited long-range coherence
Sensitive to training data quality
Not suitable for large-scale reasoning tasks

License

MIT License

Author

Igor Machado de Castro

Contributing

Contributions are welcome. Feel free to open issues or submit pull requests.

Downloads last month: 11