Krull-Micro

Krull is an acronym for Knowledge Running Under Lightweight Language.


GitHub

https://github.com/machadodecastro/krull-micro.git


Language: en

License: mit

Tags:

  • krull
  • tiny-transformer
  • distillation
  • edge-ai
  • low-memory
  • onnx
  • quantization pipeline_tag: text-generation library_name: pytorch

Krull-Micro

Krull-Micro is a distilled, ultra-lightweight GPT-style language model designed for edge devices with limited RAM and compute.

It uses an advanced comprehensive distillation strategy that transfers:

  • Embedding representations
  • Transformer hidden states
  • Attention matrices
  • Output distributions (soft targets)

This allows Krull-Micro to remain extremely small while preserving strong language modeling performance.


Key Features

  • Tiny Transformer architecture (edge-optimized)
  • Full distillation (feature-based + attention + response-based)
  • ONNX export for cross-platform deployment
  • INT8 quantization support
  • Designed for memory-bound inference

Architecture

  • Model type: Causal Language Model (GPT-style)
  • Transformer layers: (set in config)
  • Hidden size: (set in config)
  • Attention heads: (set in config)
  • Vocabulary size: (matches tokenizer)

Usage (PyTorch)

import torch

model = torch.load("krull_micro.pt", map_location="cpu")
model.eval()

# Example input (token IDs)
x = torch.tensor([[1, 5, 23, 42]])

with torch.no_grad():
    logits = model(x)

ONNX Inference (Edge Deployment)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("krull_micro.onnx")

input_ids = np.array([[1, 5, 23, 42]], dtype=np.int64)

outputs = session.run(None, {"input_ids": input_ids})

Training

Training is performed using:

python scripts/train_lm.py \
  --config configs/krull_micro.json \
  --tokenizer artifacts/tokenizer.json \
  --corpus data/tiny_corpus.txt \
  --out artifacts/krull_micro.pt \
  --epochs 3 \
  --batch-size 8 \
  --lr 3e-4

Optimization Pipeline

  1. Train distilled model
  2. Export to ONNX
  3. Apply INT8 quantization
  4. Deploy with ONNX Runtime

Intended Use

  • Embedded systems
  • Mobile devices
  • Offline text generation
  • Low-latency inference environments

Limitations

  • Small model capacity → limited long-range coherence
  • Sensitive to training data quality
  • Not suitable for large-scale reasoning tasks

License

MIT License


Author

Igor Machado de Castro


Contributing

Contributions are welcome. Feel free to open issues or submit pull requests.


Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support