Instructions to use MachadoDeCastro/krull-micro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MachadoDeCastro/krull-micro with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="MachadoDeCastro/krull-micro")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("MachadoDeCastro/krull-micro") model = AutoModelForMaskedLM.from_pretrained("MachadoDeCastro/krull-micro") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use MachadoDeCastro/krull-micro with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MachadoDeCastro/krull-micro" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MachadoDeCastro/krull-micro", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/MachadoDeCastro/krull-micro
- SGLang
How to use MachadoDeCastro/krull-micro with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MachadoDeCastro/krull-micro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MachadoDeCastro/krull-micro", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MachadoDeCastro/krull-micro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MachadoDeCastro/krull-micro", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use MachadoDeCastro/krull-micro with Docker Model Runner:
docker model run hf.co/MachadoDeCastro/krull-micro
Krull-Micro
Krull is an acronym for Knowledge Running Under Lightweight Language.
GitHub
https://github.com/machadodecastro/krull-micro.git
Language: en
License: mit
Tags:
- krull
- tiny-transformer
- distillation
- edge-ai
- low-memory
- onnx
- quantization pipeline_tag: text-generation library_name: pytorch
Krull-Micro
Krull-Micro is a distilled, ultra-lightweight GPT-style language model designed for edge devices with limited RAM and compute.
It uses an advanced comprehensive distillation strategy that transfers:
- Embedding representations
- Transformer hidden states
- Attention matrices
- Output distributions (soft targets)
This allows Krull-Micro to remain extremely small while preserving strong language modeling performance.
Key Features
- Tiny Transformer architecture (edge-optimized)
- Full distillation (feature-based + attention + response-based)
- ONNX export for cross-platform deployment
- INT8 quantization support
- Designed for memory-bound inference
Architecture
- Model type: Causal Language Model (GPT-style)
- Transformer layers: (set in config)
- Hidden size: (set in config)
- Attention heads: (set in config)
- Vocabulary size: (matches tokenizer)
Usage (PyTorch)
import torch
model = torch.load("krull_micro.pt", map_location="cpu")
model.eval()
# Example input (token IDs)
x = torch.tensor([[1, 5, 23, 42]])
with torch.no_grad():
logits = model(x)
ONNX Inference (Edge Deployment)
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("krull_micro.onnx")
input_ids = np.array([[1, 5, 23, 42]], dtype=np.int64)
outputs = session.run(None, {"input_ids": input_ids})
Training
Training is performed using:
python scripts/train_lm.py \
--config configs/krull_micro.json \
--tokenizer artifacts/tokenizer.json \
--corpus data/tiny_corpus.txt \
--out artifacts/krull_micro.pt \
--epochs 3 \
--batch-size 8 \
--lr 3e-4
Optimization Pipeline
- Train distilled model
- Export to ONNX
- Apply INT8 quantization
- Deploy with ONNX Runtime
Intended Use
- Embedded systems
- Mobile devices
- Offline text generation
- Low-latency inference environments
Limitations
- Small model capacity → limited long-range coherence
- Sensitive to training data quality
- Not suitable for large-scale reasoning tasks
License
MIT License
Author
Igor Machado de Castro
Contributing
Contributions are welcome. Feel free to open issues or submit pull requests.
- Downloads last month
- 11