programming-language-identification-100plus-lite
Byte-level programming-language identification across 107 languages. 2.35M parameters, no tokenizer, ships at ~9 MB fp32 / ~4.5 MB bf16.
Open PyTorch Notebook Β· Open ONNX Notebook β Download and run in Colab or Jupyter.
The architecture is ByteHybrid (3 Γ Conv1D β 1 Γ bidirectional attention with
RoPE β masked mean-pool β classifier head, with a 4096-bucket trigram-hash
embedding), vendored from
PleIAs/CommonLingua (Apache-2.0)
and trained from scratch on Rosetta Code + The Stack v1 across 107 canonical
programming languages.
Comparison with philomath-1209/programming-language-identification
3,057 test rows over the 26 labels philomath supports. ONNX,
CPUExecutionProvider, batch 64.
| model | params | accuracy | macro F1 | weighted F1 | speed |
|---|---|---|---|---|---|
| programming-language-identification-100plus-lite (ONNX) | 2.35 M | 0.9094 | 0.9410 | 0.9361 | 2.37Γ |
| philomath-1209/programming-language-identification (ONNX) | 84 M | 0.8449 | 0.8445 | 0.8467 | 1.00Γ |
Files
model.pt fp32 PyTorch checkpoint (CommonLingua format)
model.bf16.pt bf16 sidecar checkpoint (smaller, same accuracy in eval)
lang2idx.json 107-label index
training_metadata.json hyperparameters and dataset stats
training_history.json per-epoch loss / val_acc / val_macro_f1
onnx/
model.onnx ONNX export (opset 20, dynamic batch)
model.onnx.data external weights blob
lang2idx.json (mirror)
onnx_metadata.json parity report vs PyTorch
Quick start β PyTorch
import torch, numpy as np, sys
sys.path.append("path/to/code-language-id/src")
from code_language_id.byte_hybrid import ByteHybrid, CONFIGS
ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
model = ByteHybrid(num_classes=ckpt["num_classes"], max_len=ckpt["max_len"],
**CONFIGS[ckpt["config"]]).eval()
model.load_state_dict(ckpt["model_state_dict"])
idx2lang = {v: k for k, v in ckpt["lang2idx"].items()}
def encode(texts, max_len=ckpt["max_len"]):
out = np.full((len(texts), max_len), 256, dtype=np.int64)
for i, t in enumerate(texts):
b = t.encode("utf-8", errors="replace")[:max_len]
out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
return torch.from_numpy(out)
with torch.no_grad():
logits = model(encode(["def hello():\n print('hi')"]))
print(idx2lang[int(logits.argmax(-1))]) # -> Python
Quick start β ONNX Runtime
import onnxruntime as ort, numpy as np, json
sess = ort.InferenceSession("onnx/model.onnx", providers=["CPUExecutionProvider"])
lang2idx = json.load(open("onnx/lang2idx.json"))
idx2lang = {v: k for k, v in lang2idx.items()}
MAX_LEN = 1023
def encode(texts, max_len=MAX_LEN):
out = np.full((len(texts), max_len), 256, dtype=np.int64)
for i, t in enumerate(texts):
b = t.encode("utf-8", errors="replace")[:max_len]
out[i, :len(b)] = np.frombuffer(b, dtype=np.uint8)
return out
logits = sess.run(None, {"byte_ids": encode(["fn main() {}"])})[0]
print(idx2lang[int(logits.argmax(-1))]) # -> Rust
Training summary
- Data: Rosetta Code (
cakiki/rosetta-code) + The Stack v1 (bigcode/the-stack), task-split to prevent leakage. 72,549 / 9,495 / 8,880 rows (train / val / test) across 107 canonical labels. - Snippets: variable-window (64β1023 bytes) UTF-8.
- Optimizer: AdamW (Ξ²=0.9, 0.95, weight decay 0.01) + cosine-with-warmup, peak LR 3e-3, 5 % warmup, gradient clipping 1.0.
- Schedule: 30 epochs, bf16 autocast, batch 128 (effective 128 with gradient clipping; SDPA fused attention).
- Best val macro F1: 0.9085 @ epoch 26 (early stopped).
See training_metadata.json for the full hyperparameter dump.
Citation
If you use this model, please cite:
@misc{mariappan2026codelangidlite,
author = {Mariappan, Vijayachandran},
title = {programming-language-identification-100plus-lite: Byte-level Programming Language Identification across 107 Languages},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/FrameByFrame/programming-language-identification-100plus-lite}
}
Upstream architecture:
@misc{commonlingua,
author = {{PleIAs}},
title = {CommonLingua: Byte-level Language Identification for 334 Languages},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/PleIAs/CommonLingua}
}
License & attribution
Apache-2.0. Architecture and reference inference code derive from PleIAs/CommonLingua (Apache-2.0). Trained weights and dataset curation are original to this repository.
- Downloads last month
- 14