--- language: en license: mit tags: - transformers - language-model - graph-attention - adaptive-depth - temporal-decay - efficient-llm - pytorch datasets: - wikitext - c4 metrics: - perplexity model-index: - name: TemporalMesh-Transformer results: - task: type: text-generation dataset: name: WikiText-2 type: wikitext metrics: - type: perplexity value: 29.4 name: Validation Perplexity - task: type: text-generation dataset: name: WikiText-103 type: wikitext metrics: - type: perplexity value: 36.1 name: Validation Perplexity --- # TemporalMesh Transformer (TMT v3) **Author:** Vigneshwar LK **Paper:** [DOI 10.5281/zenodo.20287197](https://zenodo.org/records/20287390) **Code:** [github.com/vignesh2027/TemporalMesh-Transformer](https://github.com/vignesh2027/TemporalMesh-Transformer) **Live Demo:** [HuggingFace Space](https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo) **Benchmarks:** [TMT-Benchmarks Dataset](https://huggingface.co/datasets/vigneshwar234/TMT-Benchmarks) --- ## What is TMT? TMT is a novel PyTorch transformer architecture that simultaneously resolves three fundamental inefficiencies in standard transformers: | Problem | Standard Transformer | TMT Solution | |---|---|---| | Quadratic attention cost | $O(S^2)$ per layer | **Mesh Attention**: $O(S \cdot k)$ dynamic $k$NN graph | | Static attention topology | Fixed fully-connected | **Dynamic graph** rebuilt per-layer from cosine similarity | | Uniform token compute | All tokens use all $N$ layers | **Adaptive Depth Routing**: exit gate per token, avg 5.8/12 layers | | Flat positional encoding | Position only | **Temporal Decay**: learned multiplicative semantic attenuation | | No cross-sequence memory | Stateless | **EMA Memory Anchors**: 16 persistent fast-weight vectors | --- ## Results | Model | WikiText-2 PPL ↓ | WikiText-103 PPL ↓ | LongBench ↑ | Compute | |---|---|---|---|---| | Vanilla Transformer | 42.1 | 51.3 | 41.2 | 100% | | Longformer | 39.6 | 47.2 | 49.8 | 62% | | Mamba | 31.8 | 38.4 | 51.3 | 55% | | RWKV | 33.1 | 40.9 | 48.7 | 50% | | **Full TMT** | **29.4** | **36.1** | **53.4** | **48%** | All models: ~120M parameters. TMT trained for 10K steps on WikiText-2 (AdamW, cosine LR, seeds 42/1337/2024). --- ## Architecture at a Glance ``` Input → Token Embedding + RoPE → [× 12 layers] MeshBuilder (kNN graph, cosine sim, top-k=8) Mesh Attention O(S·k) + Temporal Decay Encoding EMA Memory Anchor Cross-Attention (16 anchors, β=0.99) Dual-Stream FFN (syntax stream ‖ semantic stream, sigmoid gate) Exit Gate σ(W_gate · x) > 0.85 → token frozen → LayerNorm → Tied Output Projection → Logits (B, S, V) ``` **Output fields** (TMTOutput dataclass): - `logits` — (B, S, V) next-token predictions - `exit_masks` — list of (B, S) booleans, one per layer - `confidences` — gate confidence per token per layer - `graph_edges` — sparse kNN edge list from final layer - `memory_state` — (M, D) final EMA anchor states - `decay_scalars` — temporal decay weights applied --- ## Quick Start ```bash git clone https://github.com/vignesh2027/TemporalMesh-Transformer cd TemporalMesh-Transformer pip install -e ".[dev]" ``` ```python from tmt.model.config import TMTConfig from tmt.model.model import TMTModel import torch config = TMTConfig( vocab_size=50257, d_model=512, n_heads=8, n_layers=12, graph_k=8, exit_threshold=0.85, memory_anchors=16, ) model = TMTModel(config) # ~120M params tokens = torch.randint(0, 50257, (1, 256)) out = model(tokens) print(out.logits.shape) # (1, 256, 50257) print(out.exit_masks[-1]) # which tokens exited at layer 12 avg_exit = sum(m.float().mean() for m in out.exit_masks) / len(out.exit_masks) print(f"Avg exit layer: {avg_exit:.2f}") # ~5.8 ``` --- ## Training ```bash python scripts/train.py \ --dataset wikitext-2 \ --model_size base \ --steps 10000 \ --lr 3e-4 \ --batch_size 16 \ --seq_len 256 \ --exit_threshold 0.85 \ --graph_k 8 ``` --- ## Ablation Summary | Config | PPL ↓ | Compute | VRAM | |---|---|---|---| | Vanilla Transformer | 42.1 | 100% | 18.4 GB | | + Mesh Attention only | 37.8 | 62% | 11.2 GB | | + Temporal Decay only | 40.3 | 98% | 18.4 GB | | + Adaptive Exit only | 39.6 | 51% | 18.4 GB | | Mesh + Decay | 34.2 | 61% | 11.2 GB | | Mesh + Exit | 35.1 | 50% | 11.2 GB | | **Full TMT** | **29.4** | **48%** | **11.2 GB** | The full combination achieves **superadditive gains**: interaction effect = 4.1 PPL beyond sum of individual contributions. --- ## Citation ```bibtex @misc{vigneshwar2026tmt, title = {TemporalMesh Transformer: Dynamic Graph Attention with Temporal Semantic Decay and Per-Token Adaptive Depth Routing}, author = {Vigneshwar LK}, year = {2026}, doi = {10.5281/zenodo.20287197}, url = {https://zenodo.org/records/20287390} } ``` --- ## License MIT License · © 2026 Vigneshwar LK