Skylion007/openwebtext
Viewer • Updated • 8.01M • 69.3k • 522
FELA replaces standard self-attention with a hybrid FNO+GLA block that runs in O(N log N) time during prefill and O(1) per token during autoregressive generation via a fixed-depth ring buffer.
Paper: FELA: Fourier Encoder with Linear Attention for Efficient Long Context Language Modeling (ACML 2026)
| Checkpoint | Params | Description |
|---|---|---|
checkpoints/fno_gla_gpu_v3_22b/step_111897_final.pt |
1.13B | Main FELA model, trained on 22B tokens (6×B200) |
checkpoints/chunk_ablation/variant_64/final.pt |
109M | GLA chunk ablation - C=64 |
checkpoints/chunk_ablation/variant_256/final.pt |
109M | GLA chunk ablation - C=256 |
checkpoints/chunk_ablation/variant_512/final.pt |
109M | GLA chunk ablation - C=512 |
checkpoints/chunk_ablation/variant_1024/final.pt |
109M | GLA chunk ablation - C=1024 |
checkpoints/cpu_diloco_v9/v9_step001500.pt |
109M | CPU DiLoCo v9 - 10x c7i.4xlarge, 1.5B tokens |
checkpoints/fela_byte_128m/step_038146_final.pt |
128M | Byte-level LM - no tokenizer, vocab=261, 5B bytes, val_bpb=1.692 |
pip install uv
uv sync
Or directly:
pip install torch datasets tiktoken lm-eval boto3
python scripts/train_gpu.py \
--config gpt2-1b \
--data-dir /path/to/data \
--ckpt-dir checkpoints/run
python train_cpu_gpt2.py \
--config gpt2-small \
--tokens 5e8 \
--node-rank $RANK \
--num-nodes 10 \
--master-ip $MASTER_IP \
--s3-bucket my-bucket \
--run-name cpu_gpt2
python scripts/run_eval.py \
--checkpoint checkpoints/fno_gla_gpu_v3_22b/step_111897_final.pt \
--tasks wikitext,boolq,hellaswag,arc_easy,piqa
Evaluation outputs from the paper are in results/paper_eval_22b/:
| File | Contents |
|---|---|
lm_eval_results.json |
BoolQ, HellaSwag, ARC-E, PIQA, WinoGrande |
longctx_bpb.csv |
BPB vs context length (256→65K tokens) |
throughput.csv |
Prefill throughput on NVIDIA B200 |
wallclock.csv |
Prefill throughput on NVIDIA A10G vs SDPA |
vram.csv |
Peak VRAM vs sequence length (A10G) |
needle_heatmap.json |
Passkey retrieval accuracy heatmap |
python prepare.py
Downloads and tokenizes OpenWebText using tiktoken (GPT-2 BPE).