FELA: Fourier Encoder with Linear Attention

FELA replaces standard self-attention with a hybrid FNO+GLA block that runs in O(N log N) time during prefill and O(1) per token during autoregressive generation via a fixed-depth ring buffer.

Paper: FELA: Fourier Encoder with Linear Attention for Efficient Long Context Language Modeling (ACML 2026)

Architecture

FNO (Fourier Neural Operator): causal M-tap FIR convolution computed via FFT during training; ring buffer at generation
GLA (Gated Linear Attention): recurrent state O(H × Dh²), independent of sequence length
SSSL pattern: 3 FNO : 1 GLA per 4 layers
Main model: 1.13B params, 24 layers, d=2048, 16 heads, M=512 FNO modes

Checkpoints

Checkpoint	Params	Description
`checkpoints/fno_gla_gpu_v3_22b/step_111897_final.pt`	1.13B	Main FELA model, trained on 22B tokens (6×B200)
`checkpoints/chunk_ablation/variant_64/final.pt`	109M	GLA chunk ablation - C=64
`checkpoints/chunk_ablation/variant_256/final.pt`	109M	GLA chunk ablation - C=256
`checkpoints/chunk_ablation/variant_512/final.pt`	109M	GLA chunk ablation - C=512
`checkpoints/chunk_ablation/variant_1024/final.pt`	109M	GLA chunk ablation - C=1024
`checkpoints/cpu_diloco_v9/v9_step001500.pt`	109M	CPU DiLoCo v9 - 10x c7i.4xlarge, 1.5B tokens
`checkpoints/fela_byte_128m/step_038146_final.pt`	128M	Byte-level LM - no tokenizer, vocab=261, 5B bytes, val_bpb=1.692

Setup

pip install uv
uv sync

Or directly:

pip install torch datasets tiktoken lm-eval boto3

Training (GPU)

python scripts/train_gpu.py \
  --config gpt2-1b \
  --data-dir /path/to/data \
  --ckpt-dir checkpoints/run

Training (CPU DiLoCo)

python train_cpu_gpt2.py \
  --config gpt2-small \
  --tokens 5e8 \
  --node-rank $RANK \
  --num-nodes 10 \
  --master-ip $MASTER_IP \
  --s3-bucket my-bucket \
  --run-name cpu_gpt2

Evaluation

python scripts/run_eval.py \
  --checkpoint checkpoints/fno_gla_gpu_v3_22b/step_111897_final.pt \
  --tasks wikitext,boolq,hellaswag,arc_easy,piqa

Results

Evaluation outputs from the paper are in results/paper_eval_22b/:

File	Contents
`lm_eval_results.json`	BoolQ, HellaSwag, ARC-E, PIQA, WinoGrande
`longctx_bpb.csv`	BPB vs context length (256→65K tokens)
`throughput.csv`	Prefill throughput on NVIDIA B200
`wallclock.csv`	Prefill throughput on NVIDIA A10G vs SDPA
`vram.csv`	Peak VRAM vs sequence length (A10G)
`needle_heatmap.json`	Passkey retrieval accuracy heatmap

Data Preparation

python prepare.py

Downloads and tokenizes OpenWebText using tiktoken (GPT-2 BPE).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train itstheraj/fela-acml2026

Evaluation results

BPB (32K context) on WikiText-103
self-reported

1.490