FELA: Fourier Encoder with Linear Attention

FELA replaces standard self-attention with a hybrid FNO+GLA block that runs in O(N log N) time during prefill and O(1) per token during autoregressive generation via a fixed-depth ring buffer.

Paper: FELA: Fourier Encoder with Linear Attention for Efficient Long Context Language Modeling (ACML 2026)

Architecture

  • FNO (Fourier Neural Operator): causal M-tap FIR convolution computed via FFT during training; ring buffer at generation
  • GLA (Gated Linear Attention): recurrent state O(H × Dh²), independent of sequence length
  • SSSL pattern: 3 FNO : 1 GLA per 4 layers
  • Main model: 1.13B params, 24 layers, d=2048, 16 heads, M=512 FNO modes

Checkpoints

Checkpoint Params Description
checkpoints/fno_gla_gpu_v3_22b/step_111897_final.pt 1.13B Main FELA model, trained on 22B tokens (6×B200)
checkpoints/chunk_ablation/variant_64/final.pt 109M GLA chunk ablation - C=64
checkpoints/chunk_ablation/variant_256/final.pt 109M GLA chunk ablation - C=256
checkpoints/chunk_ablation/variant_512/final.pt 109M GLA chunk ablation - C=512
checkpoints/chunk_ablation/variant_1024/final.pt 109M GLA chunk ablation - C=1024
checkpoints/cpu_diloco_v9/v9_step001500.pt 109M CPU DiLoCo v9 - 10x c7i.4xlarge, 1.5B tokens
checkpoints/fela_byte_128m/step_038146_final.pt 128M Byte-level LM - no tokenizer, vocab=261, 5B bytes, val_bpb=1.692

Setup

pip install uv
uv sync

Or directly:

pip install torch datasets tiktoken lm-eval boto3

Training (GPU)

python scripts/train_gpu.py \
  --config gpt2-1b \
  --data-dir /path/to/data \
  --ckpt-dir checkpoints/run

Training (CPU DiLoCo)

python train_cpu_gpt2.py \
  --config gpt2-small \
  --tokens 5e8 \
  --node-rank $RANK \
  --num-nodes 10 \
  --master-ip $MASTER_IP \
  --s3-bucket my-bucket \
  --run-name cpu_gpt2

Evaluation

python scripts/run_eval.py \
  --checkpoint checkpoints/fno_gla_gpu_v3_22b/step_111897_final.pt \
  --tasks wikitext,boolq,hellaswag,arc_easy,piqa

Results

Evaluation outputs from the paper are in results/paper_eval_22b/:

File Contents
lm_eval_results.json BoolQ, HellaSwag, ARC-E, PIQA, WinoGrande
longctx_bpb.csv BPB vs context length (256→65K tokens)
throughput.csv Prefill throughput on NVIDIA B200
wallclock.csv Prefill throughput on NVIDIA A10G vs SDPA
vram.csv Peak VRAM vs sequence length (A10G)
needle_heatmap.json Passkey retrieval accuracy heatmap

Data Preparation

python prepare.py

Downloads and tokenizes OpenWebText using tiktoken (GPT-2 BPE).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train itstheraj/fela-acml2026

Evaluation results