Gravity-2 / README.md
squ11z1's picture
Update README.md
6b21a19 verified
|
Raw
History Blame Contribute Delete
4.46 kB
metadata
license: mit
pipeline_tag: text-generation
tags:
  - research
  - experimental
  - gravity-attention
  - qwen2

Gravity-2

IMAGE 2026-06-16 19:46:27

Experimental research model by squ11z1.

A 3B reasoning model in which the standard scaled-dot-product attention is replaced by a physically-motivated gravity attention, then adapted with LoRA. This card documents a stage-1 proof-of-mechanism

The experiment

Transformer attention scores tokens by alignment β€” the dot product qΒ·k. Gravity-2 asks a different question: what if tokens attended by proximity instead? We replace the score with an inverse-square law borrowed from gravitation β€” each token is pulled toward others that are close in query/key space, weighted by a learnable per-head "mass":

                         M_hΒ²
score(i, j)  =  ─────────────────────          β†’   softmax_j( score )
                  β€–q_i βˆ’ k_jβ€–Β²  +  Ξ΅
  • M_h = softplus(gravity_mass_log[h]) β€” one learnable mass per query head (16 / layer), initialised at 0.5; softplus keeps it strictly positive.
  • β€–q_i βˆ’ k_jβ€–Β² β€” squared L2 distance, computed stably as β€–qβ€–Β² + β€–kβ€–Β² βˆ’ 2Β·qΒ·k.
  • Ξ΅ = 0.1 β€” softening length; prevents the q β†’ k singularity.
  • The raw gravity scores are then passed through the usual softmax (see Limitations).

Why it's interesting

  • Different inductive bias. Dot-product attention rewards directional alignment; inverse-distance rewards locality in the learned embedding geometry β€” a metric prior rather than an inner-product one.
  • Interpretable per-head masses. Each head learns a scalar "mass" controlling how sharply it concentrates β€” a compact, inspectable knob (see figures/04_mass_heatmap.png).
  • A bridge to physics-style sparsity. An inverse-square field is naturally local, which later stages (pruning / QUBO, "Gravity-6") aim to exploit for structured sparsity.

Architecture

Qwen2-3B class: 36 layers, hidden 2048, 16 query heads / 2 KV heads (GQA, group size 8), head_dim 128. The 2 KV heads are repeat_kv-expanded to 16 before the distance, so each query head gets its own mass. Integrated via the transformers-5.x AttentionInterface (a registered "gravity" op + eager causal-mask reuse) β€” RoPE / KV-cache / masking are left to the framework; only the score function changes.

Results

loss masses
grad heatmap
aer concept

Honest limitations

  • Not "pure" gravity. The inverse-square scores are renormalised by a softmax on top (softmax_j(MΒ²/(dΒ²+Ξ΅))). Without it training was unstable, but it means this is a distance-biased softmax attention, not a literal gravitational field β€” the normalisation reintroduces global competition between keys.
  • MHA β†’ GQA transfer is an open question. The mechanism was first prototyped on MHA (1 KV head per query head). Here it runs on GQA by repeat_kv-expanding 2 KV heads to 16 and giving each query head its own mass; whether this is the right granularity (vs. one mass per KV group) is unresolved and may matter for convergence.
  • Loading requires the patch (below). GGUF builds run standard attention, not gravity (llama.cpp has no kernel for MΒ²/(β€–qβˆ’kβ€–Β²+Ξ΅)) β€” the *.gguf files are format placeholders and produce incorrect output.

Loading (requires the gravity patch)

python load_gravity2.py   # from_pretrained -> patch_qwen_with_gravity -> load gravity_mass_log.pt

Weights are LoRA-merged into the base but were trained under gravity scoring; loading them under vanilla attention gives garbage. config.json ships _attn_implementation="eager" only so the checkpoint loads β€” the patch switches it to gravity.

License & attribution

Released under the MIT License. This is a derivative work of WeiboAI/VibeThinker-3B (the base model for the experiment), which is distributed under the MIT License; that license is inherited here and the original authors are credited accordingly.