Instructions to use squ11z1/Gravity-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use squ11z1/Gravity-2 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="squ11z1/Gravity-2",
	filename="gravity-2-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use squ11z1/Gravity-2 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf squ11z1/Gravity-2:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf squ11z1/Gravity-2:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf squ11z1/Gravity-2:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf squ11z1/Gravity-2:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf squ11z1/Gravity-2:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf squ11z1/Gravity-2:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf squ11z1/Gravity-2:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf squ11z1/Gravity-2:Q4_K_M

Use Docker

docker model run hf.co/squ11z1/Gravity-2:Q4_K_M

LM Studio
Jan

vLLM

How to use squ11z1/Gravity-2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "squ11z1/Gravity-2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "squ11z1/Gravity-2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/squ11z1/Gravity-2:Q4_K_M

Ollama
How to use squ11z1/Gravity-2 with Ollama:
```
ollama run hf.co/squ11z1/Gravity-2:Q4_K_M
```

Unsloth Studio

How to use squ11z1/Gravity-2 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for squ11z1/Gravity-2 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for squ11z1/Gravity-2 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for squ11z1/Gravity-2 to start chatting

How to use squ11z1/Gravity-2 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf squ11z1/Gravity-2:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "squ11z1/Gravity-2:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use squ11z1/Gravity-2 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf squ11z1/Gravity-2:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default squ11z1/Gravity-2:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use squ11z1/Gravity-2 with Docker Model Runner:
```
docker model run hf.co/squ11z1/Gravity-2:Q4_K_M
```

Lemonade

How to use squ11z1/Gravity-2 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull squ11z1/Gravity-2:Q4_K_M

Run and chat with the model

lemonade run user.Gravity-2-Q4_K_M

List all available models

lemonade list

Gravity-2 / README.md

squ11z1

Update README.md

6b21a19 verified 9 days ago

preview code

Raw

History Blame Contribute Delete

4.46 kB

metadata

license: mit
pipeline_tag: text-generation
tags:
  - research
  - experimental
  - gravity-attention
  - qwen2

Gravity-2

Experimental research model by squ11z1.

A 3B reasoning model in which the standard scaled-dot-product attention is replaced by a physically-motivated gravity attention, then adapted with LoRA. This card documents a stage-1 proof-of-mechanism

The experiment

Transformer attention scores tokens by alignment — the dot product q·k. Gravity-2 asks a different question: what if tokens attended by proximity instead? We replace the score with an inverse-square law borrowed from gravitation — each token is pulled toward others that are close in query/key space, weighted by a learnable per-head "mass":

                         M_h²
score(i, j)  =  ─────────────────────          →   softmax_j( score )
                  ‖q_i − k_j‖²  +  ε

M_h = softplus(gravity_mass_log[h]) — one learnable mass per query head (16 / layer), initialised at 0.5; softplus keeps it strictly positive.
‖q_i − k_j‖² — squared L2 distance, computed stably as ‖q‖² + ‖k‖² − 2·q·k.
ε = 0.1 — softening length; prevents the q → k singularity.
The raw gravity scores are then passed through the usual softmax (see Limitations).

Why it's interesting

Different inductive bias. Dot-product attention rewards directional alignment; inverse-distance rewards locality in the learned embedding geometry — a metric prior rather than an inner-product one.
Interpretable per-head masses. Each head learns a scalar "mass" controlling how sharply it concentrates — a compact, inspectable knob (see figures/04_mass_heatmap.png).
A bridge to physics-style sparsity. An inverse-square field is naturally local, which later stages (pruning / QUBO, "Gravity-6") aim to exploit for structured sparsity.

Architecture

Qwen2-3B class: 36 layers, hidden 2048, 16 query heads / 2 KV heads (GQA, group size 8), head_dim 128. The 2 KV heads are repeat_kv-expanded to 16 before the distance, so each query head gets its own mass. Integrated via the transformers-5.x AttentionInterface (a registered "gravity" op + eager causal-mask reuse) — RoPE / KV-cache / masking are left to the framework; only the score function changes.

Results

Honest limitations

Not "pure" gravity. The inverse-square scores are renormalised by a softmax on top (softmax_j(M²/(d²+ε))). Without it training was unstable, but it means this is a distance-biased softmax attention, not a literal gravitational field — the normalisation reintroduces global competition between keys.
MHA → GQA transfer is an open question. The mechanism was first prototyped on MHA (1 KV head per query head). Here it runs on GQA by repeat_kv-expanding 2 KV heads to 16 and giving each query head its own mass; whether this is the right granularity (vs. one mass per KV group) is unresolved and may matter for convergence.
Loading requires the patch (below). GGUF builds run standard attention, not gravity (llama.cpp has no kernel for M²/(‖q−k‖²+ε)) — the *.gguf files are format placeholders and produce incorrect output.

Loading (requires the gravity patch)

python load_gravity2.py   # from_pretrained -> patch_qwen_with_gravity -> load gravity_mass_log.pt

Weights are LoRA-merged into the base but were trained under gravity scoring; loading them under vanilla attention gives garbage. config.json ships _attn_implementation="eager" only so the checkpoint loads — the patch switches it to gravity.

License & attribution

Released under the MIT License. This is a derivative work of WeiboAI/VibeThinker-3B (the base model for the experiment), which is distributed under the MIT License; that license is inherited here and the original authors are credited accordingly.