Papers
arxiv:2604.17698

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Published on Apr 20
· Submitted by
Prashant Raju
on Apr 21

Abstract

Geometric stability measures predict language model controllability and detect structural degradation, with supervised variants excelling at steering prediction and unsupervised variants at drift detection.

AI-generated summary

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.

Community

Paper author Paper submitter

The Geometric Canary introduces geometric stability as a dual diagnostic for LLM deployment. Supervised Shesha predicts which embedding models will accept linear steering with near-perfect accuracy (rho = 0.89-0.96 across 35-69 models and three NLP tasks), capturing unique variance beyond class separability. A critical dissociation: unsupervised stability fails entirely for steering (rho ~ 0.10) but excels at detecting post-training drift, measuring up to 5.23x more geometric change than CKA in Llama-family models while maintaining a 6x lower false alarm rate than Procrustes. Together, the two variants form complementary diagnostics for the deployment lifecycle: supervised stability for pre-deployment controllability assessment, unsupervised stability for post-deployment monitoring. Code available via shesha-geometry on PyPI.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.17698
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.17698 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.17698 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.