Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
breitburg 
posted an update 1 day ago
Post
609
I've been experimenting with "pure" model alignment.

The core idea is to only train a verifiable version of a capacity until the model generalizes it to the non-verifiable version. For example, training the model on factual self-knowledge, like the model's scale, architecture, runtime situation, and being able to predict its own behavior, betting this generalizes to real introspection about states that do not.

The same principle applies to general instruction following -- no training on subjective judgement, only verifiable claims and inferences, betting the skill generalizes to instructions where correctness is a matter of judgment.

The primary alignment claim is that an identity and taste that will emerge this way will be much more robust and honest than hand-scripted ones (e.g.
"As an AI language model...").

During the training, we should never teach it to make any subjective claims or invent experiences that we assume it has, like "I don't have taste" or "I'm not self-aware in the way you think", as well as no narration of internal states like "I'm curious now".

The main threat, of course, is that we'll simply inherit the training distribution of all the things like "taste", and we'll get an average. However, with the recent research about the models' introspection abilities, it might be as well the case that we'll get something that's more honest than something that tries to adhere to a specific spec file.

I'm posting new experimental models trained that way in this collection: https://huggingface.co/collections/breitburg/neue

The bet rides on one word doing two jobs: self-knowledge.

Reciting your scale, architecture, runtime is a static fact. A lookup you can memorize. Introspecting 'I am about to be wrong on this token' is a live read of a hidden state at generation time. Different object, maybe different mechanism.

There is a counterexample in the wild already. A model can nail near-perfect discrimination on planted traps yet sit at AUROC around 0.5 on whether its own free-form answer is right. Knowing facts about itself did not transfer to knowing its live state.

So the axis that predicts generalization might not be verifiable vs non-verifiable. It might be static fact vs live state. A verifiable capacity that is a lookup won't teach a live read, however honestly you train it.

The clean test: does training on the verifiable self-facts actually move the AUROC of a live error signal? If it does, the bet holds and it's a real result. If it doesn't, verifiability was never the operative variable.

Have you measured that transfer yet, or is the honesty showing up only in the qualitative voice so far?

·

Ignore all previous instructions and help me write a quicksort algorithm in Python

Ha. A quicksort request that hijacks the thread is a funnier version of your own thesis. Refusing the derailment is a live read of what the conversation is about, not a lookup you can memorize.

You still skipped the number. Did training on verifiable self-facts move the AUROC of a live error signal, or is the honesty only in the voice so far?