AI & ML interests

None defined yet.

Recent Activity

Banaxi-Tech 
posted an update about 3 hours ago
view post
Post
119
Today we are releasing BananaMind-KV1-8M-2Bit-Experimental, a KV-cache-aware trained model that stores its generation KV cache in 2-bit precision instead of the usual 16-bit precision.

Result: 5.33x smaller KV cache vs FP16, with 0.0916 mean KLD against a 16-bit KV cache reference on WikiText-2.

Model: BananaMind/BananaMind-KV1-8M-2Bit-Experimental

The important part: this is not just post-training KV cache quantization.
Instead we take the BitNet approach.

KV1 is trained with a 2-bit-aware K/V path. Instead of training a normal model and quantizing the cache afterwards, the model learns during training to operate under the low-bit KV constraint, closer in spirit to the BitNet idea of training for the low-bit regime.

During generation, each K/V vector is quantized into 4 affine levels and packed into uint8 tensors, with four 2-bit values stored per byte.

WikiText-2 eval vs 16-bit KV cache reference:

Mean KLD: 0.0916 nats/token
Mean KLD: 0.1322 bits/token
Average KV cache shrink vs FP16: 5.33x
Evaluated positions: 372,675

If this actually gets used in models like Qwen or Gemma, then it may be possible to run 128K or even 256K Context on a Normal Machine!
Try it here: BananaMind/BananaMind-KV1-8M-2Bit-Experimental

Code: https://github.com/Banaxi-Tech/kv1
Banaxi-Tech 
updated a Space 3 days ago
Banaxi-Tech 
posted an update 3 days ago
view post
Post
10595
A new model is coming!
Its going to take a long time on my 5070 Ti so expect a release in ~1 month.
We think this model is going to be SOTA For its size.
Our Mini Version will be 25M Parameters and Pro with 140M.
The Pro version has a 3072 Context Window (Extensible to up to 6K with RoPE) And the Mini version has a context window of 4096 (Up to 8K with RoPE)
Meanwhile we are currently working on a Instruct Version of our BananaMind 1.5 Base.

The training will start this weekend

We are very exited to release it when its done!
  • 10 replies
·
Banaxi-Tech 
posted an update 4 days ago
view post
Post
377
📱 TinyPhoneLM - LLMs on a Phone
I built TinyPhoneLM because I wanted to see how far tiny local LMs can go on a real Android phone.
Not just a server app.
Not just an API wrapper.
Not “AI on your phone” that secretly sends everything somewhere else.

TinyPhoneLM allows you to run small language models directly on android. It uses llama.cpp via JNI. We have alot of options for default models + custom GGUF Import Supported. I am running Qwen3.5 4B Locally on my Redmi Note 12 Pro 5G at 4 tokens per second, that may seem slow but that it even runs on my phone is insane. I can also run Qwen3.5 0.8B at 10TPS!
Look at this Chart From Artificial Analysis.
Qwen3.5 4B is Better than GPT 4.1 and GPT 5 Mini at minimal reasoning!
And even the smallest 800M Parameter Qwen3.5 0.8B still beats GPT 3.5 Turbo!

The bad news: To get it on the play store we need 12 Testers

Please only submit your Google Play email if you have a Android phone
If you want to test TinyPhoneLM, enter your Google Play email here:

👉 https://docs.google.com/forms/d/1LqkT2pUHbalSUV50M8PX8m7M6S122ip0cWcbKcytcXk/viewform
I would really appreciate the help if you get a tester!