Openllava

company

https://opceanai.com

agua_omg

OpceanAI

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

OpceanAI updated a Space 2 days ago

Openllava/README

OpceanAI updated a model 9 days ago

Openllava/Yaki

OpceanAI published a model 9 days ago

Openllava/Yaki

View all activity

Organization Card

Community About org cards

Inject Vision Into Any Language Model.

Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.
Architecture-agnostic. Multi-backend. Production-ready. Built by OpceanAI.

What is OpenLLaVA?

OpenLLaVA is a comprehensive open-source framework for injecting vision capabilities into any language model. It provides a complete pipeline — from model construction through training, inference, serving, export, and evaluation — all accessible through a unified Python API and CLI.

The framework supports any LLM architecture (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, and more) and any HuggingFace-compatible vision encoder. It automatically detects model dimensions, constructs the appropriate projector, patches the tokenizer with visual tokens, and configures the full training and inference pipelines.

The central design goal: when a new language model drops, you should have a vision version in 48 hours.

OpenLLaVA is backend-agnostic. The same code runs on CUDA, ROCm, Apple MLX, Intel XPU, Google TPU, and CPU — with automatic hardware detection and optimal configuration selection.

Quickstart

pip install openllava        # Core
pip install openllava[cli]   # With CLI tools
pip install openllava[serve] # With serving
pip install openllava[all]   # Full installation

Inject Vision Into Any LLM

from openllava import OpenLLaVA, Backend

model = OpenLLaVA(
    llm="meta-llama/Llama-3-8B",
    vision_encoder="google/siglip2-so400m-patch14-384",
    backend=Backend.AUTO,
)

OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.

Train with LoRA

model.lora(r=64, alpha=128, dropout=0.05)

model.train(
    phase1=dict(dataset="liuhaotian/LLaVA-Pretrain", samples=100_000),
    phase2=dict(dataset="liuhaotian/LLaVA-Instruct-150K", learning_rate=2e-4),
    resume=True,
)

model.push("my-org/my-vision-model")

FastVisionModel API

from openllava.api import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "Openllava/Yaki",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastVisionModel.get_peft_model(model, r=16, alpha=32)

Serve as OpenAI-Compatible API

openllava serve Openllava/Yaki --port 8000

from openai import OpenAI

client = OpenAI(api_key="openllava", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="yaki",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
        ],
    }],
)

Key Features

Model Construction

Vision injection into any HuggingFace LLM in 3 lines
AnyRes dynamic high-resolution with patch grouping
YakiProjector: configurable MLP alignment
Auto-detects hidden dimensions, attention heads, vocabulary size
Supports LoRA-patched models

Training Pipeline

3-phase training: alignment, instruction tuning, RL alignment
LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA
BitNet ternary training (b1.58)
MoE + LoRA fusion
FP8 training on H100
Padding-free and sequence packing
Curriculum learning

RL Alignment

DPO, GRPO, ORPO, PPO
Composable reward functions
Visual reasoning reward support

Inference and Serving

Continuous batching
PagedAttention (4x memory efficiency)
Speculative decoding (Eagle, Medusa, NGram)
KV cache: quantization, eviction, compression
OpenAI-compatible FastAPI server
Streaming support

Optimization Suite (40+)

torch.compile full-graph compilation
GPTQ / AWQ / FP4 / NVFP4 quantization
GaLore gradient projection
torchao integration
EMA training stability
Selective activation checkpointing

Distributed Training

FSDP2, DeepSpeed ZeRO (stages 0-3)
Tensor, Pipeline, Expert parallelism
Ring Attention for long context
Heterogeneous GPU + CPU + TPU training
Auto-parallelism detection

Multi-Backend Support

Backend	Hardware	Status
CUDA	NVIDIA GPUs (Ampere, Ada, Hopper, Blackwell)	Production
ROCm	AMD GPUs (MI250, MI300X, RX 7000)	Production
CPU FP32	Any x86/x64 CPU (AVX-512, AVX2, NEON)	Production
TPU (XLA/SPMD)	Google TPU v3-v5	Beta
MLX	Apple Silicon M1-M4	Beta
XPU	Intel Arc, Data Center GPU	Beta
Heterogeneous	GPU + CPU + TPU mixed	Beta

Stack

Layer	Technology	Purpose
CUDA Kernels	C/CUDA	Fused projector ops, cross-attention, VQ lookup
Core	C++	Memory management, tensor routing, async streams
Bindings	pybind11	C++ to Python bridge
Triton	OpenAI Triton	Fused attention, RoPE, SwiGLU, RMSNorm
API	Python	Public interface, FastVisionModel, Trainer
Backends	CUDA/ROCm/MLX/TPU/XPU	Hardware abstraction
Export	GGUF/ONNX/SafeTensors/vLLM/MLX	Deployment formats

Architecture

Image + Text feeds into a Vision Encoder (SigLIP2, CLIP, DINOv2, or any HuggingFace encoder), whose patch features are passed through the YakiProjector (Patch Grouping 3x3 + MLP 2-layer, mapping vision_dim x 9 to llm_dim). The projected embeddings are merged with text embeddings and passed to the Language Model (any AutoModelForCausalLM, with QLoRA 4-bit NF4 and LoRA r=64), which generates text output including <think> reasoning blocks when applicable.

Yadis Architecture

Yadis is OpenLLaVA's flagship multimodal architecture — the long-term evolution of the framework combining discrete visual tokens, MLP projection, and cross-attention per LLM layer.

# Yadis Routing — multiple vision experts with MoE router
from openllava import OpenLLaVA, experts

model = OpenLLaVA(
    llm="OpceanAI/OwO-32B",
    architecture="yadis_routing",
    experts=[
        experts.Visual("google/siglip2-so400m-patch14-384"),
        experts.OCR("deepseek-ai/DeepSeek-OCR-2"),
    ],
)

# Yadis Full — discrete tokens + cross-attention per layer
model = OpenLLaVA(
    llm="OpceanAI/OwO-32B",
    architecture="yadis_full",
    vision_encoder="google/siglip2-so400m-patch14-384",
)

Mode	Description
`llava`	LLaVA-style MLP projection (default)
`yadis_routing`	Multiple expert encoders with MoE router
`yadis_full`	Discrete visual tokens with cross-attention per layer

OpceanAI Vision Models

OpceanAI uses OpenLLaVA to publish vision versions of new language models within 48 hours of release.

Yaki v1

Vision-language model built on Yuuki RxG 8B. Designed for complex visual reasoning with bilingual support (ES/EN). Preserves the <think> chain-of-thought behavior of the base model for multimodal tasks.

Base: DeepSeek-R1-Qwen3-8B fine-tune
Encoder: SigLIP 2 SO400M
LoRA: r=64, alpha=128

Yaki v2 (planned)

Built on Yuuki ExG 14B with cross-attention architecture (OpenLLaVA v4).

Yaki v3 (planned)

Built on OwO 32B with full Yadis routing architecture, combining visual and OCR expert encoders.

Philosophy

Architecture Agnostic by Design

Every existing multimodal framework is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.

Speed Over Ceremony

When a new model is released, the window to publish a vision version is 48 to 72 hours. OpenLLaVA is designed for that constraint — minimal configuration, automated phase management, one-command training.

Low Level Where It Matters

The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget research organization.

Fully Open

Apache 2.0. No gating. No commercial restrictions. The framework exists so that any researcher — with any model, any hardware, any budget — can build a competitive vision-language model.

Roadmap

Version	Features	Status
v1 - v3	LLaVA-style, QLoRA, AnyRes, 3-phase pipeline, multi-backend	Released
v4 - v5	CUDA kernels, GGUF vision export, CPU offloading, cross-attention	Active
v6 - v7	Discrete visual tokens (VQ-VAE), multi-expert routing	Planned
v8 - v9	Video support, hybrid architectures	Planned
v10	Yadis complete, omnimodal preparation	Planned

Built by OpceanAI

OpenLLaVA is the vision infrastructure layer of OpceanAI — an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on consumer hardware and validated on standard benchmarks.