PerceptionDLM Region Captioning
Parallel region captioning with multimodal diffusion LLM
None defined yet.
Parallel region captioning with multimodal diffusion LLM
2x latent super-resolution with FlowUpscaler in Flux.2 space
Text-to-image with SeFi-Image-5B Semantic-First Diffusion
Keep identity from reference, follow lineart structure
Music understanding model for caption and analysis
Text/speech to spoken response + 3D talking-avatar video
Multi-image instruction-guided image editing
Word-level timestamp alignment from audio + transcript
Subject-driven text-to-video from reference images (Wan2.2)
Image matting with diverse prompts via SAM2Matting
Multi-modal generation with diffusion transformers
Anima depth-conditioned image generation via VACE ControlNet
Separate audio into vocals and instruments with BS-Roformer
Polish speech recognition with fine-tuned Whisper Small
Real-time zero-shot stereo disparity estimation
Phone-use GUI agent - screenshot + task to next action
GUI grounding with VISTA-9B โ predict click coordinates
Multi-view visual reasoning VLM based on Qwen3-VL 4B
Object and Material Selection VLM
Document-parsing VLM (1.2B) by KoreaDeep
Vietnamese text-to-speech with Kokoro TTS
Interleaved text and image generation with SenseNova-U1
Unified AR model for image understanding & generation