MobileWorld-html-8B

MobileWorld-html-8B is a mobile GUI world model. Given the current screenshot and a candidate action, it predicts the next screen state as single-file HTML. The generated HTML can be rendered with a browser engine such as Chromium or Playwright to obtain a visual next-state prediction.

Input

The model is used as a multimodal chat model. Provide:

One current GUI screenshot.
A system prompt describing the HTML generation task.
A user prompt containing the action description, action target, and relative coordinates.

System Prompt

You are a graphical user interface (GUI) HTML code generator. You are given an action, an action target, the relative coordinates of the object being manipulated, and a screenshot. Your task is to predict the state of the next page after this action is performed. Note that you need to generate a single-file HTML code that can realistically render the visual effects of the original image at a 1:1 ratio as possible.

User Prompt Template

<image> Predict the next page state via HTML code from this current screenshot using action description "{action_description}" and action tartget "{action_target}" and relative coordinates "{relative_coordinates}".

The word tartget is intentionally kept as in the training/inference template.

Fields

action_description: natural-language action, for example click, scroll down, or input text: hello.
action_target: target UI element or target point, for example point(536, 1280) or search input field.
relative_coordinates: normalized coordinates in (x, y) format, for example (0.496296, 0.533333). Use N/A for actions without a coordinate target.

Output

The expected output is HTML code for the predicted next screen. Save it as .html and render it with a fixed mobile viewport if you need an image.

Example OpenAI-Compatible Call

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

def image_url(path):
    payload = base64.b64encode(open(path, "rb").read()).decode("utf-8")
    return f"data:image/png;base64,{payload}"

system_prompt = """You are a graphical user interface (GUI) HTML code generator. You are given an action, an action target, the relative coordinates of the object being manipulated, and a screenshot. Your task is to predict the state of the next page after this action is performed. Note that you need to generate a single-file HTML code that can realistically render the visual effects of the original image at a 1:1 ratio as possible."""

user_prompt = '<image> Predict the next page state via HTML code from this current screenshot using action description "click" and action tartget "point(536, 1280)" and relative coordinates "(0.496296, 0.533333)".'

response = client.chat.completions.create(
    model="xwk123/MobileWorld-html-8B",
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url("screenshot.png")}},
                {"type": "text", "text": user_prompt},
            ],
        },
    ],
    max_tokens=8000,
    temperature=0.0,
)

html = response.choices[0].message.content.strip()

Rendering HTML

The output can be rendered with Playwright:

from pathlib import Path
from playwright.sync_api import sync_playwright

html = Path("prediction.html").read_text(encoding="utf-8")
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page(viewport={"width": 1080, "height": 2400})
    page.set_content(html, wait_until="load")
    page.screenshot(path="render.png", full_page=True)
    browser.close()

Downloads last month: 13

Safetensors

Model size

9B params

Tensor type

BF16

Collection including xwk123/MobileWorld-html-8B

MobileWorldModel

Collection

5 items • Updated Apr 21