MobileWorldModel
Collection
5 items • Updated
MobileWorld-html-8B is a mobile GUI world model. Given the current screenshot and a candidate action, it predicts the next screen state as single-file HTML. The generated HTML can be rendered with a browser engine such as Chromium or Playwright to obtain a visual next-state prediction.
The model is used as a multimodal chat model. Provide:
You are a graphical user interface (GUI) HTML code generator. You are given an action, an action target, the relative coordinates of the object being manipulated, and a screenshot. Your task is to predict the state of the next page after this action is performed. Note that you need to generate a single-file HTML code that can realistically render the visual effects of the original image at a 1:1 ratio as possible.
<image> Predict the next page state via HTML code from this current screenshot using action description "{action_description}" and action tartget "{action_target}" and relative coordinates "{relative_coordinates}".
The word tartget is intentionally kept as in the training/inference template.
action_description: natural-language action, for example click, scroll down, or input text: hello.action_target: target UI element or target point, for example point(536, 1280) or search input field.relative_coordinates: normalized coordinates in (x, y) format, for example (0.496296, 0.533333). Use N/A for actions without a coordinate target.The expected output is HTML code for the predicted next screen. Save it as .html and render it with a fixed mobile viewport if you need an image.
import base64
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
def image_url(path):
payload = base64.b64encode(open(path, "rb").read()).decode("utf-8")
return f"data:image/png;base64,{payload}"
system_prompt = """You are a graphical user interface (GUI) HTML code generator. You are given an action, an action target, the relative coordinates of the object being manipulated, and a screenshot. Your task is to predict the state of the next page after this action is performed. Note that you need to generate a single-file HTML code that can realistically render the visual effects of the original image at a 1:1 ratio as possible."""
user_prompt = '<image> Predict the next page state via HTML code from this current screenshot using action description "click" and action tartget "point(536, 1280)" and relative coordinates "(0.496296, 0.533333)".'
response = client.chat.completions.create(
model="xwk123/MobileWorld-html-8B",
messages=[
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url("screenshot.png")}},
{"type": "text", "text": user_prompt},
],
},
],
max_tokens=8000,
temperature=0.0,
)
html = response.choices[0].message.content.strip()
The output can be rendered with Playwright:
from pathlib import Path
from playwright.sync_api import sync_playwright
html = Path("prediction.html").read_text(encoding="utf-8")
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page(viewport={"width": 1080, "height": 2400})
page.set_content(html, wait_until="load")
page.screenshot(path="render.png", full_page=True)
browser.close()