A native end-to-end multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop.
High benchmark scores do not reliably predict performance on real devices, where account states, permission dialogs, payment authentication, and risk-control mechanisms continually reshape the state distribution GUI agents encounter. To address this gap, we present Xiaomi-GUI-0, a native end-to-end multimodal GUI agent trained and evaluated within a real-device closed loop.
Hundreds of physical phones, tablets, and in-vehicle cockpits, complemented by sandbox instances, so that data collection, training, rollout, and evaluation share a single real-deployment distribution.
Failure trajectories from real rollouts are converted into corrected actions, reflective rationales, and recovery demonstrations that supervise abnormal-state recognition and self-correction.
SFT → Step-level RL → Agentic RL: basic interface operation, then local error correction, then long-horizon planning and recovery.
100 real-device tasks across 14 live applications, scored by fine-grained sub-goals, with 57% spanning multiple applications.
On the real-device benchmark RealMobile, Xiaomi-GUI-0 substantially outperforms open-source models and approaches frontier proprietary systems, while reaching the best result on AndroidWorld among evaluated models.
Success denotes the fraction of fully completed tasks; Progress denotes the mean fraction of completed sub-goals per task.
| Model | RealMobile Success |
RealMobile Progress |
AndroidWorld |
|---|---|---|---|
| Proprietary Models | |||
| OpenAI CUA (o3) | — | — | 52.5% |
| Gemini 3.1 Pro | 85.0% | 89.6% | — |
| Gemini 3.1 Flash | 58.0% | 72.4% | — |
| Claude Opus 4.7 | 60.0% | 74.8% | — |
| Claude Opus 4.6 | 33.0% | 56.7% | — |
| Seed 2.0 Pro | 80.0% | 88.1% | — |
| Seed 1.8 | 65.0% | 82.4% | 70.7% |
| UI-TARS-2 | — | — | 73.3% |
| UI-TARS-1.5 | 24.0% | 40.5% | 64.2% |
| Open-source Models | |||
| UI-Venus-1.5-8B | 16.0% | 41.6% | 73.7% |
| UI-Venus-1.5-30B-A3B | 21.0% | 44.6% | 77.6% |
| GUI-Owl-1.5-8B-Instruct | 25.0% | 44.0% | 69.0% |
| GUI-Owl-1.5-8B-Thinking | 26.0% | 39.0% | 71.6% |
| GUI-Owl-1.5-32B-Instruct | 22.0% | 40.6% | 69.8% |
| GUI-Owl-1.5-32B-Thinking | 31.0% | 51.7% | 69.8% |
| Step-GUI-8B | 15.0% | 32.8% | 67.7% |
| MAI-UI-8B | 33.0% | 50.8% | 70.7% |
| Ours | |||
| Xiaomi-GUI-0-30B-A3B | 72.0% | 85.8% | 78.9% |
RealMobile is built from real user traffic, hand-crafted for reproducible evaluation, and executed on physical devices against live applications rather than emulators. Each task is scored through fine-grained sub-goals that award partial credit, and most tasks span multiple applications.
Basic GUI operations: clicking, scrolling, inputting, and navigating across interfaces.
Refusing unsafe or irreversible operations, and recognizing infeasible goals to stop or skip.
Retaining information across steps and applying external knowledge to complete tasks.
Long-horizon planning, multi-source aggregation, and adaptive decision-making.
Physical devices serve as the primary execution environment with sandboxes as auxiliary support, organized into a resource layer, a scheduling layer, and an execution & collection layer. A Device-Pull scheduler lets idle devices request tasks matching their current readiness, avoiding assignments to devices that become ineligible.
Three progressive data tiers span the supervision needed for real mobile scenarios: high-frequency task data for head functions and abnormal states, high-generalization data for long-tail intents via function trees and behavior buckets, and agent-capability enhancement data with a five-field structured chain-of-thought schema (Observation, Reflection, Plan, Decision, Memory).
Rather than scaling data volume, the flywheel is organized around the error distribution exposed during real rollouts. Interactive annotation locates the first key error and records the corrected action and reason; teacher-model scoring & takeover detects off-path behavior at scale and demonstrates recovery to a workable path.
The pipeline forms a curriculum that progresses from dense to sparse feedback. SFT establishes the output protocol and basic interaction. Step RL applies GSPO with a hierarchy-triggered cascade reward to correct local errors. Agentic RL optimizes whole trajectories in real or near-real environments for long-horizon planning and recovery.
Two real-device trajectories: end-to-end execution and mid-trajectory recovery.
The agent observes the screen, decomposes the task into sub-goals, and emits a sequence of GUI actions until the goal is reached.
When the observed state deviates from the expected outcome, the agent records the discrepancy, revises its plan, and selects a corrective action rather than continuing the original trajectory.
@misc{cao2026xiaomigui0technicalreport,
title={Xiaomi-GUI-0 Technical Report},
author={Wanxia Cao and Chengzhen Duan and Pei Fu and Pengzhi Gao and Niu Lian and Fazhan Liu and Hui Liu and Heng Qu and Qinzhuo Wu and Zhehao Yu and Tongbo Chen and Shiqi Cui and Anan Du and Shukai Jia and Yuanfa Li and Yike Liu and Wenchao Lu and Haoyuan Sun and Jiatong Sun and Cheng Tan and Yajie Wang and Changqiao Wu and Tao Xiong and Jiahui Yang and Yuxuan Yuan and Ruoceng Zhang and Shaojie Zhang and Jian Zhu and Jian Luan and Cong Zou},
year={2026},
eprint={2606.31410},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.31410},
}