Xiaomi-GUI-0: A Native GUI Agent for Real Mobile Environments

Overview

High benchmark scores do not reliably predict performance on real devices, where account states, permission dialogs, payment authentication, and risk-control mechanisms continually reshape the state distribution GUI agents encounter. To address this gap, we present Xiaomi-GUI-0, a native end-to-end multimodal GUI agent trained and evaluated within a real-device closed loop.

📱

Real-Device Infrastructure

Hundreds of physical phones, tablets, and in-vehicle cockpits, complemented by sandbox instances, so that data collection, training, rollout, and evaluation share a single real-deployment distribution.

🔁

Error-Driven Data Flywheel

Failure trajectories from real rollouts are converted into corrected actions, reflective rationales, and recovery demonstrations that supervise abnormal-state recognition and self-correction.

🎯

Three-Stage Training

SFT → Step-level RL → Agentic RL: basic interface operation, then local error correction, then long-horizon planning and recovery.

🧪

RealMobile Benchmark

100 real-device tasks across 14 live applications, scored by fine-grained sub-goals, with 57% spanning multiple applications.

Demos

Task 1 · E-Commerce Filtering Dewu (得物)

“在得物给我找个足球鞋，我要 24 小时发货的、300 元以上的、蓝色、适合男性的，并且是李宁 / Nike / 阿迪达斯这三个品牌的；有合适的话帮我加入‘我想要’，最多加入三个就可以了。”

“On Dewu, find me football boots that ship within 24 hours, priced above ¥300, in blue, for men, and from these three brands: Li-Ning, Nike, or Adidas. Add the suitable ones to my wishlist, up to three.”

Task 2 · Cross-Application Image Search Xiaohongshu (小红书) → Dewu (得物)

“在小红书上搜索‘樊振东东京奥运会蓝色球衣’，将其中一个产品图片保存到相册，然后用保存的图片在得物上搜索相同的商品，并将其添加到‘我想要’。”

“On Xiaohongshu, search for ‘Fan Zhendong’s blue Tokyo Olympics jersey’, save one of the product images to the album, then use the saved image to search for the same item on Dewu and add it to my wishlist.”

Task 3 · Multi-Application Planning Ctrip (携程) → Amap (高德地图)

“去携程查询明天下午从北京飞往上海的最早一班机票，记下起飞时间；假设我必须在起飞前 2 小时到达机场安检，然后打开高德地图，查从北京颐和园到机场地铁需要多少分钟，最后帮我总结一份明早出行计划。”

“On Ctrip, look up the earliest flight from Beijing to Shanghai tomorrow afternoon and note its departure time. Assuming I must reach airport security two hours before departure, open Amap and check how many minutes the subway takes from the Summer Palace to the airport, then summarize a travel plan for tomorrow morning.”

Results

On the real-device benchmark RealMobile, Xiaomi-GUI-0 substantially outperforms open-source models and approaches frontier proprietary systems, while reaching the best result on AndroidWorld among evaluated models.

RealMobile: Success Rate (%)

AndroidWorld: Success Rate (%)

Navigation Results

Success denotes the fraction of fully completed tasks; Progress denotes the mean fraction of completed sub-goals per task.

Model	RealMobile Success	RealMobile Progress	AndroidWorld
Proprietary Models
OpenAI CUA (o3)	—	—	52.5%
Gemini 3.1 Pro	85.0%	89.6%	—
Gemini 3.1 Flash	58.0%	72.4%	—
Claude Opus 4.7	60.0%	74.8%	—
Claude Opus 4.6	33.0%	56.7%	—
Seed 2.0 Pro	80.0%	88.1%	—
Seed 1.8	65.0%	82.4%	70.7%
UI-TARS-2	—	—	73.3%
UI-TARS-1.5	24.0%	40.5%	64.2%
Open-source Models
UI-Venus-1.5-8B	16.0%	41.6%	73.7%
UI-Venus-1.5-30B-A3B	21.0%	44.6%	77.6%
GUI-Owl-1.5-8B-Instruct	25.0%	44.0%	69.0%
GUI-Owl-1.5-8B-Thinking	26.0%	39.0%	71.6%
GUI-Owl-1.5-32B-Instruct	22.0%	40.6%	69.8%
GUI-Owl-1.5-32B-Thinking	31.0%	51.7%	69.8%
Step-GUI-8B	15.0%	32.8%	67.7%
MAI-UI-8B	33.0%	50.8%	70.7%
Ours
Xiaomi-GUI-0-30B-A3B	72.0%	85.8%	78.9%

The RealMobile Benchmark

RealMobile is built from real user traffic, hand-crafted for reproducible evaluation, and executed on physical devices against live applications rather than emulators. Each task is scored through fine-grained sub-goals that award partial credit, and most tasks span multiple applications.

Application frequency in RealMobile — Application frequency across the 100 tasks.

Applications per task — Number of applications per task.

10 tasks

Foundation

Basic GUI operations: clicking, scrolling, inputting, and navigating across interfaces.

16 tasks

Safety & Reflection

Refusing unsafe or irreversible operations, and recognizing infeasible goals to stop or skip.

33 tasks

Memory & Knowledge

Retaining information across steps and applying external knowledge to complete tasks.

41 tasks

Reasoning & Planning

Long-horizon planning, multi-source aggregation, and adaptive decision-making.

Approach

Real-Device-Dominant Hybrid Infrastructure

Physical devices serve as the primary execution environment with sandboxes as auxiliary support, organized into a resource layer, a scheduling layer, and an execution & collection layer. A Device-Pull scheduler lets idle devices request tasks matching their current readiness, avoiding assignments to devices that become ineligible.

Hybrid infrastructure overview — Hundreds of physical phones and dozens of tablets form the primary execution substrate, complemented by hundreds of sandbox instances.

Multi-Source Training Data

Three progressive data tiers span the supervision needed for real mobile scenarios: high-frequency task data for head functions and abnormal states, high-generalization data for long-tail intents via function trees and behavior buckets, and agent-capability enhancement data with a five-field structured chain-of-thought schema (Observation, Reflection, Plan, Decision, Memory).

High-generalization data pipeline — High-generalization data construction: function trees, behavior-bucket query synthesis, rollout, and two-level cleaning.

Query synthesis pipeline — Query synthesis across single- and cross-application task types, with LLM-judge filtering and function-point back-tagging.

Error-Driven Data Flywheel

Rather than scaling data volume, the flywheel is organized around the error distribution exposed during real rollouts. Interactive annotation locates the first key error and records the corrected action and reason; teacher-model scoring & takeover detects off-path behavior at scale and demonstrates recovery to a workable path.

Teacher scoring and takeover — The student rolls out while the teacher scores each step; sustained below-threshold scores trigger a bounded takeover producing a deviation–diagnosis–recovery segment.

Progressive Three-Stage Training

The pipeline forms a curriculum that progresses from dense to sparse feedback. SFT establishes the output protocol and basic interaction. Step RL applies GSPO with a hierarchy-triggered cascade reward to correct local errors. Agentic RL optimizes whole trajectories in real or near-real environments for long-horizon planning and recovery.

SFTSupervised Fine-Tuning

→

Step RLStep-Level RL

→

Agentic RLTrajectory-Level RL

Case Studies

Two real-device trajectories: end-to-end execution and mid-trajectory recovery.

Complete real-device execution trajectory — Case 1 · End-to-end execution
The agent observes the screen, decomposes the task into sub-goals, and emits a sequence of GUI actions until the goal is reached.

Reflection and plan revision trajectory — Case 2 · Reflection & recovery
When the observed state deviates from the expected outcome, the agent records the discrepancy, revises its plan, and selects a corrective action rather than continuing the original trajectory.

Citation

@misc{cao2026xiaomigui0technicalreport,
      title={Xiaomi-GUI-0 Technical Report},
      author={Wanxia Cao and Chengzhen Duan and Pei Fu and Pengzhi Gao and Niu Lian and Fazhan Liu and Hui Liu and Heng Qu and Qinzhuo Wu and Zhehao Yu and Tongbo Chen and Shiqi Cui and Anan Du and Shukai Jia and Yuanfa Li and Yike Liu and Wenchao Lu and Haoyuan Sun and Jiatong Sun and Cheng Tan and Yajie Wang and Changqiao Wu and Tao Xiong and Jiahui Yang and Yuxuan Yuan and Ruoceng Zhang and Shaojie Zhang and Jian Zhu and Jian Luan and Cong Zou},
      year={2026},
      eprint={2606.31410},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.31410},
}