Xiaomi-GUI-0

A native end-to-end multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop.

Overview

High benchmark scores do not reliably predict performance on real devices, where account states, permission dialogs, payment authentication, and risk-control mechanisms continually reshape the state distribution GUI agents encounter. To address this gap, we present Xiaomi-GUI-0, a native end-to-end multimodal GUI agent trained and evaluated within a real-device closed loop.

📱

Real-Device Infrastructure

Hundreds of physical phones, tablets, and in-vehicle cockpits, complemented by sandbox instances, so that data collection, training, rollout, and evaluation share a single real-deployment distribution.

🔁

Error-Driven Data Flywheel

Failure trajectories from real rollouts are converted into corrected actions, reflective rationales, and recovery demonstrations that supervise abnormal-state recognition and self-correction.

🎯

Three-Stage Training

SFT → Step-level RL → Agentic RL: basic interface operation, then local error correction, then long-horizon planning and recovery.

🧪

RealMobile Benchmark

100 real-device tasks across 14 live applications, scored by fine-grained sub-goals, with 57% spanning multiple applications.

Demos

Results

On the real-device benchmark RealMobile, Xiaomi-GUI-0 substantially outperforms open-source models and approaches frontier proprietary systems, while reaching the best result on AndroidWorld among evaluated models.

RealMobile: Success Rate (%)

AndroidWorld: Success Rate (%)

Navigation Results

Success denotes the fraction of fully completed tasks; Progress denotes the mean fraction of completed sub-goals per task.

Model RealMobile
Success
RealMobile
Progress
AndroidWorld
Proprietary Models
OpenAI CUA (o3)52.5%
Gemini 3.1 Pro85.0%89.6%
Gemini 3.1 Flash58.0%72.4%
Claude Opus 4.760.0%74.8%
Claude Opus 4.633.0%56.7%
Seed 2.0 Pro80.0%88.1%
Seed 1.865.0%82.4%70.7%
UI-TARS-273.3%
UI-TARS-1.524.0%40.5%64.2%
Open-source Models
UI-Venus-1.5-8B16.0%41.6%73.7%
UI-Venus-1.5-30B-A3B21.0%44.6%77.6%
GUI-Owl-1.5-8B-Instruct25.0%44.0%69.0%
GUI-Owl-1.5-8B-Thinking26.0%39.0%71.6%
GUI-Owl-1.5-32B-Instruct22.0%40.6%69.8%
GUI-Owl-1.5-32B-Thinking31.0%51.7%69.8%
Step-GUI-8B15.0%32.8%67.7%
MAI-UI-8B33.0%50.8%70.7%
Ours
Xiaomi-GUI-0-30B-A3B 72.0%85.8%78.9%

The RealMobile Benchmark

RealMobile is built from real user traffic, hand-crafted for reproducible evaluation, and executed on physical devices against live applications rather than emulators. Each task is scored through fine-grained sub-goals that award partial credit, and most tasks span multiple applications.

Application frequency in RealMobile
Application frequency across the 100 tasks.
Applications per task
Number of applications per task.
10 tasks

Foundation

Basic GUI operations: clicking, scrolling, inputting, and navigating across interfaces.

16 tasks

Safety & Reflection

Refusing unsafe or irreversible operations, and recognizing infeasible goals to stop or skip.

33 tasks

Memory & Knowledge

Retaining information across steps and applying external knowledge to complete tasks.

41 tasks

Reasoning & Planning

Long-horizon planning, multi-source aggregation, and adaptive decision-making.

Approach

Real-Device-Dominant Hybrid Infrastructure

Physical devices serve as the primary execution environment with sandboxes as auxiliary support, organized into a resource layer, a scheduling layer, and an execution & collection layer. A Device-Pull scheduler lets idle devices request tasks matching their current readiness, avoiding assignments to devices that become ineligible.

Hybrid infrastructure overview
Hundreds of physical phones and dozens of tablets form the primary execution substrate, complemented by hundreds of sandbox instances.

Multi-Source Training Data

Three progressive data tiers span the supervision needed for real mobile scenarios: high-frequency task data for head functions and abnormal states, high-generalization data for long-tail intents via function trees and behavior buckets, and agent-capability enhancement data with a five-field structured chain-of-thought schema (Observation, Reflection, Plan, Decision, Memory).

High-generalization data pipeline
High-generalization data construction: function trees, behavior-bucket query synthesis, rollout, and two-level cleaning.
Query synthesis pipeline
Query synthesis across single- and cross-application task types, with LLM-judge filtering and function-point back-tagging.

Error-Driven Data Flywheel

Rather than scaling data volume, the flywheel is organized around the error distribution exposed during real rollouts. Interactive annotation locates the first key error and records the corrected action and reason; teacher-model scoring & takeover detects off-path behavior at scale and demonstrates recovery to a workable path.

Teacher scoring and takeover
The student rolls out while the teacher scores each step; sustained below-threshold scores trigger a bounded takeover producing a deviation–diagnosis–recovery segment.

Progressive Three-Stage Training

The pipeline forms a curriculum that progresses from dense to sparse feedback. SFT establishes the output protocol and basic interaction. Step RL applies GSPO with a hierarchy-triggered cascade reward to correct local errors. Agentic RL optimizes whole trajectories in real or near-real environments for long-horizon planning and recovery.

SFTSupervised Fine-Tuning
Step RLStep-Level RL
Agentic RLTrajectory-Level RL

Case Studies

Two real-device trajectories: end-to-end execution and mid-trajectory recovery.

Case 1 · End-to-end execution

The agent observes the screen, decomposes the task into sub-goals, and emits a sequence of GUI actions until the goal is reached.

Case 2 · Reflection & recovery

When the observed state deviates from the expected outcome, the agent records the discrepancy, revises its plan, and selects a corrective action rather than continuing the original trajectory.

Citation

@misc{cao2026xiaomigui0technicalreport,
      title={Xiaomi-GUI-0 Technical Report},
      author={Wanxia Cao and Chengzhen Duan and Pei Fu and Pengzhi Gao and Niu Lian and Fazhan Liu and Hui Liu and Heng Qu and Qinzhuo Wu and Zhehao Yu and Tongbo Chen and Shiqi Cui and Anan Du and Shukai Jia and Yuanfa Li and Yike Liu and Wenchao Lu and Haoyuan Sun and Jiatong Sun and Cheng Tan and Yajie Wang and Changqiao Wu and Tao Xiong and Jiahui Yang and Yuxuan Yuan and Ruoceng Zhang and Shaojie Zhang and Jian Zhu and Jian Luan and Cong Zou},
      year={2026},
      eprint={2606.31410},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.31410},
}