macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

Pei Yang*, Hai Ci*, and Mike Zheng Shou

Show Lab, National University of Singapore

Teaser image

macOSWorld is an interactive benchmark dedicated for testing the performance of GUI agents, featuring the design of an interactive macOS environment, multilingual benchmarking, and a subset dedicated for safety evaluation.

202
Tasks
5
Languages
28
macOS-Exclusive Apps
29
Safety-Subset Tasks

macOSWorld tasks span across 7 categories, involving both planning-oriented tasks and actioning-oriented tasks.

Statistics of macOSWorld tasks

Each macOSWorld task contains multi-language task instructions, environment preparation configurations for OS state recovery, and evaluation scripts for rewarding.

An example of a macOSWorld task

A testbench python script coordinates the evaluation process, letting the GUI agent interact with an AWS-hosted macOS environment via VNC and SSH.

Architecture of the macOSWorld benchmarking system

On macOSWorld, proprietary Computer-Use Agents lead the performance, with overall success rates reaching 40%. Agents tend to have better performance on Roman alphabetic languages. However, agents with higher general performance also tends to be more vulnerable to context deception attacks.

Main benchmark performance

Acknowledgement

macOSWorld was created with the valuable discussions and feedback from Kevin Qinghong Lin, Zhiqiang Chen, Noorbakht Khan, Brandon Ng, Mingyu Ouyang, Siyuan Hu, Xiangwu Guo, Henry Hengyuan Zhao, Difei Gao, Christopher Rawles, and Kun Shao.

@article{macosworld, title={macOSWorld: A Multilingual Interactive Benchmark for GUI Agents}, author={Pei Yang and Hai Ci and Mike Zheng Shou}, year={2025}, eprint={2506.04135}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2506.04135}, }