Pei Yang*, Hai Ci*, and Mike Zheng Shou✉
Show Lab, National University of Singapore
macOSWorld is an interactive benchmark dedicated for testing the performance of GUI agents, featuring the design of an interactive macOS environment, multilingual benchmarking, and a subset dedicated for safety evaluation.
macOSWorld tasks span across 7 categories, involving both planning-oriented tasks and actioning-oriented tasks.
Each macOSWorld task contains multi-language task instructions, environment preparation configurations for OS state recovery, and evaluation scripts for rewarding.
A testbench python script coordinates the evaluation process, letting the GUI agent interact with an AWS-hosted macOS environment via VNC and SSH.
On macOSWorld, proprietary Computer-Use Agents lead the performance, with overall success rates reaching 40%. Agents tend to have better performance on Roman alphabetic languages. However, agents with higher general performance also tends to be more vulnerable to context deception attacks.
macOSWorld was created with the valuable discussions and feedback from Kevin Qinghong Lin, Zhiqiang Chen, Noorbakht Khan, Brandon Ng, Mingyu Ouyang, Siyuan Hu, Xiangwu Guo, Henry Hengyuan Zhao, Difei Gao, Christopher Rawles, and Kun Shao.
@article{macosworld,
title={macOSWorld: A Multilingual Interactive Benchmark for GUI Agents},
author={Pei Yang and Hai Ci and Mike Zheng Shou},
year={2025},
eprint={2506.04135},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.04135},
}