macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

Pei Yang^*, Hai Ci^*, and Mike Zheng Shou^✉

Show Lab, National University of Singapore

Paper NeurIPS Page Github (AWS) Github (VMware) View Data

macOSWorld is an interactive benchmark dedicated for testing the performance of GUI agents, featuring the design of an interactive macOS environment, multilingual benchmarking, and a subset dedicated for safety evaluation.

202

Tasks

Languages

macOS-Exclusive Apps

Safety-Subset Tasks

macOSWorld tasks span across 7 categories, involving both planning-oriented tasks and actioning-oriented tasks.

Each macOSWorld task contains multi-language task instructions, environment preparation configurations for OS state recovery, and evaluation scripts for rewarding.

A testbench python script coordinates the evaluation process, letting the GUI agent interact with an AWS-hosted macOS environment via VNC and SSH.

Architecture of the macOSWorld benchmarking system

On macOSWorld, proprietary Computer-Use Agents lead the performance, with overall success rates reaching 40%. Agents tend to have better performance on Roman alphabetic languages. However, agents with higher general performance also tends to be more vulnerable to context deception attacks.

Acknowledgement

macOSWorld was created with the valuable discussions and feedback from Kevin Qinghong Lin, Zhiqiang Chen, Noorbakht Khan, Brandon Ng, Mingyu Ouyang, Siyuan Hu, Xiangwu Guo, Henry Hengyuan Zhao, Difei Gao, Christopher Rawles, and Kun Shao.

                @article{macosworld,
    title={macOSWorld: A Multilingual Interactive Benchmark for GUI Agents}, 
    author={Pei Yang and Hai Ci and Mike Zheng Shou},
    year={2025},
    eprint={2506.04135},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2506.04135}, 
}