macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

Pei Yang^* , Hai Ci^* , and Mike Zheng Shou^✉

Show Lab, National University of Singapore

🆕 Updates

[18 Sep 2025] macOSWorld accepted to NeurIPS 2025
[15 Sep 2025] Optimised the automated benchmark execution experience; Added a GUI display for real-time benchmark progress and results

❗ Important Notes

👍 Features of VMware Implementation

Faster: Compared to the official one, this implementation reduces snapshot recovery time for each task from 10-15 minutes to less than 1 minute, significantly decreasing time overhead
Cheaper: Locally deployed macOS environment, saving the cost of AWS dedicated host tenancy

⚠️ Requirements

Minimum hardware requirments:
- Ubuntu machines with Intel/AMD CPUs supporting AVX2 (Check CPU AVX2 support)
- 32 GB of RAM
- 400 GB of free disk space
This implementation cannot cover benchmarking for iMovie-related tasks. Referring to Table 3 (main results) in the macOSWorld paper, it can support reporting the following metrics:

Category	Supported
System & Interface	✔️
System Apps	✔️
File Management	✔️
Productivity	✔️
Media	✔️
Advanced Apps*	⭕
Multi-Apps	✔️
Overall	✔️
*This VMware implementation excludes iMovie tasks.

⭐ Features of macOSWorld

Interactive macOS environment: 30 macOS-native apps with their exclusive user interfaces
Multilingual benchmarking: Tasks and environments available in English, Chinese, Arabic, Japanese and Russian
Safety evaluation: Dedicated subset for benchmarking agents' resilience under context deception attacks

🚀 Getting Started

This implementation consists of a Python testbench script and a VMware-based macOS environment. Both run locally, despite the testbench may require access to online APIs. The benchmark process involves four main steps:

Step 1: Local Environment Configuration -- General

1.1. Base Environment Setup

conda create -n macosworld python=3.9.21
conda activate macosworld
pip install vncdotool==1.2.0 boto3==1.36.20 sshtunnel httpx[socks] openai anthropic google-auth google_cloud_aiplatform jupyter

1.2. Model-Specific Configurations

For GPT-4o with SoM Annotations:

Install additional dependencies:

conda activate macosworld
pip install timm easyocr paddlepaddle paddleocr einops==0.8.0 supervision==0.18.0 ultralytics==8.3.70

Download model weights:

mkdir OmniParser
cd OmniParser
git clone https://huggingface.co/microsoft/OmniParser-v2.0
mv OmniParser-v2.0 weights
mv weights/icon_caption weights/icon_caption_florence

For Gemini Models:

macOSWorld uses the VertexAI API. Follow this guide to set up credentials and obtain a JSON credential file. Set the environment variable:

export GOOGLE_APPLICATION_CREDENTIALS='/path/to/gen-lang-client-xxx.json'

For ShowUI:

conda activate macosworld
pip install torch==2.6.0 qwen-vl-utils transformers==4.47.0 accelerate==0.26.0

For UI-TARS:

Install dependencies:

conda create -n uitars python=3.9.21
pip install -U transformers==4.49.0  
pip install vllm==0.6.6 --extra-index-url https://download.pytorch.org/whl/cu124

Start the vLLM service on the same device running the testbench:

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve "bytedance-research/UI-TARS-7B-DPO" --served-model-name UI-TARS-7B-DPO --limit-mm-per-prompt image=3 -tp 4

Note: The tp parameter specifies the number of GPUs to use and must be a common divisor of 28 and 16.

Step 2: Local Environment Configuration -- VMware

Please follow the instructions here to set up the VMware environment.

Step 3: Running the Benchmark

3.1. Execute Benchmark

Configure your environment variables and run the testbench. Replace the 🧩 placeholders with your actual values:

# API Keys (configure as needed)
export OPENAI_API_KEY=🧩'sk-proj-...'       # For OpenAI models
export ANTHROPIC_API_KEY=🧩'sk-ant-...'     # For Anthropic models
export GOOGLE_APPLICATION_CREDENTIALS=🧩'/path/to/gen-lang-client-xxx.json'  # For Gemini models

# Set credential permission
chmod 400 credential.pem

# Run the benchmark
python run.py \
    --vmx_path 🧩/path/to/macOSWorld.vmx \
    --ssh_pkey credential.pem \
    --gui_agent_name 🧩gpt-4o-2024-08-06 \
    --paths_to_eval_tasks 🧩./tasks/sys_apps ./tasks/sys_and_interface ./tasks/productivity ./tasks/media ./tasks/file_management ./tasks/multi_apps \
    --languages 🧩task_en_env_en task_zh_env_zh task_ar_env_ar task_ja_env_ja task_ru_env_ru \
    --base_save_dir ./results/gpt_4o \
    --max-steps 15 \
    --snapshot_recovery_timeout_seconds 120 \
    --task_step_timeout 120

Parameter	Description
`instance_id`	EC2 instance ID
`vmx_path`	Path to the VM configuration file
`ssh_pkey`	Local file path to the provided `credential.pem` file
`gui_agent_name`	GUI agent identifier (see supported models below)
`paths_to_eval_tasks`	Directory paths containing evaluation task JSON files
`languages`	Language pairs in format `task_<lang>_env_<lang>`
`base_save_dir`	Local directory for storing evaluation results
`max_steps`	Maximum dialogue turns per task
`snapshot_recovery_timeout_seconds`	Timeout for snapshot recovery (usually doesn't need adjustment)

Supported GUI Agents

OpenAI GPT Series:
- gpt-4o, gpt-4o-2024-08-06
- With SoM: gpt-4o/omniparser, gpt-4o-2024-08-06/omniparser
- Computer Use: openai/computer-use-preview
Google Gemini Series:
- gemini-1.5-pro-002, gemini-2.5-pro-preview-03-25
Anthropic Claude:
- claude-3-7-sonnet-20250219/computer-use-2025-01-24
Open Source Models:
- UI-TARS-7B-DPO
- showlab/ShowUI-2B

Supported Language Codes: English (en), Chinese (zh), Arabic (ar), Japanese (ja), Russian (ru)

The Safety Subset: Run it separately, because the safety subset is only provided in English.

python run.py \
    --vmx_path 🧩/path/to/macOSWorld.vmx \
    --ssh_pkey credential.pem \
    --gui_agent_name 🧩gpt-4o-2024-08-06 \
    --paths_to_eval_tasks 🧩./tasks/sys_apps ./tasks/safety \
    --languages 🧩task_en_env_en \
    --base_save_dir ./results/gpt_4o \
    --max-steps 15 \
    --snapshot_recovery_timeout_seconds 120 \
    --task_step_timeout 120

3.2. Run Testbench Manually

For debugging purposes, you can run only the testbench:

python testbench.py \
    --vmx_path 🧩/path/to/macOSWorld.vmx \
    --ssh_pkey credential.pem \
    --gui_agent_name 🧩gpt-4o-2024-08-06 \
    --paths_to_eval_tasks 🧩./tasks/sys_apps ./tasks/sys_and_interface ./tasks/productivity ./tasks/media ./tasks/file_management ./tasks/multi_apps \
    --languages 🧩task_en_env_en task_zh_env_zh task_ar_env_ar task_ja_env_ja task_ru_env_ru \
    --base_save_dir ./results/gpt_4o \
    --max-steps 15 \
    --snapshot_recovery_timeout_seconds 120 \
    --task_step_timeout 120

3.3. Manually Handle Interruptions

When the testbench is interrupted, to continue, the base_save_dir needs to be cleaned up first. Although the cleanup functionality is already integrated into run.py, you can still perform this cleanup manually.

python cleanup.py --base_save_dir /path/to/base_save_dir

Clean up the base_save_dir before rerunning the testbench. Previously completed tasks will not be deleted or re-executed.

3.4. Monitor Progress and Aggregate Results

Use the provided Jupyter notebook to view benchmark progress and results. This notebook provides a GUI that displays benchmark progress and results through a hierarchical menu.

scripts/display_progress.ipynb

Step 4: Releasing VMware Resources

After completing the benchmark, shut down the virtual machine in VMware Workstation. Alternatively, run:

vmrun -T ws stop "/path/to/macOSWorld.vmx" hard

🤖 Benchmarking Other Agents

To benchmark other agents, follow these two steps:

Create agent/your_custom_agent.py — either modify based on an existing agent or start with agent/template_for_custom_agent.py.
Register your agent in agent/get_gui_agent.py.

🤝 Acknowledgements

VMware environments built upon https://www.youtube.com/watch?v=akgl2urRIyk and https://www.youtube.com/watch?v=PZXS3AztbmQ.

The authors thank Kevin Qinghong Lin, Zhiqiang Chen, Noorbakht Khan, Brandon Ng, Mingyu Ouyang, Siyuan Hu, Xiangwu Guo, Henry Hengyuan Zhao, Difei Gao, Christopher Rawles, and Kun Shao for their valuable discussions and feedback.

📄 Citation

@article{macosworld,
    title={macOSWorld: A Multilingual Interactive Benchmark for GUI Agents}, 
    author={Pei Yang and Hai Ci and Mike Zheng Shou},
    year={2025},
    eprint={2506.04135},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2506.04135}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

🆕 Updates

❗ Important Notes

👍 Features of VMware Implementation

⚠️ Requirements

⭐ Features of macOSWorld

🚀 Getting Started

Step 1: Local Environment Configuration -- General

1.1. Base Environment Setup

1.2. Model-Specific Configurations

Step 2: Local Environment Configuration -- VMware

Step 3: Running the Benchmark

3.1. Execute Benchmark

3.2. Run Testbench Manually

3.3. Manually Handle Interruptions

3.4. Monitor Progress and Aggregate Results

Step 4: Releasing VMware Resources

🤖 Benchmarking Other Agents

🤝 Acknowledgements

📄 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agent		agent
assets/readme		assets/readme
instructions		instructions
scripts		scripts
tasks		tasks
utils		utils
LICENSE		LICENSE
cleanup.py		cleanup.py
constants.py		constants.py
credential.pem		credential.pem
readme.md		readme.md
run.py		run.py
testbench.py		testbench.py

License

yangpei-comp/macosworld_vmware

Folders and files

Latest commit

History

Repository files navigation

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

🆕 Updates

❗ Important Notes

👍 Features of VMware Implementation

⚠️ Requirements

⭐ Features of macOSWorld

🚀 Getting Started

Step 1: Local Environment Configuration -- General

1.1. Base Environment Setup

1.2. Model-Specific Configurations

Step 2: Local Environment Configuration -- VMware

Step 3: Running the Benchmark

3.1. Execute Benchmark

3.2. Run Testbench Manually

3.3. Manually Handle Interruptions

3.4. Monitor Progress and Aggregate Results

Step 4: Releasing VMware Resources

🤖 Benchmarking Other Agents

🤝 Acknowledgements

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages