NEBULA: A Unified Ecosystem for Emobodied AI Agent Development & Evaluation

Jierui Peng^*, Yanyan Zhang^*, Yicheng Duan^*, Tuo Liang, Vipin Chaudhary, Yu Yin^†

Computer & Data Science Department

Case Western Reserve University
^*Equal Contribution ^†Corresponding Author

arXiv Github HuggingFace (Coming soon) Leaderboard (Coming soon)

NEBULA unifies fragmented VLA datasets and APIs for cross-dataset training and benchmarking. It introduces a dual-axis evaluation (capability and stress testing) with controlled variable isolation for skill-specific diagnosis. With hierarchical task difficulty, multi-modal annotations, and visual performance summaries, NEBULA converts success rate into a diagnostic signal, exposing failure modes and reliability limits.

Abstract

The evaluation of Vision-Language-Action (VLA) agents is hindered by the coarse, end-task success metric that fails to provide precise skill diagnosis or measure robustness to real-world perturbations. This challenge is exacerbated by a fragmented data landscape that impedes reproducible research and the development of generalist models. To address these limitations, we introduce NEBULA, a unified ecosystem for single-arm manipulation that enables diagnostic and reproducible evaluation. NEBULA features a novel dual-axis evaluation protocol that combines fine-grained capability tests for precise skill diagnosis with systematic stress tests that measure robustness. A standardized API and a large-scale, aggregated dataset are provided to reduce fragmentation and support cross-dataset training and fair comparison. Using NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities such as spatial reasoning and dynamic adaptation, which are consistently obscured by conventional end-task success metrics. By measuring both what an agent can do and when it does so reliably, NEBULA provides a practical foundation for robust, general-purpose embodied agents.

Dual-Axis Evaluation Framework

Capability Test

Examples of NEBULA Capability Test task across six core capabilities (Control, Perception, Dynamic Adaptation, Language, Spatial Reasoning, and Robustness) organized into three difficulty levels. Tasks isolate specific skills with controlled complexity. Green marks objects, red marks targets, and blue indicates contextual cues. Bold underlined text shows actions; italic underlined text gives clarifications.

The core impact of this design is its diagnostic power. By isolating a single capability per task, it moves beyond ambiguous success rates to pinpoint the precise cause of failure. This transforms a simple performance metric into a clear, interpretable signal, revealing specific weaknesses (e.g., poor spatial reasoning) that traditional benchmarks would otherwise obscure.

Stress Test

This figure provides examples from the NEBULA Stress Test suite, illustrating the Stability and Adaptability tasks across three progressive difficulty levels. The Stability Score tasks evaluate the smoothness of an agent's actions by increasing the precision required, moving from a simple stack at Level 1 to a more complex, multi-object arrangement at Level 3. The Adaptability tasks assess an agent's capacity to adjust to dynamic changes during execution, beginning with a sudden object movement (Level 1), advancing to a mid-task instruction change (Level 2), and culminating in a command abort that requires rapid re-planning (Level 3).

The true impact of our Stress Tests is revealing an agent's breaking point. By applying targeted pressure, they move beyond success rates to uncover hidden bottlenecks and predict an agent's readiness for the real world. This diagnostic approach allows us to understand why an agent fails in different scenarios by directly linking its system responsiveness (e.g., inference speed) to its ability to handle different difficulty levels.

NEBULA Data

Native Dataset

To facilitate reproducible and scalable research, NEBULA provides two large-scale, aggregated dataset variants designed to balance completeness with ease of use. All data is collected within NEBULA Ecosystem, ensuring consistency across all tasks.

Alpha: The full-scale dataset containing over 222,000 expert demonstrations across the five core capability families. This version is generated entirely using expert trajectories from motion planning and is ideal for training robust, large-scale models.
Beta: A compact, lightweight version containing 10% of the data per task, designed for rapid development, prototyping, and ablation studies. For high-difficulty tasks, this version includes data from human teleoperation to capture more diverse and realistic behaviors.

Both datasets provide multimodal inputs, including videos, language instructions, and trajectories, and are available in standardized PyTorch and TFRecord formats to reduce integration overhead.

To access the NEBULA datasets, please refer to our HuggingFace.

Customize Your Data with NEBULA

NEBULA also provides the tools necessary for you to generate your own customized datasets. We support two primary methods for data collection: automated Motion Planning and manual Human Teleoperation.

Motion Planning: For generating large-scale expert trajectories automatically, you can use our motion planning pipeline. This process utilizes robot descriptions and motion planners within our simulation environments to generate optimal solutions and collect data systematically.
Human Teleoperation: To capture more diverse and realistic behaviors for complex tasks, we provide an intuitive teleoperation interface. This system allows you to manually control the robot and record demonstrations. As shown, it is designed for cross-platform use, supporting both macOS and Ubuntu.

For more details, please refer to our GitHub.

NEBULA Unified Data Platform

NEBULA introduces a unified data platform designed to streamline research and foster collaboration. Our platform ingests these varied data formats into a single, standardized NEBULA format, providing the core infrastructure for both unified training and reproducible evaluation. This allows researchers to focus on innovation rather than data engineering.

Unified & Structured Data Format: Introducing a standardized data schema for consistent representation of robot interactions in an episode structure. Consolidating observations, actions, and metadata for plug-and-play compatibility across datasets.
Extensible, Robot-Agnostic Architecture: Our platform is not hardcoded to specific hardware. Instead, robot properties are defined in a centralized configuration system, allowing support for various arm setups, gripper types, and sensor configurations.
Powerful and Intuitive Python SDK: Includes a high-level Python API that abstracts low-level data loading and indexing. This enables researchers to easily query, filter, sample data, and perform machine learning tasks like train-test splits with minimal code.
Seamless Model Integration: Provides model input adapters for widely used VLA architectures. These adapters bridge the gap between the unified data format and a model's expected input, allowing for immediate benchmarking with minimal integration overhead.

NEBULA Simulation

This video showcases a variety of NEBULA tasks simulated across multiple skill dimensions, illustrating our platform's diverse manipulation scenarios. Each clip demonstrates interactions rendered from six distinct camera viewpoints and three sensory modalities (RGB, depth, and segmentation) enabling rich, multi-perspective observation of the agent's behavior. The tasks span key capability families including perception, control, spatial reasoning, and dynamic adaptation, highlighting NEBULA's support for structured, multi-view, and multi-modal benchmarking in embodied AI research.

Current Bottlenecks

This figure presents two radar charts summarizing model performance across six capability task families.
Left Chart: the mean and standard deviation of success rates across all models for each task family, broken down by three difficulty levels.
Right Chart: displays the average performance of individual models on Easy and Medium tasks.

=> Key Weaknesses: Spatial Reasoning remains a key bottleneck for most models, while nearly all fail at Dynamic Adaptation and Robustness.

This figure shows the stress test evaluations for four different models, comparing their performance across three increasing stress levels. The evaluation consists of four distinct tests: inference frequency, measured in Hertz; latency, measured in milliseconds; a stability score, on a scale from 0 to 1; and adaptability, measured by success rate.

=> Key Weaknesses: Under pressure, most VLAs fail to handle dynamic changes due to critical drops in inference speed and control stability, indicating they are not ready for real-world deployment.

Acknowledgements

NEBULA is built on top of the excellent work from Sapien and ManiSkill3. We have leveraged their logic and assets throughout the development of our platform. We express our deep gratitude and sincere respect to their development teams for making such powerful tools openly available to the community.

Salute! 🫡

Citation

If you find our work helpful, please consider cite us:

@article{lu2025bard,
  title={BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting},
  author={Lu, Yiren and Zhou, Yunlai and Liu, Disheng and Liang, Tuo and Yin, Yu},
  journal={arXiv preprint arXiv:2503.15835},
  year={2025}
}