Dual-Axis Evaluation Framework
Capability Test
Examples of NEBULA Capability Test task across six core capabilities (Control, Perception, Dynamic Adaptation, Language, Spatial Reasoning, and Robustness) organized into three difficulty levels. Tasks isolate specific skills with controlled complexity. Green marks objects, red marks targets, and blue indicates contextual cues. Bold underlined text shows actions; italic underlined text gives clarifications.
The core impact of this design is its diagnostic power. By isolating a single capability per task, it moves beyond ambiguous success rates to pinpoint the precise cause of failure. This transforms a simple performance metric into a clear, interpretable signal, revealing specific weaknesses (e.g., poor spatial reasoning) that traditional benchmarks would otherwise obscure.
Stress Test
This figure provides examples from the NEBULA Stress Test suite, illustrating the Stability and Adaptability tasks across three progressive difficulty levels. The Stability Score tasks evaluate the smoothness of an agent's actions by increasing the precision required, moving from a simple stack at Level 1 to a more complex, multi-object arrangement at Level 3. The Adaptability tasks assess an agent's capacity to adjust to dynamic changes during execution, beginning with a sudden object movement (Level 1), advancing to a mid-task instruction change (Level 2), and culminating in a command abort that requires rapid re-planning (Level 3).
The true impact of our Stress Tests is revealing an agent's breaking point. By applying targeted pressure, they move beyond success rates to uncover hidden bottlenecks and predict an agent's readiness for the real world. This diagnostic approach allows us to understand why an agent fails in different scenarios by directly linking its system responsiveness (e.g., inference speed) to its ability to handle different difficulty levels.