Abstract

VVision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose Hide-and-Seek, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, π₀, and π_0.5. Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.

We propose Hide-and-Seek, a lightweight runtime failure detector that draws a novel connection between coarsely supervised learning and embodied failure detection. Rather than propagating trajectory-level labels uniformly across timesteps, Hide-and-Seek seeks out where failures hide within largely normal behavior. The framework consists of 3 key ideas:

Seeking Across Trajectories: An inter-trajectory contrastive loss enforces that the most failure-indicative step in a failure trajectory scores higher than the hardest false-positive step in a successful trajectory. This adaptively discovers the most salient failure signal across trajectories without assuming where the failure occurs, replacing uniform label propagation with a sharper, instance-level supervision signal.
Seeking Within a Trajectory: An intra-trajectory contrastive loss sharpens the temporal boundary between normal execution and the failure phase by encouraging average post-onset scores to exceed average pre-onset scores, anchored at a proxy failure onset that emerges naturally from the score dynamics. This converts coarse trajectory-level supervision into a temporally structured failure signal, without any step-level annotation.
Timely Runtime Monitoring: At deployment, Hide-and-Seek raises alarms via a time-varying threshold calibrated by functional conformal prediction. The detector achieves state-of-the-art performance with a practical accuracy–timeliness trade-off, running over 2,000× faster than VLM-based monitors.

We evaluate Hide-and-Seek on two simulation benchmarks—LIBERO-10 and VLABench—as well as a real-world UFactory xArm 6 platform, using three representative VLA policies spanning both autoregressive and flow-matching paradigms: OpenVLA, π₀, and π_0.5. We compare against 12 action-based failure detection baselines spanning OOD detection, multi-sampling, classifier-based, and token uncertainty methods, and report balanced accuracy (bACC), weighted accuracy (wACC), and time-weighted accuracy (TWA) on both seen and unseen tasks.

Qualitative Results

Hide-and-Seek's failure scores align with observable failure indicators in the visual scene. The discovered onset point t_onset aligns with subtle early indicators of failure (e.g., a book starting to slip from the gripper), while the peak point t_max corresponds to the most salient and critical failure event (e.g., the robot dropping the object)—matching human intuition about when failure occurs. A failure is declared at the earliest timestep where the failure score (red curve) exceeds the conformal threshold (green region). The failure score remains consistently low across successful trajectories and rises sharply at the onset of failure.