Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

Seongheon Park1, Wendi Li1, Changdae Oh1, Samuel Yeh1,
Zsolt Kira2, Michael Hagenow1, Sharon Li1
1University of Wisconsin–Madison, 2Georgia Institute of Technology
HNS teaser image

Hide-and-Seek discovers localized failure-indicative actions in VLA trajectories using only trajectory-level labels, achieving state-of-the-art multi-task failure detection performance.

Abstract

VVision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose Hide-and-Seek, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, π0, and π0.5. Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.

Hide-and-Seek: Coarsely Supervised Failure Detector for VLA Models

Hide-and-Seek framework overview

We propose Hide-and-Seek, a lightweight runtime failure detector that draws a novel connection between coarsely supervised learning and embodied failure detection. Rather than propagating trajectory-level labels uniformly across timesteps, Hide-and-Seek seeks out where failures hide within largely normal behavior. The framework consists of 3 key ideas:

  • Seeking Across Trajectories: An inter-trajectory contrastive loss enforces that the most failure-indicative step in a failure trajectory scores higher than the hardest false-positive step in a successful trajectory. This adaptively discovers the most salient failure signal across trajectories without assuming where the failure occurs, replacing uniform label propagation with a sharper, instance-level supervision signal.
  • Seeking Within a Trajectory: An intra-trajectory contrastive loss sharpens the temporal boundary between normal execution and the failure phase by encouraging average post-onset scores to exceed average pre-onset scores, anchored at a proxy failure onset that emerges naturally from the score dynamics. This converts coarse trajectory-level supervision into a temporally structured failure signal, without any step-level annotation.
  • Timely Runtime Monitoring: At deployment, Hide-and-Seek raises alarms via a time-varying threshold calibrated by functional conformal prediction. The detector achieves state-of-the-art performance with a practical accuracy–timeliness trade-off, running over 2,000× faster than VLM-based monitors.

Quantitative Results

We evaluate Hide-and-Seek on two simulation benchmarks—LIBERO-10 and VLABench—as well as a real-world UFactory xArm 6 platform, using three representative VLA policies spanning both autoregressive and flow-matching paradigms: OpenVLA, π0, and π0.5. We compare against 12 action-based failure detection baselines spanning OOD detection, multi-sampling, classifier-based, and token uncertainty methods, and report balanced accuracy (bACC), weighted accuracy (wACC), and time-weighted accuracy (TWA) on both seen and unseen tasks.

Qualitative Results


Hide-and-Seek's failure scores align with observable failure indicators in the visual scene. The discovered onset point tonset aligns with subtle early indicators of failure (e.g., a book starting to slip from the gripper), while the peak point tmax corresponds to the most salient and critical failure event (e.g., the robot dropping the object)—matching human intuition about when failure occurs. A failure is declared at the earliest timestep where the failure score (red curve) exceeds the conformal threshold (green region). The failure score remains consistently low across successful trajectories and rises sharply at the onset of failure.

Success

Failure: Robot slips the alphabet soup and collides with nearby objects.

Success

Failure: Robot mis-contacts the top of the moka pot and drifts away.

Success

Failure: Robot misgrasps the moka pot and drifts away.

Success

Failure: Robot fails to grasp the alphabet soup and stops moving.

Success

Failure: Robot gets stuck after placing the mug, then drifts away without closing the microwave.

BibTeX

[.]