The Neverwhere Visual Parkour Benchmark Suite

Authors anonymized




Abstract

State-of-the-art visual locomotion controllers are increasingly capable at handling complex visual environments, making evaluating their real-world performance before deployment increasingly difficult. This work intends to narrow this train/evaluation gap by developing a collection of hyper-photo-realistic, closed-loop evaluation environments - The Neverwhere Benchmark Suite - comprised of over sixty 3D Gaussian Splatting of urban indoor and outdoor scenes. Our goal is to encourage large-scale and reproducible robot evaluation by making it easier to create and integrate Gaussian splats-based reconstructions into simulated continuous testing setups. We also underscore the potential pitfalls of relying exclusively on 3D Gaussian-generated data for training, by providing policy checkpoints trained over multiple Neverwhere scenes and their performance when evaluated in novel scenes. Our analysis illustrates the necessity of sourcing diverse data to ensure performance.

method overview

The Neverwhere Benchmark Suite. We offer over sixty high-quality Gaussian splatting-based evaluation environments, and the Neverwhere graphics tool-chain for producing accurate collision mesh. Our aim is to promote reproducible robotics research via fully automated, continuous testing in closed-loop evaluation.



Tasks, Scenes, and Evaluation

Tasks

For close-loop evaluation in our benchmark, we consider four challenging locomotion tasks: hurdles, gaps, ramps, and stairs.

Scene Setups

Each scene comprises three key components:

1. Gaussian Splats for realistic scene appearance modeling.
2. Collision Meshes for accurate contact modeling and robot simulation.
3. Scene Configuration for defining the scene layout and specific tasks.

Within the scene configuration, we label waypoints on the collision geometry of each scene. These waypoints define a specific trail, clearly outlining the exact task and target path for the robot.

Task Definition and Scene Labeling

Rendering Pipeline

Diverse Rendering Wrappers for Observation Modalities: Our system provides diverse rendering wrappers to support various observation modalities for robot perception. These include: Gaussian Splat Rendering Wrappers, depth wrappers, point cloud wrappers, segmentation wrappers, and more.

Rendering Pipeline

Neverwhere Data Collection Toolchain

Along with the benchmark Scene Assets, we present a data collection toolchain that facilitates the generation of new benchmark environments with minimal human intervention, allowing users to create reconstructed scenes directly from uncalibrated images or videos.

Neverwhere Toolchain Pipeline

The toolchain takes multi-view images as input and follows a sequential process:

  1. A Structure-from-Motion modules is applied to obtain camera calibrations
  2. An optimization-based MVS module is used to estimate the scene geometry
  3. Points are sampled from the textured meshes
  4. Enhanced 3DGS Training: improved initialization and refined geometrical constraints.
  5. Reorienting and rescaling the mesh to align with real-world conventions
  6. Labeling waypoints for visual parkour tasks

Experiments

Our design intention is for Neverwhere to be used as part of an automated, continuous testing setup that quickly and scalably uses closed-loop simulation to assess the policy before its real-world deployment. Although Neverwhere is primarily designed for closed-loop evaluation, we also offer an additional experiment of closed-loop training to support further research.

Closed-Loop Training

We performed closed-loop training using a teacher-student behavior cloning approach as shown below.

Single-Scene Training Results

The student policies performs well on the training scene but generalizes poorly on unseen scenes, which matches our expectation as the training domain is limited. One interesting finding is that performance slightly improves on some unseen scenes that share similar visual characteristics (e.g., both being outdoor environments) with the training domain.

Single-Scene Training Results

Multi-Scene Training Results

The performance gap between training and evaluation sets is large (about 50% on average) for the Stairs task, but relatively small (about 10% on average) for the Gaps task. This suggests that the trained visual policies exhibit limited generalization on our benchmark, particularly for more challenging tasks.

Multi-Scene Training Results
Return to Top