1. Introduction
Vision–Language–Action (VLA) models have rapidly become the dominant paradigm for general-purpose robot policy learning, fusing pretrained vision and language backbones with a low-level action head trained on large-scale demonstration corpora [
1,
2]. Despite remarkable in-distribution behaviour in simulation, sim-to-real transfer of these monolithic policies remains brittle: average success rates on unseen real-world tabletop tasks frequently drop by 30–50 percentage points relative to their simulation counterparts. The dominant explanations have been domain gap in pixel statistics [
3], mis-calibrated dynamics [
4], and task-distribution mismatch [
5]. We argue that an under-examined source of failure is the
symmetry gap: the geometric and bilateral invariances that govern manipulation are present implicitly in pixels but never made explicit in the policy. We provide a diagnostic experiment in
Section 4.1 that, on the OpenVLA baseline, finds the per-task mirror-consistency violation rate to be a stronger predictor of real-world failure (
) than lighting variance (
) or contact-noise estimates (
), motivating an explicit treatment of symmetry.
Symmetry has long been recognised as a powerful inductive bias [
7,
8]. In manipulation, two classes of geometric symmetry are particularly salient and admit a clean group action on the state–action space. First, the
bilateral (
) symmetry of dual-arm robots and of many objects (cups, drawers, cloth) implies that mirror-reflected demonstrations should yield mirror-reflected actions. Second, a workspace subgroup
of
, namely arbitrary translations together with rotations about the table normal, captures the equivariance of contact-rich tabletop actions under nuisance pose variation. Crucially,
does
not commute with
as a direct product (e.g.
and
); instead,
acts on
by automorphism, giving the
semidirect product
where
is the lateral-axis reflection. We deliberately exclude rotations about horizontal axes (
,
): they would form a representation of
but lie outside any closed subgroup well-behaved under
. We treat cross-modal paraphrase and visual alignment as a soft equivalence relation rather than a group action, and bundle them into a regularisation term
. This distinction—group action vs. soft alignment—turns out to be essential for interpreting the per-axis equivariance error in
Section 4.
We propose
SymBridge, a symmetry-aware VLA framework with four components. (i) A head-mounted display teleoperation rig built around a Meta Quest 3 and dual Franka Panda arms collects high-fidelity bimanual demonstrations, with end-to-end teleop latency below 25 ms. Body-anchored hand tracking yields kinematically natural trajectories that contain richer symmetry structure than scripted simulation rollouts. (ii) A
symmetry-preserving augmentation pipeline applies on-the-fly bilateral-mirror and
-pose transformations to every batch, with a consistency penalty enforcing that the policy commutes with the group action. (iii) A
dual-pathway encoder processes paired sim/real observations through a shared visual backbone but with separate domain heads, regularised by a sim–real contrastive objective. (iv) An
equivariant action decoder based on a steerable mixer head whose mixing weights are constrained by Equation (
10) to commute with the bilateral generator, yielding decoder-level exact
-equivariance under input-feature paring; full-policy bilateral equivariance is encouraged—not architecturally guaranteed—by
, and equivariance with respect to
is augmentation-enforced.
We evaluate SymBridge on a benchmark of 28 bimanual tabletop tasks, 16 used for training and 12 reserved as unseen real-world evaluations. Under a controlled-variable ablation that fixes backbone, demonstration count, and training schedule and toggles only the symmetry components, SymBridge raises the average real-world success rate from 47.3% (OpenVLA baseline) to 78.9%, exceeds strong equivariant baselines (EquiBot [
11] and EquiAct [
12]) by 12–15 percentage points, reduces the sim-to-real Wasserstein-2 distance between trajectory distributions by 41%, and lowers the trained-axes equivariance error from 0.184 to 0.054 rad. Bilateral mirror augmentation alone contributes
percentage points on bimanual tasks; HMD demonstration quality contributes a further
percentage points compared with retargeted scripted demonstrations.
Our contributions are: (1) a
geometrically correct factorisation of manipulation symmetries on the workspace-restricted
semidirect product group (
1), with a clean separation between group actions and soft alignment objectives; (2) a symmetry-aware VLA architecture and training objective whose ablation shows additive gains for each subgroup; (3) an HMD teleoperation pipeline that produces demonstrations with measurably stronger bilateral symmetry; and (4) a diagnostic study that ties baseline symmetry violations to real-world failure rates, together with a controlled-variable comparison against equivariant baselines under matched data and compute. We do not claim to be the first to combine bilateral and rotational equivariance; we claim that, when the group is correctly restricted and combined with a decoder-level bilateral-equivariant architecture under paired feature representation and symmetry-aware HMD data, the resulting policy outperforms prior equivariant designs on bimanual tabletop sim-to-real.
Figure 1 summarises the system.