Environment-Specific Skill Transfer for Decentralized Multi-Robot Navigation via Hybrid RRT and Behavior Cloning in Grid-Based Industrial Environments

Yovel Atia; Chen Giladi

doi:10.20944/preprints202606.0456.v1

Submitted:

04 June 2026

Posted:

05 June 2026

You are already at the latest version

Abstract

Decentralized multi-robot navigation in grid-based industrial environments, such as automated warehouses, must reach goals and avoid collisions without centralized control or direct robot-to-robot communication. We study a hybrid framework pairing an offline Rapidly-Exploring Random Tree (RRT) expert with a trained Behavior Cloning (BC) local policy and route reuse, evaluated in a fully reproducible, deterministic, seeded simulator. Our central result is that navigation quality is governed by how well the route library matches the deployment environment: a library generated for one map and deployed unchanged on another leaves robots blocked by the new obstacles, whereas an environment-adapted library transfers the learned skill and cuts collisions by 37-85% and task failures by 43-69% across fleets of two to ten robots (15 seeds; Mann-Whitney U, all p < 10^-5). An online RRT baseline attains a lower collision rate, but only through costly frequent replanning, so the environment-adapted hybrid recovers most of its navigation quality while reusing pre-computed routes. We further evaluate an optional, communication-free collision-history-sharing add-on; within the full framework its benefit is limited and layout-dependent, a nominally significant 13% collision reduction on one map that does not survive multiple-comparison correction.

Keywords:

decentralized multi-robot navigation

;

behavior cloning

;

RRT

;

skill transfer

;

imitation learning

;

collision-history sharing

;

reproducibility

;

grid-based environments

;

warehouse robotics

Subject:

Computer Science and Mathematics - Robotics

1. Introduction

Mobile robots must reach their goals while avoiding collisions with obstacles [1,2]. In structured industrial settings—automated warehouses, in particular—fleets of such robots must additionally avoid collisions with one another [3]. Operating such fleets in a decentralized way—each robot deciding from local information, without a central coordinator and no direct inter-robot communication [4]—improves robustness and scalability, but makes coordination harder; in dense decentralized fleets, the dominant failure mode is deadlock and collision [5].

Two families of methods dominate grid navigation. Sampling-based planners such as RRT explore the configuration space and return feasible paths [6], but in dynamic, multi-robot settings, they incur frequent, costly replanning. Learning-based policies such as Behavior Cloning (BC) give fast, low-latency reactive control by imitating expert demonstrations, but degrade over long horizons and when the deployment environment diverges from the training data [7]. A hybrid framework that uses a pre-computed RRT route library together with a BC local policy [8]—reusing well-tested paths while reacting locally—can capture the strengths of both.

We study exactly such a framework and ask a question central to its practical use: when does the learned navigation transfer to a new environment, and what governs its quality? Our answer, obtained from a fully reproducible re-implementation that reuses the original trained network, RRT planner and route data verbatim, is that the decisive factor is the match between the route library and the deployment map. A library generated on one map and deployed unchanged on another leaves robots colliding with the new obstacles; an environment-adapted library transfers the skill and sharply reduces both collisions and task failures. We also compare against an online RRT baseline, clarifying the computational trade-off that motivates route reuse, and we evaluate—across repeated seeded trials—a lightweight add-on that needs no direct inter-robot messaging: collision-history sharing, proposed as future work in the framework’s originating study—an earlier unpublished graduate thesis by the first author, provided as supplementary material—which we re-implement faithfully here.

Our contributions are:

A faithful, deterministic, seeded re-implementation of a hybrid RRT + Behavior Cloning decentralized navigation framework that reuses the original trained BC network, RRT planner, and route datasets, with the non-scalable multiprocessing prototype replaced by a reproducible single-process scheduler (code and raw results provided with the submission).
A quantification of environment-specific skill transfer: an environment-adapted route library cuts collisions by 37–85% (Mann–Whitney U on per-episode collision counts, all $p < 10^{- 5}$ ) and task failures by 43–69% relative to a Map2-naive library across five fleet sizes (2, 4, 6, 8, 10 robots), recovering most of the navigation quality of an online RRT baseline at a fraction of its planning cost.
A repeated-seed evaluation of collision-history sharing—indirect coordination through a shared map, with no direct messaging—within the full framework, finding only a limited benefit (full participation: a nominally significant 13% fewer collisions on Map1, $p = 0.023$ , which does not survive Holm correction; selective participation and a second layout: not significant). This corrects the originating thesis’s preliminary finding—a near-halving of collisions inferred from a single unseeded run—and shows that the effect is undetectable at low task throughput, where too few tasks accumulate to populate the shared map; we document this regime explicitly.
A transparent, reproducible methodology in which every figure and number regenerates from the provided code and raw results, including the route-filtering and throughput conditions that determine whether each effect appears.

2. Related Work

Decentralized multi-robot navigation. Decentralized schemes avoid the single point of failure of centralized coordinators; approaches include cooperative reinforcement learning with a shared policy and inter-robot information exchange [3] and multi-agent RL more broadly [9], control-barrier functions [10,11], optimization-based trajectory planning [12], and swarm-intelligence approaches (reviewed for multi-UAV systems in [13]). Coping with deadlock and collision in dense fleets remains a persistent difficulty, including for decentralized aerial swarms [5]. Planning and learning. A* and sampling-based planners (RRT/RRT*) underpin robot path planning [6,14,15,16]; Behavior Cloning learns reactive policies from expert demonstrations [8,17,18,19,20,21], with well-known covariate-shift limits when the deployment distribution differs from the demonstrations [7]. Hybrid planner–policy methods combine a model-based planner with a learned model that guides it [22]; the framework we reuse couples an RRT expert, a BC policy, and route reuse, and—unlike hybrids that use BC only to smooth or refine planner paths—uses BC for decentralized reactive control and online RRT only for connecting segments. Skill transfer and adaptation. Transferring a learned policy to a new deployment environment—for example, from simulation to a real robot—is a core challenge for imitation-learning methods [23]; performance hinges on alignment between training and deployment distributions [7]. We give a concrete grid-navigation instance: adapting the expert route library to the deployment map restores performance lost to environment mismatch. Experience sharing. Multi-agent policies can be improved by modeling other agents’ behaviors and interactions [24,25], by training against diverse partner populations [26], and by leveraging human experience [27]; we study a minimal instance that uses no direct messaging—only a shared collision-frequency map—and evaluate it across repeated seeded trials.

3. Materials and Methods

All components are implemented in a deterministic, seeded simulator provided with the paper (manuscript/code/reproducible/); every result regenerates from the provided code, with reproduction instructions provided in the public repository. The simulator reuses the original implementation verbatim—the trained BC network (trained_model_NN.pth), the RRT planner, the offline route datasets and the per-step navigation logic—and replaces only the original multiprocessing / shared-memory harness, which did not scale, with a single-process round-robin scheduler so that every episode is reproducible from its seed.

3.1. Environment and Metrics

N robots navigate a

50 \times 50

occupancy grid in one of two layouts (Map1, Map2; Figure 1); obstacles are inflated by one cell to account for the robot footprint. The base navigation is fully decentralized: each robot acts on local observations only, with no shared state and no inter-robot communication. The optional collision-history sharing of Section 3.4 is the single exception—it introduces one shared data structure (a global collision-frequency map) through which a subset of robots coordinate indirectly, in a stigmergy-style manner, without exchanging messages; we evaluate it separately as an optional add-on rather than as part of the base framework. Over a fixed step budget, a task is a robot reaching its goal, after which a new random start/goal pair is drawn; tasks completed measures throughput. A collision is a predicted blocked-move event—the robot’s three-step look-ahead lands in a cell occupied by a static obstacle or by another robot—counted per event, matching the originating study’s definition (interactions with obstacles or other agents), not a physical contact. A task that cannot make progress within a step cap is counted as a failure; the failure rate is the number of failed tasks divided by the total number of task attempts (completed plus failed). Because a fixed step budget yields different numbers of completed tasks across conditions, we report collisions both absolutely and per completed task, the workload-normalized metric.

3.2. Hybrid RRT + Behavior Cloning Framework

An offline RRT expert generates a library of collision-free routes on a map. At run time each robot (i) selects, from the library, routes whose bounding box brings it toward its goal and that are locally obstacle-free (the Get_best_routes filter); (ii) drives toward the selected route under the trained BC network, which maps the robot’s position, a target waypoint on that route, and a

5 \times 5

local occupancy window to an incremental move—so the chosen route enters the policy through its target-waypoint input; and (iii) reconnects with a bounded online RRT segment toward the goal in two cases (Algorithm 1): as a fallback when no suitable library route is available (the nearest is too far to reach directly), and—when a three-step look-ahead along the chosen route predicts a blocked cell—probabilistically, logging the event and reconnecting with a probability that decreases as the task nears completion (more readily early in a task than near the goal). The BC network is a 29-input multilayer perceptron with hidden layers 256–256–64 (2 position

+ 2

target-waypoint

+ 25

occupancy inputs). We use the original trained weights and RRT/route code without modification. Figure 2 summarizes the offline and deployment phases, the skill-transfer test condition, and the optional collision-history-sharing add-on; Algorithm 1 states the per-robot run-time decision step, and Table 1 lists the implementation settings, including the online-RRT iteration cap. In Algorithm 1, robots interact only through the shared map H: every robot writes its own collisions to H, while only the sharing subset (a fraction

q_{share}

) reads the aggregate to bias route choice—there is no direct robot-to-robot messaging. The routine

O n l i n e R R T (x, g)

plans a fresh collision-free segment from x toward goal g, bounded by that cap.

Algorithm 1:Per-robot decision step, run independently by each robot i on every scheduler tick. A predicted blocked-move is a counted event (Section 3.1), not a physical crash, and does not by itself halt the robot.

Require:: map M, route library R, BC policy $π_{BC}$ , shared map H, sharing flag $z_{i}$ , robot state $x_{i}$ , goal $g_{i}$ , counters $c_{i}$ (steps) and $h_{i}$ (collisions)
1:: $C \leftarrow Get_best_routes (R, x_{i}, g_{i}, M)$
2:: if $z_{i}$ and $C \neq \emptyset$ then
3:: $r^{★} \leftarrow arg {min}_{r \in C} \sum_{c \in r} H [c]$ ▹ sharing robots prefer low-H routes
4:: else if $C \neq \emptyset$ then
5:: $r^{★} \leftarrow$ route in C nearest $x_{i}$
6:: else
7:: $r^{★} \leftarrow$ current route if still valid, else $O n l i n e R R T (x_{i}, g_{i})$ ▹ nearest route too far / none available
8:: end if
9:: $w \leftarrow$ next target waypoint on $r^{★}$
10:: $a \leftarrow π_{BC} (x_{i}, w, W_{5 \times 5})$ ▹ BC drives toward w
11:: if three-step look-ahead along $r^{★}$ predicts a blocked cell (obstacle or robot) then
12:: $collisions \leftarrow collisions + 1$ ; record cell in $h_{i}$ ▹ counted event, not a crash
13:: with prob. $1 - ρ_{i}$ : $r^{★} \leftarrow O n l i n e R R T (x_{i}, g_{i})$ ; recompute $w, a$ ▹ $ρ_{i}$ : fraction of $r^{★}$ done
14:: end if
15:: $x_{i} \leftarrow$ advance by a along $r^{★}$ ; $c_{i} \leftarrow c_{i} + 1$
16:: if $x_{i} = g_{i}$ then▹ task complete
17:: $H \leftarrow H + h_{i}$ ; register task; resample; reset $c_{i}, h_{i}$
18:: else if $c_{i} >$ step cap then
19:: register failure; resample; reset $c_{i}, h_{i}$
20:: end if

3.3. Route Libraries and Skill Transfer

The framework’s prior knowledge lives in its route library. We compare two libraries in the Map2 deployment environment: a Map2-naive library (the “original” dataset, generated by the RRT expert on Map1) and a Map2-adapted library (the “alternative” dataset, whose routes were shaped by Map2’s obstacle configuration). Following the original pipeline, each library is filtered for obstacle-free routes against the map on which it was generated (Map1 for the original library, Map2 for the adapted one) and then deployed on Map2; the Map2-naive library therefore retains routes that intersect Map2 obstacles, which is precisely the environment-mismatch the adapted library is meant to overcome. The two filtering stages are distinct: this offline filter is applied once against the generation map, whereas the run-time Get_best_routes filter (Algorithm 1) only rejects routes blocked in the robot’s immediate local occupancy window (other robots and nearby cells). A Map1-generated route that crosses a Map2 obstacle outside that local window therefore survives both filters and stays active, so its three-step look-ahead eventually predicts a blocked move—the collision the diagnostics below quantify. This setup isolates skill transfer: the BC network, planner, and deployment map are identical, and only the route library differs. Figure 3 makes the mismatch concrete: routes valid on Map1 (panel A) cross Map2’s obstacles when deployed unchanged (panel B), and panels C–D illustrate the deployment consequence on two examples: a robot following a naive Map1 route reaches a blocked contact where the route would enter a Map2 obstacle, so online RRT must plan a fresh collision-free continuation to the goal—the run-time replanning that the static mismatch forces. Routes in panels C–D are shown after collision-checked simplification for display: every plotted segment is validated against the Map2 configuration-space obstacle, so simplification removes raw-planner jitter without moving a route into obstacle geometry. Across the entire library,

8.1 %

of the Map2-naive library’s route cells—spanning

58 %

of its routes—fall inside Map2 obstacles. The Map2-adapted library has no such overlap by construction (its routes are generated by running RRT on Map2 directly); whether this static gap translates to a run-time benefit is the subject of Section 4 (Figure 4 and Figure 5).

Table 2 quantifies the two libraries. They contain a comparable number of routes (3543 vs 3592 raw; 3304 vs 3256 after source-map filtering, within about

2 %

), so the Map2-adapted library is not simply larger than the Map2-naive one; its routes are modestly longer (median 33 vs 37 cells) and cover somewhat more free space (1207 vs 1363 unique cells). These differences are small relative to the non-trivial one: the Map2-naive library leaves

8.1 %

of its route cells, spanning

58 %

of its routes, inside Map2 obstacles, which is the static gap that drives the run-time collision difference reported in Section 4.

3.4. Collision-History Sharing

As a lightweight add-on that needs no direct robot-to-robot message passing (proposed as future work in the originating study), a global per-cell map H accumulates the collisions logged during completed tasks: coordination, when enabled, is indirect—through this shared map rather than through messages exchanged between robots. The robots in a randomly chosen subset—a fraction

q_{share}

, fixed per episode—bias their route selection toward rarely-collided cells, selecting, among valid candidate routes, the one that minimizes the summed collision history

\sum_{c \in r} H [c]

along its cells, while the rest plan normally.

q_{share} = 0

is no sharing,

0 < q_{share} < 1

selective,

q_{share} = 1

full sharing. Every robot writes its own logged collisions into H upon completing a task, while only the sharing subset reads the aggregate; no robot exchanges messages with another. Collision records from tasks that fail (exceed the per-task step cap) are discarded with the task rather than added to H, so H reflects the collision history of completed tasks only. Because H accrues only as tasks complete, the mechanism can act only once enough collision history has been observed; we therefore run these episodes to a high task throughput (2000 scheduler steps, ≈245 completed tasks at five robots).

3.5. Experimental Design

With 15 seeds per condition we run six experiments (Table 3): (i) a skill-transfer experiment on Map2 comparing the Map2-naive and Map2-adapted libraries across fleet sizes

N \in {2, 4, 6, 8, 10}

; (ii) an online RRT baseline over the same fleet sizes; (iii) a collision-history-sharing ablation (

q_{share} \in {0, 0.6, 1.0}

at five robots) on both maps; (iv) an access-fraction sweep (

q_{share} \in {0, 0.2, 0.4, 0.6, 0.8, 1.0}

); (v) a sharing scalability sweep (

N \in {2, 4, 6, 8, 10}

, with and without selective sharing); and (vi) a throughput-dependence sweep (Map1, five robots,

q_{share} \in {0, 0.6, 1.0}

over episode lengths from 300 to 2000 steps). Each of the main experiments runs for 2000 scheduler steps—about 245 completed tasks at five robots—so that the shared collision database fills before collision-rate comparisons are made; the throughput sweep additionally varies the episode length to expose the short-horizon regime in which the database stays nearly empty. We report mean ± SD, Mann–Whitney U tests (one-sided, where directional), and Spearman trends. Scripts (sim_rrtbc.py, sim_rrt_online.py, scenario.py, run_experiments_rrtbc.py, run_experiments_rrt_online.py, analyze_rrtbc.py, make_heatmaps_rrtbc.py) and raw results are provided with the submission.

4. Results

The figures separate three effects. Figure 4 measures the result of route-library adaptation; Figure 5 Figure 6 diagnose the failure mechanism by locating and classifying collisions; and Figure 7 quantifies the computational trade-off against online RRT. The final Figure 8, Figure 9 and Figure 10 evaluate the optional collision-history-sharing add-on.

4.1. Environment-Adapted Route Libraries Transfer the Navigation Skill

Deploying the Map2-adapted library instead of the Map2-naive library on Map2 sharply improves navigation (Figure 4). At four robots, the Map2-naive library incurs

703 \pm 61

collisions and a

0.23

task-failure rate, while the Map2-adapted library incurs

206 \pm 36

collisions and a

0.09

failure rate—a

71 %

reduction in collisions (Mann–Whitney U on per-episode collision counts,

p < 10^{- 5}

) and a

62 %

reduction in failures (the significance test is on collisions; failure rates are reported descriptively; full per-fleet-size numbers, bootstrap CIs and effect sizes in Table 4). The reduction is largest where obstacle mismatch dominates (

85 %

at two robots) and remains substantial as agent–agent interference grows (

54 %

at six,

37 %

at ten robots); the adapted library’s collision rate (per completed task) and failure rate stay below the naive library’s at every fleet size

N \in {2, 4, 6, 8, 10}

tested, all with

p < 10^{- 5}

. This is the environment-specific skill-transfer effect: the BC network and planner are unchanged, so the gain comes entirely from aligning the route library with the deployment geometry. The collision heatmaps (Figure 5) make the mechanism visible—the Map2-naive library produces collisions inside Map2’s obstacle regions (routes inherited from Map1 cut through walls that did not exist there), whereas the Map2-adapted library confines residual collisions to natural high-traffic corridors.

4.2. Throughput, Collision Composition, and Planning Cost Across Fleet Density

Throughput (tasks completed within the fixed budget) increases with fleet size on Map2: the completed-task counts in Table 4 rise monotonically with N for every method. The collision rate per completed task also increases gradually with density as path intersections become more frequent, reproducing the qualitative scalability behavior reported in the originating thesis; the same throughput trend appears in the Map1 scalability sweep (Figure 8, left; Spearman

ρ = + 0.98

between fleet size and tasks,

p < 0.001

). Decomposing collisions by type (Figure 6) shows why: the Map2-naive library carries a large static-obstacle collision component—roughly

2.7

–

2.9

collisions per completed task at every fleet size (its routes cross Map2 walls)—that is essentially absent from the Map2-adapted hybrid and from online RRT, both of which plan around the static obstacles and so incur no obstacle collisions. This map-mismatch component dominates the naive library’s collisions at small fleets and persists undiminished as N grows, even as robot–robot congestion—the residual common to all three methods—rises with density and overtakes it at the largest fleets. Figure 6 shows

N = 4

and

N = 10

; the complete breakdown for all fleet sizes is given in Table 5. This separates the map-mismatch failure mode (curable by adapting the library and roughly constant in N) from the remaining density-driven congestion.

Compared to the online RRT baseline (Figure 4), online RRT achieves the lowest collision and failure rates—unsurprisingly, as it computes a fresh path toward each goal—but at a steep and growing computational cost (Figure 7, Table 6): its search had to be bounded (capped RRT iterations) merely to remain tractable at higher densities, and each episode was markedly slower than the route-reusing hybrid, which invokes the planner only to reconnect. The environment-adapted hybrid closes most of the gap to online RRT (e.g.

1.08

vs.

0.82

collisions per task at four robots,

3.99

vs.

2.69

at ten) while avoiding per-step replanning—the efficiency trade-off that motivates the hybrid design. A faithful comparison of throughput under a fixed time budget—in which online RRT’s planning cost would lower its completed-task count (the metric used by the originating study)—is left to future work.

4.3. Collision-History Sharing Gives Only a Limited Benefit

Evaluated at a high task throughput (2000 scheduler steps, ≈245 completed tasks at five robots), collision-history sharing yields only a limited benefit within the full framework (Table 7). In that table, the Change column and the significance tests refer to the mean per-episode collision count (the quantity tested), while Coll./task is the corresponding workload-normalized rate, which changes by a similar amount; p is the one-sided Mann–Whitney U test against no sharing,

p_{Holm}

applies a Holm–Bonferroni correction across the four ablation comparisons, and the Significant column is judged after correction. On Map1 at five robots, the mean per-episode collision count falls from

362 \pm 66

(no sharing) to

346 \pm 62

under selective access (

q_{share} = 0.6

) and

316 \pm 54

under full access: full sharing gives a modest, nominally significant

13 %

reduction in collision count (Mann–Whitney U on these per-episode counts, one-sided

p = 0.023

); the corresponding collisions-per-completed-task rate falls from

1.42

to

1.29

(about

9 %

, the slightly smaller figure reflecting the near-constant task count). The

4 %

reduction under selective access is not significant (

p = 0.32

). The sharing analysis comprises four comparisons (two maps × two participation levels), which we treat as a single family and report both nominally and with a Holm correction (Table 7). Only the Map1 full-participation comparison is even nominally significant, and its nominal

p = 0.023

does not survive the family-wise correction (Holm-adjusted

p = 0.09

); therefore, we read the sharing benefit as limited and layout-dependent rather than robustly established. We find no evidence that full sharing increases collisions relative to selective access (one-sided

p = 0.94

in that direction), and no monotonic dose–response: the access-fraction sweep is essentially flat (Figure 9; Spearman

ρ = - 0.09

,

p = 0.42

). On Map2, the effect is still smaller and not significant (full and selective

\approx 2 %

;

p = 0.37

and

0.43

, respectively). The framework’s three-step look-ahead and environment-matched routes already keep collisions manageable, leaving little persistent congestion for the shared map to act on—so the near-halving reported from the originating study’s single preliminary run does not reproduce here. Crucially, this small benefit is undetectable at low task throughput (Figure 10): because the shared map H fills only as tasks complete (panel A), short episodes leave it nearly empty, and the collision-rate gap between full and no sharing is statistically indistinguishable from zero (panels B,C). In this regime, we cannot separate two compatible explanations—the mechanism has little accumulated history to act on, and few completed tasks give the test little power to detect such a small effect—so we report only that sharing has no measurable effect below roughly 1000 scheduler steps (the horizon at which the bootstrap interval first excludes zero in Figure 10), without claiming that the underlying effect is exactly zero.

5. Discussion

5.1. What Governs Navigation Quality

The dominant factor in this decentralized framework is the alignment between the route library and the deployment environment. Because the BC policy imitates expert routes, its behavior is only as good as the routes it can draw on: a library generated for a different map carries that map’s assumptions—including paths through cells that are obstacles in the new environment—and the robots inherit those mistakes. Adapting the library to the deployment map restores performance, transferring the learned navigation skill with the planner and policy unchanged. This is a concrete, measurable instance of the covariate-shift limitation of imitation learning [7], and of its remedy by distribution alignment.

This sensitivity to the gap between training and deployment conditions is not specific to our setup. Learning-based navigation is known to be sensitive to distribution drift between the conditions a policy is trained on and those it meets at deployment [28], and methods that learn purely from demonstration similarly falter once a robot must act in cluttered configurations outside the demonstrated range [29]. Our route-library result is the grid-navigation instance of the same phenomenon: the policy is only as good as the offline routes it imitates, so realigning those routes to the deployment map is what restores performance. A complementary route to robustness, pursued mainly in manipulation, is to distill a large corpus of sampling-planner solutions into a single reactive policy that generalizes across scene layouts [30]; our results indicate that, for a fixed policy, the cheaper lever of regenerating the route library on the target map is already decisive.

Behavior Cloning is attractive here precisely because it is the simplest form of imitation learning—a supervised mapping from observations to actions that needs neither a reward function nor a dynamics model [31,32]. That simplicity is also its weakness: with no mechanism to correct for states unseen in the demonstrations, environment-specific assumptions are baked into the policy, which is why demonstration- or planner-derived priors are frequently paired with additional components to improve generalization [33]. Our hybrid retains BC for fast reactive control and delegates only the harder, environment-specific reconnection to the planner, which localizes the covariate-shift problem to the route library.

5.2. The Role of Collision-History Sharing

Collision-history sharing is a low-cost way to steer robots away from cells that have repeatedly caused collisions, requiring no direct inter-robot messaging. Evaluated within the full hybrid framework and at a realistic task throughput, its benefit is limited: full participation yields a small, nominally significant reduction on Map1 (13%, not surviving Holm correction across the sharing comparisons), while selective participation and a second layout show no significant change, and there is no clear dose–response. The mechanism is most useful where the base navigation leaves persistent congestion; here the framework’s three-step look-ahead and environment-matched routes already keep collisions manageable, so little remains for the shared map to remove—and the near-halving reported from the originating study’s single preliminary run does not reproduce over repeated, seeded trials. The effect also depends on the shared map having accumulated enough history: at low throughput, the database stays nearly empty, and no effect is detectable, so reporting this throughput dependence is essential to evaluating the mechanism fairly.

Collision-history sharing is an indirect, stigmergy-like form of coordination: robots influence one another only through a shared spatial record of where collisions recur, never through direct messages. This contrasts with the bulk of decentralized multi-robot navigation, where coordination is achieved through explicit communication—for example, decentralized multi-agent RRT in which agents pass a planning token and exchange planned trajectories to guarantee mutual collision avoidance [34]—and with classical reactive schemes that must resolve static-obstacle and inter-robot collisions at once and are prone to local-minima deadlock [35]. The appeal of a message-free shared signal is its near-zero coordination cost; our results temper that appeal by showing that, once the base navigation is strong, the residual congestion that such a signal can remove is small.

5.3. Generality and Relation to Planning–Learning Hybrids

The framework is within a broad and active effort to combine classical planning with learning, spanning sampling-based planners, supervised policies, and reinforcement learning [36]. Sampling-based planners such as RRT remain a workhorse for mobile robot path planning, and their cost is dominated by collision checking [37] and by parameters such as step size and iteration budget [38]; pairing them with learned components is increasingly common in industrial mobile robotics, for example, combining an RRT global planner with learned perception inside an ROS navigation stack [39]. Other hybrids augment a reinforcement-learning agent’s action space with a motion planner [40], or pair a demonstration-learned task model with a sampling-based planner so that a robot reproduces demonstrated behavior while still avoiding obstacles absent from the demonstrations [41]. Against this backdrop our contribution is deliberately narrow: we reuse the original planner and policy verbatim and isolate the effect of route-library provenance, which is what lets the skill-transfer effect be attributed cleanly rather than confounded with architectural changes.

5.4. Limitations

The study is in grid-based simulation; real deployments add sensing noise, continuous dynamics, and safety constraints. Behavior-cloning policies in particular carry a well-documented sim-to-real gap, so a simulation-trained policy needs physical validation before its numbers can be expected to transfer [42]. Absolute counts depend on modeling choices (the collision metric counts predicted blocked-move events; the shared map accumulates without decay; the access subset is fixed per episode), so the robust claims are the large skill-transfer effects and their directional consistency across seeds—not the precise magnitudes, and not the marginal collision-history-sharing effect, which does not survive multiple-comparison correction.

5.5. Future Work

Several directions follow directly from these limitations. The environment-adapted route library here was produced offline; because offline plans are computed once and grow stale as the scene changes [43], generating such adaptation online and at scale is a natural next step. Further directions include comparing against optimal multi-robot solvers (e.g. conflict-based search) and continuous-space collision-avoidance methods (e.g. ORCA), adding a decaying shared collision-history map, and validating the framework on physical robots.

6. Conclusions

We presented a faithful, fully reproducible re-implementation of a hybrid RRT + Behavior Cloning framework for decentralized multi-robot navigation in grid-based industrial environments, reusing the original trained network, planner and route data. Our central finding is that navigation quality is governed by the match between the expert route library and the deployment environment: an environment-adapted library transfers the learned skill and cuts collisions by 37–85% and task failures by 43–69% relative to a Map2-naive library across five fleet sizes (2, 4, 6, 8, 10 robots), recovering most of the collision performance of an online RRT baseline at a fraction of its planning cost. A collision-history-sharing add-on that needs no direct messaging, evaluated across repeated seeded trials, gives only a limited benefit—a nominally significant 13% collision reduction under full participation on Map1 (not surviving multiple-comparison correction), but no significant change under selective participation or on a second layout—so the near-halving suggested by the originating study’s preliminary single run does not reproduce within the full framework, and the effect is undetectable at low task throughput. By providing the deterministic code and raw results with this submission—and by documenting the throughput and route-filtering conditions under which each effect appears—we aim to make these results directly reproducible and the framework a reliable basis for further work.

Author Contributions

Conceptualization, C.G.; methodology, Y.A. and C.G.; software, Y.A. and C.G.; validation, C.G.; formal analysis and investigation, Y.A. and C.G.; writing—original draft preparation, Y.A. and C.G.; writing—review and editing, C.G.; supervision and project administration, C.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code, data, and trained model supporting the reported results are openly available in the GitHub repository at https://github.com/ChenGiladi/skill-transfer-multi-robot-nav.

Acknowledgments

During the preparation of this manuscript, the authors used Writefull and an AI-based assistant for language editing and to condense portions of the text. The authors have reviewed and edited all such output and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hekmati, A.; Gupta, K. On Safe Robot Navigation Among Humans as Dynamic Obstacles in Unknown Indoor Environments. In 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO); IEEE: Kuala Lumpur, Malaysia, 2018; pp. 1082–1087. [Google Scholar] [CrossRef]
Silva, S.; Verdezoto, N.; Paillacho, D.; Millan-Norman, S.; Hernández, J.D. Online Social Robot Navigation in Indoor, Large and Crowded Environments. In 2023 IEEE International Conference on Robotics and Automation (ICRA); IEEE: London, United Kingdom, 2023; pp. 9749–9756. [Google Scholar] [CrossRef]
Han, R.; Chen, S.; Hao, Q. Cooperative Multi-Robot Navigation in Dynamic Environment with Deep Reinforcement Learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Paris, France, 2020; pp. 448–454. [Google Scholar] [CrossRef]
Xu, P.; Karamouzas, I. Human-Inspired Multi-Agent Navigation Using Knowledge Distillation. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), [cs]. 2021; pp. 8105–8112. [Google Scholar] [CrossRef]
Wu, S.; Chen, G.; Shi, M.; Alonso-Mora, J. Decentralized Multi-Agent Trajectory Planning in Dynamic Environments with Spatiotemporal Occupancy Grid Maps, 2024. arXiv [cs]. arXiv:2404.15602. [CrossRef]
Karaman, S.; Frazzoli, E. Incremental Sampling-Based Algorithms for Optimal Motion Planning, 2010. arXiv [cs]. arXiv:1005.0416. [CrossRef]
Codevilla, F.; Santana, E.; López, A.M.; Gaidon, A. Exploring the Limitations of Behavior Cloning for Autonomous Driving, 2019. arXiv [cs]. arXiv:1904.08980. [CrossRef]
Bojarski, M.; Testa, D.D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; Zhang, X.; Zhao, J.; Zieba, K. End to End Learning for Self-Driving Cars, 2016. arXiv [cs]. arXiv:1604.07316. [CrossRef]
Zhang, R.; Hou, J.; Walter, F.; Gu, S.; Guan, J.; Röhrbein, F.; Du, Y.; Cai, P.; Chen, G.; Knoll, A. Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey, 2024. arXiv [cs]. arXiv:2408.09675. [CrossRef]
Gao, Z.; Yang, G.; Prorok, A. Online Control Barrier Functions for Decentralized Multi-Agent Navigation. In 2023 International Symposium on Multi-Robot and Multi-Agent Systems (MRS); IEEE: Boston, MA, USA, 2023; pp. 107–113. [Google Scholar] [CrossRef]
Mestres, P.; Nieto-Granda, C.; Cortés, J. Distributed Safe Navigation of Multi-Agent Systems Using Control Barrier Function-Based Optimal Controllers, 2024. arXiv [eess]. arXiv:2402.06195. [CrossRef]
Kondo, K.; Tewari, C.T.; Peterson, M.B.; Thomas, A.; Kinnari, J.; Tagliabue, A.; How, J.P. PUMA: Fully Decentralized Uncertainty-Aware Multiagent Trajectory Planner with Real-Time Image Segmentation-Based Frame Alignment, 2024. arXiv [cs]. arXiv:2311.03655. [CrossRef]
Tang, J.; Duan, H.; Lao, S. Swarm Intelligence Algorithms for Multiple Unmanned Aerial Vehicles Collaboration: A Comprehensive Review. Artif. Intell. Rev. 2023, 56, 4295–4327. [Google Scholar] [CrossRef]
Li, J.; Wang, K.; Chen, Z.; Wang, J. An Improved RRT* Path Planning Algorithm in Dynamic Environment. In Methods and Applications for Modeling and Simulation of Complex Systems; Series Title: Communications in Computer and Information Science; Fan, W., Zhang, L., Li, N., Song, X., Eds.; Springer Nature Singapore: Singapore, 2022; Vol. 1713, pp. 301–313. [Google Scholar] [CrossRef]
Zhao, P.; Chang, Y.; Wu, W.; Luo, H.; Zhou, Z.; Qiao, Y.; Li, Y.; Zhao, C.; Huang, Z.; Liu, B.; Liu, X.; He, S.; Guo, D. Dynamic RRT: Fast Feasible Path Planning in Randomly Distributed Obstacle Environments. J. Intell. Robot. Syst. 2023, 107, 48. [Google Scholar] [CrossRef]
Da Silva Costa, L.; Tonidandel, F. DVG+A* and RRT Path-Planners: A Comparison in a Highly Dynamic Environment. J. Intell. Robot. Syst. 2021, 101, 58. [Google Scholar] [CrossRef]
Florence, P.; Lynch, C.; Zeng, A.; Ramirez, O.; Wahid, A.; Downs, L.; Wong, A.; Lee, J.; Mordatch, I.; Tompson, J. Implicit Behavioral Cloning, 2021. arXiv [cs]. arXiv:2109.00137. [CrossRef]
Chi, Z.; Zhu, L.; Zhou, F.; Zhuang, C. A Collision-Free Path Planning Method Using Direct Behavior Cloning. In telligent Robotics and Applications; Yu, H., Liu, J., Liu, L., Ju, Z., Liu, Y., Zhou, D., Eds.; Springer International Publishing: Cham, 2019; pp. 529–540. [Google Scholar] [CrossRef]
Samak, T.V.; Samak, C.V.; Kandhasamy, S. Robust Behavioral Cloning for Autonomous Vehicles Using End-to-End Imitation Learning. SAE Int. J. Connect. Autom. Veh. 2021, 4, 12–04–03–0023. [Google Scholar] [CrossRef]
Farag, W.; Saleh, Z. Behavior Cloning for Autonomous Driving Using Convolutional Neural Networks. In 2018 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT); IEEE: Sakhier, Bahrain, 2018; pp. 1–7. [Google Scholar] [CrossRef]
Ran, L.; Zhang, Y.; Zhang, Q.; Yang, T. Convolutional Neural Network-Based Robot Navigation Using Uncalibrated Spherical Images. Sensors 2017, 17, 1341. [Google Scholar] [CrossRef]
Pan, Z.; Manocha, D. Feedback Motion Planning for Liquid Pouring Using Supervised Learning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Vancouver, BC, 2017; pp. 1252–1259. [Google Scholar] [CrossRef]
Jia, B.; Manocha, D. Sim-to-Real Robotic Sketching Using Behavior Cloning and Reinforcement Learning. In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Yokohama, Japan, 2024; pp. 18272–18278. [Google Scholar] [CrossRef]
Dharmavaram, A.; Gupta, T.; Li, J.; Sycara, K.P. SS-MAIL: Self-Supervised Multi-Agent Imitation Learning, 2021. arXiv [cs]. arXiv:2110.08963. [CrossRef]
Fang, B.; Zheng, C.; Wang, H. Fact-Based Agent Modeling for Multi-Agent Reinforcement Learning, 2023. arXiv [cs]. arXiv:2310.12290. [CrossRef]
Strouse, D.J.; McKee, K.R.; Botvinick, M.; Hughes, E.; Everett, R. Collaborating with Humans without Human Data, 2021. arXiv [cs]. arXiv:2110.08176. [CrossRef]
Hu, H.; Wu, D.J.; Lerer, A.; Foerster, J.; Brown, N. Human-AI Coordination via Human-Regularized Search and Learning, 2022. arXiv [cs]. arXiv:2210.05125. [CrossRef]
Yu, H.; Hirayama, C.; Yu, C.; Herbert, S.; Gao, S. Sequential Neural Barriers for Scalable Dynamic Obstacle Avoidance. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Detroit, MI, USA, 2023; pp. 11241–11248. [Google Scholar] [CrossRef]
Ye, G.; Alterovitz, R. Demonstration-Guided Motion Planning. In Robotics Research; Series Title: Springer Tracts in Advanced Robotics; Christensen, H.I., Khatib, O., Eds.; Springer International Publishing: Cham, 2017; Vol. 100, pp. 291–307. [Google Scholar] [CrossRef]
Dalal, M.; Yang, J.; Mendonca, R.; Khaky, Y.; Salakhutdinov, R.; Pathak, D. Neural MP: A Generalist Neural Motion Planner. 2024. [Google Scholar] [CrossRef]
Morga-Bonilla, S.I.; Rivas-Cambero, I.; Torres-Jiménez, J.; Téllez-Cuevas, P.; Núñez-Cruz, R.S.; Perez-Arista, O.V. Behavioral Cloning Strategies in Steering Angle Prediction: Applications in Mobile Robotics and Autonomous Driving. World Electr. Veh. J. 2024, 15, 486. [Google Scholar] [CrossRef]
Zhan, E.; Zheng, S.; Yue, Y.; Lucey, P. Generative Multi-Agent Behavioral Cloning. 2018. [Google Scholar]
Zhang, Z.; Hong, J.; Enayati, A.S.; Najjaran, H. Using Implicit Behavior Cloning and Dynamic Movement Primitive to Facilitate Reinforcement Learning for Robot Motion Planning, 2024. arXiv [cs]. arXiv:2307.16062. [CrossRef]
Desaraju, V.R.; How, J.P. Decentralized Path Planning for Multi-Agent Teams in Complex Environments Using Rapidly-Exploring Random Trees. In 2011 IEEE International Conference on Robotics and Automation; IEEE: Shanghai, China, 2011; pp. 4956–4961. [Google Scholar] [CrossRef]
Zhaofeng, Y.; Ruizhe, Z. Path Planning of Multi-robot Cooperation for Avoiding Obstacle Based on Improved Artificial Potential Field Method. 2014; 165. [Google Scholar]
Zhou, C.; Huang, B.; Fränti, P. A Review of Motion Planning Algorithms for Intelligent Robotics. 2021. [Google Scholar] [CrossRef]
Yu, C.; Gao, S. Reducing Collision Checking for Sampling-Based Motion Planning Using Graph Neural Networks. 2021. [Google Scholar]
Heng, H.; Ghazali, M.H.M.; Rahiman, W. Comparative Analysis of Navigation Algorithms for Mobile Robot. J. Ambient Intell. Humaniz. Comput. 2024, 15, 3861–3871. [Google Scholar] [CrossRef]
Adiuku, N.; Avdelidis, N.P.; Tang, G.; Plastropoulos, A. Improved Hybrid Model for Obstacle Detection and Avoidance in Robot Operating System Framework (Rapidly Exploring Random Tree and Dynamic Windows Approach). Sensors 2024, 24, 2262. [Google Scholar] [CrossRef]
Yamada, J.; Lee, Y.; Salhotra, G.; Pertsch, K.; Pflueger, M.; Sukhatme, G.S.; Lim, J.J.; Englert, P. Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments. 2020. [Google Scholar] [CrossRef]
Bowen, C.; Alterovitz, R. Closed-Loop Global Motion Planning for Reactive, Collision-Free Execution of Learned Tasks. ACM Trans. Hum.-Robot Interact. 2018, 7, 1–16. [Google Scholar] [CrossRef]
Verma, A.; Bagkar, S.; Allam, N.V.S.; Raman, A.; Schmid, M.; Krovi, V.N. Implementation and Validation of Behavior Cloning Using Scaled Vehicles; 2021; p. 2021–01–0248. [Google Scholar] [CrossRef]
De Luca, A.; Muratore, L.; Tsagarakis, N.G. Autonomous Navigation With Online Replanning and Recovery Behaviors for Wheeled-Legged Robots Using Behavior Trees. IEEE Robot. Autom. Lett. 2023, 8, 6803–6810. [Google Scholar] [CrossRef]

Figure 1. The two

50 \times 50

grid workspaces (Map1, Map2): dark gray = inflated obstacle, white = free cell, with grid ticks every ten cells. Obstacles are inflated by the robot footprint. The vermillion outlines on Map2 mark the obstacles that are new relative to Map1—the cells the Map1-generated route library does not account for.

Figure 1. The two

50 \times 50

grid workspaces (Map1, Map2): dark gray = inflated obstacle, white = free cell, with grid ticks every ten cells. Obstacles are inflated by the robot footprint. The vermillion outlines on Map2 mark the obstacles that are new relative to Map1—the cells the Map1-generated route library does not account for.

Figure 2. Hybrid RRT + Behavior Cloning framework. (A) Offline, an RRT expert uses the map to build the route library. (B) At deployment, each robot selects routes, applies BC local control, then either reconnects with online RRT or advances to the next timestep. (C) Skill transfer changes only the route library; BC network, RRT planner and Map2 deployment are fixed. (D) Optional sharing: the bracket marks nonzero-

q_{share}

conditions with H read/write access;

q_{share}

is an episode condition, not an output of H.

Figure 2. Hybrid RRT + Behavior Cloning framework. (A) Offline, an RRT expert uses the map to build the route library. (B) At deployment, each robot selects routes, applies BC local control, then either reconnects with online RRT or advances to the next timestep. (C) Skill transfer changes only the route library; BC network, RRT planner and Map2 deployment are fixed. (D) Optional sharing: the bracket marks nonzero-

q_{share}

conditions with H read/write access;

q_{share}

is an episode condition, not an output of H.

Figure 3. Route libraries encode the map on which they were built. Colour and marker encodings are decoded in the bottom legend strip; the obstacle mask is shown in two gray tones (darker = physical obstacle, lighter = one-cell robot-footprint inflation) and is the same configuration-space mask the planner collision-checks against, so a valid route never overlaps it. (A) Sampled subset (

n = 80

) of the Map1-generated library on Map1 is valid. (B) The same sampled subset evaluated on Map2; orange hatching marks route–obstacle conflict cells (sampled route cells overlapping the Map2 obstacle mask). (C, D) Two examples of the deployment recovery step: the robot follows a naive Map1-library route (orange) from the start S until a blocked contact (×) where the route would enter a Map2 obstacle (invalid continuation dashed); online RRT then plans a fresh collision-free continuation (blue) from the contact to the goal G. Arrows show travel direction.

Figure 3. Route libraries encode the map on which they were built. Colour and marker encodings are decoded in the bottom legend strip; the obstacle mask is shown in two gray tones (darker = physical obstacle, lighter = one-cell robot-footprint inflation) and is the same configuration-space mask the planner collision-checks against, so a valid route never overlaps it. (A) Sampled subset (

n = 80

) of the Map1-generated library on Map1 is valid. (B) The same sampled subset evaluated on Map2; orange hatching marks route–obstacle conflict cells (sampled route cells overlapping the Map2 obstacle mask). (C, D) Two examples of the deployment recovery step: the robot follows a naive Map1-library route (orange) from the start S until a blocked contact (×) where the route would enter a Map2 obstacle (invalid continuation dashed); online RRT then plans a fresh collision-free continuation (blue) from the contact to the goal G. Arrows show travel direction.

Figure 4. Map2 transfer performance versus fleet size N, over 15 seeds. Markers show seed means; (A) collisions per completed task and (B) task-failure rate (failed tasks as a percentage of attempted tasks). Shaded bands are percentile-bootstrap

95 %

confidence intervals for the mean across seeds. Three methods are compared: the hybrid framework with a Map2-naive route library (orange), the hybrid framework with a Map2-adapted library (blue), and online RRT (black, dashed). The Map2-adapted library substantially improves the hybrid framework relative to the Map2-naive library across fleet sizes—reducing collisions by, e.g., about

73 %

at

N = 4

(bracket in panel A)—while online RRT remains the lowest-rate replanning baseline on both metrics.

Figure 4. Map2 transfer performance versus fleet size N, over 15 seeds. Markers show seed means; (A) collisions per completed task and (B) task-failure rate (failed tasks as a percentage of attempted tasks). Shaded bands are percentile-bootstrap

95 %

confidence intervals for the mean across seeds. Three methods are compared: the hybrid framework with a Map2-naive route library (orange), the hybrid framework with a Map2-adapted library (blue), and online RRT (black, dashed). The Map2-adapted library substantially improves the hybrid framework relative to the Map2-naive library across fleet sizes—reducing collisions by, e.g., about

73 %

at

N = 4

(bracket in panel A)—while online RRT remains the lowest-rate replanning baseline on both metrics.

Figure 5. Projected blocked-contact density on Map2 (panels A–F; rows = fleet size

N = 4

,

N = 10

; columns = method). A blocked contact is counted when a robot’s three-step look-ahead predicts the next cell is occupied; the event is assigned to the free cell immediately preceding the blocked transition and smoothed within the free-space mask with a mass-conserving Gaussian kernel, so the density sits on the reachable side of the geometry and never inside it. Obstacles are drawn opaque above the density in two grays (darker core = physical obstacle, lighter one-cell ring = robot-footprint / configuration-space inflation) with a boundary outline; white = zero (no displayed density). Color encodes blocked-contact events per grid cell per 100 completed tasks, aggregated over 15 seeds on a single shared scale (color bar); the scale bar in panel F spans ten cells. A dashed contour at the same absolute level

H \geq 1.2

in every panel bounds the high-density regions, and a small crosshair marks the maximum of the plotted density in the Map2-naive panels (A, D); both annotations are read directly from the displayed field. The Map2-naive hybrid concentrates blocked contacts against the obstacles its transferred Map1 routes steer into, whereas the Map2-adapted hybrid and online RRT leave only diffuse residual contacts along the central corridors—reaching the high-density contour nowhere.

Figure 5. Projected blocked-contact density on Map2 (panels A–F; rows = fleet size

N = 4

,

N = 10

; columns = method). A blocked contact is counted when a robot’s three-step look-ahead predicts the next cell is occupied; the event is assigned to the free cell immediately preceding the blocked transition and smoothed within the free-space mask with a mass-conserving Gaussian kernel, so the density sits on the reachable side of the geometry and never inside it. Obstacles are drawn opaque above the density in two grays (darker core = physical obstacle, lighter one-cell ring = robot-footprint / configuration-space inflation) with a boundary outline; white = zero (no displayed density). Color encodes blocked-contact events per grid cell per 100 completed tasks, aggregated over 15 seeds on a single shared scale (color bar); the scale bar in panel F spans ten cells. A dashed contour at the same absolute level

H \geq 1.2

in every panel bounds the high-density regions, and a small crosshair marks the maximum of the plotted density in the Map2-naive panels (A, D); both annotations are read directly from the displayed field. The Map2-naive hybrid concentrates blocked contacts against the obstacles its transferred Map1 routes steer into, whereas the Map2-adapted hybrid and online RRT leave only diffuse residual contacts along the central corridors—reaching the high-density contour nowhere.

Figure 6. Collision composition on Map2 (collisions per completed task), split into static-obstacle (hatched) and robot–robot events. (A,B) Stacked bars at

N = 4

and

N = 10

for the three methods (error bars: bootstrap 95% CI of the total). (C) The Map2-naive library’s two components versus fleet size: a large static-obstacle component (absent from the other two methods) dominates at small fleets and persists, while robot–robot congestion grows with N and overtakes it near

N = 8

. The adapted hybrid and online RRT incur essentially no obstacle collisions (all fleet sizes in Table 5).

Figure 6. Collision composition on Map2 (collisions per completed task), split into static-obstacle (hatched) and robot–robot events. (A,B) Stacked bars at

N = 4

and

N = 10

for the three methods (error bars: bootstrap 95% CI of the total). (C) The Map2-naive library’s two components versus fleet size: a large static-obstacle component (absent from the other two methods) dominates at small fleets and persists, while robot–robot congestion grows with N and overtakes it near

N = 8

. The adapted hybrid and online RRT incur essentially no obstacle collisions (all fleet sizes in Table 5).

Figure 7. Planning-cost vs collision-rate trade-off on Map2: mean wall-clock runtime per episode against collisions per completed task, one marker per method and fleet size (

N = 2, 4, 6, 8, 10

, annotated); horizontal bars are the across-seed runtime standard deviation and vertical bars the percentile bootstrap 95% confidence interval on collisions per completed task. Online RRT reaches the lowest collision rate but at roughly three times the Map2-adapted hybrid’s runtime (e.g. 522 s vs 164 s at ten robots, calling the planner

\sim 2.5

times per task versus at most

0.05

for the Map2-adapted hybrid; the Map2-naive hybrid calls it more often, up to

0.16

per task); the Map2-adapted hybrid recovers most of that collision reduction at a fraction of the cost, while the Map2-naive hybrid costs about as much yet collides far more (Table 6).

Figure 7. Planning-cost vs collision-rate trade-off on Map2: mean wall-clock runtime per episode against collisions per completed task, one marker per method and fleet size (

N = 2, 4, 6, 8, 10

, annotated); horizontal bars are the across-seed runtime standard deviation and vertical bars the percentile bootstrap 95% confidence interval on collisions per completed task. Online RRT reaches the lowest collision rate but at roughly three times the Map2-adapted hybrid’s runtime (e.g. 522 s vs 164 s at ten robots, calling the planner

\sim 2.5

times per task versus at most

0.05

for the Map2-adapted hybrid; the Map2-naive hybrid calls it more often, up to

0.16

per task); the Map2-adapted hybrid recovers most of that collision reduction at a fraction of the cost, while the Map2-naive hybrid costs about as much yet collides far more (Table 6).

Figure 8. Collision-history sharing at realistic task throughput (Map1, 15 seeds): throughput (left) and collisions per completed task (right) versus fleet size, with and without selective sharing (

q_{share} = 0.6

). Selective sharing preserves throughput and produces only small changes in collision rate; the five-robot ablation (Table 7) shows that this selective-access effect is not statistically significant.

Figure 8. Collision-history sharing at realistic task throughput (Map1, 15 seeds): throughput (left) and collisions per completed task (right) versus fleet size, with and without selective sharing (

q_{share} = 0.6

). Selective sharing preserves throughput and produces only small changes in collision rate; the five-robot ablation (Table 7) shows that this selective-access effect is not statistically significant.

Figure 9. Access-fraction sweep (Map1, five robots, 15 seeds): collisions per completed task versus the fraction

q_{share}

of robots with collision-history access. Points are means with percentile bootstrap 95% confidence intervals (not connected, to avoid implying a dose–response); the dashed line is the no-sharing baseline. The sweep is essentially flat (Spearman

ρ = - 0.09

,

p = 0.42

): there is no monotonic dose–response.

Figure 9. Access-fraction sweep (Map1, five robots, 15 seeds): collisions per completed task versus the fraction

q_{share}

of robots with collision-history access. Points are means with percentile bootstrap 95% confidence intervals (not connected, to avoid implying a dose–response); the dashed line is the no-sharing baseline. The sweep is essentially flat (Spearman

ρ = - 0.09

,

p = 0.42

): there is no monotonic dose–response.

Figure 10. Throughput dependence of collision-history sharing (Map1, five robots). (A) The shared collision map H populates only as tasks complete (a representative run, seed 0). (B) Collisions per completed task for the three sharing levels (

q_{share} = 0, 0.6, 1.0

) across episode lengths; error bars are bootstrap 95% CIs. (C) Full-sharing collision reduction versus episode length, with a bootstrap 95% CI band (gray) about the zero reference line: the difference is statistically indistinguishable from zero at short horizons (when H is nearly empty) and becomes measurable only once enough completed tasks have populated H.

Figure 10. Throughput dependence of collision-history sharing (Map1, five robots). (A) The shared collision map H populates only as tasks complete (a representative run, seed 0). (B) Collisions per completed task for the three sharing levels (

q_{share} = 0, 0.6, 1.0

) across episode lengths; error bars are bootstrap 95% CIs. (C) Full-sharing collision reduction versus episode length, with a bootstrap 95% CI band (gray) about the zero reference line: the difference is statistically indistinguishable from zero at short horizons (when H is nearly empty) and becomes measurable only once enough completed tasks have populated H.

Table 1. Implementation settings shared by all experiments.

Component	Setting
Grid size	$50 \times 50$ occupancy grid
Obstacle inflation	1 cell (robot footprint)
Robot control	decentralized; no direct inter-robot messaging
Collision-history sharing	optional, via a shared map H (written and read; no messaging)
Collision definition	predicted blocked-move (three-step look-ahead)
BC input	robot position, target route waypoint, $5 \times 5$ local occupancy window
BC architecture	29-input MLP, hidden layers 256–256–64
Episode budget	2000 scheduler steps
Per-task step cap	300 steps (else counted as a failure)
Online RRT iteration cap	8000 (hybrid reconnect) / 3000 (online baseline)
Seeds per condition	15
Fleet sizes	$N \in {2, 4, 6, 8, 10}$
Sharing fraction $q_{share}$	${0, 0.6, 1.0}$ and a 0–1 sweep
Statistical tests	Mann–Whitney U, Spearman $ρ$

Table 2. Route-library diagnostics (both libraries deployed on Map2), computed from the released route datasets and map layouts. The two libraries hold a comparable number of routes (within about 2%), so the adapted library is not simply larger; their routes differ modestly in median length and free-cell coverage, but the decisive difference is that the Map2-naive library leaves route cells inside Map2 obstacles while the adapted one does not.

Library	Generated on	Raw routes	Filtered routes	Median length	Free cells covered	Cells in obstacles	Routes overlapping
Map2-naive	Map1	3543	3304	33	1207	8.1%	58%
Map2-adapted	Map2	3592	3256	37	1363	0.0%	0%

Table 3. Experimental design. All conditions use 15 seeds; “library” is the route set deployed (Map2-naive = generated on Map1; Map2-adapted = generated on Map2; native = generated on the same map). Online RRT uses no library.

Experiment	Map	Library	N	$q_{share}$	Steps
Skill transfer	Map2	naive vs. adapted	2,4,6,8,10	0	2000
Online RRT baseline	Map2	none (online)	2,4,6,8,10	—	2000
Sharing ablation	Map1, Map2	native	5	0, 0.6, 1.0	2000
Access-fraction sweep	Map1	native	5	0–1 (step $0.2$ )	2000
Sharing scalability	Map1	native	2,4,6,8,10	0, 0.6	2000
Throughput dependence	Map1	native	5	0, 0.6, 1.0	300–2000

Table 4. Full skill-transfer results on Map2 (15 seeds, 2000 scheduler steps). Tasks is the mean number of completed tasks (rounded to an integer); collisions are mean ± SD; collisions per completed task are reported with a percentile bootstrap 95% CI. The Reduction column is the percentage reduction in mean collisions relative to the Map2-naive hybrid; the Mann–Whitney Up-value (one-sided, naive > method) and Cliff’s

δ

are computed against that same baseline. The Map2-adapted hybrid and the online RRT baseline both improve on the naive library at every fleet size, with

δ = 1

and

p < 10^{- 5}

throughout.

Table 4. Full skill-transfer results on Map2 (15 seeds, 2000 scheduler steps). Tasks is the mean number of completed tasks (rounded to an integer); collisions are mean ± SD; collisions per completed task are reported with a percentile bootstrap 95% CI. The Reduction column is the percentage reduction in mean collisions relative to the Map2-naive hybrid; the Mann–Whitney Up-value (one-sided, naive > method) and Cliff’s

δ

are computed against that same baseline. The Map2-adapted hybrid and the online RRT baseline both improve on the naive library at every fleet size, with

δ = 1

and

p < 10^{- 5}

throughout.

N	Method	Tasks	Collisions	Coll./task [95% CI]	Fail rate	Reduction	p	$δ$
2	Map2-naive hybrid	93	$281 \pm 28$	3.05 [2.86, 3.22]	0.20	—	—	—
	Map2-adapted hybrid	99	$41 \pm 16$	0.42 [0.34, 0.52]	0.06	85%	$< 10^{- 5}$	$1.00$
	Online RRT	120	$28 \pm 11$	0.23 [0.19, 0.28]	0.01	90%	$< 10^{- 5}$	$1.00$
4	Map2-naive hybrid	177	$703 \pm 61$	4.00 [3.75, 4.26]	0.23	—	—	—
	Map2-adapted hybrid	191	$206 \pm 36$	1.08 [0.98, 1.19]	0.09	71%	$< 10^{- 5}$	$1.00$
	Online RRT	221	$180 \pm 26$	0.82 [0.75, 0.88]	0.04	74%	$< 10^{- 5}$	$1.00$
6	Map2-naive hybrid	254	$1263 \pm 78$	4.98 [4.76, 5.17]	0.26	—	—	—
	Map2-adapted hybrid	276	$576 \pm 85$	2.09 [1.92, 2.26]	0.12	54%	$< 10^{- 5}$	$1.00$
	Online RRT	312	$428 \pm 57$	1.37 [1.28, 1.47]	0.06	66%	$< 10^{- 5}$	$1.00$
8	Map2-naive hybrid	327	$1944 \pm 140$	5.97 [5.63, 6.36]	0.29	—	—	—
	Map2-adapted hybrid	353	$1028 \pm 107$	2.92 [2.76, 3.10]	0.16	47%	$< 10^{- 5}$	$1.00$
	Online RRT	399	$788 \pm 55$	1.98 [1.90, 2.06]	0.09	59%	$< 10^{- 5}$	$1.00$
10	Map2-naive hybrid	394	$2687 \pm 120$	6.83 [6.62, 7.06]	0.32	—	—	—
	Map2-adapted hybrid	426	$1698 \pm 114$	3.99 [3.84, 4.15]	0.18	37%	$< 10^{- 5}$	$1.00$
	Online RRT	468	$1259 \pm 76$	2.69 [2.60, 2.80]	0.12	53%	$< 10^{- 5}$	$1.00$

Table 5. Collision composition on Map2 at all fleet sizes (15 seeds, 2000 steps): static-obstacle and robot–robot collisions per completed task, and the static-obstacle fraction. The static-obstacle component is exclusive to the Map2-naive library and roughly constant across N (it dominates at small fleets and is overtaken by robot–robot congestion at the largest fleets); the adapted hybrid and online RRT incur no obstacle collisions at any N. Full-fleet complement to Figure 6.

N	Method	Static-obstacle/task	Robot–robot/task	Static fraction
2	Map2-naive hybrid	2.69	0.36	88%
	Map2-adapted hybrid	0.00	0.42	0%
	Online RRT	0.00	0.23	0%
4	Map2-naive hybrid	2.70	1.30	68%
	Map2-adapted hybrid	0.00	1.08	0%
	Online RRT	0.00	0.82	0%
6	Map2-naive hybrid	2.77	2.21	56%
	Map2-adapted hybrid	0.00	2.09	0%
	Online RRT	0.00	1.37	0%
8	Map2-naive hybrid	2.89	3.09	48%
	Map2-adapted hybrid	0.00	2.92	0%
	Online RRT	0.00	1.98	0%
10	Map2-naive hybrid	2.90	3.93	43%
	Map2-adapted hybrid	0.00	3.99	0%
	Online RRT	0.00	2.69	0%

Table 6. Planning cost on Map2 (15 seeds, 2000 steps). Runtime is mean ± SD wall-clock for the navigation loop on one pinned CPU thread (11th Gen Intel Core i7-11700KF, 3.60 GHz, Python 3.12.3), excluding scenario construction, figure rendering and disk I/O; the RRT-calls-per-task column is the number of online RRT planner invocations per completed task. Online RRT attains the lowest collision rate but at far higher planning cost; the adapted hybrid reuses routes and calls the planner rarely.

N	Method	Runtime (s)	RRT calls/task	Coll./task	Fail rate
2	Map2-naive hybrid	$25.5 \pm 3.8$	0.07	3.05	0.20
	Map2-adapted hybrid	$19.3 \pm 2.7$	0.02	0.42	0.06
	Online RRT	$66.9 \pm 5.8$	1.16	0.23	0.01
4	Map2-naive hybrid	$52.5 \pm 4.9$	0.08	4.00	0.23
	Map2-adapted hybrid	$43.2 \pm 3.9$	0.03	1.08	0.09
	Online RRT	$151.8 \pm 7.7$	1.44	0.82	0.04
6	Map2-naive hybrid	$94.5 \pm 8.3$	0.12	4.98	0.26
	Map2-adapted hybrid	$77.7 \pm 8.6$	0.04	2.09	0.12
	Online RRT	$256.4 \pm 22.6$	1.74	1.37	0.06
8	Map2-naive hybrid	$134.7 \pm 14.2$	0.14	5.97	0.29
	Map2-adapted hybrid	$117.1 \pm 9.7$	0.05	2.92	0.16
	Online RRT	$384.3 \pm 43.8$	2.06	1.98	0.09
10	Map2-naive hybrid	$184.5 \pm 11.2$	0.16	6.83	0.32
	Map2-adapted hybrid	$163.6 \pm 8.3$	0.05	3.99	0.18
	Online RRT	$521.8 \pm 90.9$	2.45	2.69	0.12

Table 7. Collision-history sharing on the native route library of each map (five robots, 15 seeds, 2000 steps). Only Map1 full participation (

q_{share} = 1

) is nominally significant, and it does not survive correction.

Table 7. Collision-history sharing on the native route library of each map (five robots, 15 seeds, 2000 steps). Only Map1 full participation (

q_{share} = 1

) is nominally significant, and it does not survive correction.

Map	$q_{share}$	Collisions	Coll./task	Tasks	Change	p	$p_{Holm}$	Significant
Map1	0.0	$362 \pm 66$	1.42	255	—	—	—	—
	0.6	$346 \pm 62$	1.41	248	$- 4.4 %$	$0.324$	$0.972$	no
	1.0	$316 \pm 54$	1.29	246	$- 12.7 %$	$0.023$	$0.093$	no
Map2	0.0	$386 \pm 60$	1.68	231	—	—	—	—
	0.6	$377 \pm 38$	1.64	231	$- 2.2 %$	$0.434$	$0.972$	no
	1.0	$379 \pm 61$	1.63	233	$- 1.7 %$	$0.370$	$0.972$	no

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Environment-Specific Skill Transfer for Decentralized Multi-Robot Navigation via Hybrid RRT and Behavior Cloning in Grid-Based Industrial Environments

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Environment and Metrics

3.2. Hybrid RRT + Behavior Cloning Framework

3.3. Route Libraries and Skill Transfer

3.4. Collision-History Sharing

3.5. Experimental Design

4. Results

4.1. Environment-Adapted Route Libraries Transfer the Navigation Skill

4.2. Throughput, Collision Composition, and Planning Cost Across Fleet Density

4.3. Collision-History Sharing Gives Only a Limited Benefit

5. Discussion

5.1. What Governs Navigation Quality

5.2. The Role of Collision-History Sharing

5.3. Generality and Relation to Planning–Learning Hybrids

5.4. Limitations

5.5. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe