Active Vision for Social Navigation

Jack Vice; Gita Sukthankar

doi:10.20944/preprints202604.0068.v1

Submitted:

01 April 2026

Posted:

02 April 2026

You are already at the latest version

Abstract

Traditional social navigation systems often treat perception and motion as decoupled tasks, leading to reactive behaviors and perceptual surprise due to limited field of view. While active vision—the ability to choose where to look—offers a solution, most existing frameworks decouple sensing from execution to simplify the learning process. This article introduces a novel joint reinforcement learning (RL) framework (Active Vision for Social Navigation) that unifies locomotion and discrete gaze control within a single, end-to-end policy. Unlike existing factored approaches, our method leverages a model-based RL architecture with a latent world model to explicitly address the credit assignment problem inherent in active sensing. Experimental results in cluttered, dynamic environments demonstrate that our joint policy outperforms factored sensing-action approaches by prioritizing viewpoints specifically relevant to social safety, such as checking blind spots and tracking human trajectories. Our findings suggest that tight sensorimotor coupling is essential for reducing perceptual surprise and ensuring safe, socially aware navigation in unstructured spaces.

Keywords:

social navigation

;

active vision

;

reinforcement learning

Subject:

Computer Science and Mathematics - Robotics

1. Introduction

A central limitation of the social navigation pipeline is often the narrow instantaneous field of view (FOV) of a forward-facing camera. In cluttered, occluded, or dynamic scenes, pedestrians can enter the robot’s path from oblique angles without being observed until they are already dangerously close, reducing the effectiveness of downstream prediction and increasing the likelihood of high-risk encounters. Active vision addresses this limitation by treating viewpoint selection as a policy in which the robot learns to choose where to look while simultaneously controlling locomotion. Concretely, active vision improves social navigation with a discrete gaze state over a small set of yaw-offset views, enabling proactive sensing behaviors such as checking blind spots and following human trajectories that reduce perceptual surprise and improve social navigation in unstructured environments.

This article introduces a new reinforcement learning framework: Active Vision for Social Navigation. Our proposed method augments a perception–prediction–control pipeline with 1) a discrete view-selection mechanism, 2) joint reinforcement learning over locomotion and gaze actions, and 3) a spatiotemporal attention mechanism to predict near-future human occupancy. Each control step consists of three stages: selecting a gaze window index (pan) together with a wheel-motion command; rectifying the selected view and computing compact perceptual channels as the state observation; and providing the RL agent with a fused low-dimensional observation that includes both geometry and semantic cues alongside predicted human occupancies.

Learning viewpoint control and navigation control simultaneously is substantially harder than learning either in isolation, and much of the active-vision literature explicitly avoids this fully coupled formulation. Prior work typically reduces complexity by factorizing sensing and task behavior into separate policies or asynchronous loops, thereby shortening credit-assignment paths and stabilizing learning. Shang and Ryoo formulate active vision with separate sensory and motor policies to address limited observability and indirect supervision for viewpoint control [1]. Dass et al. show that separating information-seeking from task execution makes learning more tractable by assigning distinct roles to context acquisition and downstream decision making [2], while Wang et al. explicitly decouple observation and action in an asynchronous active vision–action framework [3]. Even self-supervised approaches such as [4] sidestep the coupling by first rewarding camera behaviors that improve predictability, rather than requiring task reward to supervise sensing choices directly.

Our formulation learns a single joint policy over the product action space (navigation × view angle), forcing the agent and world model to solve coupled credit assignment. This is challenging for three reasons: gaze actions have delayed utility since a pan action improves future observation quality rather than yielding immediate reward; gradients compete, as early navigation mistakes dominate returns while the signal for beneficial gaze shifts is weak and indirect; and view transitions expand the learned dynamics, changing the observation distribution even when the robot does not move. Despite this difficulty, joint optimization is well-motivated here because viewpoint control is not an auxiliary behavior but a task-dependent component of navigation itself, where the robot looks directly determines what it can infer about nearby humans and traversable space, which in turn shapes the optimal motion decision. Factored approaches [1,2,3] risk biasing the sensing strategy toward broadly informative views rather than those specifically useful for socially safe locomotion, whereas a joint policy can learn task-specific sensorimotor coordination such as briefly redirecting gaze to resolve an ambiguous human interaction before choosing a safer path. This long-horizon coupling is precisely what model-based RL is designed to capture through imagined trajectories over a learned latent world model [5].

For this to work, the world model must learn to predict how viewpoint changes alter subsequent observations. Because gaze and motor commands are jointly sampled and modeled, the latent world model associates pan actions with their resulting visual consequences, when the agent pans toward a pedestrian, the world model links that action to the ensuing observation change and any downstream proxemic penalty. Imagined rollouts can therefore assign higher value to action sequences that proactively redirect gaze toward anticipated human occupancy, producing look-before-you-drive behavior through temporal credit assignment.

Section 2 provides an overview of related work on social navigation. Our proposed architecture is described in Section 3. Section 4 presents a comparison of our method against several competitive benchmarks in three social navigation scenarios. Our results show that active vision produces a clear statistical difference in performance at achieving a higher number of navigation goals with fewer near collisions.

2. Related Work

Social navigation, the ability for robots to navigate safely and predictably in environments shared with humans—has evolved significantly with advances in perception, learning, and planning. Early work on social robot navigation emphasized proxemics and safety-aware planning [6,7], while recent trends leverage deep learning for trajectory forecasting and policy learning. Social navigation requires autonomous robots to reason not only about geometric constraints but also about human social behaviors, such as maintaining interpersonal distance and yielding right-of-way. Hall’s proxemics theory formalized the interpersonal zones that robots must respect [8]; this article evaluates the performance of the RL agent using Hall’s suggested proxemic zones for personal and social distances. Early planners extended the Dynamic World Approach (DWA) and velocity-obstacle schemes with proxemic costs to remain inside socially acceptable regions [9,10,11,12]. These rule-based methods guarantee safety but struggle with the diversity of human behaviors in dense crowds.

End-to-end learning replaces hand-tuned costs with policies optimized directly for social compliance. Tai et al. used GAIL to learn depth-only navigation that implicitly captures human preferences [13]. DRL-VO merges RL with velocity-obstacle geometry for faster convergence in cluttered scenes [14], while uncertainty-aware RL penalizes high-variance actions to curb the “freezing” phenomenon [15]. Reward shaping based on heuristics or social norms further steers policies toward courteous behavior [16]. Our proposed method is unique in its inclusion of active vision to prevent perceptual surprise and its ability to learn joint policies for sensing and action.

3. Materials and Methods

Our system architecture is a multi-process pipeline that fuses human intent prediction with terrain-aware reinforcement learning. For each new camera frame, a YOLOv11 detector identifies current pedestrian locations while a spatiotemporal attention mechanism predicts future human occupancy patterns from temporal frame sequences. Concurrently, monocular depth estimation extracts geometric scene structure from the latest RGB frame. These three information streams—current pedestrian masks overlaid on grayscale imagery, attention-based trajectory predictions, and depth maps—are fused into a compact 96×96×3 observation tensor that encodes both social and geometric constraints. This multimodal representation serves as input to a DreamerV3 [5] reinforcement learning agent that learns socially compliant navigation policies through latent world modeling, enabling real-time decision making that respects both human proxemics and terrain traversability in unstructured environments. Figure 1 shows the active vision pipeline in which a single wide-angle fisheye camera provides five discrete view frames. The RL agent learns a joint policy with motor commands and view selection.

3.1. Datasets

We employ the SiT Social-Interactive Trajectory (SiT) [17] corpus augmented with sequences collected in the DuNE–Gazebo [18] simulator to ensure coverage of both real and synthetic unstructured scenes, totaling just over 100k images from multiple video sequences. For every video sequence, frames are first subsampled at 10 Hz and assembled into fixed–length windows that respect the temporal offsets in

T_{past} = {0, - 0.5, - 1.0, - 1.5, - 2.0} s

and a prediction horizon

τ_{f} = + 2.0 s

. Each window thus contains

T = 5

RGB frames (current + 4 previous)

{I_{t + k}}_{k \in T_{past}}

. A lightweight YOLOv11n model detects pedestrians in all frames; detections are linked across time via a distance–based tracker to yield short trajectories. From the centroids

p_{i} = (x_{i}, y_{i})

of pedestrians in the future frame we generate a heatmap

H (x, y) = max_{i} exp [- \frac{{∥ p_{i} - (x, y) ∥}_{2}^{2}}{2 σ^{2}}], σ = 16 px,

(1)

which is normalized to

[0, 1]

. Simultaneously we rasterize pedestrian YOLO center boxes into binary masks

M_{t + k}

and compute monocular depth

D_{t + k}

with Depth-Anything V2 [19]. All tensors are resized to

320 \times 320

, stored as half-precision mem-maps, and streamed in constant-time batches during training, eliminating GPU out-of-memory crashes when scaling to the full ( ≈ 107,000) image dataset.

3.2. Spatiotemporal Attention Mechanism

We predict near–future human occupancy as a dense heatmap from a short RGB history and per–frame person masks. At each step we form a sequence

{(I_{t - k}, M_{t - k})}_{k = 0}^{T - 1}

of T RGB frames and their binary masks (from YOLO detections). Each pair is encoded by lightweight CNNs with shared structure: strided convolutions downsample from

320 \times 320

80 \times 80

and

40 \times 40

, producing a bottleneck feature

F_{t - k} \in R^{40 \times 40 \times C}

and skip maps at

80 \times 80

and

40 \times 40

for decoding. We then concatenate RGB and mask features channel-wise, stack them across the T timesteps, and reshape to a token sequence of length

T \cdot 40 \cdot 40

Z \in R^{B \times (T \cdot 40 \cdot 40) \times C} .

(2)

A linear layer projects

Z

to an embedding dimension D (configurable; defaults provided in code), followed by LayerNorm and the addition of a learned spatiotemporal positional embedding of the same shape. This absolute embedding lets the model encode both spatial indices and relative time implicitly.

We apply a single SelfAttention block with 8 heads over the token axis. The attention is the standard scaled dot-product:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

(3)

implemented inside multi-head projections so each head attends over the

D / h

-dimensional subspace. The attended sequence is added residually and normalized, then passed through a linear projection and dropout. We then reshape back to

[B, T, 40, 40, D]

and perform temporal mean pooling to obtain a single

40 \times 40

latent per batch.

Decoding starts from the pooled

40 \times 40

latent. We concatenate the

40 \times 40

skip feature via a

1 \times 1

projection and upsample with transposed convolutions to

80 \times 80

and finally

320 \times 320

, concatenating the

80 \times 80

skip along the way. A final

3 \times 3

convolution with sigmoid produces the occupancy heatmap

\hat{H} \in {[0, 1]}^{320 \times 320}

Training targets H are built from future YOLO detections by rasterizing each person into an elliptical Gaussian centered at the detection centroid, with per-axis standard deviations scaled by the bounding-box size. The model is optimized with a balanced focal+Dice objective:

L = L_{focal} (\hat{H}, H) + L_{dice} (\hat{H}, H),

(4)

where the focal term uses the usual

(α, γ)

parameters and the Dice term promotes overlap with soft targets (both implemented in code). For monitoring, we additionally report RMSE over the heatmap.

We train with Adam (Optax) and gradient clipping (global

ℓ_{2}

norm

1.0

). The learning rate follows a warmup+cosine schedule (linear warmup then decay), with default mini-batches of 8 sequences and sequence length

T = 5

(all configurable). The model and optimizer are instantiated in JAX/Flax; training loops include on-the-fly validation and optional TensorBoard logging. Checkpoints save the full parameter PyTree and config for resumption.

3.3. Reinforcement Learning

We build on DreamerV3 [5], which learns a compact latent-space dynamics model from pixels and proprioception, and then performs ’imagined’ rollouts in that latent space to train the policy and estimate long-horizon returns to optimize an actor–critic policy before executing in the environment. Specifically, DreamerV3 learns a recurrent state-space model (RSSM) whose latent evolves stochastically

z_{t} \sim p_{ϕ} (z_{t} ∣ z_{t - 1}, a_{t - 1})

while reconstructing observations through a decoder

o_{t} \sim p_{ϕ} (o_{t} ∣ h_{t}, z_{t})

. Let

h_{t}

denote the deterministic recurrent state (e.g., GRU hidden),

z_{t}

the stochastic latent, and

s_{t} [h_{t}, z_{t}]

the concatenated latent consumed by the actor, critic, and decoders. Actor and critic are then trained on imagined sequences

{s_{t}}_{t = 1}^{T}

, enabling sample-efficient control from high-dimensional inputs and reducing sensitivity to pixel noise. In our navigation setting, a fused encoder (RGB + attention + depth) supplies the world model with traversability and goal cues; the learned dynamics support long-horizon, goal-directed planning that captures both skid-steer kinematics and terrain structure.

The attention-based prediction module injects explicit social priors into long-horizon planning. Let

O_{t - τ : t}

denote the past

τ

seconds of observations and

{\hat{p}}_{t + δ} = f_{θ} (O_{t - τ : t})

the predicted pedestrian positions

δ

seconds ahead. These forecasts are fused into the agent’s representation (as an additional observation channel and latent feature), so DreamerV3’s imagined rollouts evaluate candidate actions under anticipated human occupancy rather than relying solely on raw pixels. Practically, this biases both the learned world model and the agent toward trajectories that steer around predicted human locations and sustain comfortable separation.

3.3.1. RL Agent Reward Design

The reward function is designed to balance efficient goal-directed navigation with socially compliant behavior in dynamic, unstructured environments. At each timestep t, the agent receives a composite reward

R_{t}

that encourages progress toward the navigation target while discouraging unsafe or inefficient behaviors. The reward is computed as:

R_{t} = α Δ d_{t} + β (1 - \frac{| θ_{t} |}{π}) + γ v_{t} - δ | ω_{t} | - ϵ - \sum_{i = 1}^{N} P (d_{i}) + κ Δ d_{t} 1_{aligned} + R_{goal} 1_{success}

(5)

where

Δ d_{t} = d_{t - 1} - d_{t}

represents the decrease in Euclidean distance to the goal (positive for progress),

θ_{t} \in [- π, π]

is the heading difference between the robot’s current orientation and the target direction,

v_{t}

is the linear velocity, and

ω_{t}

is the angular velocity. Social safety is incorporated through a proxemic penalty function

P (d_{i})

applied to each of N nearby human actors at distance

d_{i}

. Based on Hall’s proxemic zones [8], the penalty grows in discrete stages:

P (d) = \{\begin{matrix} 25.0 & if d < 0.5 m (intimate zone) \\ 8.0 & if 0.5 m \leq d < 0.8 m (personal zone) \\ 2.0 & if 0.8 m \leq d < 1.2 m (social zone) \\ 0 & otherwise \end{matrix}

(6)

The indicator function

1_{aligned}

provides an alignment bonus when the robot makes forward progress (

Δ d_{t} > 0

) while well-aligned with the target (

| θ_{t} | < π / 4

). The goal reward

R_{goal} = 100.0

is achieved when the robot reaches within 0.3m of the goal . The hyperparameters are set to

α = 2.0

(distance progress),

β = 0.03

(heading alignment),

γ = 0.015

(velocity),

δ = 1.8

(spin penalty),

ϵ = 0.008

(time penalty), and

κ = 0.3

(alignment bonus). Episodes terminate after 6000 steps. Together, these terms provide a dense and balanced learning signal that guides the agent toward socially aware, efficient navigation through rough terrain populated with dynamic pedestrians.

3.3.2. Environment Configuration

We conduct all experiments in the DUnE simulation testbed [18]. Utilizing the Gazebo Fortress simulation engine, we test in three different environments: Inspection, Construction, and Island (Figure 2). The robot used in testing is the FictionLab Leo Rover (0.985 m wheel-base, differential drive) [20]. Each scenario is initialized with a random goal position and 2–3 dynamic human actors that follow pre-scripted patrol waypoints within the same area as the random goal points.

3.3.3. Multi-Process Architecture Integration

A separate inference process takes camera images and applies a YOLOv11n-based pedestrian detector to each sampled frame. The resulting detections are converted into binary pedestrian masks, which are fused with the RGB inputs and passed to a spatiotemporal attention model. This model predicts a heatmap representing the likelihood of future human locations in the scene, leveraging both motion history and semantic context. Concurrently, the most recent RGB frame is processed by a monocular depth estimator (DepthAnythingV2) to obtain a normalized depth map. These three signals—grayscale+current pedestrian mask+predicted pedestrian attention heatmap, and depth image—are resized and fused into a

96 \times 96 \times 3

tensor suitable for downstream reinforcement learning. The resulting observation is written to a third shared memory block (rl_observation) using an atomic update protocol that includes a timestamp and validity flag.

The overall architecture ensures isolation between perception and control, enabling asynchronous inference and action cycles. By relying on shared memory rather than message-passing, the system minimizes latency and avoids blocking between modules. Synchronization is enforced via timestamp-based freshness checks and atomic memory layout conventions. This integration strategy supports continuous, low-latency operation in real-time robotics environments, with clear modular boundaries between ROS-based frame acquisition, attention inference, and RL-based navigation.

3.3.4. RL Agent Image Observation Space Design

At each time t the agent receives a 96 pixel × 96 pixel

\times 3

tensor

O_{t} = [I_{t}^{gray} ∥ {\hat{H}}_{t} ∥ D_{t}],

(7)

where “∥’’ denotes channel-wise concatenation.

Channel 1: Grayscale image with masks. The current RGB frame is converted to luminance and normalised to $[- 1, 1]$ . Pixels inside the YOLO pedestrian bounding boxes are overwritten with the constant value $- 1$ to emphasize occupied regions.
Channel 2: Human traversability heat-map ${\hat{H}}_{t} \in {[0, 1]}^{96 \times 96}$ predicted by the attention model; values are logit-scaled to $(0, 1)$ .
Channel 3: Depth map $D_{t}$ acquired from a simulated Intel Realsense at 30 Hz, clipped to 0–10 m and linearly mapped to $[- 1, 1]$ .

Before concatenation each channel is resized with bilinear interpolation and subjected to per-channel mean–variance normalisation computed over the training set. This fusion encodes both physical obstacles (depth), socially preferred regions (heat-map), and fine-grained appearance cues (grayscale) in a compact representation compatible with convolutional encoders. Figure 3 illustrates the information received by the RL agent.

4. Results

To quantify changes in the robot’s gaze while learning the social navigation policy, we plot the distribution of view windows. The window usage distribution plots shown in the upper row of Figure 5 and Figure 6 summarize how often each discrete active-vision window is selected over an evaluation. Each bar corresponds to one window index and its height indicates the total number (or fraction) of timesteps for which that window was active. A highly peaked distribution implies the policy repeatedly favors a small subset of views (potentially indicating a persistent goal/obstacle bias or limited exploration of viewpoints), whereas a flatter distribution indicates more diverse sensing behavior with frequent switching across windows.

The bottom row of Figure 5 and Figure 6 shows a histogram of the goal-alignment error, aggregated over all valid timesteps in the evaluation. Each bar counts the frequency of alignment errors within a fixed angular bin spanning

0^{\circ}

180^{\circ}

(36 bins total, i.e.,

\approx 5^{\circ}

per bin), providing a compact view of how consistently the robot’s active view is oriented toward the current navigation goal. Two reference thresholds are overlaid:

30^{\circ}

(“good” alignment) and

60^{\circ}

(“acceptable” alignment). Concentration of mass to the left of

30^{\circ}

indicates that the sensing direction is better aligned with the goal bearing, whereas a heavier tail beyond

60^{\circ}

indicates frequent misalignment and less goal-directed sensing.

Table 1 summarizes robot–human encounter statistics across the three Gazebo environments under six conditions: Attention, YOLO+Depth, No Social Nav, Nav2 (forward-only), Nav2 (with reverse enabled), and the proposed Active Vision variant. The YOLO+Depth condition provides human detections without attention-based future state prediction. Encounters are binned by minimum separation thresholds (0.5 m, 0.8 m, and 1.2 m), where smaller distances indicate higher proxemic risk, and task throughput is measured by total navigation goals completed. Active Vision yields the lowest critical close-proximity interaction counts (

d_{min} < 0.5

m) and the highest goals-per-encounter ratios across all environments. Measured by goals per

d_{min} < 0.5

m encounter, Active Vision outperforms Attention by 68%, 42%, and 145% in Inspection, Island, and Construction respectively, and outperforms YOLO+Depth by 115%, 122%, and 403%. These gains are achieved while maintaining comparable goal completion on Island and competitive throughput in Inspection and Construction, indicating that proactive gaze selection improves social safety without consistently sacrificing task performance. Active Vision reduces high-risk interactions relative to both classical navigation as well as our Attention method (Chapter 6) and ablations across all three unstructured environments.

To assess statistical differences between navigation conditions, we segmented each 90-minute evaluation into non-overlapping windows of

G = 10

consecutive goals. For each window we computed risk seconds per goal, defined as the number of 1 Hz time steps during which the minimum robot–human distance fell below 0.5 m divided by G. Individual navigation episodes time out after 207 seconds, after which the robot and actors respawn at a randomly selected starting position, reducing spatial autocorrelation between episodes within a run. This goal-aligned windowing normalizes for differences in navigation speed across conditions, producing multiple observations per condition suitable for inferential testing.

For each environment we conducted six planned pairwise comparisons using the Mann–Whitney U test, a nonparametric rank-based test that does not assume normally distributed observations. Effect sizes are reported as Cohen’s d with pooled standard deviation, and 95% confidence intervals for the difference in means use Welch–Satterthwaite degrees of freedom. To control the family-wise error rate across comparisons within each metric, we applied Holm–Bonferroni step-down correction (

α = 0.05

As shown in Table 2, across all three environments, Active Vision consistently reduced proxemic risk, measured as risk seconds per goal (time steps with

d_{min} < 0.5

m normalized by goals). In Inspection, Active Vision achieved significantly lower risk than both the No Social Nav baseline (

Δ = - 0.1825

p_{corr} < 0.001

) and Nav2 (w/reverse) (

Δ = - 0.4456

p_{corr} < 0.001

), and also improved over Attention (

Δ = - 0.0722

p_{corr} = 0.01

) (Table 1). In Island, where close-proximity interactions dominate, Active Vision produced a large and significant reduction in risk relative to No Social Nav (

Δ = - 0.4308

p_{corr} < 0.001

) and Nav2 (w/reverse) (

Δ = - 0.3919

p_{corr} < 0.001

), while remaining statistically indistinguishable from Attention and YOLO+Depth. In the most constrained Construction world, Active Vision again significantly reduced risk compared to No Social Nav (

Δ = - 0.2407

p_{corr} < 0.001

) and Nav2 (w/reverse) (

Δ = - 0.3316

p_{corr} < 0.001

), but did not differ from Attention. Overall, the primary social-navigation outcome is that both attention-based policies and Active Vision substantially decrease time spent within 0.5 m of pedestrians compared to baselines lacking social reasoning, with the strongest improvements observed against Nav2 and No Social Nav across environments.

5. Conclusions

This article demonstrates the value of active vision for social navigation. In our implementation, viewpoint selection is a jointly learned decision variable alongside locomotion. The key contributions are: (1) a discrete gaze mechanism over five yaw-offset rectified views spanning

160^{\circ}

, integrated into a flat 180-action space; (2) a dual-stream perception architecture that decouples temporally consistent center-view social prediction from actively controlled selected-view scene understanding; (3) a compact observation interface combining a

96 \times 96 \times 3

fused image, a 192-D heatmap density vector, a view token, and proprioceptive features; and (4) a three-stage curriculum (Inspection → Island → Construction) that stabilizes the inherently difficult joint gaze-plus-navigation credit assignment.

Experimental evaluation across three Gazebo environments demonstrated that AVSN consistently achieves the lowest critical encounter counts (

d_{min} < 0.5

m) and the highest goals-per-encounter ratios among all tested conditions. Statistical testing confirmed that Active Vision significantly reduces proxemic risk relative to both the No Social Nav baseline and Nav2 (w/reverse) in all three environments (

p_{corr} < 0.001

), while in Inspection it also improved significantly over the Attention-only condition (

p_{corr} = 0.01

). These results confirm that proactive gaze control substantially improves social compliance in unstructured environments without consistently sacrificing task throughput.

References

Shang, J.; Ryoo, M.S. Active vision reinforcement learning under limited visual observability. Advances in Neural Information Processing Systems 2023, 36, 10316–10338. [Google Scholar]

Dass, S.; Hu, J.; Abbatematteo, B.; Stone, P.; Martín-Martín, R. Learning to look: Seeking information for decision making via policy factorization. arXiv 2024. arXiv:2410.18964. [CrossRef]

Wang, G.; Li, H.; Zhang, S.; Guo, D.; Liu, Y.; Liu, H. Observe then act: Asynchronous active vision-action model for robotic manipulation. IEEE Robotics and Automation Letters, 2025. [Google Scholar]

Grimes, M.K.; Modayil, J.V.; Mirowski, P.W.; Rao, D.; Hadsell, R. Learning to look by self-prediction. Transactions on Machine Learning Research, 2023. [Google Scholar]

Hafner, D.; Pasukonis, J.; Ba, J.; Lillicrap, T. Mastering diverse control tasks through world models. Nature 2025, 1–7. [Google Scholar]

Rios-Martinez, J.; Spalanzani, A.; Laugier, C. From proxemics theory to socially-aware navigation: A survey. International Journal of Social Robotics 2015, 7, 137–153. [Google Scholar] [CrossRef]

Dautenhahn, K. Socially intelligent robots: Dimensions of human-robot interaction. Philosophical Transactions of the Royal Society B: Biological Sciences 2007, 362, 679–704. [Google Scholar] [CrossRef] [PubMed]

Hall, E.T. The Hidden Dimension; Anchor Books: New York, NY, USA, 1966. [Google Scholar]

Thrun, S.; Fox, D.; Burgard, W. The dynamic window approach to collision avoidance. IEEE Robotics and Automation Magazine 1997, 4, 23–33. [Google Scholar]

Alonso-Mora, J.; Breitenmoser, A.; Rufli, M.; Beardsley, P.; Siegwart, R. Optimal reciprocal collision avoidance for multiple non-holonomic robots. In Distributed Autonomous Robotic Systems: The 10th International Symposium; Springer, 2013; pp. 203–216. [Google Scholar]

Charalampous, K.; Kostavelis, I.; Gasteratos, A. Robot navigation in large-scale social maps: An action recognition approach. Expert Systems with Applications 2016, 66, 261–273. [Google Scholar] [CrossRef]

Vega, A.; Cintas, R.; Manso, L.J.; Bustos, P.; Núñez, P. Socially-accepted path planning for robot navigation based on social interaction spaces. In Proceedings of the Iberian Robotics Conference; Springer, 2019; pp. 644–655. [Google Scholar]

Tai, L.; Zhang, J.; Liu, M.; Burgard, W. Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia; 2018; pp. 1111–1117. [Google Scholar]

Xie, Z.; Dames, P. DRL-VO: Learning to navigate through crowded dynamic scenes using velocity obstacles. IEEE Transactions on Robotics 2023, 39, 2700–2719. [Google Scholar] [CrossRef]

Golchoubian, M.; Ghafurian, M.; Dautenhahn, K.; Azad, N.L. Uncertainty-aware DRL for autonomous vehicle crowd navigation in shared space. IEEE Transactions on Intelligent Vehicles, 2024. [Google Scholar]

Wang, Y.; Xie, Y.; Xu, D.; Shi, J.; Fang, S.; Gui, W. Heuristic dense reward shaping for learning-based map-free navigation of industrial automatic mobile robots. ISA Transactions 2025, 156, 579–596. [Google Scholar] [CrossRef] [PubMed]

Bae, J.; Kim, J.; Yun, J.; Kang, C.; Choi, J.; Kim, C.; Lee, J.; Choi, J.; Choi, J. SiT Dataset: Socially Interactive Pedestrian Trajectory Dataset for Social Navigation Robots. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, New Orleans, LA, USA, 2023. [Google Scholar]

Vice, J.M.; Sukthankar, G. DUnE: A Versatile Dynamic Unstructured Environment for Off-Road Navigation. Robotics 2025, 14, 35. [Google Scholar] [CrossRef]

Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything V2, 2024. arXiv arXiv:cs.CV/2406.09414.

Ali, N. Computer vision-guided autonomous grasping system using Leo Rover with robotic arm. Master’s thesis, University of South-Eastern Norway, 2024. [Google Scholar]

Macenski, S.; Martín, F.; White, R.; Ginés Clavero, J. The Marathon 2: A Navigation System. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020. [Google Scholar]

Macenski, S.; Moore, T.; Lu, D.; Merzlyakov, A.; Ferguson, M. From the desks of ROS maintainers: A survey of modern and capable mobile robotics algorithms in the robot operating system 2. In Robotics and Autonomous Systems; 2023. [Google Scholar]

Macenski, S.; Booker, M.; Wallace, J. Open-Source, Cost-Aware Kinematically Feasible Planning for Mobile and Surface Robotics. 2024. [Google Scholar]

Condition

<0.5m

<0.8m

<1.2m

Goals

Goals per
<0.5m

Goals per
all <0.8m

Construction

Nav2 (forward)

124

278

3.02

1.77

Nav2 (w/reverse)

138

320

3.30

2.03

YOLO+Depth

142

10.14

4.58

No Social Nav

168

284

4.44

1.87

Attention (ours)

104

20.8

6.50

Active Vision (ours)

153

9.56

Island

Nav2 (forward)

192

213

245

485

2.53

1.20

Nav2 (w/reverse)

236

217

261

597

2.53

1.32

YOLO+Depth

137

440

7.33

No Social Nav

214

208

247

531

2.48

1.26

Attention (ours)

139

448

34.46

7.59

Active Vision (ours)

186

537

48.82

9.59

Inspection

Nav2 (forward)

206

250

239

196

0.95

0.43

Nav2 (w/reverse)

196

211

283

197

1.01

0.48

YOLO+Depth

123

361

4.69

2.24

No Social Nav

108

129

439

4.06

2.03

Attention (ours)

121

324

2.53

Active Vision (ours)

126

293

10.10

4.31

Social Encounters per Goal (

d_{min} < 0.5

Δ

p_{corr}

Construction World

Attention vs YOLO+Depth

-0.0808

-0.68

0.26

Attention vs No Social Nav

-0.2107

-1.14

0.002

Attention vs Nav2 (w/reverse)

-0.3016

-1.10

0.008

Active Vision vs Attention

-0.0300

-0.65

0.26

Active Vision vs No Social Nav

-0.2407

-1.39

<0.001

***

Active Vision vs Nav2 (w/reverse)

-0.3316

-1.28

<0.001

***

Island World

Attention vs YOLO+Depth

-0.0216

-0.28

0.71

Attention vs No Social Nav

-0.4262

-2.57

<0.001

***

Attention vs Nav2 (w/reverse)

-0.3874

-1.90

<0.001

***

Active Vision vs Attention

-0.0045

-0.08

0.71

Active Vision vs No Social Nav

-0.4308

-2.72

<0.001

***

Active Vision vs Nav2 (w/reverse)

-0.3919

-2.00

<0.001

***

Inspection World

Attention vs YOLO+Depth

-0.0684

-0.60

0.01

Attention vs No Social Nav

-0.1103

-0.66

0.01

Attention vs Nav2 (w/reverse)

-0.3734

-1.72

<0.001

***

Active Vision vs Attention

-0.0722

-0.63

0.01

Active Vision vs No Social Nav

-0.1825

-1.11

<0.001

***

Active Vision vs Nav2 (w/reverse)

-0.4456

-2.04

<0.001

***

CNN	Convolutional Neural Network
FOV	Field of View
LSTM	Long Short Term Memory
MPC	Model Predictive Control
RL	Reinforcement Learning
YOLO	You Only Look Once

Active Vision for Social Navigation

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Datasets

3.2. Spatiotemporal Attention Mechanism

3.3. Reinforcement Learning

3.3.1. RL Agent Reward Design

3.3.2. Environment Configuration

3.3.3. Multi-Process Architecture Integration

3.3.4. RL Agent Image Observation Space Design

3.4. Perception Pipeline

3.5. Curriculum Training

3.6. Comparison Method: NAV2

4. Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

MDPI Initiatives

Important Links

Subscribe