Submitted:
30 January 2026
Posted:
02 February 2026
You are already at the latest version
Abstract

Keywords:
1. Introduction
- Probabilistic orchestration for heterogeneous agents. We introduce an orchestration framework that fuses fast ego perception with delayed, high-confidence external evidence using calibrated confidence, association reliability, and latency-aware policies.
- Uncertainty-aware decision policy. We define an interpretable fusion mechanism that produces an uncertainty-aware output distribution, reducing brittleness compared to deterministic selection and enabling stable behaviour under delayed evidence.
- Controlled teacher–student adaptation. We integrate pseudo-labelling and knowledge distillation into the orchestration layer through explicit gating, enabling continual improvement while reducing the risk of error amplification.
- Dual-camera obstacle recognition case study. We demonstrate the approach in an infrastructure-assisted perception setting and analyse robustness under distance, occlusion, and latency variability.
2. Related Works
2.1. Multi-Agent Orchestration and Decision Coordination
2.2. Uncertainty-Aware Fusion and Confidence Calibration
2.3. Teacher–Student Learning, Pseudo-Labelling, and Knowledge Distillation
2.4. Multi-Camera and Infrastructure-Assisted Perception for Autonomous Driving
2.5. Positioning of the Proposed Approach
2.6. Implications for Generative Multi-Agent Systems
3. Use Case and Problem Formulation: Dual-Camera Road Obstacle Recognition
3.1. Problem Motivation and Failure Modes in Ego-Only Perception
- Long-range degradation: Distant obstacles occupy few pixels, reducing discriminative features and increasing confusion with background patterns.
- Occlusion and partial visibility: Obstacles may be partially blocked by other vehicles or road geometry, resulting in fragmented detections or missed objects.
- Illumination changes: Shadows, glare, and nighttime conditions can distort appearance and reduce detector confidence.
- Motion blur: Rapid ego motion and vibration degrade image quality, especially at higher speeds.
- Domain shift: Weather, camera settings, and geographic differences lead to distribution shift and degraded generalization.
3.2. Infrastructure-Assisted Perception Scenario
- Latency and jitter: Sensor outputs are delayed by acquisition, processing, and network transmission, and may arrive with variable delay.
- Association uncertainty: A stationary camera’s detection must be matched to the ego camera’s observation of the same physical object across different perspectives.
- Partial coverage: Not all environments are instrumented; stationary evidence may be intermittent.
- Heterogeneous failure modes: Stationary detectors can also fail due to weather, occlusion, or miscalibration, and their confidence may not be directly comparable to ego confidence.
3.3. System Overview and Multi-Agent Perception Pipeline
- An ego detection agent that performs real-time object detection from the vehicle’s forward-facing camera.
- A geometric localization agent that maps detection outputs into a shared spatial representation (e.g., using homography or camera calibration).
- A stationary detection agent that detects obstacles from a roadside camera and communicates predictions to the vehicle.
- A cross-view association agent that establishes correspondence between the ego and the stationary detections.
| Agent | Function | Typical Latency | Typical Confidence | Output |
| A1: Ego Detector | Real-time obstacle detection on ego camera | Low | Medium (drops at distance/occlusion) | Bounding boxes + class probabilities |
| A2: Ego Localization | Map ego detections to ground plane / shared coordinates | Low | Medium | Localized boxes / estimated position |
| A3: Stationary Detector | Obstacle detection from road-side camera | Medium | High (better long-range) | Bounding boxes + class probabilities |
| A4: Cross-View Association | Match detections across ego and stationary views | Medium | High when confident match exists | Association score + matched object IDs |
3.4. Cross-View Association and Shared Coordinate Representation
- Geometric alignment: Ego detections are projected into a shared ground-plane or world coordinate frame using a homography transformation derived from camera calibration and planar road assumptions (details in Appendix A). Stationary detections are similarly mapped to the shared frame.
-
Probabilistic matching: Candidate matches are evaluated based on:
- ∘
- spatial proximity in the shared frame (e.g., Euclidean distance between projected centers),
- ∘
- temporal consistency accounting for latency and ego motion (penalizing large ),
- ∘
- semantic agreement between predicted classes (e.g., similarity of class distributions).
3.5. Learning Setup: Offline Proof-of-Concept with Feedback-Driven Adaptation
- curate pseudo-labelled training pairs,
- fine-tune the ego detector on examples where it is weak (e.g., distant obstacles),
- distil stationary soft labels into the ego model,
- and deploy updated ego model versions under controlled conditions.
3.6. Scope and Assumptions
- Vision-based detection/classification of road obstacles using ego and stationary cameras.
- Confidence-aware and latency-aware orchestration across heterogeneous agents.
- Cross-view association and reliability scoring.
- Offline/nearline adaptation through pseudo-labeling and knowledge distillation.
- End-to-end motion planning and control.
- Multi-modal fusion with lidar/radar (future work).
- High-definition mapping and full SLAM integration.
- Online model updates without validation (unsafe in practice).
4. Probabilistic Orchestrator Architecture and Decision Policy
4.1. Orchestrator Design Goals
- Robustness under uncertainty: reconcile conflicting predictions and preserve uncertainty when evidence is insufficient.
- Latency-aware decision making: produce real-time outputs even when higher-confidence evidence is delayed.
- Association-aware fusion: incorporate cross-view correspondence reliability as first-class evidence.
- Controlled learning enablement: generate decision traces and high-confidence supervision signals without amplifying errors.
4.2. Teacher-Student Orchestrator Design, Flow, and Realization
4.2.1. From Student Agent to Orchestrator Agent
- Fine-tuning and transfer learning for YOLO
- Knowledge Distillation (KD) and KD integration
- Extended loss functions for KD-enhanced YOLO training
4.2.2. Component I: Fine-Tuning and Transfer Learning for YOLO
- Either the public one: yolov8n.pt, yolov8s.pt, yolo11n.pt, etc.
- Or your own car-camera model: carcam_yolov8n.pt.
- We don’t re-learn basic things like “what is an edge, what is a car shape”.
-
We just adapt to:
- ∘
- our camera viewpoint (car + intersection geometry),
- ∘
- Tiny/far humans & cars
- ∘
- Optionally freeze early layers (so you mainly adapt high-level features & head).
4.2.3. Component II: Knowledge Distillation and Cross-View Integration
- Teacher says: “This is PERSON.”
- Student uses a hard label: class=person, 1 vs 0.
-
Teacher says:
- ∘
-
class probabilities:P_teacher = [person: 0.82, car: 0.17, dog: 0.01, ...]
- ∘
- bounding box regression confidences
- Stationary camera = stronger teacher (better angle, stable, less motion blur).
- Car camera = weaker student (far objects, motion blur, etc.).
1.2.1. Component III: KD Loss Functions (Improved YOLO Loss Function)
- Classification loss
- Box regression loss
- Objectness loss
- Classification KD Loss
| Symbol | Meaning | |
| Teacher logits (vector of un-normalized scores, one per class) | ||
| Student logits | ||
| Probability distribution over classes at temperature | ||
| KL divergence between distributions and | ||
| Temperature (usually 2–10), | ||
- 2.
- Bounding Box Distillation Loss
- = student bounding box prediction
- = teacher bounding box prediction
- Student (car camera) learns better geometry, even when its own camera view is poor
-
Helps especially on:
- ∘
- far pedestrians
- ∘
- side-view objects
- ∘
- small objects (< 16 px)
- ∘
- semantic agreement between predicted classes (e.g., similarity of class distributions).
- 3.
- Objectness Distillation Loss
- Ego (car) camera gets more false positives (reflections, poles, lights…)
- Stationary camera sees objects more clearly → gives reliable objectness
- Fewer false positives
- Stronger positive signals for far but real objects
- Better small object recall
| Component | Purpose |
| YOLO classification | Student still learns hard labels |
| YOLO bounding-box | Student learns from its own view |
| YOLO objectness | Student filters own FP/TP |
| KD-cls | Learns teacher’s “soft opinion” |
| KD- bounding -box | Learns teacher’s geometry |
| KD-obj | Learns teacher’s sense of objectness |
4.3. Decision Objective (Expected Utility Formulation)
- a utility function capturingthe benefit/cost of decisions (e.g., reward for correct obstacle classification, penalty for missed obstacles, penalty for false alarms),
- a latency cost .
- real-time constraint ,
- association reliability constraint for fusion.
4.4. Bounding Box Selection and Fusion
- Selection (default): if stationary evidence is used (α > 0), select the box from the higher-confidence source; otherwise, use ego box:
- Fusion (optional): compute a weighted average in a shared coordinate frame:
4.5. Confidence Calibration and Uncertainty Measures
- Entropy: (high entropy implies uncertainty).
- Margin: difference between top-1 and top-2 probabilities.
- Consistency: agreement between ego and stationary top labels.
4.6. Orchestrator as a Gating Mechanism for Controlled Learning
- (high association confidence),
- (teacher confidence threshold),
- (acceptable delay for labelling), and
- (low uncertainty).
4.7. Algorithm Summary
- Receive ego prediction and localization .
- If a stationary prediction is available, receive an association score .
- Calibrate distributions .
- Compute mixing weight .
- If or , set ; else set .
- Output , uncertainty , and box (selection or fusion).
5. Feedback-Driven Adaptation: Teacher–Student Fine-Tuning and Knowledge Distillation
5.1. Learning Loop Overview
- Candidate collection: identify time steps where stationary predictions are available and association confidence is high.
- Pseudo-label gating: filter candidates using teacher confidence, latency constraints, and uncertainty thresholds to reduce label noise.
- Student adaptation: fine-tune the ego detector using pseudo-labels, optionally combined with a small amount of ground-truth labeled data.
- Controlled deployment: evaluate and register the updated model; deploy only if it improves key metrics without increasing false alarms beyond an acceptable limit.
5.2. Pseudo-Label Curation and Quality Gates
- be association confidence from A4,
- be calibrated teacher probability distribution from A3,
- be stationary latency,
- be orchestrator uncertainty.
- : association reliability threshold,
- : teacher confidence threshold,
- : maximum delay allowed for supervision,
- : uncertainty threshold to avoid ambiguous learning signals.
- spatial consistency: projected positions must be within radius in the shared coordinate frame,
- temporal consistency: association must persist for consecutive frames.
5.3. Student Model and Training Objective
- is the standard detector loss under pseudo-label supervision,
- transfers teacher soft labels to the student,
- controls distillation strength.
5.4. Detection Loss (Pseudo-Label Supervision)
- and are pseudo-label class and box from the teacher,
- and are student outputs,
- is typically cross-entropy or focal loss,
- is IoU/GIoU/DIoU loss,
- models objectness confidence.
5.5. Knowledge Distillation Loss (Soft Label Transfer)
- and are teacher and student logits,
- is the distillation temperature,
- is the Kullback–Leibler divergence.
5.6. Triggering and Scheduling Adaptation
- the pseudo-label set size exceeds a minimum , or
- a time window elapses.
- pseudo-label yield and diversity are sufficient (e.g., enough long-range examples),
- mean teacher confidence exceeds a threshold,
- and (optionally) drift indicators increase (e.g., rising uncertainty or reconstruction error for hard examples).
5.7. Controlled Deployment, Rollback, and Safety Checks
- a held-out validation set (if available),
- a curated test set representing difficult cases (distant/occluded obstacles),
- and operational constraints (latency, false alarm tolerance).
5.8. Relation to Reinforcement Learning and “Judging-as-Reward.”
- pseudo-label acceptance corresponds to positive reinforcement signals,
- pseudo-label rejection corresponds to negative feedback (no learning update),
- deployment gating corresponds to policy constraints and safe rollout.
6. Implementation
6.1. System Components and Runtime Pipeline
- predict(input) → output
- metadata() → confidence/latency statistics
- trace() → evidence and internal signals used for auditing
- Ego inference runs synchronously, frame-by-frame, at time index .
- Stationary camera predictions arrive asynchronously with delay (and optional jitter).
- The orchestrator performs event-driven fusion whenever new stationary evidence becomes available within the fusion window .
6.2. Ego Detector (Student Model)
- Resize to (e.g., 640×640).
- Normalize pixel values to .
- Optional augmentation for robustness (color jitter, blur, random crop).
6.3. Stationary Detector (Teacher Agent A3)
- stable viewpoint,
- reduced motion blur,
- improved long-range visibility.
- bounding boxes ,
- class distribution ,
- objectness scores.
- timestamp ,
- estimated delay ,
- calibration parameters or confidence summary statistics.
- fixed delay, or
- stochastic delay with jitter: , truncated to positive values.
6.4. Confidence Calibration (Required for Fusion)
6.5. Geometric Localization (Agent A2) and Shared Coordinate Frame
6.6. Cross-View Association (Agent A4)
- Spatial score: Gaussian penalty over distance .
- Temporal score: exponential penalty in delay .
- Semantic score: class agreement (e.g., cosine similarity between class distributions or KL divergence).
6.7. Orchestrator Runtime and Logging
- collects ego predictions,
- checks for matched stationary evidence within window constraints,
- computes ,
- outputs fused belief , label , and uncertainty .
- agent contributions (ego-only vs fused),
- calibrated distributions , ,
- match confidence ,
- delay ,
- fusion weight ,
- uncertainty ,
- threshold values and gating decisions.
6.8. Pseudo-Label Logging and Dataset Construction
6.9. Student Fine-Tuning and Distillation Workflow
- checkpoint versioning,
- validation-based early stopping,
- and a deployment gate requiring recall improvement while bounding false positive increase and latency regression (Section 5.7).
6.10. Configuration and Reproducibility
- model checkpoint paths,
- calibration temperature values,
- thresholds ,
- timing windows ,
- latency decay ,
- training triggers ,
- distillation parameters .
7. Evaluation
7.1. Evaluation Goals
7.2. Dataset and Experimental Setup
7.3. Baselines
7.4. Experimental Protocol
7.4.1. Calibration and Threshold Selection
7.4.2. Latency Modelling
- fixed delay conditions (e.g., 100 ms, 300 ms, 500 ms),
- jittered delay conditions (e.g., ).
7.4.3. Association Reliability
7.4.4. Adaptation Schedule
- performance of the initial student model (before adaptation),
- performance of the adapted student model (after one or more adaptation cycles).
7.5. Results
7.5.1. Overall Performance Comparison
- B2 improves recall over B1 but increases false positives under association errors and delayed evidence.
- B3 improves recall over B1 while controlling false positives due to match-weighted, latency-aware fusion.
- B4 yields additional gains in long-range recall by improving the ego detector itself.
7.5.2. Effect of Latency and Jitter
- B2 may show oscillations and increased decision churn at moderate delays.
- B3 remains stable and degrades gracefully as increases.
- B4 remains comparable to B3 at inference time, and also reduces reliance on stationary evidence through improved ego performance.
7.5.3. Effect of Association Gating
- B3 exhibits a favourable trade-off where moderate maximizes benefit.
- B2 is more sensitive to due to hard switching.
7.5.4. Ablation: Benefit of Distillation and Pseudo-Label Gating
- Ablation A: B4 without KD (fine-tuning only).
- Ablation B: B4 without pseudo-label gating (accept all).
- Ablation C: B4 without latency constraints in pseudo-label selection.
- Ablation D: B4 without association thresholding.
7.6. Discussion of Findings
7.7. Limitations
8. Discussion
8.1. Why Deterministic Orchestration Fails in Heterogeneous Agent Systems
- Conflict without uncertainty: deterministic policies require a single “winner,” even when evidence is ambiguous, producing unstable or overconfident decisions.
- Confidence mismatch: deep models often produce miscalibrated probabilities, so using raw confidence as a switching criterion can amplify errors.
- Association fragility: cross-view association is inherently uncertain; deterministic fusion treats association as binary, leading to incorrect overrides and false alarms.
- Latency blindness: delayed evidence may arrive after action is taken; deterministic policies may oscillate when late signals are applied without timing-aware logic.
8.2. The Contribution of Probabilistic Orchestration
- calibrated confidence from heterogeneous agents,
- association reliability as a match probability, and
- latency-aware weighting to limit the influence of delayed evidence.
8.3. Orchestration as a Learning Manager: Safe Feedback vs Error Amplification
8.4. Generalization Beyond the Dual-Camera Use Case
- fast/low-cost agents (real-time edge models),
- slow/high-quality agents (infrastructure sensors, cloud services, heavy models),
- association/verification agents (matching, checking consistency),
- and an orchestrator that must decide under uncertainty and timing constraints.
- cooperative perception and V2X sensor networks,
- robotics perception with multi-view cameras and remote supervision,
- anomaly detection pipelines where slow forensic analysis informs fast detection,
- multi-sensor medical imaging triage pipelines.
8.5. Relevance to GenAI Multi-Agent Orchestration
- a fast, low-cost LLM for drafting responses,
- a slower, high-quality LLM for refinement,
- retrieval agents that fetch evidence,
- verifiers that validate claims,
- policy agents that enforce safety and compliance.
- Teacher agent: a higher-quality LLM or verified tool output,
- Student agent: a smaller or faster model operating under latency/cost constraints,
- Association/verification: evidence linking, citation checks, tool validity checks,
- Orchestrator: a probabilistic policy that fuses model outputs and triggers refinement or verification based on uncertainty.
8.6. Safety, Accountability, and Explainability
- auditability: tracking which agent influenced each decision,
- explainability: communicating the evidence used (confidence, match score, latency),
- intervention hooks: enabling downstream policies such as “do not act when uncertainty is high.”
8.7. Limitations and Future Work
- Dependence on calibration and association: The orchestrator assumes that confidence can be calibrated and association reliability can be estimated. Severe calibration failures or systematic association errors may still degrade performance.
- Planar-road assumption: Homography-based projection is approximate and may fail in non-planar environments or under calibration drift.
- Infrastructure availability: Stationary cameras may not always be available; while adaptation reduces reliance over time, deployment depends on infrastructure coverage.
- Offline adaptation: We focus on offline/nearline learning; future work could explore safe online adaptation with stronger safeguards.
- learned or adaptive gating policies that adjust thresholds under drift,
- integration with additional sensors (lidar/radar) and temporal tracking,
- richer uncertainty models (ensembles or evidential learning),
- broader evaluation across environments and weather conditions,
- extension to full multi-agent GenAI orchestration pipelines with verification and policy gating.
9. Conclusion
Appendix A
Ground-Plane Homography (Fast & Robust Algo)
- Solve once using 4–8 surveyed road points visible in the image (OpenCV findHomography with RANSAC).
- For every detection, take the bottom-centre of the bounding box (contact point) , map with (or , depending on convention) to ground coordinates, normalize by scale, and you get in meters.
References
- Kojukhov; Levin, I.; Bovshover, A. From Subsumption to Semantic Mediation: A Generative Orchestration Architecture for Autonomous Systems. Algorithms 2025, vol. 18(no. 12, art. no. 773). [Google Scholar] [CrossRef]
- Wooldridge, M. An Introduction to MultiAgent Systems, 2nd ed.; Wiley, 2009. [Google Scholar]
- Shoham, Y.; Leyton-Brown, K. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations; Cambridge University Press, 2008. [Google Scholar]
- Kaelbling, L. P.; Littman, M. L.; Moore, A. W. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 1996, vol. 4, 237–285. [Google Scholar] [CrossRef]
- Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed.; Pearson, 2020. [Google Scholar]
- Rajbhandari, S.; et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models; 2020. [Google Scholar]
- Lewis, M.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks; NeurIPS, 2020. [Google Scholar]
- Kalman, R. E. A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering 1960. [Google Scholar] [CrossRef]
- Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML, 2016. [Google Scholar]
- Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS, 2017. [Google Scholar]
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K. Q. On Calibration of Modern Neural Networks. ICML, 2017. [Google Scholar]
- Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles; NeurIPS, 2017. [Google Scholar]
- Sensoy, M.; Kaplan, L.; Kandemir, M. Evidential Deep Learning to Quantify Classification Uncertainty; NeurIPS, 2018. [Google Scholar]
- Wang, H.; et al. Multi-View 3D Object Detection and Tracking: A Survey. IEEE T-ITS, 2021. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. NIPS DL Workshop, 2015. [Google Scholar]
- Goodfellow, I.; Bengio, Y. A. Courville, Deep Learning; MIT Press, 2016. [Google Scholar]
- Lee, D.-H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. Workshop on Challenges in Representation Learning, ICML, 2013. [Google Scholar]
- Chen, X.; et al. Self-Training for Few-Shot Transfer Across Extreme Domain Shift. ICLR, 2020. [Google Scholar]
- Sohn, K.; et al. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence; NeurIPS, 2020. [Google Scholar]
- Tarvainen, A.; Valpola, H. Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results; NeurIPS, 2017. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR, 2012. [Google Scholar]
- Caesar, N.; Uijlings, J.; Ferrari, V. COCO-Stuff: Thing and Stuff Classes in Context. CVPR, 2018. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Chen, L.; et al. Cooperative Safety Intelligence in V2X-Enabled Transportation: A Survey. arXiv 2025, arXiv:2512.00490. [Google Scholar]
- Zhang, H.; et al. VICooper: A Practical Vehicle–Infrastructure Cooperative Perception Framework. In Computer Vision and Image Processing; Springer, 2024. [Google Scholar]
- Chen, L.; et al. Cooperative Perception Datasets and Benchmarks: A Survey. In IEEE Communications Surveys & Tutorials; 2023. [Google Scholar]
- Liu, H.; et al. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv 2023, arXiv:2302.04761. [Google Scholar] [CrossRef]
- Wei; et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models; NeurIPS, 2022. [Google Scholar]
- Brown, T.; et al. Language Models are Few-Shot Learners; NeurIPS, 2020. [Google Scholar]
- Welleck, S.; et al. Self-Refine: Iterative Refinement with Self-Feedback. arXiv 2023, arXiv:2303.17651. [Google Scholar] [CrossRef]





Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).