Submitted:
17 August 2025
Posted:
19 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Context and Importance
1.2. Problem Statement
- Synchronisation issues due to heterogeneous hardware and real-time constraints. CPS environments often consist of diverse hardware components. Ensuring seamless integration and real-time data processing across these heterogeneous platforms is complex. Synchronisation becomes particularly challenging when multiple sensors and processing units work together to provide a coherent and timely response. Variations in processing power, data transfer rates, and latency can lead to discrepancies or delays undermining the system’s overall performance. Addressing these issues requires sophisticated algorithms and synchronisation protocols that can harmonise the operation of different hardware components while meeting stringent real-time constraints.
- Optimisation difficulties related to balancing accuracy and computational efficiency. ML models, particularly deep learning architectures, often demand substantial computational resources to achieve high accuracy. In CPS, where real-time decision-making is crucial, striking a balance between model performance and computational efficiency is essential. Resource-constrained environments, such as embedded systems or edge devices, may not have the capacity to run large models or handle intensive computations. Therefore, optimising models to deliver accurate predictions without overloading system resources is a significant challenge. Techniques such as model pruning, quantisation, and knowledge distillation are commonly explored, but implementing them effectively without compromising performance remains an ongoing area of research.
- Adaptation requirements to ensure robust performance across varying environments and tasks. CPS often operates in dynamic and unpredictable environments where conditions can change rapidly. For instance, an autonomous vehicle must adapt to different weather conditions, lighting variations, and traffic scenarios. Similarly, industrial CPS must handle fluctuations in sensor data and operational conditions. ML models trained in controlled settings may struggle to maintain accuracy when faced with such variability. This necessitates adaptive learning strategies and robust models capable of generalising across different tasks and environments. Techniques such as transfer learning, online learning, and domain adaptation are crucial, but integrating them into CPS without causing disruptions or requiring constant retraining poses significant challenges.
1.3. Objectives
1.4. Structure
2. Methodology
2.1. Systematic Review Framework
2.2. Search Strategy
2.3. Selection Process
2.4. Data Extraction
3. Background
3.1. Overview of Cyber-physical Systems (CPS) and Computer Vision (CV)
-
Sensors: collect data from the physical environment, converting real-world information into digital signals. Examples include temperature sensors, cameras, LIDAR, GPS, and accelerometers. In CPS, sensors play a vital role in:
- -
- Monitoring environmental conditions (e.g., in smart buildings).
- -
- Detecting anomalies in industrial processes.
- -
- Providing input for control decisions in autonomous vehicles.
-
Actuators: perform actions based on decisions made by the computational units, transforming digital commands into physical actions. They can control various devices, such as motors, valves, or robotic arms. Key functions include:
- -
- Adjusting machinery operations in manufacturing.
- -
- Steering autonomous vehicles based on sensor data.
- -
- Regulating power distribution in smart grids.
-
Computational Units: process sensor data, run control algorithms, and send commands to actuators. They can range from embedded microcontrollers to powerful cloud-based systems. Functions include:
- -
- Real-time data analysis.
- -
- Running predictive models to anticipate system behaviours.
- -
- Ensuring system security and reliability through robust software protocols.
3.1.1. Perception and Sensing
3.1.2. Real-time Monitoring and Feedback
3.1.3. Autonomy and Decision Making
3.1.4. Safety and Surveillance
3.1.5. Human-machine Interfaces
3.2. Machine Learning Techniques in CV
3.2.1. Convolutional Neural Networks (CNNs) for Image Recognition
- Convolutional Layers: These layers apply convolution operations to the input image, using filters (or kernels) to detect various features such as edges, textures, and patterns.
- Activation Functions: After convolution, activation functions are applied to introduce non-linearity, helping the network learn more complex patterns.
- Pooling Layers: These layers reduce the dimensionality of feature maps, preserving essential information while minimising computational load and making the network more robust to variations in input.
- Fully Connected Layers: After several convolutional and pooling layers, these layers combine the features to make predictions or classifications.
- Output Layer: The final layer usually uses a softmax activation function to produce a probability distribution over the possible classes, allowing the network to make a prediction.
3.2.2. Recurrent Neural Networks (RNNs) for Sequential Data Processing
- Long Short-Term Memory (LSTM): is a type of RNN designed to address the vanishing gradient problem, allowing them to capture long-term dependencies more effectively.
- Gated Recurrent Unit (GRU): are a simplified version of LSTMs that also help mitigate the vanishing gradient problem while being computationally more efficient.
3.2.3. Transformer-Based Architectures for Advanced Feature Extraction
3.3. Challenges in Synchronisation, Optimisation, and Adaptation
3.3.1. Synchronisation Challenges
- Real-time Data Fusion: CV systems process visual data alongside other sensors like LiDAR, RADAR, and accelerometers. Poor decision-making may result from system lags or timestamp misalignments.
- Latency in Decision Making: The processing of deep learning-based CV algorithms is time-consuming, making real-time synchronisation with CPS controls essential. Delays can compromise safety in systems like autonomous vehicles and drones.
- Distributed Processing: Coordinating CV tasks among nodes in a distributed CPS network is challenging, particularly while handling time-sensitive communications and preserving system dependability.
3.3.2. Optimisation Challenges
- Resource Constraints: Memory and processing power on CPS devices, particularly edge devices, are frequently constrained. Because deep learning models require many resources, optimising them within these limitations might be challenging.
- Model Efficiency: Techniques like model compression and pruning are necessary to reduce the size and complexity of neural networks for tasks such as object detection and recognition on resource-constrained edge devices.
- Real-time optimisation: There is a trade-off between time performance and accuracy, particularly challenging for low-latency applications like autonomous navigation.
- Communication Bandwidth: In distributed CPS, efficiently transmitting high-dimensional CV data requires methods such as video compression and local processing using edge computing.
3.3.3. Adaptation Challenges
- Dynamic Environments: CV algorithms must adapt continuously to changing conditions, such as variations in lighting, weather, and the presence of new obstacles, unlike static conditions.
- Transfer Learning and Domain Adaptation: It is challenging to adapt pre-trained models to new environments with minimal retraining, such as when autonomous vehicles move from urban to rural areas.
- Online Learning and Incremental Updates: CPS requires real-time model updates without requiring full retraining, which is computationally costly, due to continuous data streaming.
- Handling Uncertainty and Noise: To ensure accurate decision-making for managing noisy, incomplete, or uncertain sensor data, the method should be robust.
4. Results and Analysis
4.1. Key Findings
4.1.1. CNN
4.1.2. Federated Learning
- Privacy Preservation: FL retains data on local devices, sharing only model updates. This safeguards sensitive visual data, such as surveillance footage or medical records, addressing significant privacy concerns.
- Scalability: FL efficiently handles large-scale distributed systems, making it ideal for extensive CPS networks with numerous devices.
- Reduced Latency: Local data processing and updates minimise communication overhead and latency compared to centralized training methods.
- Heterogeneity Handling: FL can leverage adaptive aggregation techniques and personalized models to address the heterogeneity among nodes, ensuring synchronisation is maintained even in diverse and resource-imbalanced environments.
- Robustness and Adaptability: FL supports continuous learning and adapts to new data, enhancing the robustness of models in dynamic environments.
- Synchronous FL: All nodes synchronize their updates simultaneously, which can be challenging due to varying computational capabilities and network conditions.
- Asynchronous FL: Nodes update the model independently, offering more flexibility and efficiency but potentially leading to stale updates.
- Non-IID Data: Data from different nodes may not be identically distributed, which can affect model performance. Techniques like data augmentation and domain adaptation can help mitigate this issue.[39]
- Communication Overhead: Efficient communication protocols and compression techniques are essential to reduce the bandwidth required for model updates.[37]
- Model Heterogeneity: Different devices may have varying computational capabilities. Federated learning frameworks need to account for this by using adaptive algorithms that can handle heterogeneous environments.[37]
4.1.3. Meta-Learning
- Prototypical Networks: These networks address the problem of few-shot classification by enabling generalisation to new classes with only a few examples per class. They learn a metric space where classification is based on distances to class prototype representations. They offer a simpler inductive bias compared to other few-shot learning methods, producing excellent results with limited data. [40]
- Siamese Networks: These networks consist of twin neural networks that share parameters and weights. They are trained to maximize the distance between dissimilar pairs and minimise the distance between similar pairs. which consists of twin networks with shared weights trained to map similar observations close together in feature space and dissimilar ones farther apart. Experiments on cross-domain datasets demonstrate the network’s ability to handle forgery across various languages and handwriting styles. [41]
- Model-Agnostic Meta-Learning (MAML): The MAML algorithm is compatible with any model trained by gradient descent, applicable to tasks such as classification, regression, and reinforcement learning. The objective is to train a model on diverse tasks to generalize to new tasks with minimal training samples. This method optimises model parameters to enable rapid adaptation with just a few gradient steps on new tasks, making the model easy to fine-tune. MAML achieves state-of-the-art performance on few-shot image classification benchmarks, delivers strong results in few-shot regression, and accelerates fine-tuning in policy gradient reinforcement learning. [42]
- Memory-augmented models: These models, such as Neural Turing Machines (NTMs), can enhance the efficient incorporation of new information without relearning their parameters by quickly encoding and retrieving new information. They can quickly assimilate data and predict accurately with only a few samples. Santoro et al., 2016 [43] introduce a novel method for accessing external memory that focuses on memory content, eliminating the dependence on location-based mechanisms used in previous approaches.
- Fast Adaptation: Meta-learning enables models to quickly adapt to new tasks with minimal data. It is critical for dynamic applications, such as autonomous vehicles or drones operating in changing environments.
- Data Efficiency: By leveraging prior knowledge from related tasks, meta-learning reduces the need for extensive training data. This efficiency is crucial in applications like medical imaging, where annotated data is often scarce.
- Cross-Domain Learning: Meta-learning helps models generalize better across different tasks and domains. That facilitates adaptation across domains, such as transferring knowledge from medical imaging to aerial imagery. Google Vizier includes features such as transfer learning, which allow models to use knowledge from previously optimised tasks to accelerate and enhance the optimisation of new ones[44].
- Personalisation: Meta-learning adapts models to individual preferences or environments, such as tailoring AR applications for unique users.
- Image Classification: Meta-learning algorithms can quickly adapt to classify new categories of images with minimal data, and quickly recognize unseen classes in few-shot or zero-shot settings.
- Object Detection and Tracking: By leveraging prior knowledge, meta-learning models can enhance object detection and tracking capabilities, making them more robust to variations in the visual environment.
- Image Segmentation: Meta-learning can improve the performance of image segmentation tasks, where the goal is to partition an image into meaningful segments. This is particularly useful in medical imaging and autonomous driving.
- Facial Recognition: Meta-learning techniques can be used to develop facial recognition systems that adapt quickly to new faces with limited training data, enhancing security and personalisation applications.
- Pose Estimation: Meta-learning can be applied to pose estimation tasks, where the model needs to predict the pose of objects or humans in images. This is useful in the fields of robotics and augmented reality.
- Scene Understanding: Meta-learning allows CV systems to interpret new or unseen scenes for applications such as navigation or augmented reality (AR).
- Scalability: Meta-learning algorithms often struggle with scalability when applied to large-scale datasets and high-dimensional data typical for CV tasks. Efficiently scaling these algorithms while maintaining performance is a significant challenge.
- Generalisation: Ensuring that meta-learning models generalize well across a wide range of tasks and domains is difficult. Models trained on specific tasks may not perform well on unseen tasks, highlighting the need for better generalisation techniques.
- Computational Complexity: Meta-learning methods can be computationally intensive, requiring significant resources for training and adaptation. This complexity can limit their practical application, especially in resource-constrained environments.
- Data Efficiency: When meta-learning aims to be data-efficient, achieving this in practice can be challenging. Models often require a careful balance between leveraging prior knowledge and adapting to new data with minimal samples.
- Task Diversity: The diversity of tasks used during meta-training is crucial for the model’s ability to generalize. However, creating a sufficiently diverse set of tasks that accurately represent real-world scenarios is challenging.
- Optimisation Stability: Ensuring stable and efficient optimisation during the meta-training phase is another challenge. Meta-learning models can be sensitive to hyperparameters and the choice of optimisation algorithms.
- Interpretability: Meta-learning models, especially those based on deep learning, can be difficult to interpret. Understanding how these models make decisions and adapt to new tasks is important for trust and transparency.
4.2. Themes and Categories
4.2.1. Synchronisation Strategies
- Timestamping: Timestamping involves attaching precise time metadata to each data packet as it is generated, enabling the alignment and correlation of data streams from heterogeneous sources. Yang and Kupferschmidt [45] implement timestamp synchronisation specifically for video and audio signals, demonstrating its effectiveness. This approach is typically simpler and less computationally intensive compared to more complex synchronisation methods.
-
Sensor Fusion: This technique is widely used in embedded systems to integrate data from multiple sensors, providing a more accurate and reliable representation of the environment. It is commonly applied in areas such as autonomous vehicles, robotics, and wearable devices. [?] introduce a real-time hybrid multi-sensor fusion framework that combines data from cameras, LiDAR, and radar to enhance environment perception tasks, including road segmentation, obstacle detection, and tracking. The framework employs a Fully Convolutional Neural Network (FCN) for road detection and an Extended Kalman Filter (EKF) for state estimation. Designed to be cost-effective, lightweight, modular, and robust, the approach achieves real-time efficiency while delivering superior performance in road segmentation, obstacle detection, and tracking. Evaluated on 3,000 scenes and real vehicles, it outperforms existing benchmark models.Moreover, Robyns et al.[9] demonstrate how to communicate from the physical system to the digital twin for visualising the industrial operation by using Unreal Engine. The digital twin features a modular architecture based on the publish-subscribe pattern, enabling the integration of multiple data processing modules from heterogeneous data streams.
-
Real-time task scheduling: This technique involves orchestrating machine learning and computer vision tasks to ensure timely and reliable operations. CPS applications, such as autonomous vehicles, robotics, and smart manufacturing, demand low-latency, high-accuracy processing while operating under strict deadlines and resource constraints as illustrated in Figure 9.Inside the figure, it is often hard to neatly separate some machine learning techniques or technologies into just one category: Synchronisation, Optimisation, or Adaptation. Many methods operate across multiple dimensions simultaneously, especially as systems become more complex, distributed, and real-time.
-
The Interplay Between Synchronisation, Optimisation, and Adaptation: This is pivotal for the seamless operation of ML-based CV in CPS.
- -
- Synchronisation and Optimisation: Efficient synchronisation reduces redundant computations and data transmissions, thereby optimising resource usage.
- -
- Synchronisation and Adaptation: Timely synchronized data streams enable models to adapt quickly to environmental changes, enhancing responsiveness.
- -
- Optimisation and Adaptation: Optimized models with reduced complexity facilitate faster adaptation to new data, ensuring real-time performance.
By integrating these components, CPS can achieve higher levels of autonomy, efficiency, and resilience, essential for applications like autonomous vehicles, smart manufacturing, and intelligent surveillance.Techniques like federated learning in [46] have grown in scope, adopting strategies from all three areas. [46] proposed a reinforcement learning-based federated training scheme for object detection that involves synchronisation through the edge server’s use of a shared Q-table, which coordinates and accelerates policy decisions across multiple mobile devices. This approach optimizes device selection and training data management to balance energy consumption, detection accuracy, and latency while reducing communication overhead in mobile edge computing. Additionally, the scheme adapts dynamically to varying channel conditions, prior training results, and the presence of jamming attacks by adjusting training policies on each device. This adaptability enhances the robustness and effectiveness of computer vision model training in challenging, real-world environments.Hu et al.[47] propose a framework to enhance the efficiency of AI-based perception systems in applications like autonomous drones and vehicles. The framework focuses on prioritising the processing of critical image regions, such as foreground objects, while de-emphasising less significant background areas. This strategy optimizes the use of limited computational resources. The study leverages real LiDAR measurements for rapid image segmentation, enabling the identification of critical regions without requiring a perfect sensor. By resizing images, the framework balances accuracy and execution time, offering a flexible approach to handling less important input areas. This method avoids the extremes of full-resolution processing or completely discarding data. Experiments are conducted on an AI-embedded platform with real-world driving data to validate the framework’s practicality and efficiency.
4.2.2. Optimisation Approaches
-
Model Compression Techniques: Techniques [25] such as pruning, quantisation, knowledge distillation, low-rank factorisation, and transfer learning are applied to reduce the size of deep learning models without sacrificing significant performance. This is particularly critical for edge devices and CPS with limited hardware resources [13,48].
- -
- Pruning: Reducing the number of neurons or connections in a neural network by removing weights that have little influence on the output. This decreases the size of the model, making it computationally more efficient without significantly sacrificing accuracy.
- -
- Quantisation: Reducing the precision of the weights and activations in the model, from 32-bit floating-point to 8-bit integer or even binary. This leads to reduced memory footprint and faster computation, especially on specialised hardware (like FPGAs and GPUs).
- -
- Deep compression: Han, Mao, and Dally [15] introduce "deep compression", a three-stage pipeline (pruning, quantisation, and Huffman coding) designed to reduce the storage and computational demands of neural networks, enabling deployment on resource-constrained embedded systems. Pruning removes unnecessary connections, reducing the number of connections by 9× to 13×. Quantisation enforces weight sharing, reducing the representation of each connection from 32 bits to as few as 5 bits. Huffman coding further compresses the quantised weights. Experiments on AlexNet showed a 35× reduction in weight storage, with VGG-16 and LeNet achieving 49× and 39× reductions, respectively, while maintaining accuracy. This compression enables these networks to fit into the on-chip SRAM cache, significantly reducing energy consumption compared to off-chip DRAM access. The approach enhances the feasibility of deploying complex neural networks in mobile applications by addressing storage, energy efficiency, and download bandwidth constraints.
- -
- Knowledge Distillation: A process where a smaller, less complex "student" model learns to approximate the outputs of a larger, more complex "teacher" model. This can yield a more computationally efficient model with a similar accuracy. Hinton, Vinyals, and Dean [49] demonstrate the effectiveness of distillation, successfully transferring knowledge from ensembles or highly regularised large models into a smaller model. On MNIST, this method works well even when the distilled model’s training set lacks examples of certain classes. For deep acoustic models, such as those used in Android voice search, nearly all performance gains from ensembles can be distilled into a single, similarly sized neural net, making deployment more practical. For very large neural networks, performance can be further improved by training specialist models that handle highly confusable class clusters. However, distilling the knowledge from these specialists back into a single large model remains an open challenge. This approach highlights the potential of distillation to balance performance and efficiency in machine learning systems.
- -
- Low-rank factorisation - This reduces the number of parameters in deep learning models by approximating weight matrices with lower-rank matrices. This technique helps compress models and speed up training and inference. Cai et al.[50] propose a joint function optimisation framework to integrate low-rank matrix factorisation and a linear compression function into a unified optimisation approach, designed to reduce the number of parameters in DNNs, computational and storage costs, while preserving or enhancing model accuracy.
- -
- Transfer learning is a machine learning method that involves reusing a model trained on one task to solve a related task. This approach allows the model to leverage its prior knowledge, enabling it to learn new tasks effectively even with limited data. In CPS applications, transfer learning minimises the need for extensive manual labelling by transferring insights from similar domains. By utilising models pre-trained on large-scale datasets (e.g., ImageNet) as a foundation, transfer learning avoids the need for training from scratch. Fine-tuning only a few layers enables CPS systems to adapt quickly to new tasks or environments, significantly reducing computational costs.
- -
- Lightweight Architectures: Use specialised architectures designed for efficiency while maintaining good accuracy. These include models like MobileNet and EfficientNet, which are designed to run efficiently on resource-constrained devices.
Bird et al.[16] explore unsupervised transfer learning between Electroencephalography (EEG) and Electromyography (EMG) using both MLP and CNN approaches. The models were trained with fixed hyperparameters and a limited set of network topologies determined through a multi-objective evolutionary search. Identical mathematical features were extracted to ensure compatibility between the networks. Their research demonstrates the application of cross-domain transfer learning in human-machine interaction systems, significantly reducing computational costs compared to training models from scratch.- -
- MobileNet is a class of efficient models designed for mobile and embedded vision applications. Howard et al.[51] utilise a streamlined architecture with depthwise separable convolutions to create lightweight deep neural networks. Two global hyperparameters are introduced to balance latency and accuracy, enabling model customisation based on application constraints. Extensive experiments show that MobileNets perform well compared to other popular models on ImageNet classification. Their effectiveness is demonstrated across diverse applications, including object detection, fine-grain classification, face attribute analysis, and large-scale geo-localisation.
- -
- EfficientNets are a family of CNNs designed to achieve high accuracy with significantly improved computational efficiency. They were introduced as a solution to the challenge of scaling CNNs while balancing resource usage and performance. Tan and Li [52] propose a compound scaling method, a simple and effective approach for systematically scaling up a baseline CNN while maintaining efficiency under resource constraints. Using this method, the EfficientNet models achieve state-of-the-art accuracy with significantly fewer parameters and FLOPS, and high performance on both ImageNet and five transfer learning datasets, demonstrating their scalability and efficiency.
-
Hardware Acceleration and optimisation: The method often involves leveraging parallelism (e.g., through graphics processing units (GPUs) or specialised hardware like tensor processing units (TPUs)) or optimising the inference pipeline to speed up processing as illustrated in Figure 10.Since 2015, distributed-memory architectures with GPU acceleration have become the standard for machine learning workloads due to their growing computational demands [48]. Maier et al. [11] depicts a GPU implementation of the parallel auction algorithm, optimised for both open computing language (OpenCL) and compute unified device architecture (CUDA) environments, which reduces memory usage and increases speed compared to previous implementations, making it ideal for embedded systems with large problem sizes. Experimental results across two GPUs and six datasets show a best-case speedup of 1.7x, with an average speedup of 1.24x across platforms. Additionally, this approach meets strict real-time requirements, especially for large-scale problems, as demonstrated in sensor-based sorting applications. However, optimisation is further constrained by fixed initial parameters, such as GPU architecture or model accuracy, limiting flexibility for future adjustments. Different GPUs deliver varied performance depending on factors such aslike batch size and execution context. Achieving optimal performance requires careful balancing trade-offs between accuracy, throughput, and latency [17].To illustrate the trade-off between accuracy, latency, and throughput, we refer to the SNN performance with different time steps (T) with VGG16 on CIFAR-10 in [53]. Figure 11 shown in [53] shows the relationship between SNN accuracy and sparsity across various time window sizes for training the network for 100 epochs with a learning rate 1e-2. The figure demonstrates that increasing the time window size from 4 to 8 enhances the model’s accuracy. A larger T allows neurons more time to integrate input spikes, leading to more precise outputs. This improvement in accuracy comes at the cost of increased latency, as the model requires more time steps to process each input. As latency increases with larger T values, throughput decreases. This inverse relationship means that while the model’s accuracy increases, it processes fewer inputs in the same amount of time, which limits its usefulness in real-time applications. As a result, increasing T improves accuracy, latency, and energy consumption; enhancing sparsity can mitigate energy costs but may affect accuracy. Therefore, optimising SNNs involves carefully tuning T and sparsity to meet specific application requirements for accuracy, throughput, and latency.
-
Data Augmentation: Data augmentation involves applying various transformations (such as rotation, scaling, and cropping) to the training dataset, thereby artificially expanding its size and diversity. This approach helps enhance the performance of smaller models. In many real-world scenarios, collecting sufficient training data can be challenging. Data augmentation [54] addresses this issue by increasing the volume, quality, and variety of the training data. Techniques for augmentation include deep learning-based strategies, feature-level modifications, and meta-learning approaches, as well as data synthesis methods using 3D graphics modelling, neural rendering, and generative adversarial networks (GANs).
- -
- Deeply Learned Augmentation Strategies: These techniques use deep learning models to generate augmentations automatically, improving the diversity and quality of the data. Neural networks are employed to create realistic data variations, thus enhancing the model’s robustness.
- -
- Feature-level Augmentation: This method modifies specific features of the data, rather than the raw image itself. Common operations include changing attributes like contrast, brightness, or texture. Such adjustments can improve the model’s ability to generalise across different scenarios.
- -
- Meta-learning-based Augmentation: Meta-learning approaches focus on learning how to generate useful augmentations based on the characteristics of the data. These methods aim to optimise the augmentation strategy itself, improving the model’s learning efficiency across various tasks.
- -
- Data Synthesis Methods: These involve generating synthetic data through techniques like 3D graphics modelling. This approach creates realistic data variations, which is particularly useful for simulating rare or hard-to-capture events in real-world scenarios.
- -
- Neural Rendering: This technique uses neural networks to generate images from 3D models or abstract representations, producing realistic augmentations that can improve the diversity and realism of the training data.
- -
- Generative Adversarial Networks (GANs): GANs are employed to create synthetic data by training two competing networks—the generator and the discriminator. The generator produces new images, while the discriminator evaluates their authenticity. GANs can generate highly realistic augmentations, significantly boosting the dataset’s diversity.
-
Edge Computing: This paradigm involves moving computational tasks closer to the data source, such as on embedded devices at the network’s edge. By processing data locally, edge computing reduces the latency associated with transmitting data to and from remote servers, enabling real-time responses critical for applications like autonomous navigation and real-time surveillance. This approach also conserves bandwidth and enhances data privacy. Significant improvements in latency and throughput have been observed when deploying trained networks on mobile devices and remote servers [17].Deng et al. [55] expand the scope of edge computing by integrating it with AI into a concept called Edge Intelligence, categorized into AI for Edge and AI on Edge:
- -
- AI for Edge: Utilizes AI technologies to address key challenges in edge computing, such as optimising resource allocation, reducing latency, and managing data efficiently.
- -
- AI on Edge: Focuses on performing the entire AI lifecycle, including model training and inference, directly on edge devices.
In distributed learning, the model is trained collaboratively across multiple edge devices, with only model updates—rather than raw data—being transmitted to a central server. This approach reduces communication bandwidth requirements and enhances data privacy. Tron and Vidal [56] demonstrate the application of distributed computer vision algorithms, highlighting that the storage requirements at each node depend solely on local data and remain constant irrespective of the number of cameras involved. For accelerating deep learning training, integrating distributed architectures with techniques such as gradient compression and adaptive learning rates is essential [14].
4.2.3. Adaptation Mechanisms
-
Data-driven Adaptation: This approach involves leveraging data to enable models or systems to adjust and optimise their performance in response to dynamic conditions or specific challenges. In Shen et al.’s studies[18], the parallel light field platform supports the collection of realistic datasets that capture diverse lighting conditions, material properties, and geometric details. These datasets empower data-driven adaptation by providing models with inputs that closely mimic real-world scenarios, ensuring robust generalisation across varying environments. To handle self-occlusion, the conditional visibility module adopts a data-driven strategy, dynamically computing visibility along rays based on input viewpoints. Instead of relying on predefined rules, the module learns and predicts visibility directly from data, enabling it to adapt effectively to diverse viewing conditions. Moreover, data-driven techniques are applied to address specular reflection challenges and depth inconsistencies, showcasing the system’s capability to adapt to complexities arising from changing viewpoints. These adaptations, powered by data, enhance the model’s ability to adjust predictions under varying environmental and geometric conditions.Another example is presented in Kaur et al.’s article [19], where data augmentation techniques are used to generate variations in the dataset, allowing models to learn from a wide range of scenarios. This helps models adapt to unseen conditions during inference. The techniques discussed include Geometric Transformations, Photometric Transformations, Random Occlusion, and Deep Learning-based Approaches. The choice of augmentation methods depends on the nature of the dataset, the problem domain, and the number of training samples available for each class.
-
Online Learning: This approach involves continuously updating a model with new labelled or pseudo-labelled data collected during deployment. In machine learning, models must learn and adapt in real time as fresh data becomes available. This is especially crucial in CPS where the system must adjust to changes such as varying lighting conditions for cameras or evolving cybersecurity threats. Implementing online learning in production environments typically requires several steps: debugging offline, continuous model evaluation, managing data drift, performing regular offline retraining, using efficient algorithms, ensuring data quality, having a rollback plan, and applying incremental updates [21].For online learning, Hu et al. [30] introduce the pre-trained Truncated Gradient Confidence-weighted (Pt-TGCW) model, which combines offline and online learning techniques for tasks like image classification. This model highlights the effectiveness of incremental learning approaches. Additionally, Lu et al. [57] propose Passive-Aggressive Active (PAA) learning algorithms, which update models using misclassified instances and leverage correctly classified examples with low confidence. Their methods enhance performance across various online learning tasks, including binary and multi-class classification.
- Transfer Learning: This approach involves leveraging pre-trained models on large datasets and fine-tuning them for specific tasks, utilising existing knowledge to improve robustness. In CPS, models trained on one dataset may need to be adapted to different environments or contexts. TL enables this adaptation by fine-tuning pre-trained models with smaller, task-specific datasets, making it easier to adjust models to new situations. This is particularly important in CPS, where models must be trained in one context and then applied to another. For instance, Wang et al. [20] propose a transfer-learning approach for detecting attacks in CPS using a Residual Network (ResNet). Their method refines source model parameters through an intentional sampling technique, constructing distinct sample sets for each class and extracting relevant features from attack behaviours. This approach results in a robust network capable of accurately detecting attacks across different CPS environments.
- Ensemble Methods: The method combines multiple models to enhance prediction accuracy and reliability, addressing the weaknesses of individual models. The ensemble model proposed by Tahir et al. [22] incorporates diverse architectures (MobileNetV2, Vgg16, InceptionV3, and ResNet50), each capable of adapting to different features or patterns within the dataset. These models may excel in recognising distinct aspects of the data, and their combination allows the system to handle a wider range of scenarios and data variations, such as differences in X-ray image quality or fracture types. By aggregating predictions from multiple models, the ensemble approach adapts to changes in data quality and characteristics, improving robustness and generalisation. This is particularly important when working with medical datasets like Mura-v1.1, where data can vary in terms of noise, resolution, and imaging conditions. Preprocessing techniques such as histogram equalisation and feature extraction using Global Average Pooling further support adaptation, helping the model adjust to variations in image quality. These methods ensure that the model can effectively handle different input characteristics. The combination of diverse architectures and preprocessing techniques in the ensemble model enhances its adaptability, robustness, and accuracy, which is crucial for reliable performance in the complex and variable field of medical image analysis.
-
Adversarial Training: This technique enhances the model’s robustness by making it more resistant to small, intentional perturbations in the input data that could otherwise lead to misclassifications. By generating adversarial examples [58] and incorporating them into the training process, the model learns to recognize and correctly classify inputs that would typically confuse it, thus improving its generalisation capability. This approach provides insights into how neural networks can adapt to better resist adversarial perturbations, ultimately strengthening their robustness. By using adversarial examples during training, the model becomes more adaptable to a wider range of input variations, making it more resilient and capable of generalising effectively across different datasets, architectures, and training conditions.Another example [59] involves handling adversarial perturbations through randomized smoothing, which strengthens a model’s robustness against adversarial attacks by adding Gaussian noise to the input data. This technique ensures the model is "certifiably robust" to adversarial perturbations, enabling it to maintain reliable performance even when confronted with modified inputs. Training the model with both original and noise-augmented data enhances its capacity to generalize across varied conditions, including adversarial scenarios. This adaptation process equips the model to handle a broader range of input variations, increasing its resilience to unforeseen changes in data distribution. As a formal adaptation technique, randomized smoothing ensures stability and high performance, even under adversarial conditions. By incorporating noise during training, this method significantly bolsters the model’s ability to manage adversarial inputs, enhancing its robustness and generalisation in challenging environments.
-
Federated Learning: In distributed CPS, where devices are spread across different locations (e.g., smart cities, industrial IoT), FL allows individual devices to train models locally and share updates, improving model performance across the system without centralising sensitive data.In Himeur’s article [37], FL is used to distribute computational tasks across multiple clients, alleviating the load on central servers and enabling collaborative machine learning while ensuring data privacy. FL employs various aggregation methods, such as averaging, Progressive Fourier, and FedGKT while incorporating privacy-preserving technologies like Secure Multi-Party Computation (MPC), differential privacy, and homomorphic encryption to safeguard sensitive information. Despite its advantages, FL in Computer Vision (CV) encounters several challenges, including high communication overhead, diverse device capabilities, and issues related to non-IID (non-independent and identically distributed) data, complicating model training and performance consistency.To lower resource constraints, Jiang et al. [60] introduce a Federated Local Differential Privacy scheme, named Fed-MPS (Federated Model Parameter Selection). Fed-MPS employs a parameter selection algorithm based on update direction consistency to address the limited resource issue in CPS environments. This method selectively extracts parameters that improve model accuracy during training while simultaneously reducing communication overhead.
4.3. A Practical Case Study - A Smart Surveillance System
4.3.1. Motivation
- Real-Time Distributed Architectures: Modern surveillance often requires rapid decision making in multiple locations. Low-latency data collection, real-time processing, and communication in a distributed system require robust coordination between edge devices, cloud infrastructure, and control units, especially under bandwidth and resource constraints.
- Awareness and Intelligence: Surveillance systems must move beyond passive monitoring to active interpretation of scenes. This includes contextual understanding, anomaly detection, and behaviour prediction using AI and ML. The challenge lies in integrating these intelligent capabilities without overwhelming computational resources or compromising privacy.
- Video Analysis Limitations: Traditional video analytics struggles in low-light conditions, high-speed scenarios, or scenes with occlusions and noise. They also often rely heavily on high-bandwidth video streams. Overcoming these issues requires more adaptive sensing techniques, such as event-based cameras or multimodal sensor fusion.
- Energy Efficiency in Remote Sensors: Designing energy-efficient sensing, computation, and communication protocols is essential for sustained deployment without frequent maintenance.
-
Scalability: As surveillance networks grow in size and complexity, maintaining consistent performance, data integrity, and manageability becomes increasingly difficult. Scalable architectures must support the seamless integration of new sensors, load balancing, and adaptive processing without introducing bottlenecks.Referring to Figure 8, the optical detection system comprises event cameras, RGB cameras, and LiDAR, which form a complementary multimodal sensing strategy. Each sensor modality contributes distinct, non-overlapping information. For example, RGB cameras capture rich color details, LiDAR provides precise spatial depth, and event cameras offer a high temporal resolution. When these data streams are effectively fused, they significantly enhance the robustness and accuracy of perception systems, particularly in challenging conditions such as complete darkness or heavy rain.
4.3.2. Physical Entities
- RGB Cameras are the core of a visual monitoring system. Its clear and easily understood images are helpful, especially those with high resolution and fast frame rates. However, its performance drops significantly in challenging situations like nighttime, rain, fog, or snow.
-
Event Cameras are bio-inspired sensors that detect brightness changes asynchronously at each pixel. Each pixel operates independently, continuously monitoring the scene and triggering an event whenever the brightness change exceeds a predefined threshold. This results in a continuous stream of events, providing a dynamic, sparse representation of visual information [62]. Compared to conventional cameras, event-based sensors offer several advantages, including high temporal resolution, wide dynamic range, low latency, low power consumption, and reduced motion blur—beneficial for tasks such as object reconstruction, segmentation, and recognition [62,63,64,65].However, event cameras can be sensitive to noise, especially in low-light situations. In fast-changing scenes, they can produce a very high number of events, leading to large data loads. Snowflakes or raindrops can create many false events, making it hard to detect real moving objects [66].
-
LiDAR (Light Detection and Ranging) is a sensing technology that uses laser pulses to measure distance and create detailed 3D maps of the environment. When combined with cameras in surveillance systems, LiDAR adds precise depth information to the colour and detail that cameras provide. This combination improves how we see, understand, and respond to what’s happening in an area.The main benefits of using LiDAR with cameras are as follows [?]:
- -
- Better 3D awareness: LiDAR accurately measures depth, helping build a clear 3D map. When combined with camera images, this gives a more complete picture, making it easier to detect and understand what’s in the scene.
- -
- Improved Object Recognition: By using both LiDAR’s shape and distance data along with a camera’s visual detail, it becomes easier to tell different objects apart and reduce false alarms.
- -
- Works in the Dark and Bad Weather: LiDAR doesn’t rely on light, so it works well at night or in dark areas. It also performs better than cameras in fog, smoke, or rain.
- -
- Smart Alerts: LiDAR can be used to set up virtual fences or tripwires. If something crosses them, the system can send an alert automatically.
However, in low-visibility conditions, such as fog, LiDAR measurements degrade more than those from regular cameras because the laser pulses travel twice the distance, increasing the chances of scattering and attenuation. - Computer Systems: A system can leverage parallelism through graphics GPU/TPU/optimising for both OpenCL and CUDA environments to reduce memory usage and increase speed, making it ideal for embedded systems. [11,48] The operating system used is Ubuntu 24.04 LTS with the popular deep learning framework PyTorch.
- Outputs: Visualisation screens
4.3.3. Cyber World
-
Data Fusion and Synchronisation
- -
- RGB camera-LiDAR calibration: The calibration is achieved by identifying the extrinsic parameters that maximize the mutual information between the two modalities. This optimisation can be carried out using standard tools in SciPy [66].
- -
- Event camera-LiDAR calibration: [67] proposed a novel method to calibrate the extrinsic parameters between a dyad of an event camera and a LiDAR without the need for a calibration board or other equipment. From the event camera, edges are obtained by accumulating events over time and applying edge detection filters. From LiDAR, geometric edge features are extracted by identifying sharp changes in depth (e.g., using surface normals or curvature). The method matches the edge features across the two modalities (event images and LiDAR projections) based on spatial alignment. The objective is to find the extrinsic parameters that best align the edges of both sensors.
- -
- Extrinsic calibration: While the RGB-LiDAR and event-LiDAR calibrations can be performed separately, an additional constraint is introduced to enforce consistency by incorporating the extrinsic transformation between the event camera and the RGB camera. This transformation is obtained through standard stereo calibration techniques, such as those available in OpenCV. By using a checkerboard visible in both cameras’ fields of view, we estimate their relative pose by minimising the reprojection error. The resulting extrinsic are then applied as a constraint in a joint mutual information (MI) optimisation framework, where we aim to maximize the MI between the aligned sensor data [66].
- -
- RGB Cameras: [68] proposed a novel subframe synchronisation technique for multicamera systems by interpolating between frames, allowing the system to estimate what a camera would have captured at any precise point in time, even between actual recorded frames. This ensures much tighter temporal alignment across cameras, which is critical for applications like 3D reconstruction, motion capture, and scientific analysis of fast phenomena.
- -
- Synchronisation: LiDAR is synced using PTP with software timestamping. Similarly, the RGB camera is programmed to be a PTP slave and additionally sends a trigger signal to the event camera at the start of data acquisition. The event camera receives the synthetic event from the RGB camera, which marks at the start time [66].
-
Architecture: The proposed architecture is shown in Figure 12.
- -
- MobileNetV3: It serves as a robust and efficient backbone architecture for the surveillance system. Its lightweight design, incorporating depthwise separable convolutions and NAS, makes it highly suitable for real-time inference on edge devices [69]. When integrated with SSD, MobileNetV3 enhances object detection by combining its high-level feature extraction with SSD’s fast, one-shot bounding box prediction, ensuring speed and accuracy [70,71].
- -
- Real-time Pixel-wise Estimation Flow (RPEFlow): To support reliable object tracking across video frames, RPEFlow is integrated into the detection pipeline because it provides dense motion vectors that help maintain object identity through occlusions or rapid movements, enabling consistent object association and precise motion trajectory estimation [72].
- -
- Contextual Understanding and Enhancement Network (CUE-Net): It is employed for anomaly detection, leveraging spatial appearance and temporal motion cues. It receives fused inputs from MobileNetV3-SSD detections and RPEFlow’s motion stream to detect irregular behaviours such as sudden movements, crowding, or violent actions. Its ability to process spatiotemporal features enables real-time recognition of abnormal activities, enhancing situational awareness and proactive threat response [70].
- -
- SSD: The bounding boxes produced by SSD undergo post-processing via a Kalman filter bank. This lightweight and efficient tracker predicts future object positions based on motion dynamics, manages occlusions and missed detections through probabilistic estimation, and reduces false positives, resulting in smoother and more reliable object tracking [71].
-
Optimisation
- -
- MobileNetV3: It incorporates several key optimisations to enhance performance in resource-constrained environments. It utilizes depthwise separable convolutions to significantly reduce computational load, squeeze-and-excitation (SE) modules to recalibrate channel-wise feature importance, and NAS to balance accuracy and latency effectively for real-time applications [69]. RPEFlow contributes by generating dense optical flow through real-time pixel-wise motion estimation, optimized for the temporal domain. It is designed for low-latency execution on edge devices, requiring minimal computational resources due to its reduced parameter footprint [72].
- -
- SSD: It enhances object detection by dividing the input image into a grid and applying multiscale default anchor boxes to capture objects of various sizes. Through its convolutional layers, SSD predicts both the object class and its bounding box location in a single forward pass, ensuring fast detection [71].
- -
-
Kalman filter: It optimizes the tracking algorithm by predicting future positions of detected objects based on their velocity and motion patterns. It also helps reduce false positives and maintain accurate object tracking, especially during occlusion or brief detection failures [72].Additionally, a scalable edge-to-cloud architecture allows models to run efficiently on edge devices for real-time operation while leveraging cloud resources for periodic retraining and long-term performance optimisation.
-
Adaptation
- -
- MobileNetV3 + SSD: MobileNetV3 can be pre-trained on large-scale general-purpose datasets and subsequently fine-tuned using domain-specific surveillance footage to improve object recognition accuracy within targeted environments. The model supports dynamic input scaling and incremental retraining, enabling it to adapt over time to new object classes, evolving visual patterns, or changing surveillance contexts. When combined with the SSD, MobileNetV3 supports adaptive object detection by leveraging SSD’s default anchor boxes at multiple scales and aspect ratios, effectively identifying objects of various sizes in complex scenes [71]. SSD’s performance can be further refined through feedback loops or active learning mechanisms, allowing the system to improve detection accuracy with minimal manual supervision.
- -
-
RPEFlow: It plays a crucial role in motion analysis by providing real-time, frame-wise dense optical flow, which is particularly useful in scenarios involving variable lighting, occlusions, or low visibility [72]. Its adaptive capability ensures robust motion estimation and consistent object tracking even under dynamic environmental changes. In parallel, CUE-Net is designed for anomaly detection, learning appearance features and motion dynamics to identify irregular or suspicious behaviours in surveillance footage. It employs weakly supervised learning, minimising the need for extensive labelled datasets, and supports domain adaptation to maintain performance across different surveillance settings.At the system level, machine learning adaptation is further enhanced through sensor fusion strategies. Inputs from RGB cameras, LiDAR, and event-based sensors are dynamically weighted based on environmental conditions [72]. For example, LiDAR and event data during nighttime or in low-light situations are highly preferable. Temporal adaptation is also implemented, with the system learning behavioural baselines over time and adjusting anomaly detection thresholds accordingly.
4.3.4. Evaluation Metrics
-
Object and anomaly detections: To evaluate these detection capabilities of the CV, we utilize a set of standardized quantitative metrics in conjunction with benchmark datasets such as MS COCO and OpenImages. The core evaluation metrics are as follows: [78,79]
- -
-
Precision: It measures how many of the predicted positive detections (e.g. detected objects) are correct. It is defined as:High precision indicates a low false positive rate. In surveillance, this means fewer false alarms from incorrectly detected objects.
- -
-
Recall: It quantifies how many of the actual positives are correctly detected by the model. It is defined as:High recall ensures the system misses as few real objects as possible.
- -
-
F1-Score: It is the harmonic mean of precision and recall, balancing both metrics:It is especially useful when there is an uneven class distribution or when both false positives and false negatives are important.
- -
- Accuracy: It shows how well the model performs. It is defined as:
- -
-
Intersection over Union (IOU): It is used to measure the overlap between a predicted bounding box and the ground truth box:A higher IoU value means the prediction is more accurate. In evaluation, a threshold (>0.5) is often defined as a "correct" detection.
- -
-
Mean Average Precision (mAP): It summarises the precision-recall curve across different IoU thresholds and object classes. It is computed as:It reflects both the localisation and classification performance of an object detection system.
-
System-level performance
- -
- Latency Measurement: Latency is defined as the time elapsed between the acquisition of an input (e.g., an image, video frame, or event stream) and the generation of the corresponding system output (such as object detection, anomaly alert, or visual feedback). It is a critical metric for evaluating the responsiveness of real-time surveillance systems. Latency can be measured in two main ways: End-to-end latency, which captures the total time from sensor input to final system output; and component-level latency, which isolates the processing time of individual modules (e.g., sensor capture, preprocessing, inference, and output rendering). Accurate measurement of both types helps identify bottlenecks and optimize system performance [69].
- -
- Processing Speed: Processing speed refers to the rate at which the surveillance system can handle input data, typically measured in frames per second (FPS) for image and video streams or points per second for point cloud data. This metric indicates the system’s capacity to maintain real-time operation under varying workloads. [80] demonstrated a comparative implementation of processing speed in different perception models, highlighting its importance in evaluating real-time performance for safety-critical applications such as autonomous driving and surveillance.
- -
- Throughput: It refers to the total volume of data that the surveillance system can process per second. Reflects the overall data handling capacity of the system under operational conditions. [72] demonstrated throughput-aware optimisation for multisensor systems.
-
Scalability: It refers to the system’s capacity to sustain acceptable performance as deployment parameters expand, such as the number of integrated sensors, the size of the monitored area, the volume of concurrently tracked objects, or increasing computational demands.
- -
- Sensor Scalability: To assess sensor scalability, we progressively increase the number of input sources (ranging from 2 to 20 cameras or LiDAR units) and measure key performance indicators such as system latency, frame rate (frames per second), and throughput under each configuration. This evaluation reveals the system’s capability to handle additional sensory input without significant degradation in performance.
- -
- Network Scalability: It is evaluated by monitoring the system bandwidth usage as the data transmission scales of the sensor. Tools such as NetHogs can be used to analyze real-time bandwidth consumption as demonstrated in [81].
-
Robustness: It assesses the system’s ability to sustain functional performance under adverse or variable conditions, such as poor lighting, sensor degradation, network interruptions, or adversarial inputs.
- -
- Synchronisation Robustness: It is primarily designed to test the synchronisation between two video sequences. The method aims to calculate a matching score between 0 and 1 for each pair of frames from two videos, creating a matching matrix (Matching Frame) and (Delay Estimation) using the matching matrix to estimate the delay between the video sequences by analyzing the pattern of highest matching scores across frames in [82].
- -
- Corruption Robustness: To enhance resilience against visual corruptions, two augmentation strategies are applied: (1) replacing each training image with a stylized version; and (2) augmenting the dataset by including stylized variants alongside the original images. This approach, shown to improve robustness in object detection tasks, follows the methodology outlined in [75,83].
- -
- Robust Perception Under Adverse Conditions: To evaluate perception robustness in challenging environments (e.g. rain, fog, and low lighting), the following procedures in [84] are adopted: (1) Data Augmentation via Unpaired Image-to-Image (I2I) Synthesis; (2) Two-Branch Architecture Design: Implement a generalized two-branch network that processes both the original and the I2I-enhanced images; (3) Comparative Analysis: Conduct a comprehensive performance analysis across three configurations—image enhancement only, augmentation only, and the combined two-branch architecture—to assess their effectiveness in improving robust perception.
-
Security Evaluation: It involves a comprehensive assessment of how effectively the system safeguards against unauthorized access, data breaches, and malicious manipulation.
- -
- Physical Security Measures: It is the first line of defence and includes robust housing for sensors, cables, and computing infrastructure to protect them from environmental factors and tampering. Access control must be enforced through controlled physical access zones, complemented by visible signage to deter unauthorized entry and ensure clear surveillance coverage.
- -
- Data Protection and Privacy: The system must implement role-based access control to restrict sensitive data access to authorized personnel only. Multifactor authentication adds an essential layer of security for system access. Sensitive data should be securely stored, managed according to defined retention policies, and deleted when no longer needed. To protect the integrity and confidentiality of data in transit, end-to-end encryption should be adopted across video streams, sensor data, and command signals. Audit logs should be maintained and regularly reviewed to detect and respond to unauthorized access attempts.
5. Discussions
5.1. Critical Evaluation
5.1.1. Increased Focus on Real-Time Performance
-
Synchronisation of Data Streams
- -
-
Time Synchronisation: Accurate time synchronisation with stringent latency is essential for real-time CV applications in CPS, where multiple sensors, devices, and processing units must operate cohesively. The Flooding Time Synchronisation Protocol (FTSP) [85] is designed for distributed environments in real-time, aiming to minimise synchronisation errors between multiple nodes and achieve tighter timing accuracy. However, this approach can introduce significant communication overhead, leading to network congestion and increased energy consumption.To address these challenges, [12] demonstrates that an event-based consensus clock synchronisation scheme incorporating a low pass filter and a novel event-triggered mechanism can effectively reduce communication overhead. Furthermore, [86] explores various asynchronous time synchronisation protocols across different applications to improve FTSP scalability, adaptability, latency, and accuracy.Moreover, integrating FTSP with techniques such as data compression, prediction algorithms, and data aggregation, can further improve its efficiency [87].For wired connection networks, IEEE 1588 Precision Time Protocol (PTP) is specifically designed for high-precision synchronisation in industrial automation and robotics. [88] demonstrates that a multi-domain PTP system design can effectively mitigate network faults, ensuring reliable real-time control.For lightweight synchronisation algorithms in CPS applications, the Collaborative Siamese Network (CoSiNeT) is a state-of-the-art method for handling network packet delay variations in industrial environments with constrained hardware and software resources [89].
- -
- Real-time Data Fusion: [6] demonstrates that a fusion framework combining FCNx with EKF provides a cost-effective, lightweight, modular, and robust solution for real-time road detection, including road segmentation, obstacle detection, and tracking in autonomous vehicles equipped with LiDAR, camera, and radar sensors. In industrial applications such as recycling, mining, and food processing, the auction algorithm [11] outperforms traditional multi-target tracking methods for sensor-based sorting.
- -
-
Edge Computing: Recent advancements in edge computing focus on processing data locally to reduce latency and enhance real-time inference. However, operations across distributed nodes can lead to inconsistencies due to time drift. To address this, time synchronisation-aware edge-end collaborative network routing management (TSA-RM) in [90] optimizes routing to minimise the weighted sum of the model training loss function and delay. This approach ensures coherent and timely data processing while balancing training loss and latency.Balancing real-time performance with energy efficiency remains a key challenge for embedded vision hardware. Research in [91] on energy-aware edge computing indicates that offloading computational tasks to edge servers can significantly reduce power consumption while maintaining low latency. This strategy is particularly advantageous for mobile and embedded devices in autonomous systems, where energy resources are constrained.
-
Optimisation of ML Models
- -
- Model Compression: Real-time optimisation techniques are designed to reduce computational demands while addressing the scalability and complexity of neural networks. Model compression methods [15,49,50] have been effective in simplifying neural architectures, while lightweight architectures [51,52] provide efficient solutions for real-time inference.
- -
-
Hardware acceleration: Specialized hardware accelerators, including parallel and distributed capabilities, are pivotal for enhancing performance and efficiency. GPU-based implementations [11,17] optimize performance by minimising latency and increasing throughput. As scalable ML models evolve, the shift toward specialized hardware has accelerated. TPUs, designed for tensor operations, optimize large-scale, low-precision computations, while FPGAs offer reconfigurable logic for real-time tasks in industrial automation and robotics. ASICs provide ultra-low latency and high throughput for dedicated applications but lack flexibility and require high initial investment.Each hardware accelerator presents trade-offs in throughput, precision, power consumption, and area [92]. The industry is transitioning beyond GPU-centric approaches, leveraging a diverse array of specialized chips. Approaches like COSMOS (Coordination of High-Level Synthesis and Memory Optimisation for Hardware Accelerators) utilize Pareto-optimal implementations to balance these trade-offs, reducing computational overhead and optimising energy efficiency [93].
- -
-
Other Techniques: Edge computing reduces latency by processing data locally, minimising the dependence on centralized servers while enhancing efficiency and data privacy [55,94]. Real-time data augmentation dynamically transforms data during preprocessing, optimising it for run-time processing[19,54].To further reduce computational overhead in video processing, frame-skipping techniques selectively process key frames instead of running deep learning models on every frame. This leverages the temporal correlation between consecutive frames, ensuring minimal performance loss while significantly improving efficiency. For instance, FrameHopper [95] employs a reinforcement learning-based approach to determine optimal frame skip lengths, balancing detection accuracy with reduced processing load.Similarly, adaptive sampling strategies dynamically adjust the sampling rate based on the complexity of the scene or the motion patterns. Methods like MGSampler [96] optimize sampling by prioritising frames containing significant changes, reducing redundant computations while preserving critical information for object detection and tracking. These techniques collectively improve real-time performance in resource-constrained environments, making them vital for edge-based computer vision applications in CPS.
-
Adaptation to Dynamic Environments
- -
- Model-Agnostic Meta-Learning (MAML): It is a meta-learning technique designed to enable machine learning models to adapt quickly to new tasks with minimal training data, particularly effective in few-shot image classification and object detection [42].
- -
-
Domain Adaptation: In many CPS applications, the visual environment is non-stationary. For instance, an autonomous vehicle must handle variations in lighting and weather. A domain-adaptive object detection framework in [97] can address performance degradation in adverse weather under foggy and rainy conditions. The approach includes image-level and object-level adaptations to minimise domain gaps in image style and object appearance. Additionally, a novel adversarial gradient reversal layer enhances model robustness by mining hard examples, while an auxiliary domain through data augmentation enforces domain-level metric regularisation.[98] introduces the Gated Image-Adaptive Network (GIANet), a novel framework that enhances low-light images and clusters bounding boxes effectively.
- -
- Online Continual Learning: This approach allows models to adapt to new tasks or data distributions without catastrophic forgetting. Methods such as Elastic Weight Consolidation (EWC) [99,100] or memory replay mechanisms [101] help preserve previously learned knowledge while integrating new information, developing models capable of real-time adaptation.
- -
- Transfer Learning: Fine-tuning pre-trained models on a small subset of new data helps the model quickly adapt to new tasks or environments. Techniques such as few-shot learning can be particularly valuable when only limited new data is available.
5.1.2. Hybrid Approaches
-
Digital Twin (DT): This approach integrates physical and virtual optimisation layers that emerge as powerful approaches to improve system efficiency and robustness in CPS. These methods leverage the strengths of both the physical domain (e.g., real-world sensors, actuators, and processes) and the virtual domain (e.g., simulations, predictive algorithms, and digital twins) to create a cohesive and adaptive system.A bio-inspired LIDA (Learning Intelligent Distribution Agent) cognitive-based DT architecture [102] facilitates unmanned maintenance of machine tools by enabling self-construction, self-evaluation, and self-optimisation. This architecture provides valuable insights into implementing real-time monitoring in dynamic production environments.In the manufacturing industry, DT enhances flexibility and efficiency while addressing safety and reliability challenges in collaborative tasks between human operators and heavy machinery. They enable accurate detection and action classification under diverse conditions, as demonstrated in studies [9,12]. Another prominent application involves an Autonomous Driving test system under hybrid reality [31], which improves efficiency, reduces costs, and enhances safety, offering a robust solution for autonomous driving development.However, the continuous data transmission from the physical world to the virtual DT system must demand extensive wireless resources. A continual reinforcement learning algorithm in [103] can learn a stable policy across historical experiences to quickly adapt to physical states and network capacity dynamics, optimising the digital twin synchronisation.Nevertheless, the current state of DT technology often requires offline system halts for model updates, with implementations depending on backends that enforce strict data exchange constraints. To overcome these limitations, the CoTwin framework [104] introduces a dynamic approach that enables online model refinement in CPS without operational disruptions. By leveraging a blockchain-based collaborative space for secure data management and integrating neural network algorithms for fast, time-sensitive execution, this framework ensures stable and efficient performance. Additionally, it meets the stringent temporal requirements of CPS, providing a significant advantage in industrial applications.
-
Hybrid Optimisation Algorithms: Integrating classical optimisation techniques with cutting-edge ML models offers a powerful approach to tackling complex challenges in CPS. For example, real-time tracking demands significant computational resources, posing challenges for deployment in resource-constrained environments. [105] addresses this by combining a classical optical flow algorithm with a deep learning model, striking a balance between accuracy and efficiency. Designed for human-crowd tracking, this method effectively reduces computational costs while maintaining high tracking performance.Similarly, [106] presents a hybrid approach for real-time object tracking and classification in real time, merging frame-based and event-based methods. This strategy enables real-time performance on embedded platforms without compromising accuracy, making it well-suited for energy-efficient applications.
- Integration with Transformer: Transformers utilize self-attention mechanisms [107] to capture global contextual information in images, unlike CNNs, which process them locally. With compression techniques and hardware acceleration, transformers can achieve faster and more efficient decision-making in resource-constrained environments. For instance, [108] optimizes the ARM Keyword Transformer (KWT) on a RISC-V platform, achieving a 5× speedup and power reduction for edge devices. Similarly, [109] employs distinct quantisation and approximation strategies in Softmax and LayerNorm, enhancing energy and area efficiency for end-to-end inference on GPU and ASIC platforms.
5.1.3. Human-in-the-loop
5.1.4. Standardized Benchmarks
- Diversity in Application Requirements: CPS applications have highly varied requirements in terms of latency, fault tolerance, and real-time responsiveness. For example, autonomous driving systems require low latency and strict real-time synchronisation [24], whereas construction operations prioritize robustness and fault tolerance [9]. These differences make it difficult to create universal benchmarks that address the needs of all domains effectively.
- Heterogeneous Architectures: CPS systems involve a complex mix of hardware, software, and communication protocols. Variability in processing speeds, sensor accuracies, and network latencies requires synchronisation and optimisation solutions customized to diverse architectures. Standard benchmarks often fail to account for these architectural disparities.
- Dynamic Operating Environments: CPS must perform reliably in environments with unpredictable changes, such as varying workloads, communication delays, and environmental disturbances. Creating benchmarks that accurately simulate such dynamic conditions is a complex and resource-intensive task that makes standardisation challenging.
- Inconsistent Performance Metrics: Without common benchmarks, researchers and practitioners rely on ad hoc evaluation methods. This inconsistency makes it challenging to compare the efficiency, scalability, and effectiveness of different synchronisation and optimisation techniques.
- Limited Reproducibility: The absence of standardized frameworks impedes reproducibility, as the experimental setup and evaluation criteria vary widely between studies. This inconsistency hinders progress in developing reliable CPS solutions.
- Barriers to Collaboration: Standardized benchmarks foster collaboration by providing a shared foundation for evaluating CPS technologies. Without them, it becomes difficult for researchers, engineers, and domain experts to collaborate effectively within a cohesive ecosystem.
- Challenges in Real-World Applications: Many CPS applications, such as automotive systems and smart grids, require rigorous testing and validation to meet safety and performance standards. The lack of standardized benchmarks hampers this process, potentially affecting system reliability and trustworthiness.
5.1.5. Ethical and Societal Implications
-
Privacy Infringement
- -
- Implication: ML-based CV systems often process images and video data that can identify individuals in both public and private spaces. These images are frequently collected without explicit consent, raising ethical concerns regarding the unauthorized use of personal data. Even when datasets are anonymized, there remains a significant risk of re-identification through advanced analytical techniques, particularly when combined with other data sources [?]. This presents substantial threats to individual privacy [?].
- -
- Mitigation: To address these concerns, policies and implementation practices must prioritize obtaining explicit consent from individuals before data collection, especially in sensitive environments. In cases where direct consent is impractical, visible signage, opt-out mechanisms, or public notifications should be provided to ensure transparency. Developers and system operators should adopt strong privacy frameworks grounded in informed consent and apply advanced anonymisation techniques when consent is not feasible. These anonymisation methods must be regularly updated to keep pace with emerging re-identification techniques [?]. In addition, encryption-based approaches offer further protection. Homomorphic Encryption allows computations on encrypted data without the need for decryption, while Secure Multiparty Computation (SMC) enables multiple users to collaboratively compute a function without revealing their data [114]. Finally, privacy-by-design principles should be integrated throughout the system’s lifecycle, ensuring minimal data collection, transparency in data usage, and rigorous safeguards against misuse.
-
Algorithmic Bias and Discrimination
- -
- Implications: CV systems trained on biased datasets can result in unfair outcomes such as misclassification or discriminatory treatment based on race, gender, or age. These datasets often encode societal biases, which are inadvertently learned and propagated by machine learning models. As a result, marginalized communities may face disproportionate harm, especially when these technologies are used in surveillance applications that operate without individuals’ knowledge or consent, infringing on civil liberties. Furthermore, CV models may detect statistically valid but non-causal patterns, leading to incorrect or misleading conclusions. The effectiveness and fairness of these systems are heavily dependent on the quality and integrity of their training data. When the data is incomplete, outdated, or unrepresentative, the resulting models will inevitably reflect and amplify these deficiencies [115].
- -
- Mitigation: Addressing these challenges requires a multi-pronged approach. First, datasets should be carefully curated to ensure diversity and adequate representation of all demographic groups. Second, routine algorithmic audits using fairness metrics must be conducted to monitor and mitigate biased outcomes. Third, embedding explainability into the model design is essential to promote transparency, enabling stakeholders to interpret decisions and identify potential bias [114]. Moreover, involving ethicists, domain experts, and representatives from affected communities throughout the development lifecycle ensures that ethical considerations are prioritized and embedded from the outset [115].
-
Safety and Adversarial Vulnerabilities
- -
- Implications: ML-driven CV systems are highly sensitive to data distribution shifts, adversarial inputs, and environmental variability, which can lead to unsafe or unintended decisions [116]. Overreliance on automated CV systems risks reducing essential human oversight, potentially leading to unchecked system behaviour in critical scenarios. These systems are also vulnerable to spoofing, tampering, and adversarial attacks, such as adversarial examples crafted to deceive neural networks into misclassifying objects, posing severe security risks [?]. Furthermore, the growing use of generative technologies capable of producing hyper-realistic yet fabricated audiovisual content introduces serious threats to democratic processes, including election manipulation and the erosion of public trust.
- -
- Mitigation: Enhancing the resilience of ML-driven CV systems requires integrating adversarial training and developing robust detection mechanisms to identify and neutralize spoofing or tampering attempts. Critical applications should maintain meaningful human-in-the-loop oversight to safeguard against over-automation and preserve accountability. Additionally, ethical system design must include built-in fail-safes and strong authentication protocols to prevent the malicious use of synthetic media, such as deepfakes, thereby protecting both individual rights and societal stability.
-
Labour Displacement and Job Restructuring
- -
- Implications: The integration of automated CV technologies into CPS (such as in retail checkout, quality control, and industrial inspection) poses a significant risk of displacing both manual and cognitive labour. While early automation concerns centred on physical jobs, generative AI now threatens knowledge-based professions across all educational levels. Increasingly, robots are performing roles in security, food service, and logistics, while AI is capable of completing tasks historically assigned to radiologists, software developers, designers, and journalists. Although automation can enhance efficiency and productivity, it also risks driving down wages and reducing employment opportunities, especially in the absence of adequate retraining infrastructure. The lack of robust vocational training and career transition programs further compounds the vulnerability of the workforce [?]. Nevertheless, some forecasts remain optimistic, suggesting that, with appropriate policy and educational reforms, AI and robotics could ultimately generate more jobs than they displace [117]. The central challenge lies in preparing workers to thrive in these emerging roles through strategic investments in education and human-centred skill development.
- -
- Mitigation: Addressing labour displacement requires proactive and collaborative action from policymakers, educators, and industry leaders. Large-scale reskilling and upskilling initiatives must be prioritized, emphasising digital literacy, adaptive learning, and human-AI collaboration. These programs should be tailored to prepare workers for evolving roles and enable lifelong learning pathways. In parallel, transitional support systems including career counselling, job placement services, and mental health resources can ease the socioeconomic impact on displaced individuals. Incorporating inclusive design principles that involve workers in co-developing AI systems can ensure that automation complements rather than replaces human labour [117]. Moreover, governments should create regulatory frameworks and incentive structures that encourage ethically aligned innovation, rewarding companies that pursue socially responsible automation. Collectively, these strategies foster an ecosystem where technological progress supports equity, resilience, and human dignity [??].
5.2. Interdisciplinary Perspectives
5.2.1. Interoperability, complexity, and sustainability of CPS
- Interoperability: It refers to the ability of diverse systems or components to work together seamlessly. The development of CPS necessitates collaboration across multiple engineering disciplines, each specialising in different aspects of the product. Engineers, developers, and designers operate within various departments, each with distinct priorities and experience levels, relying on specialized software tools throughout the product life cycle [118]. To ensure proper system functionality and seamless integration of interconnected components, [119] introduces a domain-specific language for CPS, improved design, verification, and deployment of these intricate systems [120,121].
- Complexity: It arises from multiple interconnected components and variables, leading to unpredictability, non-linearity, and randomness. Addressing complexity requires interdisciplinary insights. [122] introduces the parallel systems method, based on Artificial systems, Computational experiments, and Parallel execution, to develop parallel intelligence, a cycle that iteratively generates data, acquires knowledge, and refines real-world systems. This approach provides versatile solutions for managing complex systems across diverse fields.
- Sustainability: It refers to a system that could exist forever by balancing environmental, economic, and social factors, necessitating interdisciplinary strategies. [118] proposes integrating three mindsets (systems mindset, futuristic mindset, and design mindset) to navigate the uncertainty and complexity inherent in sustainability challenges.
5.2.2. Designing Effective Solutions
5.2.3. Data Interpretation in Real-World Implementation
5.2.4. Real-Time Decision-Making
5.3. Emerging Trends
5.3.1. Edge Artificial Intelligence
-
Real-Time Processing and Low Latency: Edge AI revolutionizes real-time decision-making processes by enabling on-device data processing, which minimises latency and ensures instant responses. This capability is indispensable for applications that demand immediate and reliable decision-making, such as autonomous vehicles and health care. In these scenarios, rapid responses are not only beneficial but also critical. For example, automotive vehicle systems require handling vast amounts of heterogeneous data from various sensors, requiring high-performance and energy-efficient hardware systems to process this information in real-time, interacting between functional modules seamlessly with low overhead, and facing strict energy constraints, emphasising the need for optimised hardware and computational techniques. By decentralising intelligence, edge AI brings ML model training and inference directly to the network, enabling communication between edge systems and infrastructure, and reducing the computational burden on the edge systems [10].Edge AI is a transformative technology that brings numerous benefits to the functionality and efficiency of medical devices, especially in the realm of the Internet of Medical Things (IoMT) [132]. By processing data locally, Edge AI ensures faster, real-time decision-making, crucial in medical contexts. For instance, in remote monitoring systems, critical health alerts can be instantly generated and communicated to caregivers or medical professionals, improving the reliability and responsiveness of these systems. In such cases, local storage capacities and synchronisation of sensor data may cause challenges to the application creators.
- Enhanced Security and Privacy: Edge AI minimises the need to transmit sensitive data to central servers, significantly enhancing the security and privacy of decentralized CPS applications. This localized processing not only reduces exposure to potential data breaches but also strengthens the overall resilience of the system. Ensuring the reliability, security, privacy, and ethical integrity of edge AI applications is paramount, as edge devices handle sensitive information with potentially severe consequences in the event of a breach. Robust encryption methods, stringent access controls, and secure processing and storage frameworks are indispensable for safeguarding data and maintaining trust [129]. Hardware-supported Trusted Execution Environments are often employed to enhance security by isolating sensitive computations. However, these solutions present challenges related to performance and integration, necessitating a delicate balance between maintaining robust security and ensuring efficient system operations. Addressing these challenges is critical for the successful deployment of edge AI in secure and decentralized CPS environments.
-
Energy Efficiency: The growing demand for AI applications highlights the need for energy-efficient and sustainable edge AI algorithms. Advanced AI, particularly deep learning, consumes substantial energy, posing sustainability challenges. Developing lightweight and energy-efficient AI models is essential for supporting edge devices with limited computational resources, thereby enhancing the sustainability of CPS applications. Computational offloading is another effective method to reduce energy consumption in edge devices [132].However, achieving a balance between high performance and energy efficiency is crucial. Often, small gains in accuracy require significantly more energy, which is inefficient and environmentally unsustainable when ultrahigh accuracy is not necessary. Researchers must carefully evaluate the trade-offs between accuracy and energy use.For the significant impact of energy consumption during the operation, production, and lifecycle of edge devices, creating durable, upgradeable, and recyclable devices is vital to minimise ecological impact. Implementing policies to promote energy-efficient AI and regulating the environmental footprint of device manufacturing and disposal are critical steps toward achieving sustainability in edge AI [129].
-
Interoperability: Efforts are being made to develop comprehensive standards and frameworks to ensure seamless interoperability between edge devices and CPS components across diverse applications. These standards aim to establish uniform protocols for data exchange, device communication, and system integration, enabling heterogeneous edge devices and CPS components to work together cohesively. This interoperability is critical for supporting scalability, reducing system fragmentation, and fostering a more unified ecosystem that can accommodate advancements in hardware and software technologies.Moreover, the development of such frameworks addresses challenges related to compatibility, security, and system resilience, providing a robust foundation for reliable decentralized operations. These initiatives also incorporate mechanisms to manage dynamic environments, where edge devices and CPS components must adapt to changing conditions in real time while maintaining performance and reliability [133].
5.3.2. Self-adaptive Systems Leveraging Reinforcement Learning (RL)
- Addressing Design-Time Uncertainty: One of the most significant challenges in developing self-adaptive systems is the uncertainty inherent at design time. Online RL provides a compelling solution [134]. By enabling systems to learn directly from interaction with their environment, RL equips self-adaptive systems with the ability to respond effectively to previously unencountered conditions. This adaptive capacity is critical for systems deployed in dynamic environments, such as autonomous vehicles or distributed cloud-edge networks, where operational contexts can shift unpredictably.
- Real-Time Decision-Making: The ability to make real-time decisions is the cornerstone of self-adaptive systems. RL excels in this domain by continually refining its policies based on operational feedback, ensuring the system remains responsive to changes. RL-driven systems autonomously optimize their behaviour, balancing competing objectives such as performance, energy efficiency, and reliability [134]. This capability is particularly valuable in applications like IoT-driven healthcare, where immediate responses to patient data can be life-saving, and in autonomous systems, where split-second decisions are vital for safety.
- Enhancing Efficiency: Efficiency is a critical consideration in the operation of self-adaptive systems. RL supports this by enabling dynamic resource allocation and optimising the use of computational, energy, and network resources based on current demands. Deep RL integrates energy optimisation with load balancing strategies, in order to minimise energy consumption while ensuring server load balance under stringent latency constraints [135]. Furthermore, RL’s ability to handle nonlinear and stochastic environments makes it particularly well-suited for real-world applications, where unpredictability and instability are the norm. This adaptability ensures robust performance in dynamic and challenging conditions, reinforcing its utility across various domains.
- Generalisation and Scalability: Deep RL extends the capabilities of RL by integrating neural networks to represent the learned knowledge. This allows self-adaptive systems to generalize their learning to unseen states and handle high-dimensional input spaces, such as sensor data or video streams. This generalisation capability is crucial for scalability, enabling RL-driven self-adaptive systems to operate effectively in diverse and complex environments. Applications such as smart cities, where systems must manage vast amounts of real-time data from interconnected devices, benefit immensely from Deep RL’s scalability and adaptability.
5.3.3. Hybrid Machine Learning Models
- Enhanced Performance: Deep learning excels at extracting high-level features from unstructured data, such as images and text. However, it often requires significant computational resources. Traditional algorithms handle structured data and provide clear interpretability [136]. In [137], authors applied CNN and autoencoders to extract features and then followed by the particle swarm optimisation (PSO) algorithm to select optimal features and reduce dataset dimensionality while maintaining performance. Finally, the selected features were classified by the third stage using learnable classifiers decision tree, SVM, KNN, ensemble, Naive Bayes, and discriminant classifiers to process the acquired features to assess the model’s correctness. Combining these techniques results in models that deliver high performance without the prohibitive costs of standalone deep learning methods.
- Improved Generalisation: Hybrid models combine the strengths of deep learning and traditional algorithms, capitalising on deep learning’s ability to handle complex, non-linear relationships in data while utilising traditional methods to enhance interpretability and generalisation, particularly in scenarios involving smaller datasets. For example, the Adaptive Neuro-Fuzzy Inference System (ANFIS), as discussed in [137], exemplifies a hybrid network where fuzzy logic intuitively models nonlinear systems based on expert knowledge or data. Neural networks complement this by introducing adaptive learning capabilities, enabling the system to optimise parameters such as membership functions through input-output data. This integration empowers ANFIS to effectively model complex, nonlinear relationships, making it highly applicable in tasks such as prediction, control, and pattern recognition.
- Scalability and Adaptability to Diverse Tasks: Hybrid models offer remarkable flexibility, enabling customisation for specific applications by integrating the most advantageous features of distinct paradigms. In [138], by combining Statistical Machine Translation (SMT), which uses statistical models to derive translation patterns from bilingual corpora, with Neural Machine Translation (NMT), which employs Sequence-to-Sequence (Seq2Seq) models with RNNs and dynamic attention mechanisms, these approaches capitalize on the statistical precision of SMT and the contextual richness of neural networks. Additionally, ensemble methods enhance translation quality further by amalgamating multiple models, proving particularly effective for domain-specific adaptations and ensuring robust performance.
- Limitations: Hybrid learning systems offer robust solutions for complex data-driven challenges by combining the strengths of both methodologies. However, they face several challenges [136], including high model complexity, which complicates configuration, optimisation, and interpretation. Despite advances in transparency, their layered architecture often obscures decision-making processes, raising issues of interpretability. The extensive and diverse datasets required for training pose significant privacy and security risks. Additionally, deploying and maintaining these systems is resource-intensive due to their sophisticated architecture and the need for regular updates to stay aligned with evolving data and technologies. Real-time processing capabilities can be hindered by the computational intensity of DL components, and the energy demands of training and operating hybrid models raise environmental concerns. Long-term maintenance further demands substantial effort to ensure these models remain effective and relevant in dynamic environments.
- Future Research: Future research in hybrid learning should focus on deeper interdisciplinary integration with fields like cognitive science, medicine, and computing to achieve AI systems that more closely emulate human cognition. Advancing model generalisation is equally critical, emphasising the development of adaptive systems capable of autonomously adjusting to varying datasets and environmental conditions. Additionally, enhancing AI accessibility is essential to democratize its use, and improve educational resources, and community-driven initiatives, thereby broadening the impact of AI as a universal problem-solving tool [136].
5.3.4. Vision Foundation Models
-
Segment Anything Model (SAM): A foundational vision model is designed for promptable image segmentation tasks. Trained in the large-scale SA-1B dataset, comprising over 1 billion segmentation masks in 11 million images, SAM demonstrates strong zero-shot performance, generalising to new image distributions and tasks without task-specific fine-tuning [142]. SAM has become a cornerstone for high-level vision tasks such as image segmentation, captioning, and editing. However, its high computational cost due to its transformer-based architecture limits its practicality in real-time or resource-constrained industrial settings.To address these limitations, [143] proposed FastSAM, a real-time, CNN-based alternative. By reframing segmentation as an instance segmentation task with prompting and training on only 1/50 of the SA-1B dataset, FastSAM achieves performance comparable to SAM while operating at 50× faster inference speed, making it more suitable for real-time and edge deployments.Building on the SAM foundation, [144] reviewed the SAM family, highlighting SAM2’s enhancements in segmentation accuracy and a streaming memory architecture that enables real-time video segmentation. In contrast, [145] proposed the Fovea-Like Input Patching (FLIP) approach, which encodes visual input in an object-centric, off-grid manner from the outset. By decoupling spatial location encoding from perceptual object features, FLIP enhances data efficiency and excels at segmenting small objects within high-resolution scenes. It achieves Intersection-over-Union (IoU) scores comparable to SAM with lower computational overhead and consistently outperforms FastSAM across benchmarks such as Hypersim, KITTI-360, and OpenImages.
-
Multimodal Large Language Models (MLLMs): These are advanced AI systems capable of processing and generating across multiple modalities, typically text and images, and probably, video and audio. These models support a wide range of tasks such as image captioning, VQA, and even text-to-image generation. At their core, MLLMs integrate visual encoders with language models like GPT or BERT to enable reasoning over both modalities. A prominent example of a visual encoder is CLIP (Contrastive Language–Image Pretraining). CLIP learns joint image-text representations through contrastive training on 400 million (image, text) pairs from the internet, using dual encoders, one for images and one for text. This enables strong zero-shot generalisation to novel image domains via text prompts [141]. While CLIP’s language grounding and cross-modal capabilities are powerful, its coarse-grained representations and lack of pixel-level detail limit its effectiveness for dense prediction tasks like semantic segmentation.Self-distillation with No Labels (DINO) is a vision-only, self-supervised model that learns high-quality visual features without requiring labelled data. It uses a self-distillation strategy, where a student ViT is trained to match the output of a momentum-updated teacher network. DINO excels at discovering semantic object boundaries and performs well in tasks such as unsupervised object segmentation and fine-grained categorisation [146]. Unlike CLIP, DINO is not trained with text, making it unsuitable for prompt-based or language-guided tasks. However, recent advances with DINOv2 demonstrate that with a large and diverse curated image dataset and improved training stability, vision-only models can produce universal features that outperform CLIP on many image-level and pixel-level benchmarks [147].Further insights into DINOv2’s architecture reveal that its different layers capture complementary information: lower layers specialize in fine-grained details useful for localisation, while deeper layers encode more global semantic concepts [148]. This insight has led to the design of multi-level feature fusion strategies. Despite being unimodal, DINOv2’s rich spatial representations can be effectively aligned with language models using a simple multilayer perception alignment head, making it competitive within MLLM architectures. In contrast, other visual backbones such as Masked Autoencoder (MAE) lack semantic richness, and Data-efficient Image Transformer (DeiT) models, being strongly supervised, struggle with cross-modal alignment [148].To bridge the gap, a hybrid vision encoder (CLIP + DINOv2) (COMM) leverages the global-semantic strengths of CLIP and the fine-grained, structure-aware features of DINOv2, resulting in superior performance across a range of MLLM benchmarks. This hybrid approach exemplifies the trend toward combining vision foundation models for robust multimodal understanding [148].
5.4. Research Gaps
5.4.1. Limited Scalability of Synchronisation Methods
-
Resource constraints in CPS: One of the most prominent issues with current synchronisation methods is their limited scalability in CPS environments. These systems often operate under resource-constrained conditions, with devices such as sensors, cameras, and actuators constrained by bandwidth, energy, and computational power. Many existing synchronisation techniques assume abundant resources, which is unrealistic in practical CPS deployments. We expect to have methods to balance the accuracy of synchronisation with the energy efficiency and computational cost. Consequently, there is a need for lightweight synchronisation algorithms that optimize resource usage without compromising accuracy or efficiency.In [149], the results show an improvement in precision with an increasing sampling rate at the cost of increased memory consumption and computation time. Similarly, in [7], post-deployment processing to align and synchronize data streams introduces computational overhead, which can challenge resource-constrained systems with limited processing power or memory. Moreover, in [?], the synchronisation approach is based on standard components, which may have limitations in terms of precision and robustness, especially when scaling up or requiring higher performance.
-
Bottlenecks in distributed systems: Distributed systems, another core aspect of CPS, face significant bottlenecks in synchronisation due to the communication overhead and latency associated with global updates. This is particularly problematic in real-time applications like autonomous vehicles, where even minor delays can have critical consequences. One possible solution shown in [150] is the use of a polychronous model of computation for concurrent systems to free programming from synchronous timing models and to enhance robustness against clock synchronisation failure-based attacks. This approach allows processes to execute and communicate at their paces without requiring rigid synchronisation, thereby reducing bottlenecks caused by contention for shared resources.However, current research has not sufficiently addressed techniques to minimise communication requirements, such as using model pruning, gradient sparsification, or local aggregation. Becker et al. [151] show that contention among system modules severely affects latency and performance predictability, but LiDAR-related components contribute significantly to system latency and even high-end CPU and GPU platforms cannot achieve real-time performance for the complete end-to-end system. Furthermore, ensuring a balance between local computation and global model updates remains an unresolved challenge. Hybrid synchronisation techniques that adapt dynamically to the system’s real-time state could address this issue, but their development is still in its infancy.
5.4.2. Insufficient Focus on Real-World Deployment Challenges
- Environmental Variability: The real-world deployment of ML-based synchronisation methods in CPS introduces a host of challenges that have not received sufficient attention. One major issue is the variability of real-world environments, which often include unpredictable network latency, device failures, and dynamic workloads. Current synchronisation methods are not robust enough to handle these variations, and research on fault-tolerant approaches that can recover gracefully from such disruptions is limited. Developing methods that maintain performance despite environmental variability is essential to advance the reliability of CPS.
- Deployment at scale: Many synchronisation methods are tested in controlled environments or small-scale settings, which do not reflect the challenges of real-world CPS deployments involving hundreds or thousands of nodes. In [102], the datasets and scenarios used are relatively simple and may not fully demonstrate the generality of the proposed cognitive Digital Twin architecture. More experiments in real industrial maintenance scenarios are needed to validate performance in multi-task, resource-allocation contexts involving personnel, spare parts, and materials. Additionally, the example focuses on updating microservices and the knowledge graph post-data analytics, rather than the physical-world operational responses. Future iterations could incorporate deviations between expected and actual outcomes into the self-evolution process to develop more realistic maintenance solutions.
- Real-time constraints: Real-time constraints further complicate deployment. Many CPS applications, such as surveillance and industrial automation, require synchronisation methods that can operate in real time to process high-frequency data streams. However, the latency introduced by current synchronisation methods makes them unsuitable for such applications. Research on event-driven or asynchronous synchronisation mechanisms that prioritize low-latency processing is still nascent and demands further exploration.
5.4.3. Adaptive Models for Diverse CPS Environments
- Heterogeneity in devices: The diversity of CPS environments presents another significant research gap. These systems often involve a wide range of devices with varying capabilities, such as sensors, drones, and cameras, each with different levels of processing power, storage, and communication bandwidth. Current synchronisation methods are not designed to account for this heterogeneity, leading to inefficiencies in resource utilisation. Adaptive synchronisation algorithms that dynamically adjust to the capabilities and constraints of individual nodes are needed to address this gap.
- Diverse tasks: CPS tasks vary widely, ranging from object detection to anomaly detection and action recognition. Each task has unique synchronisation requirements, but current methods often adopt a one-size-fits-all approach, failing to optimize for the priorities of individual tasks. Developing task-specific synchronisation strategies and exploring multi-tasking synchronisation approaches could significantly enhance the performance and flexibility of CPS.
- Dynamic environments: Dynamic environments pose further challenges, as CPS systems often operate under non-stationary conditions where data distributions, network topologies, or operational requirements can change over time. Existing models lack the adaptability to handle such conditions effectively. Self-learning synchronisation methods that adjust based on feedback from the environment offer a promising direction for future research, enabling models to remain robust and effective in evolving CPS scenarios.
5.4.4. Integration with Emerging Technologies
- Edge and Federated Learning: Edge and federated learning paradigms have shifted the focus from centralized to decentralized systems, necessitating new synchronisation strategies tailored for these frameworks. Efficient edge-to-cloud synchronisation techniques and privacy-preserving methods for federated learning are critical areas that require further investigation.
-
Neuromorphic Computing and Event-Based Vision: CNNs have achieved remarkable success in computer vision but remain computationally expensive and energy-intensive, making them less practical for edge devices and real-time applications. In contrast, neuromorphic computing, inspired by brain neural processes, offers a promising alternative by leveraging spiking neural networks (SNN) and event-driven processing to create efficient and biologically plausible visual recognition systems. However, despite its advantages in power efficiency and real-time processing, neuromorphic computing is still primarily a research tool rather than a mainstream solution. Most current applications focus on benchmark datasets, and neuromorphic systems have yet to outperform deep learning models in accuracy consistently [152].However, [53] highlighted a potential misconception where SNNs may seem more energy-efficient than ANNs truly are.. A significant limitation of neuromorphic computers is their heavy dependence on traditional host machines for defining software structures and managing communication with sensors and actuators. This reliance can reduce the performance benefits of neuromorphic computing, as the overhead costs of data transfer and processing on conventional systems may offset its efficiency advantages. A key challenge for the future is to minimise the dependence on conventional computers and optimize communication architectures to fully unlock the potential of neuromorphic computing [152,153].One of the most promising applications of neuromorphic computing in computer vision is edge computing. When integrated with event cameras, neuromorphic systems enable low-latency, real-time applications, making them ideal for high-speed object detection, motion tracking, semantic segmentation, optical flow estimation, and 3D vision in resource-constrained environments [154]. These capabilities are particularly valuable for autonomous systems, robotics, and surveillance, where real-time decision-making with minimal power consumption is crucial.Another key challenge is the integration of curiosity-based SNNs with traditional deep learning models or other machine learning paradigms. Hybrid approaches could combine the strengths of both deep learning and neuromorphic computing, but this integration requires significant research into training methodologies, architectural compatibility, and hardware support. Developing efficient hybrid neuromorphic-deep learning frameworks could unlock new capabilities in adaptive vision systems, self-learning models, and real-time AI applications [152].
6. Recommendations
6.1. For Practitioners
6.1.1. Implementing Edge Computing
- Hardware Utilisation: Select hardware platforms designed specifically for edge computing, such as NVIDIA Jetson, Intel Movidius, or Google Coral. Additionally, leverage accelerators such as GPUs and TPUs to improve computational efficiency.
- Runtime Frameworks: optimised inference frameworks, including TensorRT, ONNX Runtime, or PyTorch Mobile, are used to ensure efficient and reliable execution of models on edge devices.
- Partitioning Workloads: Distribute tasks strategically between edge devices and the cloud. Execute time-sensitive computations on the edge while offloading resource-intensive processes, such as retraining or extensive analytics, to cloud infrastructure.
- Real-Time Monitoring: Develop mechanisms to continuously monitor the performance and health of edge devices, ensuring reliability and consistent operation even in varying environmental conditions.
6.1.2. Leveraging Federated Learning
- Data Locality: Maintain sensitive data on local devices and transfer only model updates to a central server. This approach minimises the risk of data breaches and ensures compliance with privacy regulations such as GDPR.
- Communication Efficiency: Employ techniques like model compression, update sparsification, and asynchronous communication to reduce the bandwidth required for transmitting model updates between devices and the central server.
- Security Measures: Implement safeguards such as differential privacy and secure multi-party computation to protect data and model parameters from adversarial attacks and ensure the integrity of the learning process.
- Federated Averaging: Utilize aggregation algorithms, like FedAvg, to effectively combine model updates from multiple devices and enhance the robustness of the aggregation process by integrating outlier detection or Byzantine-resilient methods to handle potentially malicious updates.
- Heterogeneity Handling: Design systems capable of accommodating a wide range of edge devices with varying computational capabilities and network conditions. This can be achieved by dynamically allocating tasks based on each device’s capabilities, ensuring efficient and equitable participation in the learning process.
6.1.3. Integration of Edge Computing and Federated Learning with CPS
-
Monitoring and Continuous Learning: To remain effective in dynamic environments, edge computing and federated learning systems require ongoing monitoring and adaptability. Key elements include:
- -
- Performance Metrics Tracking: Continuously monitor model performance metrics, such as latency, accuracy, and confidence, to promptly detect and address potential issues.
- -
- Periodic Updates and Retraining: Incorporate mechanisms for regular model updates and retraining with newly collected data to maintain accuracy and relevance.
- -
- Anomaly Detection: Implement systems to flag anomalous data or behaviour, enabling swift intervention when deviations from expected operations occur.
-
Security and Privacy: Robust security and privacy measures are vital for safeguarding data and ensuring trust in CPS operations. These measures include:
- -
- End-to-End Encryption: Secure all data transmissions between edge devices, cloud servers, and central aggregation points using encryption.
- -
- Role-Based Access: Restrict access to models, data, and system components based on user roles and authentication protocols.
- -
- Defenses Against Adversarial Attacks: Employ strategies like input sanitisation, adversarial training, and anomaly detection to protect against malicious activities.
-
Ethical and Regulatory Compliance: Ethical standards and regulatory requirements are critical for public trust and the lawful deployment of ML models in CPS. Key considerations include:
- -
- Transparency: Provide clear and transparent documentation and explanations of model operations, especially in safety-critical applications.
- -
6.2. For Researchers
6.2.1. Developing Lightweight ML Models for Resource-Constrained CPS
- Search Space (SSp): Defines the range of architectures that NAS can explore. Advances in SSp have broadened the scope of candidate designs, enabling the discovery of innovative architectures that were previously unattainable.
- Search Strategy (SSt): Encompasses methods for exploring the defined search space. Recent research has focused on improving the efficiency of search strategies, and optimising the balance between computational resources and performance outcomes.
- Validation Strategy (VSt): Refers to the techniques used to evaluate the performance of candidate architectures. Enhanced validation strategies have increased the reliability of NAS results while minimising the time and resources required for evaluation.
6.2.2. Exploring TL for Faster Adaptation to New Tasks
- Few-Shot Learning: Facilitates robust training with minimal data in the target domain.
- Zero-Shot Learning: Uses cross-domain knowledge to predict unseen target classes without any labelled data.
- Generalisation: Focuses on transferring knowledge across related tasks (e.g., object detection and semantic segmentation) to save retraining time and resources.
- Federated Transfer Learning (FTL): Extends TL to decentralized, privacy-sensitive contexts where source and target data cannot be shared or centralized [164].
6.3. For Policy Suggestions
6.3.1. Establish Standardized Benchmarks for ML Performance in CPS
- Create comprehensive benchmark repositories to include typical and worst-case models.
- Mitigate IP and security concerns by forming regulatory bodies to redact and standardize real-world models.
- Develop industry-specific benchmarks tailored to the needs of various CPS domains.
- Mandate performance evaluations against benchmarks to ensure baseline performance and reliability prior to deployment.
6.3.2. Promote Open-Source Tools
- Collaboration: Enable global developers to share ideas and improvements, driving innovation and creative problem-solving.
- Accessibility: Make CPS development affordable by eliminating proprietary software costs, levelling the playing field for all.
- Transparency: Build trust through open review, auditing, and enhancement of code, promoting ethical and reliable systems.
- Accelerated Development**: Leverage pre-trained models and ready-to-use tools to reduce development time significantly.
6.3.3. Encourage Cross-Domain Collaboration
- Foster research programs that integrate engineering, computer science, economics, and social sciences.
- Establish CPS innovation hubs to tackle common challenges like security, real-time data processing, and model generalisation.
- Create industry consortia to set shared goals, develop common standards, and address deployment challenges collaboratively.
6.3.4. Regulate and Promote Ethical Use of CPS
- Develop ethical guidelines and regulations to ensure CPS operates safely and fairly, especially in sensitive sectors, like healthcare and autonomous vehicles.
- Support explainable CPS for transparency in machine learning models, fostering public trust and accountability.
- Enforce robust cybersecurity and privacy standards to safeguard physical assets and data against attacks or misuse.
- Promote diversity in research and development teams to create inclusive CPS technologies that consider diverse societal impacts
7. Conclusion
7.1. Summary of Findings
7.1.1. CNN
- R-CNN: A two-stage object detector with high accuracy but computationally intensive. Variants like Fast R-CNN, Faster R-CNN, and Mask R-CNN improve efficiency and extend functionality to instance segmentation and pose estimation.
- ResNet: Focuses on training deep networks efficiently using residual learning and skip connections, widely applied in image classification, segmentation, and as a backbone for detection models.
- YOLO: A single-stage real-time detector that processes images in one pass, offering high speed with moderate localisation errors. Versions like YOLOv2–YOLOv5 enhance accuracy while maintaining real-time performance, making them suitable for applications like autonomous vehicles.
- SSD: Alternatives to YOLO that improve detection accuracy and address specific limitations in network design and loss functions.
7.1.2. Federated Learning
- Privacy Preservation: Retains data on local devices, sharing only model updates to protect sensitive information.
- Scalability: Efficiently handles extensive distributed networks with numerous devices.
- Reduced Latency: Minimises communication overhead by processing data locally.
- Heterogeneity Handling: Adapts to diverse and resource-imbalanced environments.
- Robustness and Adaptability: Supports continuous learning and dynamic updates to maintain model reliability.
- Synchronous FL: Updates are synchronised simultaneously, challenging in heterogeneous environments.
- Asynchronous FL: Updates occur independently, improving flexibility but risking stale updates.
7.1.3. Meta-learning
- Prototypical Networks: Facilitate few-shot classification by learning a metric space where classification is based on distances to class prototypes.
- Siamese Networks: Twin networks that map similar data points closer in feature space, are useful for tasks like forgery detection across languages and styles.
- Model-Agnostic Meta-Learning: Trains models to adapt quickly to new tasks with a few gradient steps, excelling in few-shot classification, regression, and reinforcement learning.
- Memory-Augmented Models: Use external memory mechanisms to rapidly encode and retrieve new information without extensive retraining.
- Fast Adaptation: Quickly adapts to new tasks with limited data, essential for dynamic environments like drones or robotics.
- Data Efficiency: Leverages prior knowledge, reducing the need for extensive labelled data, critical in areas like medical imaging.
- Cross-Domain Learning: Enhances generalisation across different visual domains, aiding in knowledge transfer between tasks.
- Personalisation: Tailors models to individual preferences or unique environments, such as user-specific AR applications.
- Image Classification: Quickly identifies unseen categories with few-shot or zero-shot learning.
- Object Detection and Tracking: Enhances robustness to visual variations.
- Image Segmentation: Useful in medical imaging and autonomous driving.
- Facial Recognition: Adapts to new faces with minimal training data.
- Pose Estimation and Scene Understanding: Critical for robotics and AR applications.
- Scalability: Difficulty in handling large-scale datasets and high-dimensional CV tasks.
- Generalisation: Struggles to perform well on unseen tasks and domains.
- Computational Complexity: High resource requirements can limit applicability in constrained environments.
- Task Diversity: Developing diverse task sets for training is challenging but essential for real-world generalisation.
- optimisation Stability: Sensitive to hyperparameters and optimisation methods, requiring careful tuning.
- Interpretability: Deep meta-learning models lack transparency, complicating trust and usability.
7.1.4. Synchronisation in CPS
- Timestamping: Attaches precise time metadata to data packets to align heterogeneous data streams.
- Sensor Fusion: Integrates data from multiple sensors for accurate environmental representation, used in applications like autonomous vehicles and robotics.
- Real-Time Task Scheduling: Ensures low-latency, high-accuracy processing under strict resource and time constraints, crucial for autonomous vehicles, drones, and robotics.
7.1.5. Optimisation Approaches
-
Model Compression Techniques:
- -
- Pruning: Removes redundant neurons or connections to reduce model size without significantly impacting accuracy.
- -
- Quantisation: Lowers the precision of model weights, reducing memory usage and speeding up computation.
- -
- Knowledge Distillation: Transfers knowledge from a large, complex model (teacher) to a smaller, simpler one (student), maintaining similar accuracy while enhancing computational efficiency.
- -
- Low-rank Factorisation: Approximates weight matrices with lower-rank matrices, reducing parameters, speeding up training, and improving inference efficiency.
- -
- Transfer Learning: Reuses pre-trained models for related tasks, reducing the need for extensive data labelling and speeding up adaptation to new tasks.
-
Lightweight Architectures:
- -
- MobileNet: Uses depthwise separable convolutions to create lightweight models, customisable for different trade-offs between latency and accuracy.
- -
- EfficientNets: Achieve high accuracy with fewer parameters and FLOPS, designed for efficient scaling while maintaining performance.
-
Hardware Acceleration and optimisation:
- -
- Parallelism: Utilises GPUs and specialised hardware (e.g., TPUs) for faster processing, crucial for large-scale problems.
- -
- Inference Pipeline optimisation: Streamlines processing to meet real-time requirements, balancing trade-offs between accuracy, throughput, and latency.
-
Data Augmentation:
- -
- Deeply Learned Augmentation: Uses deep learning models to generate data variations, enhancing robustness.
- -
- Feature-Level Augmentation: Alters specific data features (e.g., brightness, contrast) to improve generalisation.
- -
- Meta-learning: Learns to generate optimal augmentations based on data characteristics.
- -
- Data Synthesis: Using techniques like 3D modelling and GANs to generate synthetic data, increasing the diversity of the training set.
-
Edge Computing:
- -
- Edge Intelligence: Combines AI and edge computing to address challenges such as reducing latency, optimising resources, and enhancing data privacy.
- -
- Distributed Learning: Collaborative model training across edge devices reduces bandwidth and enhances privacy, with local data storage requirements.
7.1.6. Adaptation Mechanisms
- Data-driven Adaptation: Utilises real-world data to enable models to adjust to varying conditions, such as lighting, material properties, and geometric complexities. Techniques like data augmentation (geometric and photometric transformations) allow models to adapt to unseen scenarios and improve generalisation across diverse environments.
- Online Learning: Models are continuously updated with new data collected during deployment. This approach is critical for real-time adjustments to environmental changes, such as varying lighting or evolving cybersecurity threats. Examples include combining offline and online learning techniques, such as the Truncated Gradient Confidence-weighted model, for tasks like image classification.
- Transfer Learning: Involves fine-tuning pre-trained models on smaller task-specific datasets, enabling adaptation to new environments without training from scratch. This is especially useful in CPS where models trained in one context are adapted to others. For example, transfer learning is used to detect attacks in CPS using a ResNet, adapting the model to different operational conditions.
- Ensemble Methods: Combines predictions from multiple models to improve accuracy and robustness. By leveraging diverse architectures (e.g., MobileNetV2, ResNet50), ensemble methods enhance the model’s ability to adapt to variations in data quality and features, as seen in medical image analysis tasks where data can differ in noise, resolution, and conditions.
- Adversarial Training: Enhances model resilience to small, intentional perturbations by incorporating adversarial examples into the training process. This helps models adapt to a broader range of input variations, making them more robust against adversarial attacks. Randomised smoothing further strengthens this by adding noise to the input, ensuring the model is resistant to adversarial modifications.
- Federated Learning: Allows distributed CPS devices to train models locally and share updates without centralising sensitive data. This improves performance while maintaining privacy. FL employs various aggregation methods (e.g. averaging, FedGKT) and privacy-preserving techniques (e.g. differential privacy, homomorphic encryption). Challenges include high communication overhead and non-IID data issues, but methods like Fed-MPS reduce resource constraints and communication costs.
7.2. Current Advancements and Future Directions
Funding
Author Contributions: Kai Hung Tang
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021, 372. [Google Scholar] [CrossRef]
- Phung, V.H.; Rhee, E.J. A High-Accuracy Model Average Ensemble of Convolutional Neural Networks for Classification of Cloud Image Patches on Small Datasets. Applied Sciences 2019, 9. [Google Scholar] [CrossRef]
- Mirzaei, S.; Kang, J.L.; Chu, K.Y. A comparative study on long short-term memory and gated recurrent unit neural networks in fault diagnosis for chemical processes using visualization. Journal of the Taiwan Institute of Chemical Engineers 2022, 130. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv 2020, arXiv:abs/2010.11929. [Google Scholar]
- Esterle, L.; Grosu, R. Cyber-physical systems: challenge of the 21st century. Elektrotech. Inftech 2016, 299–303. [Google Scholar] [CrossRef]
- Jahromi Babak Shahian, T.T.; Sabri, C. Real-Time Hybrid Multi-Sensor Fusion Framework for Perception in Autonomous Vehicles. Sensors 2018, 20, 4357. [Google Scholar] [CrossRef]
- Bennett, T.R.; Gans, N.; Jafari, R. A data-driven synchronization technique for cyber-physical systems. In Proceedings of the Proceedings of the Second International Workshop on the Swarm at the Edge of the Cloud, New York, NY, USA, 2015; SWEC ’15, p. 49–54.
- Wu, H.; Zeng, L.; Chen, M.; Wang, T.; He, C.; Xiao, H.; Luo, S. Weak surface defect detection for production-line plastic bottles with multi-view imaging system and LFF YOLO. Optics and Lasers in Engineering 2024, 181, 108369. [Google Scholar] [CrossRef]
- Robyns, S.; Heerwegh, W.; Weckx, S. A Digital Twin of an Off Highway Vehicle based on a Low Cost Camera. Procedia Computer Science 2024, 232, 2366–2375, 5th International Conference on Industry 4.0 and Smart Manufacturing (ISM 2023). [Google Scholar] [CrossRef]
- Reddy, G.J.; Sharma, D.S.G. Edge AI in Autonomous Vehicles: Navigating the Road to Safe and Efficient Mobility. International Journal of Scientific Research in Engineering and Management 2024, 08, 1–13. [Google Scholar] [CrossRef]
- Maier, G.; Pfaff, F.; Wagner, M.; Pieper, C.; Gruna, R.; Noack, B.; Kruggel-Emden, H.; Längle, T.; Hanebeck, U.D.; Wirtz, S.; et al. Real-time multitarget tracking for sensor-based sorting. Journal of Real-Time Image Processing 2019, 16, 2261–2272. [Google Scholar] [CrossRef]
- Wang, S.; Zhang, J.; Wang, P.; Law, J.; Calinescu, R.; Mihaylova, L. A deep learning-enhanced Digital Twin framework for improving safety and reliability in human–robot collaborative manufacturing. Robotics and Computer-Integrated Manufacturing 2024, 85, 102608. [Google Scholar] [CrossRef]
- Prasad, K.P.S.P. Compressed Mobilenet V3: An Efficient CNN for Resources Constrained Platforms. Purdue University Graduate School. Thesis. 2021. [Google Scholar]
- Wang, S.; Zheng, H.; Wen, X.; Shang, F. Distributed High-Performance Computing Methods for Accelerating Deep Learning Training. Journal of Knowledge Learning and Science Technology 2024, 3. [Google Scholar] [CrossRef]
- Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2016. [Google Scholar] [CrossRef]
- Bird, J.J.; Kobylarz, J.; Faria, D.R.; Ekárt, A.; Ribeiro, E.P. Cross-Domain MLP and CNN Transfer Learning for Biological Signal Processing: EEG and EMG. IEEE Access 2020, 8, 54789–54801. [Google Scholar] [CrossRef]
- Hanhirova, J.; Kämäräinen", T.; Seppälä, S.; Siekkinen, M.; Hirvisalo, V.; Ylä-Jääski, A. Latency and throughput characterization of convolutional neural networks for mobile computer vision. In MMSys ’18: Proceedings of the 9th ACM Multimedia Systems Conference; 2018; pp. 204–215. [Google Scholar]
- Shen, Y.; Li, Y.; Liu, Y.; Wang, Y.; Chen, L.; Wang, F.Y. Conditional visibility aware view synthesis via parallel light fields. Neurocomputing 2024, 588, 127644. [Google Scholar] [CrossRef]
- Kaur, P.; Khehra, B.S.; Mavi, E.B.S. Data Augmentation for Object Detection: A Review. In Proceedings of the 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS); 2021; pp. 537–543. [Google Scholar]
- Wang, H.; Zhang, H.; Zhu, L.; Wang, Y.; Deng, J. ResADM: A Transfer-Learning-Based Attack Detection Method for Cyber–Physical Systems. Applied Sciences 2023, 13, 13019. [Google Scholar] [CrossRef]
- Awan, A.A. What is Online Machine Learning? datacamp.com/blog/what-is-online-machine-learning, 2023. [Google Scholar]
- Tahir, A.; Saadia, A.; Khan, K.; Gul, A.; Qahmash, A.; Akram, R. Enhancing diagnosis: ensemble deep-learning model for fracture detection using X-ray images. Clinical Radiology 2024, 79, 1394–1402. [Google Scholar] [CrossRef]
- Liang, F.; Wu, B.; Wang, J.; Yu, L.; Li, K.; Zhao, Y.; Misra, I.; Huang, J.B.; Zhang, P.; Vajda, P.; et al. FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis. arXiv 2023, arXiv:2312.17681. [Google Scholar]
- Bengar, J.Z.; Gonzalez-Garcia, A.; Villalonga, G.; Raducanu, B.; Aghdam, H.H.; Mozerov, M. Temporal Coherence for Active Learning in Videos. arXiv 2019. [Google Scholar] [CrossRef]
- Dantas, P.V.; da Silva Jr, W.S.; Cordeiro, L.C.; Carvalho, C.b. A comprehensive review of model compression techniques in machine learning. Application Intelligence 2024, 54, 11804–11844. [Google Scholar] [CrossRef]
- Cai, E.; Juan, D.C.; Stamoulis, D.; Marculescu, D. NeuralPower: Predict and Deploy Energy-Efficient Convolutional Neural Networks. arXiv 2017, arXiv:1710.05420. [Google Scholar]
- He, K.; Gkioxari, G.; Dollär, P.; Girshick, R. Mask R-CNN. arxiv 2018, arXiv:1703.06870v3. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. Computer Vision - ECCV 2016. Lecture Notes in Computer Science () 2016, 9905, 21–37. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arxiv 2020, arXiv:2005.12872v3. [Google Scholar]
- Hu, J.; Yan, C.; Liu, X.; Li, Z.; Ren, C.; Zhang, J.; Peng, D.; Yang, Y. An integrated classification model for incremental learning. Multimedia Tools and Applications 2021, 80, 17275–17290. [Google Scholar] [CrossRef]
- Shoukat, M.U.; Yan, L.; Yan, Y.; Zhang, F.; Zhai, Y.; Han, P.; Nawaz, S.A.; Raza, M.A.; Akbar, M.W.; Hussain, A. Autonomous driving test system under hybrid reality: The role of digital twin technology. Internet of Things 2024, 27, 101301. [Google Scholar] [CrossRef]
- Pan, Y.; Luo, K.; Liu, Y.; Xu, C.; Liu, Y.; Zhang, L. Mobile edge assisted multi-view light field video system: Prototype design and empirical evaluation. Future Generation Computer Systems 2024, 153, 154–168. [Google Scholar] [CrossRef]
- Sakina, M.; Muhammad, I.; Abdullahi, S.S. A multi-factor approach for height estimation of an individual using 2D image. Procedia Computer Science 2024, 231, 765–770. [Google Scholar] [CrossRef]
- Kaushik, H.; Kumar, T.; Bhalla, K. iSecureHome: A deep fusion framework for surveillance of smart homes using real-time emotion recognition. Applied Soft Computing 2022, 122, 108788. [Google Scholar] [CrossRef]
- Murel, J.; Kavlakoglu, E. What is object detection? ibm.com/topics/object-detection, 2024. [Google Scholar]
- GreeksforGreeks. What is Object Detection in Computer Vision? geeksforgeeks.org/what-is-object-detection-in-computer-vision/, 2024. [Google Scholar]
- Himeur, Y.; Varlamis, I.; Kheddar, H.; Amira, A.; Atalla, S.; Singh, Y.; Bensaali, F.; Mansoor, W. Federated Learning for Computer Vision. arXiv 2023, arXiv:2308.13558. [Google Scholar] [CrossRef]
- Sprague, M.R.; Jalalirad, A.; Scavuzzo, M.; Capota, C.; Neun, M.; Do, L.; Kopp, M. Asynchronous Federated Learning for Geospatial Applications. Communications in Computer and Information Science 2019, 967. [Google Scholar]
- He, C.; Shah, A.D.; Tang, Z.; Sivashunmugam, D.F.N.; Bhogaraju, K.; Shimpi, M.; Shen, L.; Chu, X.; Soltanolkotabi, M.; Avestimehr, S. FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks. Computer Vision and Pattern Recognition 2021. [Google Scholar]
- Snell, J.; Swersky, K.; Zemel, R.S. Prototypical Networks for Few-shot Learning. arXiv 2017. [Google Scholar]
- Dey, S.; Dutta, A.; Toledo, J.I.; Ghosh, S.K.; Lladós, J.; Pal, U. SigNet: Convolutional Siamese Network for Writer Independent Offline Signature Verification. ArXiv, 2017; arXiv:abs/1707.02131. [Google Scholar]
- Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv 2017. [Google Scholar]
- Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; Lillicrap, T. Meta-Learning with Memory-Augmented Neural Networks. In Proceedings of the Proceedings of The 33rd International Conference on Machine Learning; Balcan, M.F.; Weinberger, K.Q., Eds., New York, New York, USA, 20–22 Jun 2016; Vol. 48, Proceedings of Machine Learning Research, pp. 1842–1850.
- Golovin, D.; Solnik, B.; Moitra, S.; Kochanski, G.; Karro, J.E.; Sculley, D. (Eds.) Google Vizier: A Service for Black-Box Optimization. 2017. [Google Scholar]
- Yang, H.s.; Kupferschmidt, B. Time Stamp Synchronization in Video System. International Telemetering Conference Proceedings 2010. [Google Scholar]
- Lin, Z.; Li, P.; Xiao, Y.; Xiao, L.; Luo, F. Learning Based Efficient Federated Learning for Object Detection in MEC Against Jamming. In Proceedings of the 2021 IEEE/CIC International Conference on Communications in China (ICCC); 2021; pp. 115–120. [Google Scholar] [CrossRef]
- Hu, Y.; Liu, S.; Abdelzaher, T.; Wigness, M.; David, P. Real-time task scheduling with image resizing for criticality-based machine perception. Real-Time Systems 2022, 58, 430–455. [Google Scholar] [CrossRef]
- Ben-Num, T.; Hoefler, T. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Computing Surveys (CSUR) 2019, 52, 1–43. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015. [Google Scholar] [CrossRef]
- Cai, G.; Li, J.; Liu, X.; Chen, Z.; Zhang, H. Learning and Compressing: Low-Rank Matrix Factorization for Deep Neural Network Compression. Applied Sciences 2023, 13. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020. [Google Scholar] [CrossRef]
- Yan, Z.; Bai, Z.; Wong, W.F. Reconsidering the energy efficiency of spiking neural networks. arXiv 2024, arXiv:cs.NE/2409.08290. [Google Scholar]
- Mumuni, A.; Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
- Deng, S.; Zhao, H.; Fang, W.; Yin, J.; Dustdar, S.; Zomaya, A.Y. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet of Things Journal 2020, 7, 7457–7469. [Google Scholar] [CrossRef]
- Tron, R.; Vidal, R. Distributed computer vision algorithms through distributed averaging. In Proceedings of the CVPR 2011, 2011, 57–63. [Google Scholar]
- Lu, J.; Zhao, P.; Hoi, S.C.H. Online Passive-Aggressive Active Learning. Machine Learning 2016, 103, 141–183. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. arXiv 2015. [Google Scholar] [CrossRef]
- Cohen, J.M.C.; Rosenfeld, E.; Kolter, J.Z. Certified Adversarial Robustness via Randomized Smoothing. arXiv 2019. [Google Scholar] [CrossRef]
- Jiang, S.; Wang, X.; Que, Y.; Lin, H. Fed-MPS: Federated learning with local differential privacy using model parameter selection for resource-constrained CPS. Journal of Systems Architecture 2024, 150. [Google Scholar] [CrossRef]
- Nazare Jr., A. C.; Schwartz, W.R. A scalable and flexible framework for smart video surveillance. Computer Vision and Image Understanding 2016, 144, 258–275, Individual and Group Activities in Video Event Analysis. [Google Scholar] [CrossRef]
- Gallego, G.; Delbruck, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-Based Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022, 44, 154–180. [Google Scholar] [CrossRef] [PubMed]
- Ren, H.; Zhou, Y.; Huang, Y.; Fu, H.; Lin, X.; Song, J.; Cheng, B. SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition. arXiv 2024, arXiv:cs.CV/2310.07189. [Google Scholar]
- Ji, M.; Wang, Z.; Yan, R.; Liu, Q.; Xu, S.; Tang, H. SCTN: Event-based object tracking with energy-efficient deep convolutional spiking neural networks. Frontiers in Neuroscience 2023, 17, 1123698. [Google Scholar] [CrossRef]
- Cordone, L.; Miramond, B.; Ferrante, S. Learning from Event Cameras with Sparse Spiking Convolutional Neural Networks. arXiv 2021, arXiv:2104.12579. [Google Scholar] [CrossRef]
- Brödermann, T.; Bruggemann, D.; Sakaridis, C.; Gool, L.V. EFCL Dataset - Technical Report. Technical report, ETH Zurich. 2022. [Google Scholar]
- Xing, W.; Lin, S.; Yang, L.; Pan, J. Target-free Extrinsic Calibration of Event-LiDAR Dyad using Edge Correspondences. arXiv 2023, arXiv:cs.RO/2305.04017. [Google Scholar] [CrossRef]
- Zhou, X.; Dai, Y.; Qin, H.; Qiu, S.; Liu, X.; Dai, Y.; Li, J.; Yang, T. Subframe-Level Synchronization in Multi-Camera System Using Time-Calibrated Video. Sensors 2024, 24. [Google Scholar] [CrossRef]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244V3. [Google Scholar] [CrossRef]
- Senadeera, D.C.; Yang, X.; Kollias, D.; Slabaugh, G. CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention. arXiv 2024, arXiv:cs.CV/2404.18952. [Google Scholar]
- cob parro, a.c.; losada gutiérrez, c.; marrón romera, m.; gardel vicente, a.; bravo muñoz, i. smart video surveillance system based on edge computing. Sensors 2021, 21. [Google Scholar] [CrossRef]
- Wan, Z.; Mao, Y.; Zhang, J.; Dai, Y. RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation. arXiv 2023, arXiv:cs.CV/2309.15082. [Google Scholar]
- Pei, K.; Zhu, L.; Cao, Y.; Yang, J.; Vondrick, C.; Jana, S. Towards Practical Verification of Machine Learning: The Case of Computer Vision Systems. arXiv 2022, arXiv:1712.01785. [Google Scholar] [CrossRef]
- Santos, V.f.; Albuquerque, C.; Passos, D.; Quincozes, S.E.; Mossé, D. Assessing Machine Learning Techniques for Intrusion Detection in Cyber-Physical Systems. Energies 2023, 16. [Google Scholar] [CrossRef]
- Windmann, A.; Steude, H.; Niggemann, O. Robustness and Generalization Performance of Deep Learning Models on Cyber-Physical Systems: A Comparative Study. arXiv 2023, arXiv:2306.07737. [Google Scholar] [CrossRef]
- Calba, C.; Goutard, F.L.; Hoinville, L.; Hendrikx, P.; Lindberg, A.; Saegerman, C.; Peyre, M. Surveillance systems evaluation: a systematic review of the existing approaches. BMC Public Health 2015, 15. [Google Scholar] [CrossRef]
- Kim, Y.; Jeong, J. A Simulation-Based Approach to Evaluate the Performance of Automated Surveillance Camera Systems for Smart Cities. Applied Science 2023, 13, 10682. [Google Scholar] [CrossRef]
- Pagire, V.; Chavali, M.; Kale, A. A comprehensive review of object detection with traditional and deep learning methods. Signal Processing 2025, 237, 110075. [Google Scholar] [CrossRef]
- Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.l.; Da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]
- Luo, C.; He, X.; Zhan, J.; Wang, L.; Gao, W.; Dai, J. Comparison and Benchmarking of AI Models and Frameworks on Mobile Devices. arXiv 2020, arXiv:2005.05085. [Google Scholar] [CrossRef]
- Park, H.D.; Min, O.G.; Lee, Y.J. Scalable Architecture for an Automated Surveillance System Using Edge Computing. J Supercomput 2017. [Google Scholar] [CrossRef]
- Boizard, N.; Haddad, K.E.; Ravet, T.; Cresson, F.; Dutoit, T. Deep learning-based stereo camera multi-video synchronization. arXiv 2023, arXiv:cs.CV/2303.12916. [Google Scholar]
- Michaelis, C.; Mitzkus, B.; Geirhos, R.; Rusak, E.; Bringmann, O.; Ecker, A.S.; Bethge, M.; Brendel, W. Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming. arXiv 2020, arXiv:1907.07484. [Google Scholar] [CrossRef]
- Zheng, Z.; Cheng, Y.; Xin, Z.; Yu, Z.; Zheng, B. Robust Perception Under Adverse Conditions for Autonomous Driving Based on Data Augmentation. IEEE Transactions on Intelligent Transportation Systems 2023, 24, 13916–13929. [Google Scholar] [CrossRef]
- Maróti, M.; Kusy, B.; Simon, G.; Lédeczi, A. The flooding time synchronization protocol. In SenSys ’04: Proceedings of the 2nd international conference on Embedded networked sensor systems; 2024; pp. 33–49. [Google Scholar]
- Abdul-Rashid, R.; Al-Shaikhi, A.; Masoud, A. Accurate, Energy-Efficient, Decentralized, Single-Hop, Asynchronous Time Synchronization Protocols for Wireless Sensor Networks. arXiv 2018, arXiv:eess.SP/1811.01152. [Google Scholar]
- Kreković, D.; Krivić, P.; Žarko, I.P.; Kušek, M.; Le-Phuoc, D. Reducing Communication Overhead in the IoT-Edge-Cloud Continuum: A Survey on Protocols and Data Reduction Strategies. arXiv 2024, arXiv:cs.NI/2404.19492. [Google Scholar] [CrossRef]
- Kyriakakis, E.; Tange, K.; Reusch, N.; Zaballa, E.O.; Fafoutis, X.; Schoeberl, M.; Dragoni, N. Fault-tolerant Clock Synchronization using Precise Time Protocol Multi-Domain Aggregation. In Proceedings of the 2021 IEEE 24th International Symposium on Real-Time Distributed Computing (ISORC); 2021; pp. 114–122. [Google Scholar]
- Gore, R.N.; Lisova, E.; Åkerberg, J.; Björkman, M. CoSiNeT: A Lightweight Clock Synchronization Algorithm for Industrial IoT. In Proceedings of the 2021 4th IEEE International Conference on Industrial Cyber-Physical Systems (ICPS); 2021; pp. 92–97. [Google Scholar]
- Yao, Z.; Jia, L.; Wang, Y.; Wang, Z.; Zhou, Z.; Liao, B.; Mumtaz, S.; Wang, X. Time Synchronization-Aware Edge-End Collaborative Network Routing Management for FL-Assisted Distributed Energy Scheduling. In Proceedings of the ICC 2023 - IEEE International Conference on Communications; 2023; pp. 1780–1785. [Google Scholar]
- Jiang, C.; Fan, T.; Gao, H.; Shi, W.; Liu, L.; Cérin, C.; Wan, J. Energy aware edge computing: A survey. Computer Communications 2020, 151, 556–580. [Google Scholar] [CrossRef]
- Vasile, C.E.; Ulmămei, A.A.; Bîră, C. Image Processing Hardware Acceleration—A Review of Operations Involved and Current Hardware Approaches. Journal of Imaging 2024, 10. [Google Scholar] [CrossRef]
- Piccolboni, L.; Mantovani, P.; Guglielmo, G.D.; Carloni, L.P. COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators. ACM Trans. Embed. Comput. Syst. 2017, 16. [Google Scholar] [CrossRef]
- Rehman, M.; Petrillo, A.; Forcina, A.; Felice, F.D. Metaverse Simulator for Emotional Understanding. Procedia Computer Science 2024, 232, 3216–3228. [Google Scholar] [CrossRef]
- Adnan Arefeen, M.; Tabassum Nimi, S.; Yusuf Sarwar Uddin, M. FrameHopper: Selective Processing of Video Frames in Detection-driven Real-Time Video Analytics. In Proceedings of the 2022 18th International Conference on Distributed Computing in Sensor Systems (DCOSS); 2022; pp. 125–132. [Google Scholar]
- Zhi, Y.; Tong, Z.; Wang, L.; Wu, G. MGSampler: An Explainable Sampling Strategy for Video Action Recognition. arXiv 2021, arXiv:2104.09952. [Google Scholar] [CrossRef]
- Li, J.; Xu, R.; Liu, X.; Ma, J.; Li, B.; Zou, Q.; Ma, J.; Yu, H. Domain Adaptation based Object Detection for Autonomous Driving in Foggy and Rainy Weather. IEEE Transactions on Intelligent Vehicles 2014, 1–12. [Google Scholar]
- Wang, Y.; Lu, Y.; Qiu, Y. Gated image-adaptive network for driving-scene object detection under nighttime conditions. Multimedia Syst. 2024, 31. [Google Scholar] [CrossRef]
- Madasu, A.; Vijjini, A.R. Sequential Domain Adaptation through Elastic Weight Consolidation for Sentiment Analysis. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR); 2021; pp. 4879–4886. [Google Scholar]
- Aslam, S.; Rasool, A.; Wu, H.; Li, X. CEL: A Continual Learning Model for Disease Outbreak Prediction by Leveraging Domain Adaptation via Elastic Weight Consolidation. arXiv 2024, arXiv:cs.LG/2401.08940. [Google Scholar] [CrossRef]
- Zhou, Z.; Yeung, G.; Schapiro, A.C. Self-recovery of memory via generative replay. arXiv 2023, arXiv:2301.06030. [Google Scholar] [CrossRef]
- Lv, J.; Li, X.; Sun, Y.; Zheng, Y.; Bao, J. A bio-inspired LIDA cognitive-based Digital Twin architecture for unmanned maintenance of machine tools. Robotics and Computer–Integrated Manufacturing 2023, 102489. [Google Scholar] [CrossRef]
- Tong, H.; Chen, M.; Zhao, J.; Hu, Y.; Yang, Z.; Liu, Y.; Yin, C. Continual Reinforcement Learning for Digital Twin Synchronization Optimization. arXiv 2025, arXiv:cs.NI/2501.08045. [Google Scholar] [CrossRef]
- García-Valls, M.; Chirivella-Ciruelos, A.M. CoTwin: Collaborative improvement of digital twins enabled by blockchain. Future Generation Computer Systems 2024, 157, 408–421. [Google Scholar] [CrossRef]
- Scarrica, V.M.; Panariello, C.; Ferone, A.; Staiano, A. A Hybrid Approach To Real-Time Multi-Object Tracking. arXiv 2023, arXiv:cs.CV/2308.01248. [Google Scholar]
- Ussa, A.; Rajen, C.S.; Pulluri, T.; Singla, D.; Acharya, J.; Chuanrong, G.F.; Basu, A.; Ramesh, B. A Hybrid Neuromorphic Object Tracking and Classification Framework for Real-Time Systems. IEEE Transactions on Neural Networks and Learning Systems 2024, 35, 10726–10735. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:cs.CL/1706.03762. [Google Scholar]
- Al-Qawlaq, A.; M, A.K.; John, D. KWT-Tiny: RISC-V Accelerated, Embedded Keyword Spotting Transformer. arXiv 2024, arXiv:cs.AR/2407.16026. [Google Scholar]
- Wang, W.; Sun, W.; Liu, Y. Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems; 2024; p. 1. [Google Scholar]
- Piardi, L.; Leitão, P.; Queiroz, J.; Pontes, J. Role of digital technologies to enhance the human integration in industrial cyber–physical systems. Annual Reviews in Control 2024, 57, 100934. [Google Scholar] [CrossRef]
- Wang, B.; Zheng, P.; Yin, Y.; Shih, A.; Wang, L. Toward human-centric smart manufacturing: A human-cyber-physical systems (HCPS) perspective. Journal of Manufacturing Systems 2022, 63, 471–490. [Google Scholar] [CrossRef]
- Xia, L.; Lu, J.; Lu, Y.; Hao, Z.; Fan, Y.; Zhang, Z. Augmented reality and indoor positioning based mobile production monitoring system to support workers with human-in-the-loop. Robotics and Computer-Integrated Manufacturing 2024, 86, 102664. [Google Scholar] [CrossRef]
- Xiao, W.; Li, A.; Cassandras, C.G.; Belta, C. Toward model-free safety-critical control with humans in the loop. Annual Reviews in Control 2024, 57, 100944. [Google Scholar] [CrossRef]
- Saeed, M.M.; Alsharidah, M. Security, privacy, and robustness for trustworthy AI systems: A review. Computers and Electrical Engineering 2024, 119, 109643. [Google Scholar] [CrossRef]
- Tahir, G.A. Ethical Challenges in Computer Vision: Ensuring Privacy and Mitigating Bias in Publicly Available Datasets. arXiv 2024, arXiv:cs.CV/2409.10533. [Google Scholar]
- Khargonekar, P.P.; Sampath, M. A Framework for Ethics in Cyber-Physical-Human Systems. IFAC-PapersOnLine 2020, 53, 17008–17015. [Google Scholar] [CrossRef]
- Zahidi, S.; Ratcheva, V.; Hingel, G.; Brown, S. The Future of Jobs Report 2020. World Economic Forum 2020. [Google Scholar]
- Broo, D.G. Transdisciplinarity and three mindsets for sustainability in the age of cyber-physical systems. Journal of Industrial Information Integration 2022, 27, 100290. [Google Scholar] [CrossRef]
- Gerhold, M.; Kouzel, A.; Mangal, H.; Mehmed, S.; Zaytsev, V. Modelling of Cyber-Physical Systems through Domain-Specific Languages: Decision, Analysis, Design. In MODELS Companion ’24: Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems; 2024; pp. 1170–1179. [Google Scholar]
- Ghanem, M.; Dawoud, F.; Gamal, H.; Soliman, E.; El-Batt, T.; Sharara, H. FLoBC: A decentralized blockchain-based federated learning framework. In Proceedings of the 2022 Fourth International Conference on Blockchain Computing and Applications (BCCA). IEEE; 2022; pp. 85–92. [Google Scholar]
- Ghanem, M.; Mouloudi, A.; Mourchid, M. Towards a scientific research based on semantic web. Procedia Computer Science 2015, 73, 328–335. [Google Scholar] [CrossRef]
- Zhao, Y.; Zhu, Z.; Chen, B.; Qiu, S.; Huang, J.; Lu, X.; Yang, W.; Ai, C.; Huang, K.; He, C.; et al. Toward parallel intelligence: An interdisciplinary solution for complex systems. The Innovation 2023, 4, 100521. [Google Scholar] [CrossRef] [PubMed]
- Tomelleri, F.; Sbaragli, A.; Piacariello, F.; Pilati, F. Safe Assembly in Industry 5.0: Digital Architecture for the Ergonomic Assembly Worksheet. Procedia CIRP 2024, 127, 68–73. [Google Scholar] [CrossRef]
- Ghanem, M.C.; Ratnayake, D.N. Enhancing WPA2-PSK four-way handshaking after re-authentication to deal with de-authentication followed by brute-force attack a novel re-authentication protocol. In Proceedings of the 2016 International Conference On Cyber Situational Awareness, Data Analytics And Assessment (CyberSA). IEEE; 2016; pp. 1–7. [Google Scholar]
- Loumachi, F.Y.; Ghanem, M.C.; Ferrag, M.A. Advancing Cyber Incident Timeline Analysis Through Retrieval-Augmented Generation and Large Language Models. Computers 2025, 14, 67. [Google Scholar] [CrossRef]
- Yedilkhan, D.; Kyzyrkanov, A.E.; Kutpanova, Z.A.; Aljawarneh, S.; Atanov, S.K. Intelligent obstacle avoidance algorithm for safe urban monitoring with autonomous mobile drones. Journal of Electronic Science and Technology 2024, 22, 100277. [Google Scholar] [CrossRef]
- Towards an efficient automation of network penetration testing using model-based reinforcement learning. Ph.D. thesis, University of London, 2022.
- Ghanem, M.C.; Salloum, S. Integrating AI-Driven Deep Learning for Energy-Efficient Smart Buildings in Internet of Thing-based Industry 4.0. Babylonian Journal of Internet of Things 2025, 2025, 121–130. [Google Scholar] [CrossRef] [PubMed]
- Meuser, T.; Lovén, L.; Bhuyan, M.; Patil, S.G.; Dustdar, S.; Aral, A.; Bayhan, S.; Becker, C.; Lara, E.d.; Ding, A.Y.; et al. Revisiting Edge AI: Opportunities and Challenges. IEEE Internet Computing 2024, 28, 49–59. [Google Scholar] [CrossRef]
- Farzaan, M.A.; Ghanem, M.C.; El-Hajjar, A.; Ratnayake, D.N. AI-powered system for an efficient and effective cyber incidents detection and response in cloud environments. IEEE Transactions on Machine Learning in Communications and Networking 2025. [Google Scholar] [CrossRef]
- Basnet, A.S.; Ghanem, M.C.; Dunsin, D.; Kheddar, H.; Sowinski-Mydlarz, W. Advanced persistent threats (apt) attribution using deep reinforcement learning. Digital Threats: Research and Practice. [CrossRef]
- Rocha, A.; Monteiro, M.; Mattos, C.; Dias, M.; Soares, J.; Magalhães, R.; Macedo, J. Edge AI for Internet of Medical Things: A literature review. Computers and Electrical Engineering 2024, 116, 109202. [Google Scholar] [CrossRef]
- Alshinwan, M.; Memon, A.G.; Ghanem, M.C.; Almaayah, M. Unsupervised text feature selection approach based on improved Prairie dog algorithm for the text clustering. Jordanian Journal of Informatics and Computing 2025, 2025, 27–36. [Google Scholar] [CrossRef]
- Feit, F.; Metzger, A.; Pohl, K. Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems. arXiv 2022. [Google Scholar] [CrossRef]
- Zhou, X.; Yang, J.; Li, Y.; Li, S.; Su, Z. Deep reinforcement learning-based resource scheduling for energy optimization and load balancing in SDN-driven edge computing. Computer Communications 2024, 226-227, 107925. [Google Scholar] [CrossRef]
- Singh, R.; Bengani, V. Hybrid Learning Systems: Integrating Traditional Machine Learning With Deep Learning Techniques. ResearchGate 2024. [Google Scholar]
- Kadhim, Y.A.; Guzel, M.S.; Mishra, A. A Novel Hybrid Machine Learning-Based System Using Deep Learning Techniques and Meta-Heuristic Algorithms for Various Medical Datatypes Classification. Diagnostics 2024, 14. [Google Scholar] [CrossRef] [PubMed]
- Jia, J.; Liang, W.; Liang, Y. A Review of Hybrid and Ensemble in Deep Learning for Natural Language Processing. arXiv 2023. [Google Scholar] [CrossRef]
- et. al., R.B. On the Opportunities and Risks of Foundation Models. arXiv 2022, arXiv:cs.LG/2108.07258. [Google Scholar]
- Awais, M.; Naseer, M.; Khan, S.; Anwer, R.M.; Cholakkal, H.; Shah, M.; Yang, M.H.; Khan, F.S. Foundational Models Defining a New Era in Vision: A Survey and Outlook. arxiv 2023, arXiv:cs.CV/2307.13721. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:cs.CV/2304.02643. [Google Scholar]
- Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast Segment Anything. arXiv 2023, arXiv:2306.12156. [Google Scholar] [CrossRef]
- Zhang, C.; Cho, J.; Puspitasari, F.D.; Zheng, S.; Li, C. Y.Q.T.K.X.S.C.Z.C.Q.F.R.L.H.L.S.H.B.C.S.H. A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering. arXiv 2024, arXiv:cs.CV/2306.06211. [Google Scholar]
- Traub, M.; Butz, M.V. Rethinking Vision Transformer for Object Centric Foundation Models. arXiv 2025, arXiv:cs.CV/2502.02763. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. arXiv 2021, arXiv:2104.14294. [Google Scholar] [CrossRef]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar] [CrossRef]
- Jiang, D.; Liu, Y.; Liu, S.; Zhao, J.; Zhang, H.; Gao, Z.; Zhang, X.; Li, J.; Xiong, H. From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models. arXiv 2024, arXiv:2310.08825. [Google Scholar] [CrossRef]
- Abobeah, R.; Shoukry, A.; Katto, J. Video Alignment Using Bi-Directional Attention Flow in a Multi-Stage Learning Model. IEEE Access 2020, 8, 18097–18109. [Google Scholar] [CrossRef]
- Gautier, T.; Le Guernic, P.; Besnard, L.; Talpin, J.P. The polychronous model of computation and Kahn process networks. Science of Computer Programming 2023, 228, 102958. [Google Scholar] [CrossRef]
- Becker, P.H.E.; Arnau, J.M.; González, A. Demystifying Power and Performance Bottlenecks in Autonomous Driving Systems. In Proceedings of the 2020 IEEE International Symposium on Workload Characterization (IISWC); 2020; pp. 205–215. [Google Scholar]
- Schuman, C.D.; Kulkarni, S.R.; Parsa, M.; Mitchell, J.P.; Date, P.; Kay, B. Opportunities for neuromorphic computing algorithms and applications. Nature Computational Science 2022, 2, 10–19. [Google Scholar] [CrossRef]
- Deng, L.; Roy, K.; Tang, H. Understanding and Bridging the Gap between Neuromorphic Computing and Machine Learning. Frontiers Media SA. 2021. [Google Scholar]
- Zheng, X.; Liu, Y.; Lu, Y.; Hua, T.; Pan, T.; Zhang, W.; Tao, D.; Wang, L. Deep Learning for Event-based Vision: A Comprehensive Survey and Benchmarks. arXiv 2024, arXiv:2302.08890. [Google Scholar] [CrossRef]
- Benyahya, M.; Collen, A.; Nijdam, N.A. Analyses on standards and regulations for connected and automated vehicles: Identifying the certifications roadmap. Transportation Engineering 2023, 14, 100205. [Google Scholar] [CrossRef]
- Wolford, B. What is GDPR, the EU’s new data protection law? Available online: https://gdpr.eu/what-is-gdpr/ (accessed on 9 January 2025).
- Wang, X.; Zhu, W. Advances in neural architecture search. National Science Review 2023, 11. [Google Scholar] [CrossRef] [PubMed]
- Avval, S.S.P.; Eskue, N.D.; Groves, R.M.; Yaghoubi, V. Systematic review on neural architecture search. Artificial Intelligence Review 2025, 58. [Google Scholar] [CrossRef]
- Chandrala, J. Transfer Learning: Leveraging Pre-trained Models for New Tasks. International Journal of Research and Analytical Reviews 2017, 4, 809–815. [Google Scholar]
- Liu, S. Unified Transfer Learning Models in High-Dimensional Linear Regression. arXiv 2024. [Google Scholar] [CrossRef]
- He, Z.; Sun, Y.; Liu, J.; Li, R. TransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional Regression. arXiv 2024. [Google Scholar]
- He, Z., S. Y.L.J.; Li, R. AdaTrans: Feature-wise and Sample-wise Adaptive Transfer Learning for High-dimensional Regression. arXiv 2024. [Google Scholar]
- Lee, J.H.; Kvinge, H.J.; Howland, S.; New, Z.; Buckheit, J.; Phillips, L.A.; Skomski, E.; Hibler, J.; Corley, C.D.; Hodas, N.O. Adaptive Transfer Learning: A simple but effective transfer learning, 2021.
- Guo, W.; Zhuang, F.; Zhang, X.; Tong, Y.; Dong, J. A comprehensive survey of federated transfer learning: challenges, methods and applications. Frontiers of Computer Science 2024, 18. [Google Scholar] [CrossRef]













| Challenges | Reference | Techniques/Methods |
|---|---|---|
| Real-time Data Fusion | [6] | A hybrid framework integrating an FCNx and an EKF |
| [7] | Known templates method, using predefined patterns; An information-theoretic approach. | |
| [8] | LFF YOLO Network | |
| [9] | The digital twin architecture integrates the different modules. | |
| [10] | Multi-sensor fusion algorithms | |
| Latency in Decision Making | [8] | LFF YOLO Network |
| [9] | The publish-subscribe pattern architecture | |
| [11] | Parallel Algorithm for Multitarget Tracking | |
| [12] | Coordinate transformation or homography to map 2D face coordinates onto the 3D space | |
| Distributed Processing | [6] | A hybrid framework integrating an FCNx and an EKF |
| [?] | A distributed motion control system for reconfigurable manufacturing systems | |
| Resource Constraints | [13] | Compressed MobileNet V3 Architecture |
| [14] | Key optimisation techniques including distributed optimisation algorithms, gradient compression | |
| Model Efficiency | [15] | Pruning: Removes unimportant connections; Quantisation: Huffman Coding: weight sharing |
| [10] | Dynamic power management; Efficient algorithms; Hardware optimisation | |
| Real-time optimisation | [16] | A hyperheuristic multi-objective evolutionary search method |
| [17] | CNN for mobile computer vision systems | |
| [14] | Federated learning and neuromorphic computing | |
| Communication Bandwidth | [14] | Communication-efficient algorithms, including Ring All Reduce and decentralized training methods |
| [10] | Open communication protocols | |
| Dynamic Environments | [18] | A physical-virtual interactive parallel light fields collection method |
| Transfer Learning and Domain Adaptation | [19] | Neural style transfer and GAN |
| [20] | ResADM: a transfer-learning-based attack detection method | |
| Online Learning and Incremental Updates | [21] | Model-Agnostic Meta-Learning and Conditional Neural Processes |
| Handling Uncertainty and Noise | [22] | Ensemble model |
| Reference | CNN Techniques / Models | Key Contributions | Main Tasks / Application Domains | Major Limitations |
|---|---|---|---|---|
| [23] | U-Net architecture, Diffusion models, Optical flow estimation, Image-to-image models, and Frame interpolation | It manages imperfections in flow estimation effectively and decoupled edit propagate design | Local edits and short-video creation | Dependence on the first frame and struggle with highly complex or rapid motions |
| [24] | ResNet, Dense Networks (DenseNet), Generative Adversarial Networks (GANs) and Multi-scale Networks | Advancing techniques | Video object detection for medical imaging, surveillance, and autonomous driving | Artificial degradation may not apply to real-world situations. |
| [6] | Fully convolution neural network (FCNx) for classification tasks and ResNet for feature extraction | Hybrid multi-sensor fusion uses encoder-decoder FCNx with extended Kalman Filter for environmental perception. | Environmental perception for autonomous driving | Significant computational resources |
| [25] | CNNs various model compression techniques | Comprehensive review | Mobile devices, edge computing, IoT and embedded systems | A trade-off between computation and performance |
| [17] | CNNs on TensorFlow and TensorRT platforms via parallelism | Comprehensive Latency Analysis, Novel Measurement Techniques, optimisation Strategies, and Latency-Throughput Trade-Offs | Reducing latency in cloud gaming, optimising AR and VR delay applications and strategies to object detection and recognition models | Sensor Dependency and significant computational resources |
| [26] | Sparse Polynomial Regression and Energy-Precision Ratio (EPR) | Predictive Framework: NeuralPower | Mobile Devices, Data Centres, and embedded systems | Specific GPU platforms and may not generalize well to all hardware configurations |
| [27] | Mask R-CNN: Extends Faster R-CNN, Region of Interest (RoI) Align, FCNs for the mask prediction | Instance Segmentation, accuracy improvements in pose estimations | Object Detection and Segmentation, human pose estimations, AR applications | Significant computational resources, performance depending on specific applications and datasets |
| [28] | Single Shot MultiBox Detector (SSD), uses of default boxes and multiscale feature maps in detecting objects | Unified framework: SSD for real-time detection | real-time object detection for autonomous driving, embedded systems, and AR applications | Significant computational resources, not well performance on very small objects |
| [29] | DEtection TRansformer (DETR): Combines a common CNN backbone with a transformer architecture. | End-to-End Object Detection and Bipartite Matching Loss | Object Detection in various applications: autonomous driving, surveillance, and robotics.; and Panoptic Segmentation | Significant computational resources, not well performance on very small objects |
| [30] | Pre-trained CNN model for image extraction and Truncated Gradient Confidence-Weighted (TGCW) Model for online classification | Improved accuracy and efficiency by noise handling | Image classification in medical imaging and personal credit evaluation | Significant computational resources and noise sensitivity |
| [9] | Preprocessing step using OpenCV, YOLOv5 for real-time object detection | Integrated Framework for precise position estimation and error levels below 1 degree and 3D rendering of vehicles and their surroundings in digital twin visualisation | Accurate position estimations | inaccuracies in varying lighting or occlusion scenarios |
| [31] | 3D Coordinate Mapping and Hybrid Reality Integration | Development of a Hybrid Reality-Based Driving Testing Environment | Autonomous Driving Development and extends to Internet of Vehicles | Reducing stability in higher frequency and incomplete real-world testing |
| [18] | Parallel light field platform, a data-driven approach for self-occlusion and inconsistency in viewpoints, colmap for offline re-construction | Improvements in PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index Measure) metrics. | Applications requiring accurate 3D modelling and relighting, such as virtual reality, game development, and visual effects | Variations in colour temperature affecting 3D reconstruction and low-quality reconstruction models |
| [32] | Adaptive LFV coding and future integration with decentralized deep learning | balancing computation and communication latency to optimize performance | Enabling realistic digital twins, VR, AR and IoT-driven applications | Processing in off-line, not well performing in dynamic lighting conditions and occlusions |
| [33] | Integration of Yolov7 for human pose estimation and the DeepFace pre-trained model for age, gender, and race estimation | While Yolov7 performed well, the DeepFace model fell short in accuracy | The task of estimating human height from a single full-body image | Inaccurate performance in the DeepFace model and only single image input |
| [34] | EmoFusioNet, a deep fusion-based model | EmoFusioNet uses stacked and late fusion methods to ensure a color-neutral ER system, achieving high accuracy | A real-time facial emotion-based security | Underperformance for very dark-skinned individuals due to poor resolution of CMOS cameras |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).