Deep Learning (DL) is a subset of machine learning that employs multiple layers of neural networks for the modeling of intricate patterns and relationships within a dataset [
47,
48]. Popular DL techniques are CNN and Recurrent Neural Network (RNN). CNN according to [
48] specializes in solving image classification problems, while RNN are powerful DL tool specialized in solving sequential data such as speech recognition and natural language processing [
49]. Today CNN has become a cornerstone in image classification problems with several factors such as spatial hierarchy and locality, parameter sharing, local connectivity of feature maps, and pooling translation invariance [
50].
Figure 1 presents the architecture of CNN, with components such as input layers, convolutional layers, fully connected layers, output layers, pooling layers, feature maps, and activation functions [
51,
52].
Figure 1 presents the block diagram of a CNN with an input layer responsible for input image dimension considering height, weight and color channel. The image pixels are identified by a filter that performs a convolutional scan at the image receptive fields to extract feature maps and then pool using pooling techniques such as maximum pooling or average pooling [
53]. The CNN in this case has five convolutional layers each formulated from the convolutional scan of the previous layer. At the end of the convolutional process, the feature vectors are flattened and fed to the fully connected layer where batch normalization and optimization algorithms are used to train the neural network in the FCL. The softmax function is the activation function that transforms the features into a probability score in the output layer. While CNN has recorded great success for image classification problems, it has struggled to replicate the same success on small datasets due to issues such as over-fitting, slow convergence, poor generalized model, Shit invariance, and bias [
54,
55] and hence presents the need for pertained models. These are pre-trained as a heterogeneous CNN architecture trained with a large corpus of datasets to form the building block when developing new models and hence address these aforementioned traditional CNN challenges [
56]. Popular CNN-based pre-trained models are Alex.Net, ResNet, Mobile.Net, and DenseNet [
57], while many other pre-trained models are GoogleNet [
58], Inception-V3 [
59], VGG16 [
60] and VGG19 [
61], Inception-ResNet [
62], DarkNet [
63], Xception [
64], ShuffleNet [
65], and SqueezeNet [
66].
2.1. Popular CNN Model and Their Architectures
This section will discuss popular CNN-based pre-trained models such as Alex.Net, ResNet, Mobile.Net, and DenseNet, identifying the major components that constitute them and discussing their functions.
- 1.
Mobile-Net
Mobile-Net is another popular pre-trained model specially designed for image classification problems on devices with limited computational resources like mobile applications [
53,
67]. It is a lightweight model with multiple layers with the ability to adapt when fine-tuned as a transfer learning algorithm.
Figure 2 presents the architecture of a Mobile-Net.
Figure 2 illustrates a Mobile-Net and its layers. The input layer dimensions the image size into 224*224*3 (height, weight and color channel), then convolution is applied on the image strides to extract the feature maps into the depth-wise layer and point-wise layer which doubles the number of channels for the global average pooling for better output and full connections [
68]. The depth-wise layer uses a single filter per input channel to minimize complexity, while the point-wise uses its convolutional process to fusion feature maps across channels and global average pooling applied to extract the features until the final layer, with reduced image size and increased color channel [
13]. Overall in the image input, 224*224*3 was transformed through the application of the MobileNet components to 7*7*1024, with the color channel. This increased transformation in color channel plays a significant role in the Mobile-Net classification efficiency, reduces parameters and ensures model compactness during deployment [
69].
- 2.
Alex.Net
Alex.Net is one of the popular object detection algorithms developed with multiple layers inspired by CNN architecture as a pre-trained model for object detection and recognition. Since its innovation by Alex et al. in 2012 using the ImageNet dataset with 1000 classes, it has gained increased attention, particularly for image classification problems [
70]. The Alex.Net in
Figure 3 is made of the input layer, five convolutional layers, three fully connected layers and the output layer [
71].
The input layer of the Alex.Net in
Figure 3 is the first layer which is the access to import data to the network. The imported data is dimensioned in the layer and then the features are scanned using a convolutional filter and then maximum pooling is applied to extract the identified features maps and formulate the convolutional layer (CONV). The process continues until the final convolutional layer where the features are flattened and then fed to the Fully Connected Layer (FC). The first FC is made of 4096 neurons which combine the high-level features map flatten and form a deeper representation of the image, this process is further refined in the second FC to form a more complex representation of the image, before the final FC which has 1000 neurons corresponding to the 1000 classes in the training ImageNet, each presenting a class of the output triggered by softmax activation function [
72].
- 3.
DenseNet
DenseNet is another deep-learning model designed for image classification problems. It is a densely connected convolutional neural network that serves different purposes in computer vision applications [
73]. This dense connectivity of neurons ensures that feature maps can be re-used, thereby solving the vanishing gradient problem, improving the propagation of features and reducing the number of training parameters [
74].
Figure 4 presents a block diagram of DenseNet.
Figure 4 showcased the components of DenseNet which are made of dense blocks, a bottleneck layer and a transition layer. The dense block is made of Batch Normalization Rectified Linear Unit Convolutional 2D (BN-ReLU-conv2D) and Concatenation (Concat) [
73]. The BN-ReLU-Connv2D helps in stabilizing and accelerating the process of feature extraction, while Concat merges the extracted information from different pathways within the network dense network structured in this case of five layers
[
73,
74]. The bottleneck layer manages the number of channels using a compression factor to control computational efficiency while the transformation layer applies the pooling process to minimize the spatial dimension of the image size [
75].
- 4.
Residual Network (ResNet)
Residual Network (ResNet) introduced in 2016 by Kaiming et al. is another deep learning model designed with the primary aim of addressing training complexities with deep neural networks using residual block [
76]. Over the years, ResNet has been applied for real-time classification of objects in various applications [
77].
Figure 5 presents the architecture of ResNet with its major components which are the input layer, residual block, fully connected layer, pooling, softmax function and output.
Figure 5 showcased the architecture of a ResNet. It contains the input layer which takes the input data, initial convolution and pooling used on the input data to extract features, residual stages are divided into stages, and each stage consists of several residual blocks that help in activating functions, global average pooling used to reduce the spatial dimensions to a 1x1 size and fully connected layer used for the final classification [
76]. The convolutional layers (conv0, conv1), batch normalization and ReLu collectively extract features of the input images and sum at the output of the residual block for pooling and transformation into the fully connected layer (FC), where it is trained and output using softmax function to predict the classification results [
77].
2.2. Literature on Image Classification Problem Using Pre-Trained Models
This section will review relevant literature that applied pre-trained models such as Alex.Net, ResNet, Mobile.Net, and DenseNet for image classification problems. The literature will be reported systematically and at the end of every section, a summary of the review, findings and research will be presented.
- 1.
Relevant Literatures on image classification with Alex.Net
Aapplied fine-tuned Alex.net with a CNN model trained with the Cifar-10 version-2019 dataset for object detection and recorded 98% classification accuracy [
78]. Also trained CNN with four brain tumor datasets which are Figshare, Brain MRI Kaggle, Medical MRI datasets and BraTS 2019 datasets to generate a model used to fine-tune Alex.Net as a 3D model for the classification of brain tumours [
79]. The average performance for accuracy is 99%, mean Average Precision (mAP) reported 99%, sensitivity 87% and detection time of 1300ms. While these models recorded very high accuracy for object detection and classification, argued that the delay in classification makes Alex.net not the most suitable for real-time object detection problems [
25]. Another study [
70] applied Alex.net for the classification of eye behavior using the Zhejiang University (ZJU) and Closed Eyes in the Wild (CEW) dataset. The average results recorded for both datasets, considering parameters such as accuracy, sensitivity, specificity, and precision, are 95% respectively which is good, however [
71] argued that during the training of the Alex.Net, traditional training algorithms such as gradient descent has potential for over-fitting, especially with a small dataset. This was addressed using an improved Adam optimization technique achieved with corrosion expansion and Gaussian filtering [
72]. In addition, the quality of data collection was improved using features of weight sharing, local connection, and learning image representation. This was applied to Alex.net for the classification of oil spillage. The results when tested reported 99.76% for recall and 98.91% for accuracy. While most of these models lack practical validation despite their success, applied Field Programmable Gate Arrays (FPGA) hardware to validate the Alex.net model for real-time object detection [
80]. However, [
19,
20,
21] argued that for real-time classification tasks, the speed of object detection is vital, and the performance of Alex.Net with 0.78seconds [
81] and 1.24seconds [
82], suggested that while it is perfect for the classification problem, it is not the most suitable to address real-time object detection problems. In the context of a Navigation guidance system for the blind, the speed of object detection is very important to inform immediate decisions to avoid accidents, however when there is a delay, this impacts the system responsiveness and also the user’s ability to react to obstacle avoidance along the path of navigation, thus leading to collision and then affects user confidently and safety. Therefore it is necessary to adopt a model which not only detects and recognize objects with high accuracy, but also in real-time.
Table 1 was used to summarize the systematic literature review on Alex.net for object detection problems.
- 2.
Relevant Literatures on object detection with Mobile-Net
Mobile-Net over the years has been applied to solve object detection problems [
53]. Single-Shot Detector (SSD) and OpenCV library were applied in [
15] to optimize the object detection capabilities of Mobile-Net to generate a Mobile-Net-SSD. The SSD operated as a multiscale object detector of feature maps from different layers of the Mobile-Net and then applied a bounding box to predict the position of the object [
67]. The Mobile-Net-SSD was trained on MS Common Objects in Context (COCO). The results reported 0.92 for accuracy, recall of 0.81 and mAP of 0.85, which is good but leaves room for improvement on both performance metrics. Similarly [
68] applied SSD-Mobile-Net-2 for object detection and obstacle avoidance in Autonomous Driving Assistance System (ADAS). The results when tested on five different objects reported an average accuracy of 97.8%, while [
69] who also applied the SSD for multi-scale feature map detection and bounding box prediction in Mobile-Net for real-time object detection reported an average accuracy of 89.53% which is ok, but leaves room for improvement. From the literature review, it was observed that while Mobile-Net is carefully designed for lightweight applications systems with less computing resources, during object detection, its overall accuracy leaves room for improvement when compared with another deep learning model such as YOLOV [
83]. More so [
84] revealed after comparing Mobile-Net with YoloV-5 on three different Google Cloud Platforms (Nvidia Telse, 1160 and Jatson Nano) respectively showcased that Mobile-Net has less object detection speed than YOLOV-5. What this means concerning the guide navigation application system is that, with improvement needed in speed and accuracy despite its success, it may not be the most suitable model of adoption for real-time classification problems. This is because speed is a very vital factor that informs other processes such as the audio feedback mechanism, hence when the speed and accuracy of classification is not optimal, its reliability for navigation by the blind will be affected.
Table 2 presents a summary of the literature reviewed on Mobile-Net for object detection and classification.
- 3.
Relevant Literatures on object detection with DenseNet
Applied two DenseNet models for object detection using Pascal VOC2007 dataset [
85]. First is the F-RCNN [
22,
23] based DenseNet where the dense network extracts multiscale features from the images, and then the F-RCNN acts as the predictor using object regions using a bounding box. Also, SSD [
53] was applied as a multiscale object detector and bounding box predictor for another DenseNet. The two models when comparative analyzed reported mAP of 5.63 with F-RCNN-based DenseNet and 3.86 with SSD-based DenseNet. Also when tested on video data, an improved mAP of 0.9854 was reported in [
86] when SDD-based DenseNet was applied for object classification. The result for precision also reported 98.5%, while recall recorded 97.0%, which suggested the SDD-DenseNet performs better for video image classification than a standard image dataset. In another study, applied Region Proposed Network (RPN) to identify potential region of interest for processing by the CNN or prediction [
73], after the DenseNet has extracted the features maps from the input image. The RPN-based DenseNet was trained for roadside object detection using PASCAL VOC and MS COCO datasets. The result reported for PASCAL VOC reported 80.30% mAP and MS COCO dataset reported mAP of 55.0%, while [
74] used Deep Pyramidal Residual Networks (DPRN) and DenseNet for image recognition. The DPRN was applied to facilitate the training performance of the DenseNet, through a parallel convolutional feature extraction. The DenseNet was trained with CIFAR10 and CIFAR100 datasets. The result reported an accuracy of 83.98% for CIFAR10 and 51.19% for CIFAR100 dataset respectively, then finally [
75] proposed a Multi-Scale DenseNets (MS-DenseNet) for aircraft detection from remote sensing images. In the study, a Feature Pyramid Network (FPN) was applied for the extraction of multi-scale feature detection with a focus on the features of small objects. The result of the MS-DenseNet reported a recall of 94%, an F1-score of 92.7%, a training time of 0.168s and a detection time of 0.094s. Overall these studies have investigated the performance of DenseNet for object detection problems, considering diverse datasets. From the review, it was observed while high success was recorded for metrics such as accuracy, recall, mAP, the delay during object detection [
87,
88] may affect its reliability when deployed as the computer vision model [
76] for guidance assistance system for blind navigation. In addition, the maps despite their success need to be improved. This is because precision is a crucial factor that shows how reliable a system is, and with a poor precision result, the detection output of the model may be compromised and affect the accuracy of the audio feedback mechanism, thus leading to a poor decision by the user which might lead to object collision and accident. In summary, while DenseNet is successful for object detection in images, it may not be the best for real-time classification problems. The summary of the literature review on DenseNet is presented in
Table 3;
- 4.
Relevant Literatures on object detection with ResNet
Today, ResNet has been applied in many areas for object detection and image classification. For instance, it was proposed for underwater object detection in [
77], using a Multi-scale-ResNet (M-ResNet). The multi-scale segment [
75] was applied to allow the detection of underwater objects with small sizes. The trained model reported a mAP of 96.5%. In [
89], the Detection Transformer (DETR) algorithm was applied to improve ResNet for end-to-end object detection. The DETR was aimed at improving object feature detection to provide an effective image representation task. The tested model on diverse objects reported an average precision of 0.82 and a mean average recall of 0.63, which is good, but was improved in [
90] using a hybrid approach that combined YOLOV and ResNet. The ResNet was utilized as the backbone of the YOLOV which is responsible for feature extraction to pool the diverse feature maps of the image and concentrate in the neck of the YOLOV for object detection. The results reported mAP of 95.1% and an F1-score of 98%, while the training speed of ResNet classification was considered in [
91] using the Kaggle indoor scenes dataset and reported 142.2 minutes with an accuracy of 74%. From the review of ResNet for object detection applications, it was observed that while standalone ResNet recorded good performance for accuracy, mAP and recall, the combination of YOLOV, or multiscale feature detection mechanism has the potential to improve the success of real-time object detection applications. The summary of the literature review on ResNet for object detection is reported in
Table 4.
2.5. Literature Review on the Application of YOLO for Real-Time Object Detection
Over the years, many studies have applied YOLOV for the classification of objects in real-time. For instance, in [
95], Super-Resolution Reconstruction (SRR) was applied to optimize the performance of YOLOV-5. The SRR was focused on the image enhancement of small dense faces in real-time, while the YOLOV-5 solves the real-time classification problem. The wider face dataset used to train the model when tested reported a precision score of 88.2% and recorded a 2.9% improvement when compared to the standard YOLOV-5 algorithm. In another work [
96] an improved performance (accuracy of 90.06%, mAP of 90.06%) was recorded when Repetitive Convolution (RepConv), Transformer Encoder (T-E), and Bil Bidirectional Feature Pyramid Network (BiFPN) modules were integrated into the architecture of YOLOV-5 and trained with UCAS-AOD and HRSC2016 datasets; while [
97] applied Bidirectional Feature Pyramid Network (BiFPN) for multi-scale diverse feature map fusion in the neck of YOLOV-5.In addition, a Convolutional Block Attention Module (CBAM) was also applied, while a new non-maximum suppression technique using Distance Intersection Over Union (DIOU) was integrated into the model to address feature enhancement and bounding box overlaps. BDD100K dataset was used to train the model and a precision of 72.0%, recall of 42.1% and mAP of 49.4% were reported. Overall the literatures so far in this section are dominated by YOLOV-5 with various techniques employed to optimize the classification performance, from the study it was observed that [
96] those who applied RepConv, T-E, and BiFPN modules to optimize YOLOV-5 recorded the best success rate for real-time object detection.
A voice assistance system for real-time object detection using YOLOV-4 was presented in [
98]. The YOLO algorithm was trained with COCO dataset and deployed on Android application system. In addition, an algorithm was developed to compute objects of interest considering distance and then relay the output via sound to the user. The model mAP when tested using car and motorcycle as the objects of interest recorded 0.74 and 0.55 respectively. [
99] Compared Fast Recurrent Convolutional Network (F-RCNN), YOLO-v4, YOLO-v4-hybrid, YOLO-v3, and SSD on COCO dataset considering frame captured per second (fps), speed and mAP. The results reported 110.56fps, 0.986mAP and 16.01ms or YOLOV-4 hybrid, as against the 60.2 fps yielded by YOLO-v4, 0.976 map, and 16.47ms detection speed; 61.3 fps by the YOLO-v3, 0.974 map, and 19.76ms detection speed; SSD’s 5.8 fps, 0.883 map, and 178.6ms detection speed; F-RCNN’s 3.7 fps, 0.925 map, and 275ms detection speed. [
100] Applied YOLOV-3 on indoor object recognition problem and reported. In the model, CNN in the Yolov-3 was used to determine the bounding box class probability of objects. For the data acquisition process, OpenCV was applied and overall the classification accuracy was measured using accuracy and mAP which reported (99% and 100%) respectively, however, while this study [
100] recorded a significant success, [
101] argued that issues of over-fitting, varnishing gradient problem [
16] due to small size of indoor dataset utilized may affect the model reliability. To address this challenge [
101] applied DenseNet for feature connection and spatial separation convolution to replace the normal convolution. The aim is to apply these two strategies for parameter reduction and optimization of YOLOV-3. Comparative analysis with standard YOLOV-3 on ship image reported that the DenseNet-based YOLOV-3 recorded mAP of 0.96 as against 0.93 in YOLOV-3. In another study by [
102], F-RCNN was applied to optimize the bounding box prediction in YOLOV-3 as a hybrid model. The model was trained with PASCAL VOC, VGG-M, and CaffeNet datasets and compared with the standard F-RCNN considering mAP. The results for each dataset with the hybrid YOLOV model reported (75.7, 64.2, and 62.2) respectively as against the (66.9, 59.2, and 57.1) obtained from the F-RCNN respectively, while in the same vein, [
103] applied recall, and fps to measure compared the two models and recorded 0.451fps and 0.83mAP for YOLOV-3 as against 0.891, and 0.893 yielded by Faster RCNN. While these studies have all recorded significant contributions for real-time object detection problems, it was observed that YOLOV-3 when improved with F-RCNN [
22,
24] and DenseNet [
16] when applied to improve YOLOV-3 has the potential to optimize the classification performance; however argued that while YOLOV-3 records high classification accuracy and mAP, that YOLOV-4 is better when compared with speed and mAP performance respectively. This suggests that the higher version of the YOLOV series supersedes the lowest version for real-time classification problems. The summary of the literature review on YOLOV for object detection is reported in
Table 5.
2.7. Review of Techniques to Address Occlusion and Objects in a Changing Environment
ROI is a crucial component in various applications of real-time image classification systems such as wearable devices for the visually impaired [
106], autonomous vehicles [
107], surveillance systems [
108,
109], industrial automation [
110], and human-computer interaction [
111]. This is necessary to facilitate the primary goal of the computer vision task, by identifying and localizing a particular object, which informs speedy decision-making. While deep learning algorithms have dominated basic object detection models, and form the foundation for the OOI, other approaches have been proposed to make the classification model output more object-specific. For instance [
30] applied Class Activation Mapping (CAM) and Adaptive Offloading Algorithm (AOA) for the extraction of edge-assisted lightweight region-of-interest for vehicle perception. The CAN was used for Region of Interest (ROI) mapping, while the AOA was applied to prompt inference through the adjustment of the down-sampling rate of the boxes in vehicle-to-edge communication. The benchmark YOLOV-5 model was also integrated with ResNet in the backbone to facilitate feature extraction, while the ROI and AOA search for the interested object. The results when compared to Youtube video generated from in-vehicle cameras, reported a 16% improvement against standard YOLOV-5 and a transmission demand reduction of 49%. While this study recorded a great improvement in ROI, [
33] argued that issues of occlusion especially when there are multiple objects of the same kind remained a major challenge. To address this problem, YOLOV was used for the object tracking algorithm, then Kalman filter was applied to estimate the object state, then spatio-temporal feature information between the object and environment was used for the multiple object tracking [
34]. The results when tested on a multi-object tracking dataset with a detection threshold set to 0.5, reported an accuracy of 74.5% and when the detection threshold was set to 0.2, the tracking accuracy was 73%. However, this method cannot be applied to all systems like visually impaired navigation, where only one object of interest detection is required. In another study [
41] applied grid virtual division which operates based on pixel grayscale values for high-speed extraction of OOI in an optical camera communication system. The grid division strategy divides the received image into blocks and randomly samples several pixels in different blocks to locate the OOI characterized by the high grayscale values in the original image. The result when tested reported a transmission frequency of 5kHz. However, the study did not consider occlusion when there are similar multiple objects in the same scene. Object occlusion detection algorithm was proposed in [
114] using camera information calibration based on the Gaussian mixture model [
115], depth estimator using projective matrix in 3D space, and occlusion region detector using estimated object depth. The result when tested showcased the ability to detect occlusion by background, occlusion by another object and occlusion without depth estimation. However, if the object position is slanted, it may affect the accuracy of the object depth information. [
116] applied Hard Example Mining (HAM) and Augmented Policy Optimization (APO) Approach. The APO applied nine augmented approaches [
117], for the policy which focuses on policies such as CutOut, Contrast, MixUP, CutMix, blur, contrast, Hue, grayscale, and brightness [
118], which has a positive influence on the learning data performance. The HAM extracted the hard positive data generated in the model training process using a false positive detection rate to detect the occluded objects. The result when tested demonstrated the model’s ability to detect occlusion, and also mAP of 90.49% accuracy for the YOLOV model. In [
119], an adaptive Spatio-temporal context (STC)-based algorithm for online tracking is proposed by combining the Context-Aware formulation (CAF), Kalman filter, and Adaptive Model Learning Rate (AMLR). The CAF was context context-aware filter for tracking the object, while the Kalman filter was applied for the object state estimation. The AMLR computes the mean of the image frames as the target motion of the image framed changes and is used to update the targeted model; however, despite the OOI success, the system is not interactive and therefore limits its application diversity. Overall, while these studies have made significant contributions to OOI, [
114] is a more preferred approach due to its consideration of diverse occlusion problems and addresses its impact on OOI. Therefore, this algorithm will be applied as the OOI techniques to optimize the proposed YOLOV algorithm for real-time object detection which facilitates free navigation by impaired vision persons.
2.8. Open Research Gaps Not Addressed in Object of Interest Detection Literature Reviewed
From the literature reviewed which addressed issues of occlusion and detection of objects in a challenging environment like areas with low light, objects whose behavior changes with time, the application of techniques such as Spatio-temporal context (STC)-based algorithm, Context-Aware formulation (CAF), Kalman filter, and Adaptive Model Learning Rate (AMLR), Gaussian Mixture Model, Class Activation Mapping (CAM) and Adaptive Offloading Algorithm (AOA) have all made significant contributions when detecting an object of interest, however in certain application such as wearable devices for visually impaired navigation, vehicle collision detection system, surveillance system, where one particular object of interest is required at a time to inform other decisions, solution has not been obtained to address this problem. For instance in visually impaired navigation or surveillance systems, while occlusion detection systems are crafted for multiple objects, of specific objects of interest as identified in the literature reviewed, in cases where multiple objects of the same kind are in one scene, solution have not been presented to address such problem. In addition, multiple objects of the same type, in a constantly changing environment are another problem unsolved and applies to accident detection and control systems for instance. Therefore there is a need for an occlusion object of instance detection technique capable of solving these problems.