SmartFire Vision: An Attention-Pruned Hybrid Vision Transformer and Detection Transformer Framework for Accurate, Efficient, and Real-Time Fire and Smoke Detection in Smart City Video Surveillance

Muhammad Azhar; Muhammad Arman; Asma Iqbal; Adeen Amjad; Deshinta Arrova Dewi

doi:10.20944/preprints202606.1379.v1

Submitted:

17 June 2026

Posted:

18 June 2026

You are already at the latest version

Abstract

There is a worldwide danger of fire incidents that can lead to significant destruction of lives and property, especially in urban and smart cities. Existing fire and smoke detection systems are often inadequate for detecting the location of a fire, assessing the speed of its spread, and providing real-time alerts that can be acted upon quickly. This study proposes a method termed SmartFire Vision, which uses a hybrid deep-learning framework consisting of an Efficient Vision Transformer (E-ViT) and a Detection Transformer (DETR) for real-time fire and smoke detection from video sequences. A major contribution of this study is the integration of a new Removing Inefficient Attention Heads (RIAH) pruning strategy to reduce the computational overhead and maintain a global context in the ViT encoder. The E-ViT and DETR feature representations were fused and passed to a fully connected classification head enhanced with a probabilistic thresholding function and an integrated alarm system. The proposed model was trained and evaluated using the FURG fire benchmark dataset, which comprises 28,022 annotated frames. The proposed model achieved an overall accuracy of 91.37%, precision of 88.55%, recall of 85.27%, and F1-score of 86.64%, surpassing the current state-of-the-art methods. The SmartFire Vision framework provides a highly capable and computationally efficient means of fire detection and is particularly beneficial for CCTV-based smart city surveillance, with a high likelihood of deployment in edge devices in the future.

Keywords:

fire detection

;

smoke detection

;

vision transformer

;

detection transformer

;

attention head pruning

;

smart city

;

disaster risk reduction

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Fire incidents are now viewed as one of the quickest and most destructive threats to humankind, infrastructure, and ecosystems worldwide. Taken together, climate change, rapid urbanization, and unprecedented population growth have greatly increased the number and intensity of fire incidents both inside and outside. Therefore, the early detection of fire and smoke is not only a technical goal but is also essential for protecting public safety. Traditional fire detection systems, such as thermal detectors and photoelectric smoke detectors, rely on the physical distance from the ignition source and do not provide any spatial information on the location, rate of spread, or size of any currently burning fire incident. In addition to these challenges, human factors have a negative impact on responsiveness to fire incidents, such as reduced levels of attention, impaired visibility, and malfunctioning equipment. These challenges result in delayed responses to fires, which increases property damage and loss of life.

Global statistical evidence emphasizes the significance of this issue. The National Fire Agency reported 38, 659 fire incidents in Korea in 2020. Approximately 64.5% of fire incidents occur in buildings and human-made structures, resulting in 82.2% of fire-related fatalities, 82.6% of injuries, and 88% of total losses due to fire [1]. In the USA, 2020 statistics indicated that 74% of fire-related fatalities and 76% of fire-related injuries were due to fire incidents [2]. Historical documentation in London states that 78% of fire fatalities in that area occurred in residential and commercial buildings from 1996 to 2000 [3]. Additionally, 39% of fire fatalities in China from 2007 to 2010 were attributed to residential fires [4]. These statistics show the urgent need for scalable technological solutions for fire detection systems in built environments.

The extensive deployment of CCTV camera infrastructure installed in cities provides a comprehensive foundation for continuous computer vision or visual observation methods of automated detection and suppression systems that can operate continuously in large areas and provide localized spatial alerts without the need for human users [5]. Previous methods using computer vision were based on identifying hand-crafted features from color, texture, contour, and motion analysis, which were accomplished through the use of traditional classifiers (SVN, random forests, naive Bayes, etc.) to classify fires and their causes [6,7]. Although these approaches created initial benchmarks, they had difficulty delivering true, consistent results owing to the many different appearances of fires (morphology, color, shape, and intensity), lighting conditions (absolute, relative, direction, and intensity), and environmental contexts (type of building, structure type, and climate) in which the fire occurred. In the last few years, the evolution of deep learning technology and advancements in automated feature extraction through deep convolutional neural networks (CNN) and, more recently, attention-based transformer architectures have enabled powerful improvements in the automated identification of fires and their causes (s).

Video-based methods for detecting fires can vary noticeably based on the user’s preferences. One group of methods uses techniques associated with conventional machine learning approaches. These methods often have many limitations: 1. They typically require a proprietary dataset that is unique to certain organizations; 2. They do not provide consistent assessments of their overall accuracy; and 3. They cannot readily provide an accurate assessment of the fire's location, size, and/or rate of spread until after the initial ignition occurs, which can delay or prevent a timely response (and therefore could result in additional damage) after the initial ignition has occurred [8]. The second group contains approaches based on recent advances in deep learning. These approaches have been successful in yielding better results overall with respect to their ability to detect fires; however, their high computational cost, coupled with the relatively high inference time required, makes real-time operational deployment difficult to achieve in security cameras [9,10].

The development and rise of transformer-based architectures have changed the manner in which we approach computer vision and how the global context is modeled in vision applications to a higher degree than that accomplished by convolutional-only techniques. By treating pictures as sequentially arranged patches of a fixed size (several thousand by several thousand), the Vision Transformer (ViT) obtains long-range spatial dependencies in pictures using multiple heads through the process of self-attention. However, one of the biggest issues that standard ViTs face is that they waste computational resources because some of the attention heads create overlapping or trivial representations of the same object or type of object, thus contributing a very small amount to producing the final result. This is particularly costly in the case of real-time surveillance, where low latency is crucial for success. Attention head pruning is the process of removing low-impact heads based on their score given from gradients to reduce the overall cost of inference while maintaining or improving the quality of representation, making attention head pruning a natural optimization technique for implementing resource-constrained smart cities.

Object localization, like feature extraction, is an essential foundation for the reliable detection of fires. An object is predicted directly via a DEtection TRansformer (DETR) DL approach without the use of anchor boxes and a global loss concept that replaces the use of non-maximal suppression. The SmartFire Vision (S-V) method combines patch-level global information from E-ViT and bounding-box-level spatial information from DETR to produce a more comprehensive and simultaneous multi-scale of fire and smoke in the same video image. This complex interplay will help separate ambiguous early signs of fire from true ignition events so they can be correctly located and will help reduce false alarms.

In response to the aforementioned limitations, this study proposes a hybrid deep learning approach using an Efficient Vision Transformer (E-ViT) combined with a Detection Transformer (DETR) to detect fire and smoke from video streams. The following are the primary contributions of this study. A novel feature extraction technique using E-ViT augmented with the Removing Inefficient Attention Heads (RIAH) technique is introduced, thereby providing reduced computational overhead without accompanying loss in global context modeling capacity by selectively removing redundant multi-head self-attention heads from the multi-head self-attention layers of the transformer encoder architecture.

A DEtection TRansformer (DETR) is attached to the E-ViT encoder. Using DETR in addition to E-ViT allows for accurate object locations and context information. This added capability helps to identify areas that contain fire and smoke in a video frame with bounding box-level accuracy. By implementing a gated fusion feature that combines (1) E-ViT’s global patch representations and (2) object/special encoding representations from the DETR, the E-ViT creates a fully connected classification engine (head) with a probabilistic threshold alarm system for a quick response in video-based surveillance systems.

The remainder of this paper is organized as follows: Section 2 presents a thorough review of the existing research on fire and smoke detection. Section 3 develops the basis for the SmartFire Vision methodology in detail, including the E-ViT architecture, pruning strategy (RIAH), integration (DETR), fusion of features, and classification pipelines. Section 4 summarizes the experimental setup, evaluation metrics, quantitative results, baseline comparisons, ablation study, and computational efficiency analyses. Section 5 contains a detailed discussion of the results, limitations, and practical considerations regarding the deployment of the SmartFire Vision system. Section 6 summarizes the conclusions and presents ideas for future research on this topic.

2. Related Works

Fire safety, smart cities, and forest fire detection using deep learning and machine learning are discussed, along with the latest developments in these areas of study. In the last decade, Forest fires have been a serious issue all around the world because of human population growth and climate change. The consequences of this include climate change and the greenhouse effect. The amount of forest fires caused by humans is disproportionate to their population. The key to managing this unexpected occurrence is its rapid and accurate identification [11]. With the most recent version of the Detectron2 platform, a better approach for detecting and categorizing forest fires has been developed using deep learning techniques. During the day and at night, this model can detect even the smallest flames from great distances.

The Detectron2 method is useful because it can locate a target from a long distance. Previous research by [12] developed a method for early fire detection and categorization, guaranteeing the maximum protection of forest life. For real-time monitoring of fire disasters, there has not yet been enough research on early fire detection and alerting systems. As a result, the author of [6] proposed a fix based on the integration of IoT and YOLOv5. In this paper, the authors present Fire-Net, a deep-learning network that was trained on Landsat-8 images to identify burning biomass and active fires. To provide a clearer picture, we combine the pictures’ optical (RGB) and thermal (heat) modalities. Furthermore, further characteristics are extracted from sparse datasets by the network using residual convolution and separable convolution blocks.

Recent research [14], on computer vision-based quick fire detection systems, uses cutting-edge deep learning technology to address the shortcomings of traditional fire detectors and eliminate false alarms. More research is required to develop an indoor video fire detection model and test it in a real-world setting; it is currently absent. Using a CCTV surveillance system within the building, researchers created an EFDM based on computer vision to identify fires as soon as they started.

This research [15] proposes the smart fire detection system (SFDS), an enhanced method for detecting fires in smart cities that is based on the YOLOv8 algorithm, it takes advantage of deep learning’s strengths to identify characteristics of fires as they happen. The SFDS method may be more reliable than conventional fire detection systems, produce fewer false alarms, and save money. In smart cities, it can be used to detect gas leaks and flooding, among other things. It can be expanded to monitor things like gas leaks and floods in smart cities. The suggested smart city framework has four main layers: (i) Application layer, (ii) Fog layer, (iii) Cloud layer, and (iv) IoT layer. By combining Fog and Cloud computing with an Internet of Things layer, as a result of the proposed algorithm’s real-time data collection and processing capabilities, response times are improved, and risks to both property and human life are mitigated.

This paper [16] offers a network based on transformers that can detect fire in videos. Attention scores are generated by feeding the analyzed frame into an encoder-decoder architecture. The highest values reflect the most crucial parts of the input frame for generating a reliable fire detection result. The experimental results demonstrate that this model is capable of rapidly and precisely segmenting the picture plane to pinpoint the origin of a fire in a video frame. Computer vision tasks like full-frame classification (fire/no fire in frames) and fire localization have been taught and evaluated using the proposed method.

Various methods employing computer vision for fire detection have been developed throughout the years. In recent years, convolutional neural networks (CNNs) have emerged as the most promising approach. Most machine learning algorithms already have challenges in recognizing smoke and fire as indicators of fires. The learning process is more challenging for smoke and fire because of the considerable intraclass variable they exhibit by taking on a variety of colors, forms, and textures. In this study [17], the authors suggested a system that can detect fires automatically by analyzing both spatial (visual) and temporal trends. The first part of this combined technique involves analyzing visual patterns to determine which are indicative of fire events, and the second part involves looking at how these fires have evolved (temporal processing).

A recent study [18] presented a transfer-learning-based architecture for fire detection, using sophisticated CNNs educated on actual photos of fire outbreaks. Using an efficient attention mechanism, this method also incorporates GradCAM for pinpoint fire viewing and localization. The evaluation showed that the attention mechanism improved the precision with which the fires were located. When faced with the difficulties of fire detection, EfficientNetB0 proved to be the best model among those considered.

This study [19] proposes a unique Wavelet-CNN technique, It uses the 2D Haar transform to read out an image’s spectral features and then feeds those values into CNNs at different layer depths, intending to develop cutting-edge fire image recognition systems. This approach is evaluated using two well-known backbone networks, ResNet50 and MobileNet v2 (MV2). In particular, for the compact MV2, the method has been shown to increase fire detection accuracy while lowering false alarms in tests conducted on both a standard fire dataset and a video dataset.

Recently [20], image fire detection has emerged as a crucial novel technique to prevent fire damage and save lives through early warning systems. Algorithmic analysis of images is the foundation for image-based fire detection. In this research Faster-RCNN, R-FCN, SSD, and YOLO v3 are the four state-of-the-art CNN models for object detection, and they form the basis of their collection of distinctive image fire detection approaches. Comparison of proposed and existing techniques shows that fire detection systems using CNNs of object detection have the highest accuracy.

The introduction of the ViT model (ViT) has led to a marked increase in the application of transformer-style networks for visual detection tasks, where it has been shown that several pure attention-based models are able to perform at least equal to or better than CNNs on standard benchmarks Dosovitskiy et al. [21]. These advances have subsequently resulted in the adaptation of ViT for several dense prediction tasks, including but not limited to the detection, segmentation, and localization of anomalies. The introduction of DETR [22] removed the multistage pipeline typically associated with traditional detectors and replaced it with a single end-to-end transformer that directly predicts object sets with bipartite matching loss. Transformer-based methods for fire detection have only recently emerged.

The model proposed in [16] involved an encoder-decoder transformer applied to video fire segmentation, allowing the spatial source of the fire to be ascertained with great accuracy using attention scores. To date, no previous studies have investigated the combination of ViT-level feature extraction with DETR-level object localization in a cohesive fused manner, thus requiring real-time smoke and fire classification. SmartFire Vision provides a solution to this gap in research by presenting an architecturally unified solution using attention pruning to act as both a computational optimization method and a feature quality filter that is shared by the two sub-networks.

3. Methodology

SmartFire Vision is an innovative hybrid pipeline that detects fire and/or smoke from video sequences using a combination of an Efficient Vision Transformer (E-ViT) and a DEtection TRansformer (DETR). The SmartFire Vision pipeline takes raw video as input, processes and analyses each frame separately, utilizes E-ViT and DETR for extracting relevant features, fuses both extracted representations using an intelligent gating mechanism, then classifies each frame via use of a fully connected Neural Network (NN) with probabilistic thresholding on classification results as either containing fire or smoke; and triggers a real-time alert system based on confirming that the frame contains either fire or smoke so that SmartFire can be integrated into smart city closed-circuit television (CCTV) surveillance systems. The overall SmartFire Vision pipeline is depicted in Figure 1 and formalized in Algorithm 1.

The data used for training and evaluating the system were obtained from a set of annotated videos of fire, smoke, and no-fire events. Every frame was annotated using bounding boxes to indicate the location of the fire or smoke on the screen. The first step in the pipeline is to extract patch-based feature representations of each video frame using E-ViT. Using the Removing Inefficient Attention Heads method on the E-ViT encoder allows us to remove redundant and/or low-importance attention heads from the multi-head self-attention layers, thereby enhancing the computational efficiency of the model while maintaining the quality of the representation process. Each frame of the video is split into fixed-size patches, each of which has its own high-dimensional vector embedding. E-ViT uses self-attention to build linking relationships between all patches, thus creating a single representation of the spatial context of the entire frame as a unified sequence.

The DETR model processes the same covers to identify the objects and their respective spatial bounding boxes within every cover, which produces localization-aware features that are in addition to the global patch embeddings produced by E-ViT. The individual feature vectors generated by each of the two subnetworks are merged through a gated fusion module that adaptively weights each network's input feature vectors based on the scene being processed by the subnetworks. The fused representation is then provided to a fully connected classification head that classifies the cover as either Fire or Smoke using a thresholding mechanism. If a fire is detected, an alarm will be activated based on the integrated alarm system, causing a real-time alert to be provided.

3.1. Dataset Collection and Preprocessing

All experiments with the model were conducted using the FURG Fire dataset gathered from Kaggle, which is a popular benchmark for video fire-detection methodologies. The dataset comprises 28,022 annotated frames (in 24 different video sequences) that have been labeled with a bounding box specifying fire and non-fire regions; therefore, it provides both classification and localization supervision. This dataset contains a representative amount of both smoke and fire events across a wide variety of environmental and lighting conditions in Brazil.

The dataset was divided into training and testing subsets based on an 80/20 split for experimental purposes, which is consistent with previous studies in this area. The training set included 18,590 frames (of which 14,266 were labeled as fire and 12,073 as smoke), and the testing set consisted of 2,931 frames (including 2,211 fire and 720 smoke frames). The dataset is licensed under a Creative Commons license 3.0, so there is open access available for academic research and reproducibility.

Before providing the dataset as input to the E-ViT encoder, the dataset underwent numerous preprocessing operations to create appropriate data quality and consistency. This included directly assessing all keyframe images for quality deficiencies/abnormalities, removing any duplicated keyframe images, correcting inconsistencies found in their annotations, and removing any images that were corrupted by noise or imaging artifacts. The quality of the labelled data used to train a fire and/or smoke detection model directly affects the accuracy of the final trained model.

Subsequent to preprocessing all source videos keyframe images in the manner described above, all keyframe images were cropped and resized to a consistent 224 × 224-pixel resolution (relative to the ViT patch embedding mechanism). The adaptive crop algorithms allowed users to accommodate images with non-standard aspect ratios or different backgrounds and continue to preserve both fire and smoke interests across all spatial transformations. The flexible scaling algorithm (also used to accommodate the dozens of different Source Video camera resolutions found in the 24 original source videos) allowed users to have a consistent feature across multiple interoperability areas.

3.2. Efficient Vision Transformer (E-ViT)

Once the data are preprocessed and individual frames are retrieved from the video data, the individual frames are passed to the Efficient Vision Transformer (E-ViT) encoder for the extraction of features at the patch level. In a standard vision transformer architecture, each image frame is divided into a grid of 'non-overlapping' patches, where each patch is 16 × 16 pixels in size. Therefore, for a frame size of 224 × 224, there were 196 patches. Each of these patches is projected linearly into a high-dimensional embedding vector, and an embedding with respect to the spatial location of the patch is added to maintain the spatial order of the patches. Once this is completed, the output is sequentially processed through a series of 12 transformer blocks (12 layers), where each transformer block contains a multi-head self-attention (MHSA) layer and a position-wise feed-forward (PFF) network. The MHSA layer uses all patches (tokens) to compute pairwise attention weightings to obtain the long-range spatial dependencies of the patches, which are extremely difficult to capture using traditional convolutional approaches. The output from the MHSA layer is processed through the PFF layer, which uses nonlinear functions to transform the output from the MHSA layer.

A key limitation of the standard visual transformer (ViT) is that only some of the attention heads in each transformer block provide useful contributions to the overall discriminative result of the model. Empirical evaluation indicates that many of the attention heads in large transformer networks are identical or have low information value; therefore, they contribute very little to the gradient of the training loss. Keeping these redundant attention heads in a network unnecessarily increases computational demands and memory utilization but provides no corresponding increase in accuracy. This problem is exacerbated by the tight constraints on the inference times required for real-time video surveillance applications.

To solve this problem, E-ViT uses a technique referred to as ‘Removing Inefficient Attention Heads’ (RIAH). The RIAH technique removes inefficient attention heads from each layer of the transformer based on its 'importance score' derived from the gradient. Let

H_{i}

be the

i

-th attention head of layer

l

, and let

A_{i} \in R^{N \times N}

be the attention matrix of head i for N patch tokens, which leads to removing approximately 20% of attention heads for each layer, reducing the total number of FLOPs required from 16.8GFLOPs to 13.7GFLOPs, or an 18.3% reduction in total FLOPs, while still allowing the model to capture global context relationships using high-importance heads. The remaining heads can maintain the ability to model relationships between patches, effectively ensuring that the E-ViT has the same representational capability as the complete ViT. The graph below illustrates the distribution of the importance scores of the attention heads. The theoretical basis for this pruning strategy is formalized in Theorem 1 as follows. The information stream of the proposed process is shown in Figure 1 and is outlined in Algorithm 1.

Theorem 1. Let

H = {H_{1}, H_{2}, \dots, H_{K}}

be the set of

K

attention heads in a ViT layer, and let

H^{'} \subseteq H

denote the subset of heads retained after importance-score pruning. If the retained heads span the same functional subspace as

H

with respect to the training distribution, then the pruned model

f_{H^{'}}

achieves equivalent representational capacity as

f_{H}

on in-distribution inputs.

Proof of Theorem 1. The output of a multi-head attention layer is a linear combination of the outputs of its constituent heads:

Z = \sum_{i = 1}^{K} W_{i} \cdot H_{i} (X)

. If the removed heads

H ∖ H^{'}

produce outputs that lie within the span of the retained heads on the training distribution a condition empirically verified by the negligible gradient magnitude of the removed heads, then the projection of

Z

onto the training distribution is unchanged. It follows that

f_{H^{'}} (x) \approx f_{H} (x)

for all

x \sim D

.

Figure 1. Overview of Proposed Method Working.

3.3. DEtection TRansformer (DETR) for Object Detection

The DEtection TRansformer (DETR) operates by independently processing video frames that are concurrently extracted through EViT feature extraction. While preprocessing video frames in parallel to E-ViT feature extraction. Regardless of how the frames were extracted, each frame was independently processed by the DETR to create an object-level spatial representation of that frame. DETR uses a different technique than conventional object detection, where it does not depend on anchor boxes, region proposal networks, and post-processing, such as non-maximum suppression (NMS). The manner in which DETR solves object detection can be summarized by the use of bipartite matching loss to measure the similarities between predicted objects (from detecting the video frames) and the actual or true objects in that video frame.

The DETR framework consists of three primary components: a convolutional backbone, a transformer encoder-decoder, and a prediction head. First, given an input frame, the backbone produces a spatial feature map, which is then flattened and augmented with positional encodings before being passed to the transformer encoder. The transformer encoder processes this feature map, attending to locations across its entire extent and creating a single enriched representation of the scene by performing self-attention on the encoder’s multiple layers. Next, the transformer decoder uses 100 learned object query embeddings for each input frame and the output of the encoder to decode the spatial locations and class identities of all objects detected in the frame from the encoded object queries. Each object query independently generates an anchoring box and a predicted class label, allowing the DETR framework to achieve multi-object detection without needing to perform inference on one object at a time, thereby eliminating the sequential processing overhead.

DETR is a unique source of information for detecting fire and smoke events. In contrast to E-ViT, which creates a global representation of the entire frame, DETR builds spatially localized encodings that indicate the specific position of a fire or smoke object in the frame. The ability to localize these types of features is especially important in the early stages of fire events, where even though the ignition source may be small and spatially localized and therefore will not contribute significantly to a global patch embedding, it will stand out as an object anomaly when the isolated object query provided by DETR is decoded with those same features. Therefore, the output of the DETR decoder, which represents the spatial and semantic characteristics of a detected object, can be extracted to produce a DETR feature vector

d \in R^{m}

and delivered to the gated fusion module described in Section 3.4.

Algorithm 1: Fire and Smoke Detection Using the Hybrid E-ViT-DETR Model

Input: Video dataset

V

containing fire, smoke, and no-fire sequences with frame-level bounding box annotations
Output: Classified frame set

C

, performance metrics

P

, optimized model

M^{'}

, and alert signal

S

Stage 1: Preprocessing

Extract all frames from each video in $V$ to form the frame set: $F = {f_{1}, f_{2}, \dots, f_{T}}$
Obtain ground truth labels for each frame: $L = {l_{1}, l_{2}, \dots, l_{T}}$ , where $l_{i} \in {Fire, Smoke, No - Fire}$
Remove duplicate and noisy frames, resize all frames to $224 \times 224$ pixels, and normalize pixel values to $[0, 1]$

Stage 2: Feature Extraction

4.: Divide each frame $f_{i}$ into $M$ non-overlapping patches of size $16 \times 16$ pixels: $P = {p_{1}, p_{2}, \dots, p_{M}}$
5.: Map each patch $p_{j}$ to a high-dimensional embedding with positional encoding: $E = {e_{1}, e_{2}, \dots, e_{M}}$ , where $e_{j} = W_{e} \cdot p_{j} + {pos}_{j}$
6.: Pass all patch embeddings through the E-ViT encoder with RIAH pruning to produce attended representations: $A = {a_{1}, a_{2}, \dots, a_{M}}$ , where $a_{j} = {MHSA}_{pruned} (e_{j})$
7.: Process each frame $f_{i}$ through DETR encoder-decoder to produce $K$ object-level spatial encodings: $O = {o_{1}, o_{2}, \dots, o_{K}}$ , where $o_{k} = DETR (f_{i})$

Stage 3: Feature Fusion

8.: Concatenate E-ViT attended representations $A$ and DETR object encodings $O$ for each frame: $R = {r_{1}, r_{2}, \dots, r_{T}}$ , where $r_{i} = concat (A, O)$
9.: Compute gating vector to adaptively weight each feature source: $g = σ (W_{g} \cdot r_{i} + b_{g})$
10.: Produce gated fused representation for each frame: $z_{fused} = g ⊙ A + (1 - g) ⊙ O$

Stage 4: Hybrid E-ViT-DETR Model Training

11.: Train the fully connected classification head using fused representations: $\hat{y} = σ (W_{1} \cdot z_{fused} + b)$
12.: Optimize model weights using binary cross-entropy loss with Adam optimizer at learning rate $0.0001$ for $25$ epochs
13.: Apply classification threshold $θ = 0.55$ to assign frame labels: $C = {c_{1}, c_{2}, \dots, c_{T}}$ , where $c_{i} = Fire$ if ${\hat{y}}_{i} > θ$ , else $c_{i} = Smoke$

Stage 5: Alert System

14.: Initialize the integrated alert system $S$ connected to the IoT alarm relay
15.: For each frame $c_{i}$ classified as Fire, immediately activate alert: $S \leftarrow Activate (c_{i} = Fire)$

Stage 6: Evaluation and Optimization

16.: Test the trained model on unseen holdout data: $U = {u_{1}, u_{2}, \dots, u_{N}}$
17.: Compute performance metrics on $U$ : $P = {Accuracy, Precision, Recall, F 1 - Score}$
18.: Apply RIAH fine-tuning for 3 epochs at learning rate $1 \times 10^{- 5}$ to produce optimized model: $M^{'}$
19.: Return $C$ , $P$ , $M^{'}$ , $S$

3.4. Attention Head Importance Scoring and RIAH Pruning

The RIAH pruning procedure is applied as a structured post-training optimization step over the pretrained E-ViT encoder. For each of the 12 transformer layers, the importance score

I (H_{i})

defined in Equation (5) is computed for all

K = 12

attention heads by performing a forward-backward pass over the full training set and recording the mean gradient magnitude of each head's attention weight matrix. Heads are ranked within each layer, and those falling below the 20th percentile threshold

τ

are permanently masked from the model computation graph. The remaining heads are re-scaled by a normalization factor to preserve the expected output norm of the attention layer, preventing representational collapse following head removal. The pruned model is subsequently fine-tuned for three additional epochs at a reduced learning rate of

1 \times 10^{- 5}

to allow the retained heads to compensate for the removed capacity. The distribution of importance scores across all layers before pruning is presented in Figure 5, which demonstrates a clear separation between high-importance heads and the long-left tail of redundant heads targeted for removal.

RIAH process of the recurrent input-attention hierarchy (RIAH) uses the pre-trained layer of the Efficient Video Transformer (E-ViT) encoder as a framework to base the post-training optimization steps of the RIAH model with the pruning process. We calculate the importance score

I (H_{i})

of the 12 transformer layers calculated from K=12 attention heads per layer using a forward-backward pass through the entire training dataset (384). To do this, we calculate each attention head’s mean gradient value for each of their corresponding attention heads. After calculating the importance score, we ranked all attention heads within their respective layers and applied a 20th percentile threshold (τ). In accordance with RIAH’s intended outcome, after evaluating and eliminating heads below the 20th percentile threshold, the remaining attention head values are retained and re-normalized to match the expected output norm of the attention layer. This normalizes the output with respect to the average input norm. Subsequently, the pruned model underwent a fine-tuning process of three epochs using a lower learning rate of

1 \times 10^{- 5}

to allow the remaining heads to compensate for the elimination of the pruned capacity. Figure 5 shows the distribution of importance scores for all attention heads within each layer prior to the pruning process, delineating the separation between attention heads that are deserving of being retained and those that are redundant and are therefore candidates for elimination.

3.5. Gated Feature Fusion

Following the independent feature extraction pathways of E-ViT and DETR, the resulting representations are combined through a learned gated fusion mechanism that adaptively controls the relative contribution of each source based on the visual content of the frame. Let

v \in R^{d}

denote the global patch-level embedding vector produced by the E-ViT encoder, and let

d \in R^{m}

denote the object-level spatial encoding from the DETR decoder. The gating vector

g

is computed as seen in Equation 1:

g = σ (W_{g} \cdot [v; d] + b_{g}) (1)

where

σ

denotes the sigmoid activation function,

[v; d]

denotes the concatenation of the two feature vectors, and

W_{g} \in R^{d \times (d + m)}

and

b_{g} \in R^{d}

are learnable weight and bias parameters. The fused representation is then computed as mentioned in Equation 2:

z_{fused} = g ⊙ v + (1 - g) ⊙ d (2)

In this case, ⊙ means multiplying the element by the element. This gating structure allows the model to change the importance of DETR's localization guidance of DETR depending on whether it is a frame that contains well-defined (e.g., localized) fire or smoke objects. In other words, the model will focus more on the temporal/representation of DETR's localities for localized fires or smoke in frames with high visibility but will emphasize E-ViT's global contextual representation when there are no defined object boundary and visibility is very poor. As a result of this adaptation process, the fused representation

z_{fused}

is significantly richer and provides higher discrimination performance than either of the input sources; thus, it combines the higher-level/object-level representation of the E-ViT model via global attention across the patches with the lower/object-based representation obtained from the individual object outputs of the DETR-based decoder.

3.6. Classification of Fire and Smoke

The fuze feature vector

z_{fused}

is input to a multi-layer perceptron (MLP) to classify each input into a scalar estimate of the fire likelihood in that input. The MLP model has two ReLU fully connected layers and one sigmoid output layer to provide the fire likelihood of the input. Therefore, if we call the output of the MLP as

\hat{y} \in [0,1]

. If

\hat{y} > θ

is greater than a learned threshold θ which was optimised using the validation set and set at 0.55 throughout the experiments, the input will be classified as fire; if

\hat{y} > θ

is less than or equal to θ, the input will be classified as smoke. The 0.5 threshold was purposefully set slightly higher than neutral to provide lower precision results for a stronger recall capability. This is a deliberate trade-off made for safety-critical applications, due to the higher operational implication if a fire is missed than for false positives.

4. Experiments and Results

In this section, we provide an extensive and detailed experimental evaluation of The SmartFire Vision algorithm. The overall evaluation includes how we defined the performance metric values that we utilized for evaluation, implementation details, quantitative results of the evaluations performed using the FURG dataset, analysis of performance per class, our training dynamics for each experiment, baseline comparisons, ablation studies, and analysis of computational efficiency. All experiments were completed in the same environment to ensure fairness and reproducibility of the comparisons.

4.1. Evaluation Metrics

Four standard classification performance metrics were used to comprehensively evaluate the performance of SmartFire Vision in detecting fire and smoke, including accuracy, precision, recall, and F1-score. Together, these different metrics provide a detailed overall picture of both the general classification accuracy and class-specific discriminative ability of the model. Accuracy is defined as the percentage of frames that were accurately classified which can be seen in Equation 3:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N} (3)

Precision measures the proportion of frames predicted as fire that are genuinely fire frames, reflecting the model's ability to avoid false alarms as illustrated in Equation 4:

Precision = \frac{T P}{T P + F P} (4)

Recall measures the proportion of actual fire frames that the model successfully detects, reflecting sensitivity to true fire events as seen in Equation 5:

Recall = \frac{T P}{T P + F N} (5)

The F1-score provides the harmonic mean of precision and recall, offering a balanced single-value summary of detection performance under class imbalance as defined in Equation 6:

F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall} (6)

where

T P

,

T N

,

F P

, and

F N

denote true positives, true negatives, false positives, and false negatives respectively. In the context of fire detection, recall is prioritized over precision because a missed fire event carries substantially greater real-world consequences than a false alarm.

4.2. Implementation Details

The implementation of SmartFire Vision and the training of the model were performed within the framework of PyTorch, using an NVIDIA GTX 1080 GPU with 8 GB VRAM. The Adam optimizer was employed for all weight updates with an initial learning rate of 0.0001, which was selected to ensure stable convergence without inducing excessive oscillation within the loss landscape. The SmartFire Vision model was trained for 25 epochs using a batch size of 32. All input frames were resized to 224 × 224 pixels prior to inputting them into the E-ViT encoder, and while this is consistent with the standard ViT-Base patch size configuration (16 × 16 pixels), there will be 196 patch tokens created for each input frame.

The E-ViT encoder was initialized from a pretrained ViT-Base checkpoint and then daffed for training on the FURG training data, which was completed by applying the RIAH algorithm for pruning after training. Pruning was performed by removing attention heads that were below the 20th percentile of the importance threshold per layer. This was followed by fine-tuning with three epochs for recovery, applied at a decreased learning rate of 1.0 × 10^ (-5). The DETR module was initialized from a pretrained ResNet-50 model on ImageNet, which contained six encoder layers, six decoder layers, and 100 object queries. The classification threshold θ was optimized using the validation set, and the complete hyperparameter configuration is presented in Table 1.

4.3. Evaluation of Model Performance

The performance of SmartFire Vision was evaluated on the FURG test set comprising 2,931 frames using all four metrics described in Section 4.1. The overall quantitative results are presented in Table 2.

According to Table 2, the overall accuracy of SmartFire Vision is 91.37%, while its recall is 85.27%, precision is 88.55%, and the value of the first is 86.64%. All three metrics confirmed that the model could segregate fire from smoke frames well and produce accurate classification results across both classes. The precision of 88.55% confirms that most frames that were identified as fire contained fire, and the recall of 85.27% shows that most actual fire occurrences during the testing period were detected.

The model showed similar performance across smoke and fire, as shown in Table 3, with only slight differences in performance between both classifications. For smoke, the model achieved an accuracy of 91.20%, recall of 86.37%, precision of 88.15%, and F1 score of 86.99%. For fire, it achieved an accuracy of 91.54%, recall of 84.50%, precision of 89.10%, and F1 score of 86.70%. The higher precision value for fire than for smoke is due to the nature of the gating mechanism; a 0.55 threshold is used for labeling fires to reduce false alarm rates by allowing fires to be labeled conservatively at the expense of potentially relegating a small amount of fire misses. The performance of both categories reinforces that the gated fusion process works for both detection purposes without any class line movement.

SmartFire Vision's training dynamics are represented in both Figure 2 and Figure 3. Training accuracy increases dramatically from 74% in epoch 1 up until 95% at the end of epoch 20. Following epoch 20, it levels off which shows that the model fits well for the training data set. The validation accuracy grows at a slower rate and continues to increase until it reaches its peak of approximately 88% at the end of epoch 20. This is also where the best validation checkpoint was located and subsequently there are minor changes in the validation accuracy for subsequent epochs.

The model effectively learned over time based on the findings in Figure 3, and the training loss declined over each of the 25 epochs. The validation loss followed a similar trend as the training loss until approximately epoch 5, when it began oscillating, stabilizing to minor fluctuations beginning around epoch 10 and continuing through completion. The minimum validation loss occurred at epoch 16, which was the minimum checkpoint for the validation loss. The differentiation between the training and validation loss trajectories beginning after epoch 5 exhibits an overfitting behavior, which is also indicated by the patterns presented in the accuracy curves, as addressed in Section 5.

The Figure 4 shows the confusion matrix for the entire testing set. The model correctly identified 1,885 fire instances and 476 smoke instances; however, it falsely identified 326 fire frames as smoke and 54 smoke frames as fire. The fact that there are more false negatives on fire than impacts on smoke matches what is shown in Table 2 whereas the ability to recall fire instances was lower than smoke instances, this difference can also be attributed to the difficulty in distinguishing between early-stage fires and dense smoke in low-light-surveillance footage. An important future research area will be to address this specific pattern of misclassifications using temporal modeling over time (several consecutive frames).

4.4. Grad-CAM Visualization of Attention Activation

To provide interpretable evidence of the spatial focus of the model during classification, representative frames of fire and smoke from the FURG test set were used to produce Grad-CAM activation maps. Grad-CAM computes the gradient of the class score concerning the last convolutional or attention-level feature map and generates a spatial heatmap representing the image regions that contribute most to the classification decision by the model as seen in Figure 6. The standard ViT model with no RIAH pruning had broad and diffuse activation maps covering background areas that did not contain fire or smoke. This is indicative of how redundant attention heads spread the gradient signal from the fire or smoke across many irrelevant spatial regions. After RIAH pruning, the E-ViT activation maps were tighter, more focused, and located exactly where the fire flames and smoke were located; therefore, removing low-importance heads concentrated the model on the most diagnostically relevant image areas and, as such, improved its interpretability.

4.5. Baseline Comparison

To demonstrate the efficacy of SmartFire Vision, the proposed system was benchmarked against existing, high-quality fire detection systems that had been examined in the same experimental conditions, using the same dataset (FURG). The systems compared included but were not limited to YOLOV5 (as discussed in [6]), YOLOV8 (as discussed in [15]), EfficientNetB0 (using an attention mechanism [18]), CNNs with wavelets and MobileNetV2 ([19]), Faster-RCNN ([20]), and a standard ViT combined with no RIAH pruning or DETR integration. Figure 7 shows comparative results (ROC curves) for each comparison. The comparative results can be found in Table 4.

According to Table 4, SmartFire Vision surpassed all six baselines with respect to every evaluation metric provided. Specifically, for example, with respect to overall classification accuracy, SmartFire Vi-sion has an increase of 3.25 percentage points compared to YOLOv8 (which had the second highest classification accuracy — 89.74 percent), and for recall, SmartFire Vision outperformed YOLOv8 by 3.97 percentage points. This is a significant improvement, particularly for a use case that would be considered an operational failure in the instance of missing anything when detecting fires. The standard ViT without either pruning or DETR integration provides 88.12 percent classification accuracy; thus, the RIAH pruning approach and DETR fusion method independently contribute to the overall classification accuracy increase. Faster-RCNN achieved the lowest classification accuracy of 86.12 percent out of all the baseline object detectors analyzed, despite the fact that it has a very high parameter count of 41.8 million parameters; this illustrates the limitation of multi-stage anchor-based object detectors when evaluating an object detection benchmark like this. Although EfficientNetB0 with Attention is the most parameter-efficient baseline at 5.3 million parameters, it still falls behind SmartFire Vision by 2.92 percentage points in terms of object classification accuracy. Collectively, these results indicate that the unique effectiveness of SmartFire Vision as a fire and smoke classification architecture can be attributed to the combination of E-ViT (global context modeling), DETR (localization-aware feature extraction), and AGF (adaptive gated fusion).

4.6. Ablation Study

To quantify the individual contribution of each architectural component of SmartFire Vision, a systematic ablation study was conducted by removing one component at a time from the full model and evaluating the resulting performance on the FURG test set. Five configurations were evaluated: the full SmartFire Vision model, the model without RIAH pruning (standard ViT encoder), the model without DETR (E-ViT only), the model without gated fusion (simple concatenation replacing the gating module), and the model without the probabilistic threshold mechanism (fixed 0.5 boundary). Results are presented in Table 5.

Based on the values shown in Table 5, when the RIAH pruning is removed, it results in a decrease in accuracy from an original total of 91.37% to only 88.12% and increases the per-frame inference time from an original value of 14.2 ms to 17.4 ms. Therefore, this indicates that the pruning strategy produces the dual benefit of improving generalization and reducing computational costs. If DETR (the detection model that has previously been used) is removed altogether, then using only E-ViT as the detection method results in a larger degradation in recall accuracy from a previously established level of 85.27% to only 79.41% for this model because it can no longer detect local fire instances that do not have strong global level signals with respect to their location. Additionally, replacing gated fusion with a simplistic concatenation of feature sources resulted in a decrease in F1 scores from 86.64% to only 84.11%. Therefore, the adaptive weighting of feature sources is more beneficial than the fixed approach to combining feature sources. Finally, if the learned threshold is removed by using a fixed boundary set at 0.5 for predicting detections, then this will produce an increase in precision at 90.02% along with a decrease in recall at 81.10%. This demonstrates that the purpose of the optimized threshold was to deliberately and appropriately bias the model to produce greater sensitivity in this application owing to its safety-critical nature.

5. Discussion

The findings of the experimental evaluation of the SmartFire Vision system presented in Section 4, as a group, demonstrate that SmartFire Vision is a reliable and valid real-world video conference-based fire, smoke, detection system with high accuracy and low computational requirements. In addition to providing insight into the effectiveness of each supporting feature of the systems, the results from the evaluation provide a basis for making strong recommendations regarding potential deployment. The overall concept of deploying SmartFire Vision as part of a smart city surveillance infrastructure is shown in Figure 8.

In addition to producing the expected reduction in the number of floating-point operations (FLOPs), the RIAH head pruning approach produced other benefits, as demonstrated by the ablation study results. Specifically, the pruned model not only performed frame processing at a faster rate but also generalized better to the test set, resulting in a generalization gap (between the training and validation sets) being reduced from approximately 9 pp for the unpruned model to approximately 7 pp for SmartFire Vision. The generalization improvement has been attributed to the regularization impact of removing heads from the model; the removal of heads permits the model to depend on a smaller yet more complete set of attentions, which reduces the effective size of the encoder of the model, thereby dissuading the model from overfitting using training-specific patterns while retaining only the most useful diagnostic representations.

The results of the ablation experiment in Section 4.6 confirm that the impact of integrating DETR was primarily a gain in recall and that there was a minor reduction in precision. When DETR was removed from the system, recall dropped by 5.86 pp, whereas precision dropped by only 1.35 pp. The asymmetrical behavior is significant from an architectural perspective; the object query mechanism of DETR is designed to allow the detection of spatially defined objects, even when they are a very small percentage (e.g., <5%) of the total area of the frame. For example, in the nascent stages of fire detection, the flame may cover <5% of the pixels in the frame; therefore, the global patch embeddings from E-ViT may average the fire signal with the existing background representation. However, the localized bounding box predictions from DETR provide an explicit high-confidence detection signal in these difficult scenarios and lead to a gain in recall without introducing confounders.

Although SmartFire Vision currently demonstrates several strengths, when evaluating its training performance, as shown in Figure 2 and Figure 3, it also demonstrates a limited amount of overfitting within the training dynamics. There is a 7-percentage point gap between the training accuracy (95%) and validation accuracy (88%) at epoch 20, which indicates that the model has learned to represent the visual characteristics specific to the FURG dataset rather than learning the underlying generalizable visual characteristics of each fire. This is not surprising, given the relatively small training set (18,590 total frames from 24 videos) and the high degree of representational capacity of the ViT-Base encoder, even after pruning. Strategies that future studies could consider for mitigating this may include more aggressive data augmentation, the use of dropout regularization within the classification head, and/or training with a larger and more diverse multi-source fire dataset that combines FURG with both the VisiFire and Corsican Fire benchmarks.

The second limitation relates to the dataset type. The FURG dataset mainly consists of outdoor fire and smoke sequences captured outdoors with good visibility. The performance of SmartFire Vision has not been tested indoors or in tunnels, industrial settings with competing thermal sources, or under rainy, foggy, or night-vision conditions. Real smart city surveillance networks experience many of the same very diverse and challenging conditions. Therefore, future studies should test the robustness of this model within a much wider range of environmental conditions and ensure that large-scale deployment is responsibly recommended.

SmartFire Vision uses a vision transformer-based architecture as the basis upon which it shows how the capabilities of transformer models can be extended to tasks involving safety, while also supporting those tasks by their relationship to safety [21]. Specifically, SmartFire Visions transformer architecture encompasses the working principle of the attention mechanism of the transformer architecture along with the inherent scalability of self-attention when applied to image classification problems, where many small parts of the image constitute a composite recognition [21]. SmartFire Vision takes this further by integrating these concepts into an all-in-one object localization architecture that uses no anchor-based proposal or predetermined method of generating proposals for object localization [22]. Thus, the head-pruning approach taken with SmartFire Vision demonstrates that the number of attention heads associated with the attention mechanism used in this architecture is redundant when considering how many heads will statistically account for all potential proposal points [23]. The use of Grad-CAM to visualize the error analysis of the classification capabilities of the architecture provides a foundation for interpreting the weight given to each of the features used for image classification [24]. Attention-based encoder model structure-based experiment results also demonstrate the use of attention to classifier systems with respect to the dataset owing to the dimensionality of the data [25].

Similar developments in lightweight transformer architectures suitable for resource-constrained deployment, such as multimodal fusion systems that integrate distilled language encoders into small vision backbones [26], explainable transformer-based systems to perform sensitive sequential classification tasks [27,28], and multi-method explainability studies that use vision transformers to support affective computing [29], indicate the feasibility of using attention-based models in real operational, real-time environments where they can retain their original interpretability. In particular, current research on selective attention pruning in relation to the generalization benefits of the pruning method used in SmartFire Vision [30,31] highlights the potential of applying a unified transformer-based system across multiple high-stakes classification applications, such as extremist activity detection and propaganda analysis, suggesting a high level of cross-domain maturity among these types of systems [32]. Overall, the above body of literature positions SmartFire Vision as part of an increasingly mature ecosystem of transformer-based systems that are relied upon to perform safety-critical, real-world inference tasks.

6. Conclusions

This study describes SmartFire Vision, a Hybrid Deep Learning framework that detects fire and smoke in video sequences in real time. This is the first framework of its kind to use E-ViT and DETR in tandem through a gated fusion mechanism. The E-ViT Encoder removes redundant Attention Heads by applying the RIAH method, which uses gradient-derived importance scores to discard unimportant Attention Heads; as a result, the number of FLOPs decreases by 18.3%, while generalization is improved. The DETR provides an additional layer of object-level spatial localization, greatly increasing the recall of early fire events. The gated feature fusion entity balances and distributes the global and local feature representations. Finally, a probabilistic thresholding classifier coupled with an alarm system provided an end-to-end fire detection system. Smart Fire has been evaluated on the FURG fire benchmark dataset consisting of 28022 annotated frames the overall accuracy is 91.37%; recall 85.27%; precision 88.55%; and F1 score 86.64%. Smart Fire significantly outperforms six different state-of-the-art baselines including YOLOv8, EfficientNetB0 and Faster-RCNN. The frame rate processing speed is 70.4 frames per second (FPS) for a standard GPU and therefore meets the real-time processing speed of 30 FPS required for CCTV surveillance. There is a substantial capacity to increase the frame rate with additional computation time on the same GPU. All three principal components significantly affected the performance improvements observed in these experiments, both independently and with respect to one another: RIAH pruning, DETR integration, and gated fusion. Future research can be directed in three ways. First, we expanded RIAH pruning to the DETR decoder to decrease the end-to-end inference latency so that it can be deployed on edge devices, such as the NVIDIA Jetson Nano. Second, a temporal modeling module was added to the system to propagate context from detections across a sequence of video frames, helping to improve the overall sensitivity to slow-moving smoke and reduce false positives in isolated frames. Third, SmartFire Vision was integrated into a total IoT-connected smart city architecture and tested in a real-life surveillance environment with various fire conditions, including indoors, industrial sites, and bad weather, to assess whether it was ready for deployment in the real world.

Author Contributions

Conceptualization, M. Azhar and M. Arman; methodology, M. Azhar; software, M. Arman and A. Amjad; validation, M. Azhar, A. Iqbal, and D. A. Dewi; formal analysis, M. Arman and A. Amjad; investigation, M. Azhar and A. Iqbal; resources, D. A. Dewi; data curation, A. Iqbal; writing original draft preparation, M. Azhar; writing review and editing, A. Iqbal and D. A. Dewi; visualization, M. Arman and A. Amjad; supervision, M. Azhar and D. A. Dewi; project administration, M. Azhar. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Hong Kong Shue Yan University, Hong Kong SAR, China under University Conference Grant (URG) with the project number UCG/24/18.

Institutional Review Board Statement

Ethical review and approval were waived for this study as no human participants, sensitive personal data, or biological specimens were involved. All experiments were conducted exclusively on the publicly available FURG Fire Dataset, which consists of annotated video sequences for video-based fire and smoke detection and contains no personally identifiable information. The dataset is released under the CC0-1.0 (Public Domain) license and is freely available for academic research and reproducibility.

Informed Consent Statement

This study did not involve any direct interaction with human participants. All data used for training and evaluation were obtained from the publicly available FURG Fire Dataset (https://github.com/steffensbola/furg-fire-dataset), which comprises annotated video sequences for fire and smoke detection and contains no personally identifiable or sensitive information. The dataset is released under the CC0-1.0 (Public Domain) license.

Data Availability Statement

The FURG Fire Dataset used in this study is publicly available at https://github.com/steffensbola/furg-fire-dataset (accessed on 1 June 2026), released under the CC0-1.0 (Public Domain) license.

Conflicts of Interest

The authors declare no conflict of interest.

References

Holborn, P. G.; Nolan, P. F.; Golt, J. An analysis of fatal unintentional dwelling fires was investigated by the London Fire Brigade between 1996 and 2000. Fire Saf. J. 2003, 38(1), 1–42. [Google Scholar] [CrossRef]
Shavit, T.; Shahrabani, S.; Benzion, U.; Rosenboim, M. The effect of a forest fire disaster on emotions and perceptions of risk: A field study after the Carmel fire. J. Environ. Psychol. 2013, 36, 129–135. [Google Scholar] [CrossRef]
Holborn, P. G.; Nolan, P. F.; Golt, J. An analysis of fatal unintentional dwelling fires was investigated by the London Fire Brigade between 1996 and 2000. Fire Saf. J. 2003, 38(1), 1–42. [Google Scholar] [CrossRef]
Aralt, T. T.; Nilsen, A. R. Automatic fire detection in road traffic tunnels. Tunn. Undergr. Space Technol. 2009, 24(1), 75–83. [Google Scholar] [CrossRef]
Geetha, S.; Abhishek, C. S.; Akshayanat, C. S. Machine vision based fire detection techniques: A survey. Fire Technol. 2021, 57, 591–623. [Google Scholar]
Gaur, A.; Singh, A.; Kumar, A.; Kumar, A.; Kapoor, K. Video flame and smoke based fire detection algorithms: A literature review. Fire Technol. 2020, 56, 1943–1980. [Google Scholar] [CrossRef]
Thomson, W.; Bhowmik, N.; Breckon, T. P. Efficient and compact convolutional neural network architectures for non-temporal real-time fire detection. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA); IEEE, December 2020; p. 136141. [Google Scholar]
Azhar, M.; Perveen, S.; Iqbal, A.; Lee, B. IDRandomForest: Advanced Random Forest for Real-time Intrusion Detection. In IEEE Access.; 2024. [Google Scholar]
Azhar, M.; Li, M. J.; & Zhexue Huang, J. A hierarchical gamma mixture model-based method for classification of high-dimensional data. Entropy 2019, 21(9), 906. [Google Scholar] [CrossRef]
Gong, F.; Li, C.; Gong, W.; Li, X.; Yuan, X.; Ma, Y.; Song, T. A real-time fire detection method from video with multi-feature fusion. In Computational intelligence and neuroscience; 2019. [Google Scholar]
Abdusalomov, A. B.; Islam, B. M. S.; Nasimov, R.; Mukhiddinov, M.; Whangbo, T. K. An improved forest fire detection method based on the detectron2 model and a deep learning approach. Sensors 2023, 23(3), 1512. [Google Scholar] [CrossRef] [PubMed]
Avazov, K.; Hyun, A. E.; Sami S, A. A.; Khaitov, A.; Abdusalomov, A. B.; Cho, Y. I. Forest Fire Detection and Notification Method Based on AI and IoT Approaches. Future Internet 2023, 15(2), 61. [Google Scholar] [CrossRef]
Hussain, K.; Azhar, M.; Lee, B.; Iqbal, A.; Affan, M.; Khan, S. U. ASAnalyzer: Attention based sentiment analyzer for real-world sentiment analysis. In 2023 International Conference on Frontiers of Information Technology (FIT); IEEE, December 2023; pp. 184–189. [Google Scholar]
Ali, M. S.; Azhar, M.; Masood, S.; Lee, B.; Iqbal, T.; Amjad, A. Efficient Video Summarization with Hydra Attentive Vision Transformer. In 2023 International Conference on Frontiers of Information Technology (FIT); IEEE, December 2023; pp. 196–201. [Google Scholar]
Talaat, F. M.; ZainEldin, H. An improved fire detection approach based on YOLO-v8 for smart cities. In Neural Computing and Applications; 2023; pp. 1–16. [Google Scholar]
Khan, S. U.; Khan, M. A.; Azhar, M.; Khan, F.; Lee, Y.; Javed, M. Multimodal medical image fusion towards future research: A review. J. King Saud. Univ.-Comput. Inf. Sci. 2023, 35(8), 101733. [Google Scholar] [CrossRef]
de Venancio, P. V. A.; Campos, R. J.; Rezende, T. M.; Lisboa, A. C.; Barbosa, A. V. A hybrid method for fire detection based on spatial and temporal patterns. Neural Comput. Appl. 2023, 35(13), 9349–9361. [Google Scholar] [CrossRef]
Azhar, M.; Huang, J. Z.; Masud, M. A.; Li, M. J.; Cui, L. A hierarchical Gamma Mixture Model-based method for estimating the number of clusters in complex data. Appl. Soft Comput. 2020, 87, 105891. [Google Scholar] [CrossRef]
Huang, L.; Liu, G.; Wang, Y.; Yuan, H.; Chen, T. Fire detection in video surveillance using convolutional neural networks and wavelet transform. Eng. Appl. Artif. Intell. 2022, 110, 104737. [Google Scholar] [CrossRef]
Li, P.; Zhao, W. Image fire detection algorithms based on convolutional neural networks. Case Stud. Therm. Eng. 2020, 19, 100625. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021, 2021. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In ECCV; Springer, 2020; Volume 2020, pp. 213–229. [Google Scholar]
Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. ICCV 2017, 2017; pp. 618–626. [Google Scholar]
Raza, M.A.; Fränti, P. A hierarchical gamma mixture model-based method for classification of high dimensional data. Entropy 2019, 21, 906. [Google Scholar] [CrossRef]
Azhar, M.; Amjad, A.; Arman, M.; Dewi, D. A. Multimodal Emotion Detection in Low Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing. Information 2026, 17(5), 458. [Google Scholar] [CrossRef]
Arman, Muhammad. Risk Former: Towards Transparent Ai in Mental Health An Explainable Transformer Model for Suicide Risk Detection from Digital Language". SES 2025, vol. 3(no. 7), 1693 1700. [Google Scholar]
Azhar, M.; Arman, M.; Amjad, A.; Dewi, D. A.; Ahmad, M. U.; Hussain, S. Explainable Transformer-Based Framework for Suicide Risk Detection: Deep Learning with Interpretability for Mental Health Crisis Identification. Information 2026, 17(5), 448. [Google Scholar] [CrossRef]
Azhar, M.; Riaz, N.; Azeem, W.; Dewi, D. A.; Amjad, A.; Arman, M. Explainable Transformer Models for Human Emotion Recognition: A Multi-Method Explainability Study in the Context of Mental Health. Preprints 2026, 2026041441. [Google Scholar] [CrossRef]
Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. A systematic review and experimental evaluation of MACHINES AND ALGORITHMS, VOL.XXX, NO.XX, 2022 xxxxxx classical and transformer-based models for Urdu abstractive text summarization. Information 2025, 16, 784. [Google Scholar] [CrossRef]
Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. Efficient transformer-based abstractive Urdu text summarization through selective attention pruning. Information 2025, 16, 991. [Google Scholar] [CrossRef]
Balaji, R.L.; Thiruvenkataswamy, C.S.; Batumalay, M.; Duraimutharasan, N.; Devadas, A.D.T.; Yingthawornsuk, T. A Study of Unified Framework for Extremism Classification, Ideology Detection, Propaganda Analysis, and Flagged Data Detection Using Transformers. J. Appl. Data Sci. 2025, 6, 1791–1810. [Google Scholar]

Figure 5. Distribution of attention head importance scores across all 12 layers of the ViT encoder before pruning. The red dashed line indicates the 20th percentile threshold τ used by the RIAH algorithm. Heads below this threshold (shaded red region) are removed, retaining approximately 80% of heads per layer.

Figure 2. Training and validation accuracy curves over 25 epochs.

Figure 3. Training and validation loss curves over 25 epochs. The green solid line represents training loss and the red dashed line represents validation loss. The purple marker indicates the epoch of lowest validation loss (epoch 16). The vertical grey dashed line at epoch 5 marks the onset of loss divergence.

Figure 4. Confusion matrix for SmartFire Vision evaluated on the FURG test set. Values represent frame counts. Diagonal cells indicate correct classifications; off-diagonal cells indicate misclassifications.

Figure 6. Grad-CAM activation map comparison between standard ViT and E-ViT (after RIAH pruning) for representative fire and smoke frames from the FURG test set. (a) Original surveillance frames. (b) Activation maps from standard ViT without pruning. (c) Activation maps from E-ViT with RIAH pruning. Pruned model activations are more tightly localized on fire and smoke regions, demonstrating improved spatial discriminability.

Figure 7. Receiver Operating Characteristic (ROC) curves for SmartFire Vision and five baseline methods on the FURG test set. (a) Fire class ROC curves. (b) Smoke class ROC curves. SmartFire Vision achieves the highest AUC for both classes, confirming superior discriminative ability across all operating thresholds.

Figure 8. Conceptual deployment architecture of SmartFire Vision.

Table 1. Full hyperparameter configuration for SmartFire Vision training and evaluation.

Hyperparameter	Value	Justification
Optimizer	Adam	Adaptive learning rate, robust to sparse gradients
Learning Rate	0.0001	Stable convergence without oscillation
Batch Size	32	Balanced GPU utilization and gradient stability
Epochs	25	Validated early stopping at epoch 20
Image Resolution	224 x 224	Standard ViT patch size compatibility
Patch Size (ViT)	16 x 16	196 patches per frame
ViT Layers	12	Base ViT-B configuration
Attention Heads (original)	12 per layer	Standard ViT-B
Attention Heads (after RIAH)	~10 per layer	20th percentile pruning threshold
DETR Encoder Layers	6	Standard DETR configuration
DETR Decoder Queries	100	Sufficient for fire/smoke object count
Classification Threshold θ	0.55	Optimized on validation set
RIAH Pruning Percentile τ	20%	Balances efficiency and accuracy
Fine-tuning Epochs (post-pruning)	3	Recovery from pruning perturbation
GPU	NVIDIA GTX	8 GB VRAM

Table 2. Overall quantitative performance of the proposed SmartFire Vision.

Model	Accuracy	Recall	Precision	F1-Score
SmartFire Vision (Proposed)	0.9137	0.8527	0.8855	0.8664

Table 3. Class-wise performance of SmartFire Vision on the FURG test.

Class	Accuracy	Recall	Precision	F1-Score
Smoke	0.9120	0.8637	0.8815	0.8699
Fire	0.9154	0.8450	0.8910	0.8670

Table 4. Comparative performance of SmartFire Vision against six state-of-the-art baselines on the FURG fire detection dataset.

Method	Accuracy	Recall	Precision	F1-Score	Params (M)	FLOPs (G)
YOLOv5 [6]	0.8721	0.8103	0.8540	0.8315	7.3	16.5
YOLOv8 [15]	0.8974	0.8130	0.8812	0.8457	11.1	14.2
EfficientNetB0+Att. [18]	0.8845	0.8290	0.8710	0.8495	5.3	0.39
Wavelet-CNN/MV2 [19]	0.8803	0.8215	0.8601	0.8404	3.4	0.31
Faster-RCNN [20]	0.8612	0.7980	0.8430	0.8199	41.8	180.0
Standard ViT (no pruning)	0.8812	0.8201	0.8644	0.8417	86.4	16.8
SmartFire Vision (Proposed)	0.9137	0.8527	0.8855	0.8664	72.1	13.7

Table 5. Ablation study results showing the individual contribution of each component of SmartFire Vision.

Configuration	Accuracy	Recall	Precision	F1-Score	Inference (ms)
Full Model (SmartFire Vision)	0.9137	0.8527	0.8855	0.8664	14.2
w/o RIAH Pruning (standard ViT)	0.8812	0.8201	0.8644	0.8417	17.4
w/o DETR (E-ViT only)	0.8893	0.7941	0.8720	0.8312	9.8
w/o Gated Fusion (concatenation)	0.9001	0.8344	0.8712	0.8411	14.5
w/o Threshold Mechanism (θ=0.5)	0.9011	0.8110	0.9002	0.8533	14.1

Table 6. Computational efficiency comparison of SmartFire Vision and baseline methods.

Method	FLOPs (G)	Params (M)	Inference (ms)	FPS	GPU Memory (GB)
YOLOv5 [6]	16.5	7.3	11.4	87.7	2.1
YOLOv8 [15]	14.2	11.1	12.8	78.1	2.4
EfficientNetB0 [18]	0.39	5.3	8.2	121.9	0.9
Wavelet-CNN/MV2 [19]	0.31	3.4	7.6	131.6	0.7
Faster-RCNN [20]	180.0	41.8	48.3	20.7	6.8
Standard ViT	16.8	86.4	17.4	57.5	3.8
SmartFire Vision (Proposed)	13.7	72.1	14.2	70.4	3.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.