Submitted:
17 June 2026
Posted:
18 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
3. Methodology
3.1. Dataset Collection and Preprocessing
3.2. Efficient Vision Transformer (E-ViT)

3.3. DEtection TRansformer (DETR) for Object Detection
| Algorithm 1: Fire and Smoke Detection Using the Hybrid E-ViT-DETR Model |
|
Input: Video dataset containing fire, smoke, and no-fire sequences with frame-level bounding box annotations Output: Classified frame set , performance metrics , optimized model , and alert signal Stage 1: Preprocessing
Stage 2: Feature Extraction
Stage 3: Feature Fusion
Stage 4: Hybrid E-ViT-DETR Model Training
Stage 5: Alert System
Stage 6: Evaluation and Optimization
|
3.4. Attention Head Importance Scoring and RIAH Pruning
3.5. Gated Feature Fusion
3.6. Classification of Fire and Smoke
4. Experiments and Results
4.1. Evaluation Metrics
4.2. Implementation Details
4.3. Evaluation of Model Performance
4.4. Grad-CAM Visualization of Attention Activation
4.5. Baseline Comparison
4.6. Ablation Study
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Holborn, P. G.; Nolan, P. F.; Golt, J. An analysis of fatal unintentional dwelling fires was investigated by the London Fire Brigade between 1996 and 2000. Fire Saf. J. 2003, 38(1), 1–42. [Google Scholar] [CrossRef]
- Shavit, T.; Shahrabani, S.; Benzion, U.; Rosenboim, M. The effect of a forest fire disaster on emotions and perceptions of risk: A field study after the Carmel fire. J. Environ. Psychol. 2013, 36, 129–135. [Google Scholar] [CrossRef]
- Holborn, P. G.; Nolan, P. F.; Golt, J. An analysis of fatal unintentional dwelling fires was investigated by the London Fire Brigade between 1996 and 2000. Fire Saf. J. 2003, 38(1), 1–42. [Google Scholar] [CrossRef]
- Aralt, T. T.; Nilsen, A. R. Automatic fire detection in road traffic tunnels. Tunn. Undergr. Space Technol. 2009, 24(1), 75–83. [Google Scholar] [CrossRef]
- Geetha, S.; Abhishek, C. S.; Akshayanat, C. S. Machine vision based fire detection techniques: A survey. Fire Technol. 2021, 57, 591–623. [Google Scholar]
- Gaur, A.; Singh, A.; Kumar, A.; Kumar, A.; Kapoor, K. Video flame and smoke based fire detection algorithms: A literature review. Fire Technol. 2020, 56, 1943–1980. [Google Scholar] [CrossRef]
- Thomson, W.; Bhowmik, N.; Breckon, T. P. Efficient and compact convolutional neural network architectures for non-temporal real-time fire detection. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA); IEEE, December 2020; p. 136141. [Google Scholar]
- Azhar, M.; Perveen, S.; Iqbal, A.; Lee, B. IDRandomForest: Advanced Random Forest for Real-time Intrusion Detection. In IEEE Access.; 2024. [Google Scholar]
- Azhar, M.; Li, M. J.; & Zhexue Huang, J. A hierarchical gamma mixture model-based method for classification of high-dimensional data. Entropy 2019, 21(9), 906. [Google Scholar] [CrossRef]
- Gong, F.; Li, C.; Gong, W.; Li, X.; Yuan, X.; Ma, Y.; Song, T. A real-time fire detection method from video with multi-feature fusion. In Computational intelligence and neuroscience; 2019. [Google Scholar]
- Abdusalomov, A. B.; Islam, B. M. S.; Nasimov, R.; Mukhiddinov, M.; Whangbo, T. K. An improved forest fire detection method based on the detectron2 model and a deep learning approach. Sensors 2023, 23(3), 1512. [Google Scholar] [CrossRef] [PubMed]
- Avazov, K.; Hyun, A. E.; Sami S, A. A.; Khaitov, A.; Abdusalomov, A. B.; Cho, Y. I. Forest Fire Detection and Notification Method Based on AI and IoT Approaches. Future Internet 2023, 15(2), 61. [Google Scholar] [CrossRef]
- Hussain, K.; Azhar, M.; Lee, B.; Iqbal, A.; Affan, M.; Khan, S. U. ASAnalyzer: Attention based sentiment analyzer for real-world sentiment analysis. In 2023 International Conference on Frontiers of Information Technology (FIT); IEEE, December 2023; pp. 184–189. [Google Scholar]
- Ali, M. S.; Azhar, M.; Masood, S.; Lee, B.; Iqbal, T.; Amjad, A. Efficient Video Summarization with Hydra Attentive Vision Transformer. In 2023 International Conference on Frontiers of Information Technology (FIT); IEEE, December 2023; pp. 196–201. [Google Scholar]
- Talaat, F. M.; ZainEldin, H. An improved fire detection approach based on YOLO-v8 for smart cities. In Neural Computing and Applications; 2023; pp. 1–16. [Google Scholar]
- Khan, S. U.; Khan, M. A.; Azhar, M.; Khan, F.; Lee, Y.; Javed, M. Multimodal medical image fusion towards future research: A review. J. King Saud. Univ.-Comput. Inf. Sci. 2023, 35(8), 101733. [Google Scholar] [CrossRef]
- de Venancio, P. V. A.; Campos, R. J.; Rezende, T. M.; Lisboa, A. C.; Barbosa, A. V. A hybrid method for fire detection based on spatial and temporal patterns. Neural Comput. Appl. 2023, 35(13), 9349–9361. [Google Scholar] [CrossRef]
- Azhar, M.; Huang, J. Z.; Masud, M. A.; Li, M. J.; Cui, L. A hierarchical Gamma Mixture Model-based method for estimating the number of clusters in complex data. Appl. Soft Comput. 2020, 87, 105891. [Google Scholar] [CrossRef]
- Huang, L.; Liu, G.; Wang, Y.; Yuan, H.; Chen, T. Fire detection in video surveillance using convolutional neural networks and wavelet transform. Eng. Appl. Artif. Intell. 2022, 110, 104737. [Google Scholar] [CrossRef]
- Li, P.; Zhao, W. Image fire detection algorithms based on convolutional neural networks. Case Stud. Therm. Eng. 2020, 19, 100625. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021, 2021. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In ECCV; Springer, 2020; Volume 2020, pp. 213–229. [Google Scholar]
- Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. ICCV 2017, 2017; pp. 618–626. [Google Scholar]
- Raza, M.A.; Fränti, P. A hierarchical gamma mixture model-based method for classification of high dimensional data. Entropy 2019, 21, 906. [Google Scholar] [CrossRef]
- Azhar, M.; Amjad, A.; Arman, M.; Dewi, D. A. Multimodal Emotion Detection in Low Resource Languages Using Lightweight Transformer Architectures: A Dual-Level Fusion Framework Integrating DistilBERT, CNN-BiGRU, and MobileViT for Efficient Real-Time Urdu Affective Computing. Information 2026, 17(5), 458. [Google Scholar] [CrossRef]
- Arman, Muhammad. Risk Former: Towards Transparent Ai in Mental Health An Explainable Transformer Model for Suicide Risk Detection from Digital Language". SES 2025, vol. 3(no. 7), 1693 1700. [Google Scholar]
- Azhar, M.; Arman, M.; Amjad, A.; Dewi, D. A.; Ahmad, M. U.; Hussain, S. Explainable Transformer-Based Framework for Suicide Risk Detection: Deep Learning with Interpretability for Mental Health Crisis Identification. Information 2026, 17(5), 448. [Google Scholar] [CrossRef]
- Azhar, M.; Riaz, N.; Azeem, W.; Dewi, D. A.; Amjad, A.; Arman, M. Explainable Transformer Models for Human Emotion Recognition: A Multi-Method Explainability Study in the Context of Mental Health. Preprints 2026, 2026041441. [Google Scholar] [CrossRef]
- Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. A systematic review and experimental evaluation of MACHINES AND ALGORITHMS, VOL.XXX, NO.XX, 2022 xxxxxx classical and transformer-based models for Urdu abstractive text summarization. Information 2025, 16, 784. [Google Scholar] [CrossRef]
- Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. Efficient transformer-based abstractive Urdu text summarization through selective attention pruning. Information 2025, 16, 991. [Google Scholar] [CrossRef]
- Balaji, R.L.; Thiruvenkataswamy, C.S.; Batumalay, M.; Duraimutharasan, N.; Devadas, A.D.T.; Yingthawornsuk, T. A Study of Unified Framework for Extremism Classification, Ideology Detection, Propaganda Analysis, and Flagged Data Detection Using Transformers. J. Appl. Data Sci. 2025, 6, 1791–1810. [Google Scholar]







| Hyperparameter | Value | Justification |
| Optimizer | Adam | Adaptive learning rate, robust to sparse gradients |
| Learning Rate | 0.0001 | Stable convergence without oscillation |
| Batch Size | 32 | Balanced GPU utilization and gradient stability |
| Epochs | 25 | Validated early stopping at epoch 20 |
| Image Resolution | 224 x 224 | Standard ViT patch size compatibility |
| Patch Size (ViT) | 16 x 16 | 196 patches per frame |
| ViT Layers | 12 | Base ViT-B configuration |
| Attention Heads (original) | 12 per layer | Standard ViT-B |
| Attention Heads (after RIAH) | ~10 per layer | 20th percentile pruning threshold |
| DETR Encoder Layers | 6 | Standard DETR configuration |
| DETR Decoder Queries | 100 | Sufficient for fire/smoke object count |
| Classification Threshold θ | 0.55 | Optimized on validation set |
| RIAH Pruning Percentile τ | 20% | Balances efficiency and accuracy |
| Fine-tuning Epochs (post-pruning) | 3 | Recovery from pruning perturbation |
| GPU | NVIDIA GTX | 8 GB VRAM |
| Model | Accuracy | Recall | Precision | F1-Score |
| SmartFire Vision (Proposed) | 0.9137 | 0.8527 | 0.8855 | 0.8664 |
| Class | Accuracy | Recall | Precision | F1-Score |
| Smoke | 0.9120 | 0.8637 | 0.8815 | 0.8699 |
| Fire | 0.9154 | 0.8450 | 0.8910 | 0.8670 |
| Method | Accuracy | Recall | Precision | F1-Score | Params (M) | FLOPs (G) |
| YOLOv5 [6] | 0.8721 | 0.8103 | 0.8540 | 0.8315 | 7.3 | 16.5 |
| YOLOv8 [15] | 0.8974 | 0.8130 | 0.8812 | 0.8457 | 11.1 | 14.2 |
| EfficientNetB0+Att. [18] | 0.8845 | 0.8290 | 0.8710 | 0.8495 | 5.3 | 0.39 |
| Wavelet-CNN/MV2 [19] | 0.8803 | 0.8215 | 0.8601 | 0.8404 | 3.4 | 0.31 |
| Faster-RCNN [20] | 0.8612 | 0.7980 | 0.8430 | 0.8199 | 41.8 | 180.0 |
| Standard ViT (no pruning) | 0.8812 | 0.8201 | 0.8644 | 0.8417 | 86.4 | 16.8 |
| SmartFire Vision (Proposed) | 0.9137 | 0.8527 | 0.8855 | 0.8664 | 72.1 | 13.7 |
| Configuration | Accuracy | Recall | Precision | F1-Score | Inference (ms) |
| Full Model (SmartFire Vision) | 0.9137 | 0.8527 | 0.8855 | 0.8664 | 14.2 |
| w/o RIAH Pruning (standard ViT) | 0.8812 | 0.8201 | 0.8644 | 0.8417 | 17.4 |
| w/o DETR (E-ViT only) | 0.8893 | 0.7941 | 0.8720 | 0.8312 | 9.8 |
| w/o Gated Fusion (concatenation) | 0.9001 | 0.8344 | 0.8712 | 0.8411 | 14.5 |
| w/o Threshold Mechanism (θ=0.5) | 0.9011 | 0.8110 | 0.9002 | 0.8533 | 14.1 |
| Method | FLOPs (G) | Params (M) | Inference (ms) | FPS | GPU Memory (GB) |
| YOLOv5 [6] | 16.5 | 7.3 | 11.4 | 87.7 | 2.1 |
| YOLOv8 [15] | 14.2 | 11.1 | 12.8 | 78.1 | 2.4 |
| EfficientNetB0 [18] | 0.39 | 5.3 | 8.2 | 121.9 | 0.9 |
| Wavelet-CNN/MV2 [19] | 0.31 | 3.4 | 7.6 | 131.6 | 0.7 |
| Faster-RCNN [20] | 180.0 | 41.8 | 48.3 | 20.7 | 6.8 |
| Standard ViT | 16.8 | 86.4 | 17.4 | 57.5 | 3.8 |
| SmartFire Vision (Proposed) | 13.7 | 72.1 | 14.2 | 70.4 | 3.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).