In this study, a detection framework is presented and evaluated that integrates sensor data (e.g., temperature, humidity, gas readings) with machine learning (ML) models and computer vision-based smoke and fire detection systems, in an effort to increase overall accuracy, robustness, as well as false-alarm reduction. To this end, sixteen (16) ML and deep learning (DL) models are employed on an internet of things (IoT) sensor dataset. Moreover, a range of YOLO models, such as older versions (YOLOv5n, YOLOv8n), as well as newer versions (YOLOv10n, YOLOv11n, YOLOv12n) are employed on an image-label based dataset. Model selection initially prioritizes lightweight architectures that are suitable for resource-constrained edge devices. Afterwards, the selected models are evaluated via well-known metrics, such as parameter count, F1-score/mean average precision (mAP) and real-time inference latency. In the same context, explainable AI (XAI) techniques, such as SHAP (SHapley Additive exPlanations) for ML models and LIME (Local Interpretable Model-agnostic Explanations) for the YOLO detectors, are integrated to the platform as well. According to the presented results, the Explainable Sensor Fusion (ESF) achieves decent performance metrics on a resource-constrained hardware device, demonstrating a viable, explainable, and highly efficient solution for real-time smoke and fire emergency response in industrial environments.