Road traffic accidents constitute a critical global public health problem, with over 1.19 million deaths annually and an economic cost equivalent to 3% of GDP in most countries, according to the World Health Organisation’s 2023 report. Traditional methods of forensic analysis have significant limitations in terms of latency, coverage and scalability. This study proposes an Intelligent Multimodal Framework comprising four sequential phases: object detection, scene classification, visual understanding using vision-language models, and the generation of forensic reports using large-scale language models. All phases are evaluated on a specialised multimodal dataset constructed ad hoc from heterogeneous sources. Each phase was trained independently through fine-tuning on a 2× NVIDIA RTX A4500 platform. To select the optimal configuration from the 320 combinations in the factorial search space of candidate models per phase, the five-layer MODM-MCDM Hybrid Protocol was implemented, utilising seven multi-criteria decision-making methods and AHP weighting across 49 normalised criteria. The results identified two deployment configurations: S1 as the configuration offering maximum performance; and S2 as the configuration offering maximum methodological robustness.