Intrusion detection in microgrid systems is a cyber-physical task that requires correlating different data from networks, hosts, and endpoints to create actionable evidence. Existing approaches largely treat intrusion detection as a classification problem and provide explanations at the sample or feature level. However, these explanations lack physical interpretability and fail to reveal cross-modal interactions underlying system decisions. As a result, operators cannot reliably trace detected anomalies to the physical layer, limiting the ability to diagnose root causes. This leads to incorrect or delayed responses and potentially compromises the safety of microgrid operations. This work proposes a physical and data-link layer explainable intrusion detection framework via cross-modal evidence reasoning. This framework reformulates intrusion detection as an operation Q\&A task over structured multi-modal evidence, including network flows, Software-Defined Networking (SDN) states, system calls, and power measurements. By designing an evidence-based explanation mechanism, sample importance is aligned with structured evidence and aggregated into physical modalities to construct evidence representations. These representations are further transformed into structured features to build joint decision models, enabling the extraction of decision paths and their conversion into interpretable reasoning processes grounded in physical evidence. The proposed framework is evaluated on realistic cyber–physical microgrid datasets. It provides consistent and physically meaningful explanations, revealing distinct cross-modal evidence patterns across different cyber attacks. This work advances intrusion detection from samples to physical-layer reasoning, enabling trustworthy security analysis in microgrid systems.