Motivation: Gene regulatory network (GRN) inference from single-cell RNA-seq (scRNA-seq) data remains hampered by technical noise, high false-positive rates, and extreme computational costs. Existing methods often require hours or days to process developmental datasets yet fail to capture the physical and topological constraints of regulatory interactions, essential for accurate regulatory mapping. Results: We developed AENetMoX, a multimodal autoencoder that integrates transcriptomic correlations with transcription factor (TF) binding motifs and protein-protein interaction (PPI) networks. We evaluated AENetMoX against SCENIC, GRNBoost2, and CLR across 48 independent configurations using three human brain organoid lineages. At K=100, AENetMoX achieved 7.7±5.1% precision (95.7% relative improvement over SCENIC; p=1.039×10^(-3), one-sided Wilcoxon test; Vargha-Delaney Â=0.635). ChIP-seq validation showed 51.6% precision (+9.5% over SCENIC, p=0.141, Â=0.500) and 168.4% improvement in F1 over SCENIC (p=2.883×10^(-9), Â=0.854). Ablation studies revealed PPI integration as the primary driver of performance, increasing precision by 492% (~5.9×) and ChIP-seq recovery precision by 198% (~3×) over its expression-only variant. Crucially, AENetMoX completes inference in under 5 minutes, a 24-fold speedup over SCENIC (~2 hours) and significantly outperforming CLR (>12 hours). Analysis of novel predictions identified 334 unique regulatory edges, including temporally persistent SMAD3 and SOX2 hubs that remain stable across multiple developmental stages. Availability: Source code is available at https://github.com/Shirshak52/AENetMoX. All datasets and databases used are available at OSF (Project DOI: https://doi.org/10.17605/OSF.IO/K6EHW).