Assessing human embryo quality is a critical step in in vitro fertilization (IVF), yet traditional manual grading remains subjective and physically limited by the shallow depth-of-field in conventional microscopy. This study develops a novel "soft optical sensor" architecture that transforms standard optical microscopy into an automated, high-precision instrument for embryo quality assessment. The proposed system integrates two key computational innovations: 1) a multi-focal image fusion module that reconstructs lost morphological details from Z-stack focal planes, effectively creating a 3D-aware representation from 2D inputs; and 2) a retrieval-augmented generation (RAG) framework coupled with a Swin Transformer to provide both high-accuracy classification and explainable clinical rationales. Validated on a large-scale clinical dataset of 102,308 images (prior to augmentation), the system achieves a diagnostic accuracy of 94.11%. This performance surpasses standard single-plane analysis methods by over 10%, demonstrating the critical importance of fusing multi-focal data. Furthermore, the RAG module successfully grounds model predictions in standard ESHRE consensus guidelines, generating natural language explanations. The results demonstrate that this soft sensor approach significantly reduces inter-observer variability and offers a viable pathway for fully automated, transparent embryo evaluation in clinical settings.