Embedded GPUs with unified memory (UMA) often suffer from the memory wall: modern restoration/segmentation pipelines trigger heavy DRAM traffic and incur long-tail latency jitter. We present MW-DSNet (Memory-Wall-aware Dual-Stream Network), a latency-deterministic hardware–software co-design that combines roofline-based diagnosis, DRAM traffic accounting, and activation-bounding deployment rules with a static-shape TensorRT pipeline and a lightweight Sigmoid-based Inverted-Parabola Attention Module (IP-SIAM). On Jetson Orin Nano (15 W), MW-DSNet sustains 720p@30 FPS with P95 latency 35.1 ms, and reduces DRAM traffic per frame by 3–9× versus transformer/diffusion baselines under fixed power/clock settings. Here, the reported 30 FPS / 35.1 ms (p95) is measured on the visual restoration engine (restoration stage) only; the downstream segmentation head is evaluated separately to isolate restoration-induced robustness gains. The resulting Design Rules provide practical guidance for deterministic real-time perception on memory-wall-bounded edge GPUs.