The proliferation of urban air pollution—particularly fine particulate matter (PM2.5) —demands scalable monitoring approaches that go beyond the sparse networks of reference-grade stations installed in most cities worldwide. Over the past decade, deep learning methods applied to ground-level photographs have emerged as a low-cost complement to conventional sensing: convolutional neural networks (CNNs) and hybrid architectures can extract visually detectable pollution cues—haze opacity, color temperature shifts, and reduced horizon contrast—and map them onto continuous air-quality index (AQI) or PM2.5 estimates. This review synthesizes 40 primary studies and several additional supporting sources published between 2020 and 2025 to characterize the state of the art in image-based AQI estimation, identify the key technical and infrastructural limitations, and outline research directions relevant to data-scarce, under-monitored cities such as Bishkek, Kyrgyzstan. Three interlocking themes structure the review: (1) deep learning architectures and training strategies, from single-modality CNNs to multimodal and spatiotemporal hybrid models; (2) dataset characteristics and their decisive influence on regression accuracy; and (3) the monitoring infrastructure gap in low-income and middle-income cities of Central Asia and comparable regions. The evidence consistently shows that positive R2 values require at least 3,000–5,000 labeled image–pollutant pairs, controlled temporal stratification, and, ideally, auxiliary meteorological inputs. Promising directions include vision transformers, structured state-space models, Grad-CAM interpretability, and cross-city transfer learning. The review concludes with a structured research agenda for image-based air-quality monitoring in Central Asia.