Preprint
Article

This version is not peer-reviewed.

Convolutional Neural Networks: Biological Foundations, Hidden Limitations, and Future Directions

Submitted:

15 May 2026

Posted:

19 May 2026

You are already at the latest version

Abstract
Convolutional neural networks have transformed visual recognition, yet robust geometric reasoning, reliable out-of-distribution generalization, and recognition from limited data remain substantially unsolved. CNNs draw their architectural inspiration from the mammalian visual cortex, but the translation from biology to engineering was selective and in places imprecise, and those imprecisions have consequences that are well documented. This paper examines where the biological fidelity holds and where it gives way, grounding the analysis in formal results that predate deep learning and in recent empirical findings on CNN failure modes. We identify three diagnosable architectural limitations. First, CNNs conflate visual modalities that the biological system separates structurally at the lateral geniculate nucleus, feeding raw RGB pixels into a single undifferentiated filter bank and entangling orientation, color, and texture signals from the first layer onward. Second, CNNs repeat a spatial subsampling operation across the full depth of the network, far beyond the early visual cortex stages where it has biological warrant. Barnard and Casasent established formally in 1990 that this operation discards positional information irreversibly at every layer where it is applied, and repeating it into regions that correspond to V4 and inferotemporal cortex compounds this loss without the compensating transition to qualitatively different computations that the biological hierarchy performs. Third, the pooling-as-complex-cell analogy that motivated this design reflects a misreading of what complex cells compute. The spatiotemporal energy model formalizes complex cell behavior as geometry extraction: detecting the presence and orientation of a local edge structure robustly, abstracting over photometric accidents of contrast polarity and sub-wavelength phase that are not geometrically meaningful. Pooling is attempting something categorically different, namely object-level position invariance for recognition through spatial subsampling, which achieves its goal by discarding exactly the geometric information that the energy model preserves. Treating pooling as an approximate or scaled-up implementation of the energy model conflates two operations that differ not in degree but in kind, and crucially it removed the principled criterion for confining the S-C operation to early visual cortex: because pooling was understood as a general-purpose invariance mechanism rather than an approximation of a first-stage geometry extractor with a natural biological endpoint at V3, the field had no architectural reason to stop repeating it. We survey how capsule networks, group-equivariant CNNs, PDE-based networks, and vision transformers each address one or two of these limitations while leaving the others intact. We propose six desiderata that a more biologically complete architecture would need to satisfy, and argue that satisfying them requires treating the visual cortex’s solution as a coherent package in which each component depends on the others working correctly, rather than as a menu of independently selectable principles.
Keywords: 
;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated