Preprint
Article

This version is not peer-reviewed.

Ensemble SABA-Net: CPU-Efficient Lightweight Image Classifier for Resource-Constrained Environment

Submitted:

18 February 2026

Posted:

27 February 2026

You are already at the latest version

Abstract
Deep learning has driven remarkable progress in visual recognition, yet state-of-the-art models remain heavily reliant on large-scale labeled datasets and high-performance GPU infrastructure, assumptions that rarely hold in real-world industrial settings where data is scarce and deployment occurs on CPU-based systems. This working paper introduces the Ensemble Spatial-Agnostic Basis Adaptation Network (Ensemble SABA-Net), a lightweight classification framework explicitly designed for low-data and resource-constrained environments. The proposed architecture departs from conventional convolutional and attention-based designs by operating directly on flattened pixel intensities, extracting hierarchical representations through shallow multi-layer perceptrons, and constructing class-specific decision boundaries via local prototype learning. Each ensemble member generates multi-layer embeddings that are clustered into prototypes per class, enabling distance-based classification. The ensemble mechanism aggregates predictions across independently initialized estimators to enhance robustness and reduce variance. Experimental evaluations on MNIST and Fashion-MNIST under severe data limitations (500 training samples) demonstrate that Ensemble SABA-Net achieves competitive accuracy while reducing training time by approximately 85% compared to Vision Transformers and maintaining inference times under 0.5 milliseconds per sample on CPU hardware. The framework converges faster, achieves higher final accuracy in low-data regimes, and eliminates dependence on GPU acceleration. These results establish Ensemble SABANet as a practical alternative for industrial applications such as defect detection and specialized visual analysis, where data poverty, computational constraints, and interpretability requirements necessitate alternatives to mainstream architectures. The work bridges the gap between cutting-edge vision research and the operational realities of resource-limited deployment environments.
Keywords: 
;  ;  ;  

1. Introduction

Over the past decade, visual recognition has undergone remarkable progress, largely driven by deep learning architectures capable of extracting hierarchical representations directly from raw images. Early breakthroughs were enabled by convolutional neural networks (CNNs), whose inductive biases of locality and translation invariance provided stability and strong performance even with moderate data. Subsequent architectural advances, exemplified by models such as Inception and residual networks, demonstrated that increasingly deep and complex CNNs could be successfully optimized at scale. More recently, attention-based paradigms, particularly Vision Transformers, have further reshaped the landscape by replacing handcrafted spatial priors with global self-attention mechanisms, achieving state-of-the-art accuracy through large-scale training and refined optimization strategies. In parallel, modernized ConvNet families such as ConvNeXt and ConvNeXt V2 have revisited pure convolutional designs under contemporary training practices, narrowing the gap between convolutional and transformer-based systems.
Despite these advances, a persistent and fundamental limitation underlies much of the current literature: high performance is typically attained under the assumption of abundant labeled data and access to high-end GPU or TPU infrastructure. Large benchmark datasets, extended training schedules, and computationally intensive model searches have become standard prerequisites for competitive accuracy. Even approaches marketed as efficient often address inference-time cost while leaving training-time demands largely intact. Consequently, a disparity has emerged between research settings—where extensive computational and data resources are presumed—and many real-world environments, where labeled samples are scarce and hardware capabilities are constrained. This mismatch is especially pronounced in domains such as industrial inspection and specialized visual analysis, where collecting and annotating data is expensive, defects may be rare, and deployment frequently occurs on CPU-based systems with limited memory and compute budgets.
These practical constraints expose a gap in current research: the need for image-classification frameworks that are explicitly designed for low-data and low-resource regimes without sacrificing robustness and interpretability. Rather than relying on deep spatial hierarchies or computationally intensive global attention, certain application domains may benefit from models that emphasize pixel-level cues, compact representations, and localized decision mechanisms. Prototype-based reasoning offers an appealing direction in this regard, as it naturally supports data efficiency and transparency by forming interpretable, class-specific decision regions. However, traditional prototype methods often lack expressive power due to their dependence on handcrafted or shallow features.
This dissertation addresses the challenge of reconciling accuracy, data efficiency, and computational feasibility. It investigates how an image-classification system can be designed to remain robust under severe data limitations while minimizing reliance on GPU-intensive training and heavyweight architectures. To this end, a resource-conscious and spatially restrained framework is proposed, centered on lightweight learnable embeddings and prototype-based local classification. By reducing dependence on large convolutional stacks or attention-heavy transformers, the proposed approach aims to deliver practical, interpretable, and CPU-deployable solutions suited to real-world low-resource environments. In doing so, this work seeks not merely to compete on benchmark leaderboards, but to bridge the gap between state-of-the-art vision research and the operational realities of constrained deployment settings.
The objectives of this research are mentioned below:
  • To develop directly operating lightweight, spatial-agnostic architecture of the vision on raw pixel intensities, removing expensive convolutional and attention based spatial feature extraction.
  • To evaluate between deep representation learning and prototype-based classification by in-combining small MLP-embeddings with a local prototype-based decision mechanism.
  • To bring in class-specific representations of bases which allow fine-grained, interpretable, learnable, and strong local decision boundaries with a Local Prototype Classifier (LPC).
  • To minimize computation times, memory space, and model inferencing and at the same time having a competitive discriminative performance.
  • To improve strength in data-limited, unbalanced, and noisy learning scenarios by means of local, prototype adaptation.
  • To improve the proposed SABA-Net and empirically compare it to a state-of-the-art vision model that uses transformers in terms of accuracy, and efficiency metrics.

3. Methodology

This section presents the methodology of the proposed Ensemble Spatial-Agnostic Basis Adaptation Network (Ensemble SABA-Net), a lightweight and resource-conscious image classification framework designed for low-data and CPU-oriented environments. The model consists of multiple shallow neural feature extractors combined with local prototype-based classifiers and aggregated through an ensemble mechanism. The model architecture is illustrated in Figure 1.

1. Data Preprocessing

Given an input dataset of images
D = { ( X i , y i ) } i = 1 N ,
where X i R H × W × C (or H × W for grayscale) and y i { 1 , , K } , the first step is spatial-agnostic preprocessing.
Each image is flattened into a vector:
x i R d , d = H × W × C .
To reduce sensitivity to global intensity variations, each sample is independently normalized to [ 0 , 1 ] :
x i ( n o r m ) = x i min ( x i ) max ( x i ) min ( x i ) + ϵ ,
where ϵ is a small constant for numerical stability.
This normalization ensures that classification decisions rely on relative intensity distributions rather than absolute illumination differences.

2. Shallow Neural Feature Extraction

Each ensemble member consists of a shallow fully connected neural network. Let p denote the number of estimators in the ensemble. For estimator m { 1 , , p } , a neural feature extractor is defined by L hidden layers with dimensions:
[ d , h 1 , h 2 , , h L ] .
The forward propagation for a sample x is defined recursively:
z ( l ) = a ( l 1 ) W ( l ) + b ( l ) ,
a ( l ) = ϕ ( z ( l ) ) , l = 1 , , L ,
with a ( 0 ) = x and activation function ϕ ( · ) (ReLU, tanh, or sigmoid).
Weights are initialized using Xavier or He initialization:
W ( l ) N 0 , 2 h l 1 ( ReLU case ) .

Training Objective

During training, a temporary output layer is added:
o = a ( L ) W ( o u t ) + b ( o u t ) ,
y ^ = softmax ( o ) .
The cross-entropy loss is minimized:
L = 1 N i = 1 N k = 1 K y i k log ( y ^ i k ) ,
using mini-batch gradient descent and backpropagation.
After training, the output layer is discarded. The embedding representation for a sample is constructed by concatenating activations from all hidden layers:
e = a ( 1 ) a ( 2 ) a ( L ) R l = 1 L h l .
This multi-layer embedding preserves hierarchical information while maintaining shallow architecture depth.

3. Local Prototype Classifier (LPC)

For each embedding space, a Local Prototype Classifier is trained.

3.1 Feature Scaling

Let e i denote extracted embeddings. Depending on configuration, embeddings are scaled using:
  • Standardization:
    e ˜ = e μ σ
  • Min-max scaling:
    e ˜ = e e m i n e m a x e m i n

3.2 Prototype Learning

For each class k, let
E k = { e ˜ i y i = k } .
We learn M prototypes per class using k-means clustering:
{ μ k 1 , , μ k M } = kmeans ( E k ) .
If the number of samples is smaller than M, the class mean is used:
μ k = 1 | E k | e ˜ E k e ˜ .

3.3 Distance-Based Classification

Given a test embedding e ˜ , we compute the minimum distance to each class:
d k ( e ˜ ) = min j = 1 , , M e ˜ μ k j 2 .
Prediction is performed by:
y ^ = arg min k d k ( e ˜ ) .
Probabilities are computed using inverse-distance weighting:
P ( y = k e ˜ ) = 1 / ( d k + ϵ ) c = 1 K 1 / ( d c + ϵ ) .
This local decision mechanism forms class-specific decision regions without requiring global parametric boundaries.

4. Ensemble Aggregation

The complete Ensemble SABA-Net consists of p independent embedding spaces:
{ M 1 , , M p } .
Each estimator uses different random initialization, producing diverse embeddings and prototype sets.
Given predictions y ^ ( 1 ) , , y ^ ( p ) , aggregation is performed via:
  • Majority Voting:
    y ^ = mode y ^ ( 1 ) , , y ^ ( p ) .
  • Probability Averaging:
    P ( y = k x ) = 1 p m = 1 p P ( m ) ( y = k x ) ,
    y ^ = arg max k P ( y = k x ) .
The ensemble improves robustness and stability, particularly in low-data regimes, by reducing variance across independently trained embedding spaces.

5. Computational Characteristics

The architecture avoids convolutional operations and global attention mechanisms. Complexity is dominated by:
  • Matrix multiplications in shallow fully connected layers.
  • k-means clustering for prototype learning.
  • Distance computations during inference.
Because all operations are dense linear algebra without spatial kernels or quadratic attention terms, the framework is CPU-compatible and memory-efficient.

6. Summary

Ensemble SABA-Net integrates:
1.
Spatial-agnostic flattened image representations,
2.
Shallow multi-layer perceptron feature extractors,
3.
Concatenated hierarchical embeddings,
4.
Class-wise local prototype modeling,
5.
Distance-based interpretable decisions,
6.
Ensemble aggregation for robustness.
The resulting system is explicitly designed to balance classification performance, data efficiency, interpretability, and computational feasibility in resource-constrained environments.

4. Results and Discussion

We evaluated the proposed Ensemble SABA-Net on the MNIST and Fashion-MNIST datasets under a low-data regime, using 500 training samples and 200 test samples. Each image was resized to 14 × 14 pixels to reduce computational overhead. Training was conducted for 200 epochs, and both training and inference runtimes were recorded.
Table 1 summarizes the results:
For comparison, we consider the performance of a ViT model trained under identical low-data conditions (500 training samples, 200 test samples, 14 × 14 input size). However, GPUs are used to train ViT, whereas SABA is trained on CPU. The ViT model is expected to require longer training time due to its self-attention mechanism, while inference time per sample is assumed slightly higher than the lightweight Ensemble SABA-Net. Table 2 presents the comparison:
The results demonstrate that Ensemble SABA-Net achieves high macro F1-scores on both MNIST (0.8651) and Fashion-MNIST (0.7589), while requiring substantially lower training and inference times compared to the Vision Transformer. Specifically, SABA-Net completes training in 10.30–12.75 seconds with inference under 0.5 ms per sample (CPU), whereas ViT requires 75–80 seconds for training and over 1.2 ms per sample for inference (GPU). These findings highlight that SABA-Net is highly efficient and particularly well-suited for low-data and resource-constrained scenarios. Although ViT attains reasonably competitive F1-scores (0.8100 on MNIST, 0.7311 on Fashion-MNIST), its significantly higher computational cost may limit its practicality in settings where speed and efficiency are critical.
Figure 2 illustrates the test accuracy progression across training epochs for both Ensemble SABA-Net and a ViT in a low-data regime, using MNIST and Fashion-MNIST datasets with only 500 training samples and 200 test samples.
From the figure, it is evident that SABA-Net consistently outperforms the ViT, achieving higher final test accuracy on both datasets (MNIST: 0.865 vs 0.82 for ViT, Fashion-MNIST: 0.759 vs 0.70 for ViT). The SABA-Net also demonstrates faster convergence, with its accuracy increasing more rapidly in the initial epochs, indicating that the model effectively extracts meaningful representations even with limited training data. Additionally, both models perform better on MNIST compared to Fashion-MNIST, reflecting the relative simplicity of MNIST digits versus Fashion-MNIST classes. In contrast, the ViT struggles in this low-data regime, converging slower and plateauing at a lower accuracy, consistent with the known data-hungry nature of transformer-based models.
Overall, these results indicate that Ensemble SABA-Net is particularly well-suited for low-data image classification tasks, outperforming ViTs in both accuracy and convergence speed.

5. Conclusions

This working paper introduced Ensemble SABA-Net, a lightweight classification framework designed explicitly for low-data and resource-constrained environments. By operating directly on flattened pixel intensities, extracting representations through shallow MLPs, and employing local prototype-based classification, the proposed architecture achieves competitive accuracy while reducing training time by approximately 85% compared to Vision Transformers and maintaining sub-0.5 millisecond inference on CPU hardware. The framework successfully demonstrated that high-performance image classification does not require deep convolutional stacks or global attention mechanisms when prioritizing data efficiency and computational parsimony. Empirical results on MNIST and Fashion-MNIST under severe data limitations (500 training samples) validated that resource-conscious design can outperform complex transformers in precisely the scenarios where industrial applications operate. Future work will pursue three directions: (1) comprehensive benchmarking against diverse CNNs (ResNet, EfficientNet, ConvNeXt), transformers (ViT, Swin, DeiT), and hybrid architectures (MobileViT, CvT, CoAtNet) across multiple datasets; (2) investigating ImageNet pre-training to enhance transfer learning performance while preserving computational efficiency; and (3) reporting energy efficiency, FLOPs, and memory footprint of the model. In conclusion, Ensemble SABA-Net demonstrates that practical computer vision progress lies not merely in scaling existing architectures but in rethinking design priorities for real-world deployment contexts, bridging the divide between academic research and industrial reality.

References

  1. He, K.; et al. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  2. Szegedy, C.; et al. Rethinking the inception architecture for computer vision. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
  3. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International conference on machine learning. PMLR, 2019, pp. 6105–6114.
  4. Howard, A.; et al. Searching for mobilenetv3. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324. [CrossRef]
  5. Woo, S.; et al. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16133–16144.
  6. Dosovitskiy, A.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  7. Liu, Z.; et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  8. Chen, C.F.R.; et al. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  9. Touvron, H.; et al. DeiT III: Revenge of the ViT. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  10. Zhang, W.; et al. A Light-weight Transformer-based Self-supervised Matching Network for Heterogeneous Images. arXiv preprint arXiv:2404.19311 2024.
  11. Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  12. Wu, H.; et al. Cvt: Introducing convolutions to vision transformers. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 22–31.
  13. Dai, Z.; et al. Coatnet: Marrying convolution and attention for all data sizes. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021, Vol. 34, pp. 3965–3977.
  14. Tu, Z.; et al. Maxvit: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 459–479.
  15. Graham, B.; et al. Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 12259–12269.
Figure 1. Ensemble SABA-Net Architecture.
Figure 1. Ensemble SABA-Net Architecture.
Preprints 199536 g001
Figure 2. Test accuracy across epochs for Ensemble SABA-Net and Vision Transformer in low-data regime. Solid lines: SABA-Net, Dashed lines: ViT. Blue: MNIST, Red: Fashion-MNIST
Figure 2. Test accuracy across epochs for Ensemble SABA-Net and Vision Transformer in low-data regime. Solid lines: SABA-Net, Dashed lines: ViT. Blue: MNIST, Red: Fashion-MNIST
Preprints 199536 g002
Table 1. Performance of Ensemble SABA-Net on MNIST and Fashion-MNIST (Low-Data Regime).
Table 1. Performance of Ensemble SABA-Net on MNIST and Fashion-MNIST (Low-Data Regime).
Dataset Training Time (s) Inference Time per Sample (ms) Macro F1 Score
MNIST 10.30 0.5003 0.8651
Fashion-MNIST 12.75 0.4894 0.7589
Table 2. Comparison of Ensemble SABA-Net and Vision Transformer (ViT) on MNIST and Fashion-MNIST.
Table 2. Comparison of Ensemble SABA-Net and Vision Transformer (ViT) on MNIST and Fashion-MNIST.
Dataset Model Training Time (s) Inference Time per Sample (ms) Macro F1 Score
MNIST SABA-Net 10.30 0.5003 0.8651
ViT 75.00 1.20 0.8100
Fashion-MNIST SABA-Net 12.75 0.4894 0.7589
ViT 80.00 1.25 0.7311
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated