Preprint
Article

This version is not peer-reviewed.

Exploring the Collaboration Between Vision Models and LLMs for Enhanced Image Classification

Submitted:

12 December 2025

Posted:

15 December 2025

You are already at the latest version

Abstract
This paper defines a task that utilizes vision and language models to improve benchmarks through analysis of CIFAR-10 and CIFAR-100 datasets. The work divides its operations into image categorization followed by visual description production. The task utilizes BEiT and Swin models as state-of-the-art application-specific components for both parts of this research. We selected the current best image classification checkpoints available in the market which delivered 99.00% accuracy on CIFAR-10 and 92.01% on CIFAR-100. For dense contextually rich text output we used BLIP. The expert models performed well on their target responsibilities using minimal noisy data. The BART model achieved new state-of-the-art accuracies when used as a text classifier to compare synthesized descriptions and reached 99.73% accuracy on CIFAR-10 while attaining 98.38% accuracy on CIFAR-100. This paper demonstrates how our integrated vision and language decomposition-hierarchical model surpasses all existing state-of-the-art results on these common benchmark classifications. The full framework, along with the classified images and generated datasets, is available at https://github.com/bhavyarupani/LLM-Img-Classification.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated