Submitted:
08 December 2025
Posted:
09 December 2025
You are already at the latest version
Abstract
Keywords:Â
1. Introduction
2. Methodology


3. Experiments
- 1)
- 5âclassification models input: In this setting, we include code from 5 different image classification models from LEMUR. For each run, a random subset of 5 model snippets is sampled (excluding the one corresponding to the captioning modelâs own encoder to avoid trivial copies). Examples of models whose snippets were used include EfficientNetâB0, ConvNeXtâT, DenseNetâ121, VGG16, and . Each snippet typically consisted of a few key layers or blocks (e.g., a convolutional block with BatchNorm and ReLU from ResNet, or a selfâattention block from ViT). The idea was to expose the LLM to a diverse set of architectural âideasâ in a small number.
- 2)
- 10âclassification models input: Here, we doubled the number of snippets, providing 10 different classification model excerpts. The pool included the above models, as well as others such as MobileNetV2, Inceptionâv3, and SqueezeNet, aiming for even greater diversity. The prompt length increases significantly with 10 snippets, which could potentially overwhelm the LLMâs context or lead to further confusion.
4. Results and Discussion
- Prefixes and counts.
| Models (Prefix) | Decoder Type | # Models |
|---|---|---|
| C1C-RESNETLSTM | LSTM | 1 |
| C5C-RESNETLSTM | LSTM | 100 |
| C10C-RESNETLSTM | GRU(+Attn) | 3 |
| C5C-ResNetTransformer | Transformer | 250 |
| C8C-ResNetTransformer | GRU (feat-init) | 3 |
| Total | 357 |
- Main COCO results (val).
- Enhanced CNN Encoders: Many models retained a CNN backbone (often the same ResNetâ50 from the baseline), as the LLM wasnât explicitly asked to change the entire backbone â it could, but often chose to augment rather than replace. A common modification was to add a squeezeâandâexcitation (SE) block or a CBAM () to the ResNet encoder path. For instance, one generated model inserted an SE module after the ResNetâs final convolutional block, as inspired by a classification snippet that contained an SE implementation. This model () achieved a BLEUâ4 score of , slightly lower than the best model, indicating that channelâwise attention effectively focused the image features.
- Alternative Encoders: In a few cases, the LLM did swap the encoder entirely. One notable architecture was a decoder model. The prompt included a snippet from ConvNeXt (a modern CNN architecture), and the LLM ultimately used ConvNeXtâs stem and stages to construct an encoder, rather than ResNet. It then attached a Transformer decoder. This model was successful in training and achieved a BLEUâ4â score of approximately . Qualitatively, it generated slightly more descriptive captions than the baseline (likely due to the stronger visual features of ConvNeXt), although its performance was not the highest.
- Recurrent vs. Transformer Decoders: Approximately half of the successful models utilised LSTM decoders, while the other half employed Transformer decoders. Interestingly, no GRUâbased decoder was chosen by the LLM in our runs â even though GRU was allowed, the LLM seemed to prefer either sticking to LSTM (perhaps because the baseline and example had LSTM) or going for the more complex Transformer for a potential performance boost. We generated some models by forcibly inserting GRU as a decoder in Prompt to check. The Transformerâbased models generally yielded higher BLEU scores. Our top model, which we call , combined the ResNetâ50 encoder (with some modifications) and a Transformer decoder with 768 hidden size and 8 attention heads. This model achieved a BLEUâ4 score of , the highest among the generated models. It closely mirrored the structure of the ResNetTransformer example provided in LEMUR. Still, the LLM introduced a learnable positional encoding and increased the number of decoder layers to 4 (compared to 6 in the LEMUR reference). The LSTMâbased models tended to plateau at BLEUâ4 in the range of â. We suspect the Transformerâs multiâhead attention was able to better utilize the encoder features, as expected (and indeed our prompt hinted to the LLM that multiâhead attention can improve BLEU [7]).
- Attention Mechanisms: Almost every Transformer decoder model is designed to naturally use attention. For the LSTM models, we saw that the LLM sometimes implemented a form of attention on the encoder output. In one example, the LLM added a simple additive attention: it learned a weight to multiply the encoderâs feature vector for each decoding timestep (akin to an attention that is basically a linear layer on the feature concatenated with the decoderâs hidden state).
5. Conclusion
References
- Goodarzi, A.T.; Kochnev, R.; Khalid, W.; Qin, F.; Uzun, T.A.; Dhameliya, Y.S.; Kathiriya, Y.K.; Bentyn, Z.A.; Ignatov, D.; Timofte, R. LEMUR Neural Network Dataset: Towards Seamless AutoML. arXiv 2025, arXiv:2504.10552. [Google Scholar] [CrossRef]
- Kochnev, R.; Khalid, W.; Uzun, T.A.; Zhang, X.; Dhameliya, Y.S.; Qin, F.; Vysyaraju, C.; Duvvuri, R.; Goyal, A.; Ignatov, D.; et al. NNGPT: Rethinking AutoML with Large Language Models. arXiv 2025, arXiv:2511.20333. [Google Scholar] [CrossRef]
- DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025, [arXiv:cs.CL/2501.12948].
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; DollĂĄr, P. Microsoft COCO: Common Objects in Context, 2015, [arXiv:cs.CV/1405.0312].
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the ACL, 2002. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. In Proceedings of the CVPR, 2015. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; et al.. Attention Is All You Need. In Proceedings of the NeurIPS, 2017.
- Kochnev, R.; Goodarzi, A.T.; Bentyn, Z.A.; Ignatov, D.; Timofte, R. Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning? In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2025; pp. 5664â5674. [Google Scholar]
- Gado, M.; Taliee, T.; Memon, M.D.; Ignatov, D.; Timofte, R. VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs? arXiv 2025, arXiv:2504.19267. [Google Scholar] [CrossRef]
- Rupani, B.; Ignatov, D.; Timofte, R. Exploring the Collaboration Between Vision Models and LLMs for Enhanced Image Classification. Dimensions 2025, 27. [Google Scholar] [CrossRef]
- Khalid, W.; Ignatov, D.; Timofte, R. A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks. arXiv 2025, arXiv:2512.04329. [Google Scholar] [CrossRef]
- Uzun, T.A.; Khalid, W.; Din, S.U.; Mulukuledu, S.R.; Singh, A.; Vysyaraju, C.; Duvvuri, R.; Goyal, A.; Lukhi, Y.R.; Hussain, A.; et al. LEMUR 2: Unlocking Neural Network Diversity for AI. arXiv preprint 2025.
- Mittal, Y.; Ignatov, D.; Timofte, R. Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis. arXiv 2025, arXiv:2511.07329. [Google Scholar] [CrossRef]
| 1 |


| Method / Family | BLEU-4 |
|---|---|
| Baseline ResNet+LSTM | 0.3246 |
| Baseline ResNet+Transf. | 0.2336 |
| C1C-RESNETLSTM | |
| C5C-RESNETLSTM | 0.1192 |
| C10C-RESNETLSTM | 0.0914 |
| C5C-ResNetTransformer | 0.0862 |
| C10C-ResNetTransformer | 0.0637 |
Disclaimer/Publisherâs Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).