Submitted:
17 March 2025
Posted:
18 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Introduced a novel CLIP-based framework that employs zero-shot learning to accurately recognize and validate food items in food packaging, improving order accuracy.
- Achieved high precision (92.92%) and recall 76.65% in identifying food items from images, showcasing the model’s robust image-to-text matching capabilities.
- Demonstrated an overall order accuracy of 85.86% in ensuring food packages match customer orders, highlighting the framework’s potential.
2. Related Work
3. Material and Method
3.1. Food Segmentation
3.2. Contrastive Language-Image Pre-training (CLIP)
- -
- denotes the cosine similarity between the image and text embeddings.
- -
- is a parameter that scales the similarity scores.
- Image encoder uses the input image to obtain its embedding .
- Text encoder encodes each class description to obtain a set of text embeddings , where K is the number of potential classes.
- Calculate each class’s similarity score between the image and a text embedding.
-
Predict the class whose text embedding has the highest similarity score with the image embedding. The predicted class is given by:The CLIP’s ability to correlate textual descriptions with images enables it to generalize well across various domains. This procedure allows CLIP to accomplish classification tasks, including identifying food items, without having been explicitly trained in those classes.
3.3. CLIP for Food Recognition
| Algorithm 1Order validation algorithm |
|
3.4. Dataset

4. Experiments
5. Results and Analysis
6. Conclusion
References
- Mentzer, J.T.; Flint, D.J.; Kent, J.L. Developing a logistics service quality scale. Journal of Business logistics 1999, 20. [Google Scholar]
- Gallo, I.; Rehman, A.U.; Dehkordi, R.H.; Landro, N.; La Grassa, R.; Boschetti, M. Deep object detection of crop weeds: Performance of YOLOv7 on a real case dataset from UAV images. Remote Sensing 2023, 15, 539. [Google Scholar]
- Rehman, A.U.; Gallo, I. Cross-pollination of knowledge for object detection in domain adaptation for industrial automation. International Journal of Intelligent Robotics and Applications 2024, 1–19. [Google Scholar] [CrossRef]
- Rehman, A.U.; Gallo, I.; Lorenzo, P. A Food Package Recognition Framework for Enhancing Efficiency Leveraging the Object Detection Model. In Proceedings of the 2023 28th International Conference on Automation and Computing (ICAC). IEEE; 2023; pp. 1–6. [Google Scholar]
- Junyong, X.; Kangyu, W.; Hongdi, Z. Food packaging defect detection by improved network model of Faster R-CNN. Food and Machinery 2023, 39, 131–136. [Google Scholar]
- Selvam, P.; Koilraj, J.A.S. A deep learning framework for grocery product detection and recognition. Food Analytical Methods 2022, 15, 3498–3522. [Google Scholar]
- Aguilar, E.; Remeseiro, B.; Bolaños, M.; Radeva, P. Grab, pay, and eat: Semantic food detection for smart restaurants. IEEE Transactions on Multimedia 2018, 20, 3266–3275. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PMLR; 2021; pp. 8748–8763. [Google Scholar]
- Wang, W. Advanced Auto Labeling Solution with Added Features. 2023. Available online: https://github.com/CVHub520/X-AnyLabeling.
- Chaurasia, G.J.A.; Qiu, J. YOLOv8 Instance Segmentation. 2023. Available online: https://github.com/ultralytics/ultralytics.
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the Proceedings of the IEEE international conference on computer vision; 2017; pp. 2961–2969. [Google Scholar]
- Kagaya, H.; Aizawa, K.; Ogawa, M. Food Detection and Recognition Using Convolutional Neural Network. In Proceedings of the Proceedings of the 22nd ACM International Conference on Multimedia, New York, NY, USA, 2014; MM’14. pp. 1085–1088. [Google Scholar] [CrossRef]
- Subhi, M.A.; Md.Ali, S. A Deep Convolutional Neural Network for Food Detection and Recognition. In Proceedings of the 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), 2018; pp. 284–287. [CrossRef]
- Khan, S.; Ahmad, K.; Ahmad, T.; Ahmad, N. Food items detection and recognition via multiple deep models. Journal of Electronic Imaging 2019, 28, 013020. [Google Scholar] [CrossRef]
- Bretti, C.; Mettes, P. Zero-shot action recognition from diverse object-scene compositions. arXiv 2021, arXiv:2110.13479 2021. [Google Scholar]
- Thong, W.; Snoek, C.G. Bias-awareness for zero-shot learning the seen and unseen. arXiv 2020, arXiv:2008.11185 2020. [Google Scholar]
- Huang, H.; Chen, Y.; Tang, W.; Zheng, W.; Chen, Q.G.; Hu, Y.; Yu, P. Multi-label zero-shot classification by learning to transfer from external knowledge. arXiv 2020, arXiv:2007.15610 2020. [Google Scholar]
- Chen, J.; Pan, L.; Wei, Z.; Wang, X.; Ngo, C.W.; Chua, T.S. Zero-Shot Ingredient Recognition by Multi-Relational Graph Convolutional Network. Proceedings of the AAAI Conference on Artificial Intelligence 2020, 34, 10542–10550. [Google Scholar] [CrossRef]
- Li, G.; Li, Y.; Liu, J.; Guo, W.; Tang, W.; Liu, Y. ESE-GAN: Zero-Shot Food Image Classification Based on Low Dimensional Embedding of Visual Features. IEEE Transactions on Multimedia 2024, 1–11. [Google Scholar] [CrossRef]
- Zhou, P.; Min, W.; Zhang, Y.; Song, J.; Jin, Y.; Jiang, S. SeeDS: Semantic Separable Diffusion Synthesizer for Zero-shot Food Detection. In Proceedings of the Proceedings of the 31st ACM International Conference on Multimedia. [CrossRef]
- Zhou, P.; Min, W.; Song, J.; Zhang, Y.; Jiang, S. Synthesizing Knowledge-enhanced Features for Real-world Zero-shot Food Detection. IEEE Transactions on Image Processing 2024. [Google Scholar] [CrossRef] [PubMed]
- Ogrinc, M.; Koroušić Seljak, B.; Eftimov, T. Zero-shot evaluation of ChatGPT for food named-entity recognition and linking. Frontiers in nutrition 2024, 11, 1429259. [Google Scholar] [CrossRef] [PubMed]
- Tran-Anh, D.; Huu, Q.N.; Bui-Quoc, B.; Hoang, N.D.; Quoc, T.N. Integrative zero-shot learning for fruit recognition. Multimedia Tools and Applications 2024, 1–23. [Google Scholar]
- Bossard, L.; Guillaumin, M.; Van Gool, L. Food-101 – Mining Discriminative Components with Random Forests. In Proceedings of the European Conference on Computer Vision; 2014. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp. 770–778. [Google Scholar]
- Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. arXiv 2019, arXiv:cs.CV/1911.11929. [Google Scholar]
- Carlsson, F.; Eisen, P.; Rekathati, F.; Sahlgren, M. Cross-lingual and Multilingual CLIP. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, Jun 2022; Calzolari, N.; Béchet, F.; Blache, P.; Choukri, K.; Cieri, C.; Declerck, T.; Goggi, S.; Isahara, H.; Maegaard, B.; Mariani, J.; et al., Ed.; pp. 6848–6854. [Google Scholar]



| Models | Images | Labels | P | R | mAP@0.5 | mAP@0.5:0.95 |
|---|---|---|---|---|---|---|
| YOLOv8 | 250 | food | 99.00% | 98.90% | 98.90% | 95.90% |
| Mask R-CNN | 250 | food | 98.20% | 84.00% | 84.40% | 84.30% |
| Dataset | Model | Images | Accuracy | Precision | Recall |
|---|---|---|---|---|---|
| Food101 | CLIP | 25250 | 95.40% | 96.31% | 94.43% |
| FR (EN) | CLIP | 3850 | 85.40% | 92.92% | 76.65% |
| FR (IT) | CLIP | 3850 | 71.40% | 69.64% | 75.87% |
| Dataset | Model | Containers | Accuracy |
|---|---|---|---|
| FR (EN) | YOLO + CLIP | 1000 | 85.86% |
| FR (IT) | YOLO + CLIP | 1000 | 58.71% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).