Reliable visual characterization of food composition is a fundamental prerequisite for image-based dietary assessment and health-oriented food analysis. In fine-grained food recognition, models often suffer from large intra-class variation and small inter-class differences, where visually similar dishes exhibit subtle yet discriminative differences in ingredient compositions, spatial distribution, and structural organization, which are closely associated with different nutritional characteristics and health relevance. Capturing such composition-related visual structures in a non-invasive manner remains challenging.
In this work, we propose a fine-grained food classification framework that enhances spatial relation modeling and key-region awareness to improve discriminative feature representation. The proposed approach strengthens sensitivity to composition-related visual cues while effectively suppressing background interference. A lightweight multi-branch fusion strategy is further introduced for stable integration of heterogeneous features. Moreover, to support reliable classification under large intra-class variation, a token-aware subcenter-based classification head is designed.
The proposed framework is evaluated on the public FoodX-251 and UEC Food-256 datasets, achieving accuracies of 82.28% and 82.64%, respectively. Beyond benchmark performance, the framework is designed to support practical image-based dietary analysis under real-world dining conditions, where variations in appearance, viewpoint, and background are common. By enabling stable recognition of the same food category across diverse acquisition conditions and accurate discrimination among visually similar dishes with different ingredient compositions, the proposed approach provides reliable food characterization for dietary interpretation, thereby supporting practical dietary monitoring and health-oriented food analysis applications.