1. Introduction
With the accelerating pace of modern life, dietary patterns have become increasingly diverse, accompanied by substantial visual variability in food presentation across regions, cooking styles, and consumption contexts. Such variability poses significant challenges for fine-grained food recognition, especially when visually similar food categories exhibit subtle yet discriminative differences in ingredient dominance and spatial organization, which in turn affect the reliability of image-based dietary assessment and food-related health analysis.
An ideal dietary analysis approach should be efficient, convenient, and accurate. However, traditional manual recording approaches are often tedious, error-prone, and highly dependent on users’ subjective judgment, making them difficult to apply in daily life and limiting their scalability in practical food monitoring scenarios. Consequently, recent studies have incorporated computer vision and deep learning techniques into dietary image analysis, enabling tasks such as automatic food category recognition, in some cases, calorie-related analysis based on photographs captured in real-world dining environments [
1]. These image-based approaches can lower the usage threshold [
2] and improve the efficiency of dietary data acquisition, motivating further research on reliable fine-grained food recognition in health-related contexts. Against this background, fine-grained food recognition technology has gradually become a research hotspot in the field of intelligent nutrition analysis [
3]. When dealing with diverse and fine-grained food image data, improving a model’s ability to recognize visually similar food items has become an urgent problem to address [
4].
Compared with traditional fine-grained tasks such as animals or vehicles, food images pose greater challenges due to significant intra-class variations and inter-class visual overlaps caused by ingredient composition, cooking processes, and presentation styles. Specifically, food images often exhibit intra-class diversity [
5], where the same dish may appear with noticeable differences in color, shape, and background due to variations in region, cooking style, or shooting angle [
6]. Meanwhile, there also exists inter-class similarity, as certain stir-fried dishes or desserts made from similar ingredients or arranged in similar plating styles present highly comparable visual appearances, making category distinction much more difficult [
7], and resulting in dispersed intra-class distributions and small inter-class distances in the feature space.
A number of large-scale datasets dedicated to food recognition have been successively introduced [
8], such as Food-101 [
9], UEC Food-256, FoodX-251 [
10]. The release of these datasets has not only accelerated the research on fine-grained food classification models but also revealed the complexity and challenges of this task in real-world scenarios, where complex backgrounds, diverse plating patterns, and fine-grained visual ambiguity are commonly observed [
11]. In professional fields such as medical nutrition and health monitoring, accurate food recognition serves as an important prerequisite for subsequent dietary analysis and decision support. Therefore, developing models capable of accurately identifying and distinguishing fine-grained food categories is not merely a technical breakthrough, but also a key to improving the applicability, reliability, and trustworthiness of image-based dietary analysis systems [
12].
Although mainstream models such as Swin-DR have demonstrated promising recognition capabilities in fine-grained food image classification, their fundamental architectures—based on local window modeling and multi-scale feature fusion—only partially improve feature representation [
13], while significant limitations remain in spatial perception and key region modeling [
14].
On the one hand, Swin Transformer has inherently limited spatial modeling capacity, making it difficult to effectively capture the complex variations in structure and arrangement within the same food category. For example, while both are “sushi” rolled sushi usually appears in a compact strip arrangement, whereas scattered sushi distributes rice and toppings freely across the plate. Such global layout differences are difficult to model adequately through local attention alone, thereby affecting classification stability and accuracy [
15]. On the other hand, the model’s positional awareness and attention mechanisms are relatively weak [
16], often leading to overfitting on background areas or irrelevant edges while lacking precise localization and emphasis on the main object. Particularly when dealing with mixed dishes such as boiled fish with chili or spicy hot pot, recognition becomes challenging to focus accurately on the key food regions. As a result, existing models struggle to jointly model global–local spatial relations, fine-grained variations intra-class, and background suppression [
17]. As can be seen from
Figure 1 Food categories usually exhibit small inter-class differences, large intra-class variations, and high diversity in ingredients and plating forms. Therefore, models must not only finely model local structures but also maintain sensitivity to overall layout and positional relationships to break through existing performance bottlenecks.
Food images often require the joint consideration of heterogeneous visual cues, such as local texture patterns, global spatial layouts, and salient ingredient regions; however, existing modeling strategies tend to integrate these cues in a coarse manner, which can result in redundant representations, noise amplification, and inefficient use of discriminative information. Meanwhile, existing recognition pipelines often rely on overly simplified decision mechanisms, where global feature aggregation tends to discard important spatial distribution cues, and single-prototype representations are insufficient to accommodate the pronounced intra-class variability commonly observed within the same food category. These issues collectively constrain further performance improvement in fine-grained food recognition. The hybrid model integrates Convolutional Neural Networks with the Swin Transformer to simultaneously capture intricate local textures and global spatial dependencies. By embedding specialized convolutional blocks directly into the Transformer architecture, the framework significantly enhances the recognition of visually similar food items for precision dietary analysis [
18].
To address these challenges, we propose a unified fine-grained food recognition framework that emphasizes spatial structure modeling and composition-aware feature representation, aiming to improve the robustness and reliability of food recognition in complex real-world scenarios. From a food analysis perspective, explicitly modeling spatial structure and ingredient-dominated regions enables more stable characterization of composition patterns within a dish, which is essential to distinguish visually similar foods with different underlying ingredient arrangements. Enhancing positional awareness of key food regions allows the model to focus on composition-related visual cues, such as ingredient concentration areas, while effectively suppressing background elements unrelated to food structure. Existing feature integration strategies often struggle to effectively leverage heterogeneous visual cues, leading to redundant representations, noise amplification, or inefficient information utilization [
19], particularly in complex fine-grained food recognition scenarios [
20]. Accordingly, an adaptive feature integration strategy is required to preserve complementary information while suppressing redundancy and noise in fine-grained food recognition. Previous studies indicate that adaptive normalization and gating mechanisms can help balance heterogeneous feature contributions and improve robustness in complex visual recognition tasks [
21]. Low-rank interaction and attention-based enhancement have also been shown to improve fine-grained discrimination while maintaining computational efficiency [
22,
23]. At the decision stage, conventional classification designs often suffer from spatial information loss and limited ability to model intra-class diversity, which constrains performance in fine-grained food recognition. Preserving spatial distribution cues during pooling has been shown to improve recognition of food categories with subtle structural and compositional differences [
24]. Such spatially aware pooling strategies help maintain sensitivity to ingredient distribution and local structure in visually similar food categories [
25].
Experimental results demonstrate the effectiveness of the proposed framework for fine-grained food image recognition under complex visual conditions. The proposed approach achieves accuracies of 82.64% on Food-256 and 82.28% on FoodX-251, demonstrating strong discrimination among visually similar food categories, especially dishes sharing similar ingredients or culinary styles.
In the context of globally diversified food culture, a single food category often exhibits substantial visual variation across regions and preparation styles, resulting in a widespread “same-category, different appearance” phenomenon. Failure to capture variations in overall layout and local ingredient arrangements may lead to representation confusion, while insufficient positional awareness often causes attention to drift toward irrelevant regions (e.g., plates or tablecloths), weakening focus on key ingredients.
To address these challenges, we introduce a unified framework that jointly considers spatial structure and key food region awareness. By integrating global layout cues with local ingredient arrangements, the framework improves sensitivity to complex spatial structures and subtle appearance variations in food images. With enhanced positional awareness, the framework focuses on ingredient-dominant regions while suppressing background interference, thereby improving classification accuracy and robustness in food recognition. This approach provides reliable technical support for image-based dietary composition analysis and health-oriented food assessment pipelines.
The main contributions of this work are summarized as follows:
To handle complex spatial structures and subtle appearance variations caused by ingredient composition and preparation styles, we design AGRA (Adaptive Grouped Residual Attention) to jointly capture global layout and local ingredient arrangements in food images.
We introduce CEAG (Coordinate-Enhanced Adaptive Gating) to improve the model’s ability to localize key regions while suppressing background distractions such as plates and tablecloths, thus enhancing classification accuracy and robustness.
We propose SGLR-Mixer, a soft-gated low-rank fusion strategy that adaptively integrates heterogeneous visual cues in food images while avoiding redundant representations and excessive computational overhead.
We design TAPSubCenterArcHead, which preserves spatial distribution cues during pooling and improves intra-class diversity modeling, enabling more reliable discrimination among visually confusing food categories.
We propose Swin-ACST, a fine-grained food classification framework that integrates spatial relationship modeling with key-region awareness, effectively enhancing feature discrimination and overall classification performance in complex food image analysis scenarios.
3. Methods
3.1. Overall Architecture of the Approach
The overall architecture of the proposed model is shown in
Figure 2. The model mainly consists of three parts: the first part is the backbone network, the second part is the spatial perception and modeling enhancement module, and the third part is the fine-grained food classifier.
Among them, the backbone network is composed of multiple Swin Transformer modules. The spatial perception and modeling enhancement module sequentially integrates global spatial relation modeling enhancement and coordinate-aware spatial enhancement, combined with a DRConvBlock to further improve local feature representation ability. After multi-branch enhancement, to robustly integrate heterogeneous semantic information, SGLR-Mixer is introduced to adaptively integrate heterogeneous features from multiple enhancement paths, enabling stable fusion of complementary spatial and semantic information. Then, spatial selective amplification is applied to further strengthen the response of salient regions and suppress background interference.
The fine-grained food classifier no longer adopts a simple global average and single-center metric, but employs a combination of weighted aggregation preserving spatial distribution and multi-center angular metric, namely the TAPSubCenterArcHead, to preserve the spatial distribution of key food components and improve the modeling of intra-class appearance diversity.
The overall process is as follows: First, the three-channel RGB fine-grained food image is input into the backbone network, and the Swin Transformer extracts global feature representations. Then, the feature information is processed by AGRA for global and local relation modeling, which focuses on both overall structure and local arrangement to enhance spatial sensitivity and subtle-difference discrimination ability. Next, the CEAG module, based on coordinate awareness and spatial gating, improves the model’s attention accuracy to target regions, effectively suppresses irrelevant background areas, and avoids background overfitting.After that, local feature enhancement is performed to improve edge and texture representations.
For the features obtained from the above multi-branch enhancement, SGLR-Mixer performs path alignment and soft-gated low-rank fusion, highlighting salient regions and producing stable and more discriminative global representations. Finally, the enhanced features are fed into the TAPSubCenterArcHead, where weighted aggregation and multi-center angular metric are used to complete category prediction.
Overall, through multi-level spatial relation modeling, salient region localization, and robust fusion, the proposed model effectively enhances spatially aware feature representation and category discrimination for fine-grained food images, providing more stable and reliable predictions under diverse presentation conditions.
3.2. CEAG Block
Following the Swin-DR backbone, this section details the proposed CEAG Block to improve spatial awareness in fine-grained food image analysis. In food images, discriminative visual cues are often closely tied to the spatial distribution of ingredients and their relative positions within a dish, while irrelevant background elements such as plates or tablecloths may introduce significant interference. However, the window-based attention mechanism of Swin Transformer exhibits limited explicit positional sensitivity, making it difficult to consistently emphasize key food regions. The CEAG block addresses this issue by incorporating coordinate-aware attention to explicitly encode spatial location information into channel representations. This design enables the model to better align visual responses with ingredient-dominated regions while suppressing non-food background areas. By enhancing spatial localization and region-level discrimination, CEAG provides more reliable feature representations for fine-grained food recognition and downstream food composition analysis. The CEAG Block comprises two components: Coordinate Attention and Spatial Gating. Coordinate Attention encodes spatial locations and channel dependencies to provide strong coordinate guidance. It uses horizontal and vertical global average pooling to produce spatial embeddings:
These embeddings are concatenated and processed via
convolutions, BatchNorm2d, and h-swish to yield fused spatial features
:
is then split into horizontal and vertical attention weights via separate
convolutions and sigmoid activation:
Input features
are recalibrated in space and across channels to produce
:
where ⊙ is element-wise multiplication. This encodes spatial coordinates explicitly, improving localization and suppressing background. The CEAG Block uses a dual gating mechanism. The first gate uses
convolution and sigmoid to produce feature gate
, and modulates the recalibrated features via a residual path:
Here,
is a scaling factor (set to 0.1), and
is the feature gate. This gate focuses on spatially relevant regions and reduces background contributions. The second gate produces spatial gate
via
convolution and sigmoid, modulating the coordinate-enhanced features. Results are merged with the input via a residual path to yield output
:
Here,
is a scaling factor (set to 0.1), and
is the spatial gate. The dual gating refines features by focusing on relevant spatial regions and suppressing background to reduce overfitting. BatchNorm2d and nonlinearities (e.g., h-swish or ReLU) are applied after each convolution for stable, expressive distributions. The design enables progressive refinement: coordinate attention provides spatial awareness, and the dual gates refine representations. By integrating coordinate attention and dual gating into a single block, CEAG improves focus on relevant regions, suppresses background, reduces overfitting, and enhances recognition on complex scenes and fine-grained targets.
3.3. AGRA Block
The Swin Transformer employs window-based local self-attention, which limits its capacity for comprehensive spatial modeling. To solve it, we design a novel Adaptive Grouped Residual Attention(AGRA) module with lightweight adaptive weights. The core innovation of AGRA lies in its ability to simultaneously capture both global structural information and local spatial arrangements, thereby significantly enhancing the model’s sensitivity to fine-grained spatial details. By explicitly modeling relationships between distant and nearby spatial positions, AGRA enables the network to better distinguish subtle structural differences arising from ingredient distribution and arrangement, which is crucial for fine-grained food recognition.
Given the input feature
, we first split the channel dimension into two equal groups, denoted as
. For each group, we apply a grouped residual linear transformation to generate the query, key, and value representations, while introducing learnable adaptive weights
to control the contribution of the residual branch. Specifically, the transformation for each group is formulated as
The grouped
Q,
K, and
V are then concatenated and reshaped for multi-head attention computation. We normalize
Q and
K and compute the cosine similarity, scaled by a learnable parameter
, to obtain the attention map. The attention calculation is given by
where
denotes the exponential-space relative position bias. To further enhance the model’s ability to capture spatial relationships and suppress background noise, we adopt an exponential mapping for the relative position coordinates. For any two tokens with relative coordinates
, we compute
The output of the attention is then split into two groups, and each group is projected back to the original feature space using a grouped residual linear layer with the same adaptive weights:
This design enables adaptive control of information flow across grouped attention branches, thereby enhancing the robustness and discriminative capability of spatial attention modeling. The introduction of exponential-space relative position bias further strengthens the modeling of spatial relevance across different structural regions, facilitating more consistent discrimination of fine-grained food patterns with complex spatial arrangements. All components are fully differentiable and can be seamlessly integrated into window-based self-attention frameworks such as Swin Transformer, enabling effective spatial relation modeling for fine-grained food classification under complex presentation variations.
3.4. SGLR-Mixer
In fine-grained food classification, multi-branch feature extraction is commonly adopted to capture complementary cues related to global layout, ingredient-dominant regions, and local structural details. However, in food images with complex composition and high visual variability, simple fusion strategies such as direct concatenation or uniform weighting often introduce feature redundancy, noise accumulation, and unnecessary computational overhead, which may lead to unstable category predictions under real-world conditions. To address this issue, we propose a Soft-Gated Low-Rank Mixer (SGLR-Mixer) that performs adaptive fusion of three heterogeneous feature streams in a lightweight and structured manner. By selectively integrating spatial structure–aware features and key-region–focused representations, SGLR-Mixer enhances the consistency and reliability of the final classification output, which is critical for composition-aware food recognition serving downstream dietary analysis applications. The framework diagram of SGLR-Mixer and its role within the overall multi-branch enhancement pipeline are illustrated in
Figure 3.
Given three inputs
, we templately apply batch norm to align magnitude differences:
Then fuse via learnable soft-gated weights
:
To capture complementary information, add a small second-order interaction term:
where
is the interaction coefficient and ⊙ is element-wise multiplication. A low-rank channel mixer performs lightweight cross-modal interaction. Reduce channels to a low-rank space:
then map back to the original channel space:
Finally, add the mixed features back to the enhanced features via a residual:
where
is a scaling factor controlling the residual contribution.
3.5. TSCA-Classifier
In fine-grained food classification, global average pooling and single-centroid classifiers tend to discard spatial distribution cues and inadequately model intra-class diversity, which are critical for distinguishing visually similar food categories with different composition-related characteristics. To address this limitation, we propose the TSCA-Classifier, as illustrated in
Figure 4.
Given inputs
, we use Token-aware Pooling (TAP) to aggregate spatial features.TAP uses two-layer convolutions to generate spatial attention:
with
to
and
to single-channel. A lightweight enhancement follows:
where
; MLP maps
; Dropout
. BatchNorm is applied:
SubCenter ArcFace keeps
K sub-centers per class (
), weight matrix
. Normalize features and weights:
Cosine similarity:
Reshape to
and take the max over
K sub-centers:
Train with an angular margin
m:
The target logit for the ground-truth class is:
Final logits are obtained by scaling with
s:
4. Experiments and Results
4.1. Dataset
This paper employs two common fine-grained food image datasets to evaluate the proposed method. All datasets adhere to official standard partitions.
FoodX-251. FoodX-251 (Kaur et al., 2019) comprises 251 visually similar fine-grained categories (e.g., cakes with varying decorations, sandwiches with distinct fillings, and pasta in diverse shapes) with 120,216 training images (raw web labels), 12,170 validation images, and 228,399 test images (human-verified labels for validation/test sets).
UEC FOOD-256. UEC FOOD 256 (Kawano and Yanai, 2014) includes 256 categories of food images, each annotated with bounding boxes precisely localizing food regions. The dataset primarily features Japanese cuisine (e.g., tamagoyaki and takoyaki) alongside international dishes, where certain Japan-specific categories may present recognition challenges for non-native observers.
4.2. Evaluation Metrics
To comprehensively assess the effectiveness of our proposed method in image classification, three widely accepted evaluation metrics were employed: Top-1 Accuracy, F1 Score, and Precision.
- Accuracy measures the proportion of correctly predicted samples over the total number of test instances, indicating overall performance.
- Precision reflects how many of the predicted positive results are actually correct.
- Recall represents the ability of the model to correctly identify all actual positive instances.
- F1 score is the harmonic mean of precision and recall, providing a balanced metric between the two.
The definitions of these metrics are given as:
Here, TP (True Positives) refers to correctly classified positive samples, FP(False Positives) are negative samples mistakenly predicted as positive, FN (False Negatives) are positive samples incorrectly classified as negative, and TN (True Negatives) are correctly classified negative samples.
All experiments were conducted under consistent settings and hardware configurations to ensure the reliability of the comparisons. In addition to benchmarking against state-of-the-art self-supervised learning methods, ablation studies were also performed to analyze the individual contributions of each module within the proposed architecture.
4.3. Performance Comparative Experiments
To comprehensively evaluate the effectiveness of the proposed method in fine-grained food recognition tasks, a series of systematic comparative experiments were conducted on two authoritative public datasets FoodX-251 and UEC FOOD-256. The compared methods include representative image recognition architectures, covering both conventional convolutional neural networks (e.g., ResNet, TResNet, EfficientNet) and a variety of recently developed Vision Transformer-based models (e.g., ViT, Swin Transformer, Twins, CaiT, ConvNeXt, CSwin, VOLO) [
39,
40], all of which have demonstrated strong benchmark performance in food image classification and fine-grained recognition tasks.
All experiments were performed on a single NVIDIA GeForce RTX 4090 GPU with mixed-precision training to improve computational efficiency. The batch size was set to 8, and the AdamW optimizer was employed. Both datasets were split into training and validation sets following their official protocols, without any manual intervention. To ensure fairness and reproducibility, three commonly used evaluation metrics, Top-1 Accuracy, F1 Score, and Precision were adopted to comprehensively assess the classification capability of different models under multi-class and imbalanced food image distributions.
As shown in
Table 1, the performance differences among the compared models are evident across both datasets. Traditional convolutional networks exhibit certain advantages in local texture modeling, yet they often fail to establish global semantic consistency when facing food images with complex backgrounds or diverse structural forms. Transformer-based architectures, by contrast, improve the integration of global features through self-attention mechanisms, thereby achieving higher recognition accuracy. However, these methods still face limitations when handling food categories with large intra-class variations and scattered discriminative features.
Our proposed model incorporates several optimization mechanisms tailored for fine-grained food imagery, effectively enhancing its capabilities in spatial modeling, region perception, and detail discrimination. On the FoodX-251 and UEC FOOD-256 datasets, our method achieves 82.28% and 82.64% Top-1 accuracy, respectively, confirming its superiority and robustness in complex food recognition scenarios.
The model centers around unified spatial relation modeling and composition-related salient region localization, strengthening spatial dependency representations across both global and local levels. Through robust alignment of heterogeneous semantic branches and lightweight low-rank fusion, the proposed architecture enables effective feature integration that preserves the spatial distribution of key regions while suppressing background noise and redundancy. Based on this, the classification head replaces the single-center paradigm with a multi-center angular metric, enhancing intra-class diversity representation and boundary robustness. With only marginal additional computational overhead, the entire pipeline achieves end-to-end optimization from feature extraction and cross-branch fusion to decision-space refinement, significantly improving recognition stability and discriminative capability in "same-class samples with different composition layouts" and cluttered background scenarios.
Notably, the proposed method demonstrates strong discriminative capability in handling visually similar food categories that belong to different semantic classes, a scenario that frequently occurs in real-world food image analysis. For example, soup-based dishes such as beef soup and pork bone soup often present highly comparable visual appearances in terms of color tone, liquid dominance, and serving context, despite differing in primary ingredients and dietary relevance. Conventional models tend to confuse such categories due to their reliance on coarse global representations. In contrast, the proposed framework can accurately attend to composition-related regions while modeling global spatial relationships, enabling more reliable differentiation between visually confusing yet semantically distinct food categories and maintaining stable classification performance under complex background conditions.
These results demonstrate that the proposed approach achieves systematic improvements in feature alignment, category boundary modeling, and local–global information fusion by integrating more efficient spatial dependency modeling and region-aware mechanisms. Consequently, it enhances the model’s ability to discriminate subtle inter-class differences and greatly improves its classification robustness and stability in fine-grained food recognition. The above experimental results confirm that our method provides a more accurate and reliable solution for complex fine-grained food recognition tasks.
4.4. Ablation Analysis
To verify the effectiveness of the overall architectural design proposed in this study, we conducted a series of ablation experiments on two fine-grained food image datasets, FoodX-251 and UEC FOOD-256. Starting from the baseline backbone, the experiments follow the overarching principle of “enhancing spatial relation modeling — emphasizing key regions — robust multi-branch fusion and discrimination”. Each capability is introduced progressively to assess its individual and cumulative contribution to the overall recognition performance, while avoiding isolated or fragmented evaluations of single modules.
Specifically, the original Swin-DR model without any structural enhancement is designated as the baseline. As shown in
Table 2, Its Top-1 accuracies on the two datasets are 81.07% and 82.15%, respectively. After incorporating spatial relation modeling enhancement, the accuracies increase to 81.41% and 82.36%, indicating that such modeling facilitates better representation of global layouts and cross-region dependencies, yielding stable gains particularly on the UEC FOOD-256 dataset. Further introducing discriminative region focusing and guidance leads to accuracies of 81.67% and 82.48%. When these capabilities are jointly integrated under a unified framework with lightweight low-rank multi-branch fusion, the model achieves 81.97% and 82.54%, respectively. Finally, after convergence through a more robust multi-center discriminative space, the model attains its best performance of 82.28% on FoodX-251 and 82.64% on UEC FOOD-256. These results are clearly superior to those achieved by any single modification, validating the synergistic effect of the overall design in spatial representation, salient localization, and heterogeneous feature integration.
Taking visually similar soup-based dishes as an example, categories such as beef soup and chicken soup often exhibit highly comparable appearances, characterized by dominant liquid regions, similar color distributions, and overlapping serving contexts. Differences in primary ingredients are frequently reflected only in subtle local regions and spatial composition cues, making them prone to confusion under background interference and plating variations. The baseline model tends to be distracted by dominant background or container regions, resulting in unstable attention and inconsistent predictions. In contrast, when the proposed integrated strategy is enabled, the model can consistently emphasize ingredient-related regions within the global layout, preserve meaningful spatial distribution patterns, and suppress irrelevant background responses, thereby achieving more reliable discrimination between visually confusing yet semantically distinct food categories.
In summary, the ablation results indicate that the observed performance gains are not attributable to any single component in isolation, but arise from the coordinated optimization of spatial relation modeling, salient region localization, robust low-rank fusion, and discriminative decision modeling. This integrated design enhances the consistency of feature representation, stabilizes region-aware responses, and improves fine-grained alignment across complex visual conditions. These findings further validate the structural rationality of the proposed framework and its practical suitability for reliable fine-grained food image recognition in application-oriented settings.
5. Dietary Analysis Application
In recent years, increasing attention has been paid to dietary health and nutritional balance. However, in daily dining scenarios, many consumers lack accurate knowledge of food composition and ingredient structure, especially when visually similar dishes differ subtly in their main ingredients or preparation styles. This often leads to misunderstandings in dietary assessment and limits the effectiveness of image-based dietary analysis.
To address this problem, we develop an image-based dietary analysis application in which the proposed Swin-ACST framework serves as the core food recognition module. The application is designed to support reliable food category identification under real-world dining conditions and to provide users with consistent food-related information for subsequent dietary interpretation.
As illustrated in
Figure 5, users capture food images using mobile devices in unconstrained environments, where variations in shooting angle, illumination, distance, and background are common. The captured images are uploaded through the application interface and transmitted to the inference server, where the trained Swin-ACST model performs fine-grained food recognition. Despite substantial visual variations in image appearance, the system produces stable and consistent recognition results for the same food category. This indicates that the proposed framework effectively captures food-related structural and compositional cues that remain reliable across different acquisition conditions, which is essential for practical dietary monitoring.
Beyond robustness to appearance variation, accurate discrimination among visually similar food categories is particularly important for dietary analysis. As shown in
Figure 6, several dishes with highly similar visual appearances but different ingredient composition are evaluated. The visualization results demonstrate that the proposed system focuses on ingredient-dominant regions while suppressing irrelevant background areas, enabling more precise differentiation between visually confusing dishes. Such behavior is critical in dietary applications, where misclassification between similar-looking foods may lead to incorrect interpretation of nutritional characteristics.
After food recognition is completed, the system automatically retrieves corresponding food-related information and health-oriented references from the database according to the recognized category and returns the results to the application interface. By reliably distinguishing food categories with subtle compositional and structural differences, the application supports more meaningful interpretation of meal composition and provides users with clearer dietary insights. These results indicate that Swin-ACST is well suited for deployment in real-world image-based dietary analysis applications, providing a reliable visual foundation for intelligent food monitoring and health-oriented dietary assessment.