3.2. Performance and Validation of Chemometric Models
The performance of the supervised models was evaluated using nested cross-validation, a fundamental strategy to reduce optimistic bias in high-dimensional metabolomic datasets with limited sample size. The evaluated metrics included Overall Accuracy, Balanced Accuracy, and Cohen’s Kappa coefficient, allowing simultaneous assessment of global predictive performance, class balance, and agreement beyond random classification.
The results demonstrated important differences in the behavior of the algorithms in response to the metabolomic complexity of Coffea arabica genealogical groups (
Table 2;
Figure 2). Overall, the models showed distinct performances according to their mathematical characteristics and the correlation structure present in the NMR spectral dataset.
Repeated stratified cross-validation demonstrated that the supervised models exhibited distinct performances in response to the metabolomic complexity of Coffea arabica genealogical groups. Overall, the results confirmed that discrimination among groups is feasible using ^1H NMR spectra, although efficiency varied according to the mathematical structure of each algorithm.
The PLS-DA model achieved the highest overall accuracy (87.9%), the highest Balanced Accuracy (89.4%), and the most balanced performance among the three genealogical classes. In addition, the confidence interval associated with Balanced Accuracy indicated moderate to high predictive stability, considering the limited sample size and the high dimensionality of the dataset. The homogeneous behavior of the individual recalls (Bourbon = 92.5%; Mundo Novo = 91.7%; Timor Hybrid = 84.0%) demonstrated that PLS-DA was the least affected by class imbalance, a particularly important characteristic in metabolomic datasets with reduced sample size and uneven class distribution. These results reinforce the suitability of PLS-DA for highly collinear spectral datasets, since the method projects variables into latent components optimized to simultaneously maximize group separation and explained variance.
The Elastic Net model showed overall accuracy very close to that of PLS-DA (86.6%) and a high Kappa coefficient (0.790), indicating agreement between predicted and observed classifications. The good performance of this model suggests that genealogical discrimination may be partially explained by reduced subsets of highly informative spectral variables, since Elastic Net combines L1 and L2 regularization, simultaneously promoting variable selection and control of metabolomic collinearity. The model exhibited particularly high performance for Bourbon (recall = 90.0%) and Timor Hybrid (88.0%). However, the lower performance for Mundo Novo (75.0%) indicates greater metabolomic overlap of this group with the others, especially due to its intermediate nature between Bourbon and Timor Hybrid.
The SVM-RBF model showed intermediate performance, with overall accuracy of 76.8% and Balanced Accuracy of 60.6%. The algorithm demonstrated high ability to discriminate Timor Hybrid (92.5%) and Bourbon (89.2%), suggesting that nonlinear relationships partially contributed to the separation of these groups. However, the extremely low recall for Mundo Novo (0.0%) indicates greater sensitivity of SVM-RBF to intraclass heterogeneity and metabolomic overlap observed in this group. In small and highly correlated datasets, highly flexible nonlinear models may exhibit increased susceptibility to overfitting local classification boundaries, reducing their overall stability.
Random Forest showed the lowest overall performance, with overall accuracy of 72.9% and Balanced Accuracy of 59.4%. Despite the good identification of Timor Hybrid (85.0%) and Bourbon (85.0%), the model exhibited low sensitivity for Mundo Novo (8.3%), indicating substantial difficulty in establishing consistent classification boundaries for this genealogical group. This behavior suggests that the metabolomic structure of the dataset was better represented by linear latent-variable models, favoring projection-based and regularized approaches rather than decision tree-based algorithms. Furthermore, the high dimensionality associated with the small sample size tends to reduce the stability of individual trees, increasing variability in ensemble decisions.
Overall, the results demonstrate that genealogical separation in Coffea arabica is mainly associated with gradual and predominantly linear metabolomic patterns, particularly related to lipids, amino acids, trigonelline, caffeine, and chlorogenic acids. The superior performance of PLS-DA and Elastic Net reinforces that regularized linear models provide greater stability and interpretability for metabolomic applications involving ^1H NMR data. On the other hand, the consistently lower performance observed for Mundo Novo across nearly all models suggests a partially intermediate metabolomic profile, which is consistent with its genetic origin derived from the natural cross between Bourbon and Typica, although the reduced number of samples within the Mundo Novo group may also have contributed to greater classification instability.
Despite the promising results obtained from the supervised models, some limitations should be considered when interpreting the findings. The sample set used in this study is still relatively small for high-dimensional supervised metabolomic approaches, particularly for the Mundo Novo group, which contained fewer samples than the other genealogical groups. Class imbalance may affect model stability and favor classification bias, especially in more complex algorithms sensitive to sample distribution. Nevertheless, the use of robust metrics such as balanced accuracy, combined with repeated cross-validation and nested cross-validation strategies, partially minimized these effects and increased the statistical reliability of the analyses. Even so, future studies involving a larger number of accessions, multiple crop seasons, and independent external validation will be important to confirm the stability of the identified metabolomic markers.
3.3. Main Chemical Compounds Responsible for the Separation of Coffea Arabica Genealogical Groups
PLS-DA analysis of the ^1H NMR spectra revealed the main chemical compounds responsible for the separation of Coffea arabica genealogical groups. The Bourbon group, associated with negative coefficients, showed higher relative abundance of sucrose, lipids, and trigonelline, compounds related to sweetness, light body, and delicate aroma, which are characteristic of high-quality coffees. The Mundo Novo group, associated with intermediate coefficients, was characterized mainly by the contribution of amino acids and organic acids, which are precursors of aroma and body-related compounds, reflecting a balanced sensory profile between acidity and intensity. In contrast, the Timor Hybrid group, associated with positive coefficients, was characterized by the predominance of trigonelline, chlorogenic acids, and caffeine, phenolic and nitrogen-containing compounds associated with higher bitterness and acidity, which are commonly related to resistant genotypes. It is important to emphasize that these identified compounds represent potential precursors or chemical markers associated with sensory attributes described in the literature, rather than compounds directly determining such attributes.
Figure 3 shows the distribution of chemical annotations across the PLS-DA coefficients: 7.9–8.8 ppm (orange) corresponds to the caffeine region, more strongly associated with Timor Hybrid (positive coefficients); 7.0–7.6 ppm (red) corresponds to trigonelline and chlorogenic acids, which also predominated in Timor Hybrid; 4.0–5.5 ppm (green) corresponds to sucrose and sugars, associated with Bourbon (negative coefficients); and 0.6–1.5 ppm (blue) corresponds to fatty acids and lipids. Therefore, the plot visually demonstrates which spectral regions discriminate against each genealogical group: Bourbon is associated with negative regions (blue and green), Timor Hybrid with positive regions (red and orange), and Mundo Novo occupies an intermediate position between both groups.
Therefore, the PLS-DA classification model was constructed to discriminate the genealogical groups.
Figure 4A shows the projection of the accession codes onto the first two PLS-DA components according to the loadings of the corresponding components from matrix X. A certain degree of overlap among groups was still observed, indicating the existence of common metabolomic regions among accession codes, as evidenced by the 95% confidence ellipses shown in
Figure 4A.
Figure 4B and 4C present the correlation plot and the main discriminant buckets selected from components 2 and 3 of the PLS-DA model, respectively, highlighting the ^1H NMR spectral regions most associated with the discrimination of genealogical groups. For the Bourbon group, the variables that most influenced classification were lipids, amino acids, and organic acids. In the Mundo Novo group, amino acids and organic acids were also the compounds that most contributed to separation, although with lower intensity. Chlorogenic acids were the variables that most strongly influenced the classification of the Timor Hybrid group.
According to the PLS-DA results, it was possible to identify the variables with the greatest influence on the classification of genealogical groups and, consequently, determine potential chemical markers associated with these genotypes. The chemical compounds present in green coffee beans are important precursors of flavor and aroma compounds formed during the roasting process. Breeding programs aimed at improving beverage quality are complex because the sensory parameters evaluated during cupping are strongly influenced by environmental conditions [
16,
17]. Therefore, analytical tools capable of identifying chemical markers associated with beverage quality are required ([
18]).
To verify and validate the constructed model, the classification error rate was evaluated (
Figure 4D), and the selected error rate corresponded to component 5 (10.5%). It should be emphasized that this error rate was obtained through a repeated 5-fold cross-validation procedure, in which one subset was used to fit the model and the remaining subsets were used for testing across 100 simulations. The permutation test demonstrated that the observed classification performance was significantly superior to that obtained by random assignment of class labels (p < 0.05). The real PLS-DA model exhibited a substantially lower balanced error rate than the distribution generated by the permuted models, indicating that the observed discrimination was not due to random chance compared with the distribution generated from random permutation of class labels, corroborating the robustness and non-random nature of the metabolomic discrimination among
Coffea arabica genealogical groups.
According to the PLS-DA model, each accession code was classified into the genealogical group presenting the highest loading value in the PLS-DA component associated with the lowest classification error rate (
Figure 4D). The PLS-DA model containing five components was selected as the final model because it provided the best balance between predictive performance, chemical interpretability, and statistical parsimony. According to the PLS-DA model fitted with five latent variables, the genealogical groups showed moderate separation with partial overlap among classes (
Figure 5). The Bourbon group presented the highest classification consistency, whereas the Timor Hybrid group showed greater dispersion and higher misclassification rates. The overall model performance reached 89.5% accuracy after repeated 5-fold cross-validation (100 simulations), indicating satisfactory discrimination power based on the chemical composition of coffee beans. Thus,
Figure 5 shows the classifications resulting from the PLS-DA model, in which one sample from the Bourbon group, one from Mundo Novo, and two from Timor Hybrid were not correctly classified (
Figure 5 and
Table 3). Therefore, the samples with the highest correct classification rate belonged to the Bourbon genealogical group, which presented the lowest error rate (8.3%). The performance metrics presented in
Table 2 correspond to repeated stratified 5-fold cross-validation (100 repetitions) performed after hyperparameter optimization through nested cross-validation, whereas
Table 3 summarizes the performance of the final optimized model containing five latent variables.
The variables were selected according to the highest absolute loadings in component 3 combined with positive class association. Putative metabolite assignments were based on literature-reported ^1H NMR spectral regions for coffee metabolites.
The PLS-DA model (5 components, 5-fold cross-validation × 100 simulations) showed discrimination among the genealogical groups based on the chemical composition of coffee samples.
The PLS-DA model with five components demonstrated high classification accuracy for the Bourbon and Timor Hybrid, whereas Mundo Novo exhibited the highest misclassification rate. Overall, the model showed relative consistent and robust performance across 100 cross-validation simulations, indicating a satisfactory ability to discriminate genealogical groups based on chemical composition. The values presented in Table 3 correspond to the optimized final model and therefore are not directly comparable to the repeated cross-validation estimates presented in Table 2. 3.4. Metabolic Signatures for Genealogical Discrimination
The analysis of the variables with the highest coefficients in the models (
Figure 6) provided strong biochemical plausibility for the separation of the genealogical groups (
Table 4).
The identification of discriminant metabolites was performed based on loading plots (PC1 × PC2) and confirmed by bidimensional heteronuclear correlation experiments (^1H–^13C HSQC and HMBC). Thus, the assignment of chlorogenic acids, amino acids, organic acids, and lipids was confirmed, whereas additional signals were annotated putatively. The consistency between the identified chemical markers and the genetic background of the samples validates the metabolomic approach.
The integration of PLS-DA coefficients, VIP values, and bidimensional NMR experiments (HSQC and HMBC) allowed the identification of consistent metabolomic signatures associated with the different genealogical groups of Coffea arabica. The Bourbon group showed greater association with lipids, organic acids, and amino acids, compounds related to creaminess and aromatic refinement. The Mundo Novo group exhibited an intermediate metabolic profile, mainly characterized by amino acids and organic acids, suggesting a balance between aroma precursors and compounds associated with beverage body. In contrast, the Timor Hybrid group showed higher relative intensities of chlorogenic acids, reflecting a more phenolic and acidic chemical profile associated with its interspecific genetic origin. These results reinforce the biological consistency of the chemometric models and demonstrate the potential of ^1H NMR metabolomics for genealogical discrimination of Coffea arabica.
The heatmap displays the thirty most discriminant chemical variables (VIP scores) identified by the PLS-DA model, normalized by Z-score across the three genealogical coffee groups (Bourbon, Mundo Novo, and Timor Hybrid). Red shades indicate higher relative intensities (positive Z-scores), whereas blue shades represent lower relative intensities (negative Z-scores). Each variable is identified by its chemical shift (δ, ppm) and its major compound class detected in the ^1H NMR spectra. The color distribution highlights the distinct metabolomic profiles characterizing each genealogical group. The Bourbon group showed relatively higher intensities in spectral regions associated with lipids, amino acids, and organic acids, whereas Mundo Novo displayed intermediate intensities. In contrast, Timor Hybrid was enriched in chlorogenic acids and aromatic phenolic compounds.
3.4.1. Chemical Interpretation of the Bourbon Genealogical Group
The signal observed at δ 0.83 ppm is typical of terminal methyl protons (–CH₃) from aliphatic chains of saturated fatty acids, such as palmitic acid (C₁₆:₀) and stearic acid (C₁₈:₀), in addition to possible contributions from triacylglycerols [
19,
20]. In the ^1H NMR spectrum, this region (0.7–1.0 ppm) is widely attributed to neutral lipid components of green and roasted coffee.
These compounds are classical chemical markers associated with beverage body and creamy texture, since lipids participate in the retention of volatile compounds and contribute to viscosity perception (Speer & Kölling-Speer, 2006). The high intensity of this signal in the Bourbon group suggests a higher lipid content compared with other groups, which is consistent with the metabolic profile of
C. arabica, generally characterized by higher proportions of total oils (~15% dry weight) compared with hybrids related to
C. canephora [
22,
23].
From a sensory perspective, higher lipid content may potentially be associated with compounds previously related to enhanced body and aromatic persistence, in addition to attenuating acidity and bitterness perception ([
24]). Thus, lipid-associated signals may contribute to chemical characteristics previously related to beverage body and creaminess.
The bucket centered at δ 1.69 ppm was putatively associated with overlapping signals from aliphatic amino acids and organic acids located within the 1.6–2.0 ppm region of the ^1H NMR spectrum ([
19]). These metabolites play an essential role as precursors of volatile aroma compounds formed during the roasting process. During the Maillard reaction, amino acids react with reducing sugars, generating pyrazines, furanones, and aldehydes responsible for fruity, caramel-like, and floral notes [
24,
25]. Thus, the significant presence of this signal in the Bourbon group may be associated with characteristics previously related to aromatic potential and sensory complexity, resulting in coffees with intense and pleasant aroma.
The exact identification of the compounds within this bucket is hindered by signal overlap, an inherent characteristic of NMR analyses of complex mixtures. However, studies using bidimensional spectra (HSQC and HMBC) confirm that this region is predominantly composed of free amino acids and carboxylic acids associated with fruit maturation metabolism ([
19]).
The signal at δ 7.25 ppm is in the aromatic proton region of conjugated phenolic compounds, including chlorogenic acids (CGAs), caffeic acid, and trigonelline derivatives. This region is typical of aromatic phenolic compounds recognized for their antioxidant activity and contribution to acidity and aroma ([
25]). During roasting, CGAs undergo thermal degradation, generating phenol, guaiacol, vanillin, and cresols, compounds responsible for caramel, vanilla, and chocolate notes ([
26]). Although the Bourbon group presented lower relative intensities of these compounds compared with Timor Hybrid, their moderate presence contributes to balanced acidity and aromatic complexity without the excessive bitterness commonly associated with resistant hybrids.
3.4.2. Chemical Interpretation of the Mundo Novo Genealogical Group
The Mundo Novo group was mainly discriminated against by amino acids and organic acids. In the PLS-DA model, the Mundo Novo genealogical group showed the most discriminant variables at the chemical shifts δ 1.75, δ 1.95, and δ 7.31 ppm, corresponding to the classes’ amino acids/organic acids and chlorogenic acids/aromatic phenolics. These compounds reflect a balanced and intermediate chemical profile between the Bourbon and Timor Hybrid groups, consistent with the hybrid origin of Mundo Novo, a natural cross between the Typica and Bourbon varieties of Coffea arabica.
The signals at δ 1.75 and δ 1.95 ppm are associated with aliphatic amino acids (such as alanine, valine, leucine, and isoleucine) and low-molecular-weight organic acids (such as malic, citric, succinic, and lactic acids), which act as precursors of volatile aroma compounds during roasting [
24,
25]. During thermal processing, these metabolites participate in Maillard reactions with reducing sugars, forming volatile nitrogen-containing compounds (pyrazines, furanones, and aldehydes) responsible for sweet, floral, and fruity notes ([
24]). Therefore, the expressive presence of these compounds in the Mundo Novo group suggests metabolically balanced beans with potential for coffees exhibiting medium body and complex aroma, sensory characteristics typically associated with this variety.
On the other hand, the signal at δ 7.31 ppm is characteristic of aromatic protons from chlorogenic acids (CGAs) and other aromatic phenolic compounds. CGAs are the major phenolic compounds present in green coffee and play a central role in defining beverage acidity, bitterness, and antioxidant potential ([
25]). During roasting, these compounds undergo partial thermal degradation, generating volatile derivatives such as phenol, guaiacol, vanillin, and cresols, which contribute caramel, vanilla, and chocolate notes ([
26]).
The coexistence of amino acids and chlorogenic acids at expressive levels in the Mundo Novo group indicates a chemical balance between aroma precursors and acidity-related compounds, resulting in a balanced sensory profile with good sweetness, medium body, and bright acidity, characteristics frequently observed in coffees from this lineage ([
18]).
In summary, the Mundo Novo group stands out for presenting an intermediate chemical profile between the extremes represented by Bourbon (sweetness and body) and Timor Hybrid (acidity and bitterness), representing a metabolic and sensory balance highly appreciated in specialty coffees.
3.4.3. Chemical Interpretation of the Timor Hybrid Genealogical Group
The Timor Hybrid genealogical group presented, in the PLS-DA model, the most expressive discriminant variables at the chemical shifts δ 7.29, δ 6.85, and δ 6.77 ppm, all located within the aromatic region of the ^1H NMR spectrum. These signals correspond predominantly to phenolic compounds and chlorogenic acid (CGA) derivatives, which are recognized as chemical markers associated with acidity, bitterness, and antioxidant activity in coffee ([
25]).
Chlorogenic acids are esters formed between quinic acid and caffeic acid and represent one of the major classes of phenolic metabolites in green coffee, accounting for approximately 4–10% of the bean dry matter ([
27]). The variations observed in the intensity of these spectral regions in Timor Hybrid indicate higher accumulation of CGAs, which is consistent with the genetic background of this group, derived from the cross between
Coffea arabica and
Coffea canephora (robusta). This inheritance confers to Timor Hybrid a chemical profile closer to robusta coffee, characterized by higher levels of phenolic and bitter compounds, as well as greater resistance to biotic and abiotic stresses [
18,
28].
During the roasting process, chlorogenic acids undergo thermal rearrangements and oxidative degradation, generating a variety of volatile aromatic compounds such as phenol, guaiacol, cresols, and vanillin, which are directly responsible for roasted, bitter, and full-bodied beverage notes ([
23]). However, excessive degradation of these phenolic compounds may also generate reactive quinones associated with increased astringency and residual bitterness.
The predominance of signals in the δ 6.7–7.3 ppm region in Timor Hybrid demonstrates the strong presence of conjugated aromatic structures, typical of caffeic, ferulic, and p-coumaric acids, in addition to trigonelline and nicotinamide derivatives, which also contain active aromatic protons in this region ([
19]). Trigonelline is a precursor of niacin (vitamin B3) and contributes to sweet and slightly spicy notes after roasting.
This high concentration of aromatic phenolic compounds confers on Timor Hybrid a distinct chemical profile characterized by high acidity, perceptible bitterness, and elevated antioxidant potential. In sensory analyses, coffee with these characteristics usually presents pronounced acidity and intense body, frequently associated with spicy and cocoa-like notes ([
26]).
Thus, the set of observed variables suggests that the Timor Hybrid group is distinguished by the chemical intensity of phenolic compounds, being the group that differs most strongly from the others due to the pronounced presence of chlorogenic acids and aromatic phenolics. This profile reinforces its hybrid genetic origin and explains the behavior observed in the Top 30 heatmap, in which Timor Hybrid exhibited higher Z-scores in the aromatic region of the spectrum.