6. Validation Using Foundation Models in WatsonX
After generating the complete semantic corpora using the proposed Semantic Flow Encoding (SFE) mechanism and demonstrating the informational separability between BENIGN and DDOS traffic through formal measures (Shannon entropy and Jensen–Shannon Divergence), an additional level of interpretative validation was needed. The numerical analysis has confirmed the existence of a structural difference at distributional level; however, to consolidate the central hypothesis of the paper, it was essential to verify whether this difference would be also detectable at the level of language abstraction, in absence of access to the original numerical values.
For this purpose, a multi-model validation was performed through controlled inference within the IBM WatsonX platform. At this stage, LLMs are used as semantic auditing mechanisms, capable of abstracting the symbolic distribution generated through SFE. The objective is not classification or re-computation of statistical properties, but rather evaluation of semantic structural robustness and the testing of interpretative convergence across distinct models [
37,
38,
39,
40].
From the completely programmatic generated corpora, representative samples of 1,000 instances were extracted for each semantic configuration:
BENIGN_SFE-5_1K
DDoS_SFE-5_1K
BENIGN_SFE-8_1K
DDoS_SFE-8_1K
The analysis was organized into BENIGN–DDoS pairs corresponding to each configuration (SFE-5 and SFE-8), keeping both the sample size and the semantic parametrization constant.
The global distributional properties had already been quantified on the complete corpora during the informational analysis stage. The samples preserve the dominant patterns and the relevant informational structure, and processing the entire corpus would not provide significant additional information.
The semantic analysis was conducted in IBM WatsonX Prompt Lab using three distinct foundation models: Mistral, LLaMA-3-70B-Instruct, and Granite (IBM) [
41,
42,
43]. The choice of the WatsonX platform was motivated by the need for a unified and controlled environment that enables us to run different models with identical generation parameters, ensuring direct comparability of results. For all experiments, identical parameters were used: temperature = 0.2, top-p = 1, frequency penalty = 0, presence penalty = 0, and a constant generation limit (~300 tokens). The low temperature limits stochastic variation and promotes stable inferential behavior. No retrieval mechanism or vector store was used; each file was processed in full as a single context, reducing the risk of introducing selection bias and ensuring a global analysis of the symbolic distribution.
Within this experimental framework, IBM WatsonX provides the necessary infrastructure to test the hypothesis that the informational differences numerically demonstrated between BENIGN and DDOS are systematically reflected in the language abstraction produced by distinct LLMs. The results thus would suggest the existence of a correspondence between the numerical space of distributions and the symbolic space of semantic interpretation. Across all runs, the same prompt was used, formulated in a neutral and descriptive manner, without references to classification or informational measures. The prompt imposed four analytic dimensions: attribute variability, intensity (throughput/packet rate), temporal structure and dominant behavioral motives. Testing was carried out step by step, applying the identical prompt to each pair of datasets and each model, to observe whether the numerically demonstrated informational differences are associated with consistent differences in the generated semantic characterization.
The prompt used was: You are analyzing a corpus of structured semantic descriptions of network flows. Examine the overall behavioral structure of this corpus. Focus strictly on variability of attributes, intensity patterns (throughput, packet rate), temporal structure (inter-arrival timing, active phase duration) and dominant behavioral motifs. Do NOT classify the data. Do NOT assume labels. Provide a structural analysis of the corpus in approximately 200–300 words.
The results for each pair of datasets are presented in
Table 4,
Table 5,
Table 6 and
Table 7. These tables highlight both inter-model convergence and the semantic differences between the BENIGN and DDOS corpora. The comparative tables were produced through a direct and structured analysis of the outputs generated by the three foundation models for each dataset. For each corpus (BENIGN_SFE-5_1K, DDoS_SFE-5_1K, BENIGN_SFE-8_1K, DDoS_SFE-8_1K), the three responses generated by Mistral, LLaMA, and Granite were analyzed separately.
The analysis was organized strictly along the four dimensions that were explicitly imposed by the prompt: attribute variability, intensity (throughput and packet rate), temporal structure, and dominant behavioral motives. For each dimension, the descriptions generated by the three models were compared directly, and the table synthesizes the common ideas and recurrent elements consistently observed across models. No additional automated summarization or classification procedures were applied to the generated outputs. The tables represent a systematic and comparative organization of the content produced by the models, without modification or semantic reinterpretation. The “Convergence” column indicates the degree of conceptual similarity among the descriptions generated by the three models for the same analytical dimension.
The comparative tables synthesize the results generated by each model across four analytical dimensions corresponding to the structure imposed by the prompt. Each row reflects a distinct category of semantic characterization, ensuring that comparisons are performed using homogeneous criteria across models and datasets. The row “Attribute variability” aggregates references to the distribution of descriptive flow values, such as packet size variability, average packet size, or directional balance. This row synthesizes how the models describe the degree of dispersion or amplitude of values along these dimensions, without introducing additional interpretation; it effectively reflects how broad the attribute spectrum is described to be within the analyzed corpus.
The row “Intensity (throughput/packet rate)” consolidates observations related to traffic levels, namely combinations of throughput and packet rate. It synthesizes mentions of very low or very high values, discrepancies between throughput and packet rate, and the way these intensities are distributed across the corpus. The intensity dimension is treated separately from attribute variability to avoid interpretative overlaps.
The row “Temporal” synthesizes elements related to the temporal organization of flows, including explicit references to inter-arrival timing and active phase duration, as well as descriptions of stable or burst-like transmission patterns when correlated with temporal dynamics. This row excludes intensity and focuses exclusively on rhythm, duration, and the temporal distribution of activity.
The row “Dominant motives” reflects how the models articulate recurrent behavioral patterns observed in the corpus, such as stable communication behavior, burst-like transmission, or persistent low-volume communication. This row does not introduce external evaluations but explicitly synthesizes the recurrent narrative structures used by the models to characterize the global behavior of the flows.
The “Convergence” column indicates the degree of semantic similarity among the descriptions generated by the three models for the same analytical dimension. It does not assess performance or accuracy, but rather the consistency of formulations across distinct architectures under identical experimental conditions.
Through this structure the tables do not represent a reinterpretation of the results, but rather a systematic organization of outputs according to the specifications imposed by the prompt, facilitating inter-model and inter-corpus comparison in a transparent and reproducible way.
All three models describe the corpus as heterogeneous, dispersed, and exhibiting wide variation across all dimensions. No clearly dominant core emerges. There is coexistence between stable and burst-like patterns without evident concentration. Inter-model convergence is high.
Compared with BENIGN_SFE-5_1K (
Table 4), in the case of DDOS_SFE-5_1K (
Table 5) the models emphasize the recurrent intensity combinations more explicitly as well as the coexistence of distinct behavioral modes, frequently described as steady communication and burst-like transmission. Granite and LLaMA define almost identically the idea of two main behavioral modes: steady transmission and short-term high intensity activities. Even if the variability of the features remains present, the semantic description is better organized around the correlation between throughput, packet rate and temporal patterns associated with these high intensity episodes.
For
BENIGN_SFE-8_1K (
Table 6), all three models describe the corpus as exhibiting extensive attribute variability, with distributed values across the full spectrum (very low to very high) for throughput, packet rate, and packet size variability.
Active phase duration is frequently mentioned as being high; however, this feature appears in combination with variable intensities and different temporal patterns (stable and burst-like), without the models explicitly indicating the predominance of a single behavioral type. The overall characterization remains one of structural diversity, convergently supported by all three architectures.
In the case of
DDOS_SFE-8_1K (
Table 7), all three models frequently show high values of
active phase duration and elevated levels of
packet size variability and
directional balance. Intensity (throughput and packet rate) is described as varying across a wide spectrum, but it is often associated with low inter-arrival timing and episodes of intense activity. Granite and LLaMA formulate very similar descriptions, using comparable conceptual structures to highlight the coexistence of stable and burst-like behaviors. Semantic characterization is consistent across models and remains stable across all analyzed dimensions.
The comparative analysis of the results for the BENIGN and DDOS datasets highlights a high level of inter-model convergence and consistent semantic differences between the two traffic types. However, although the prompt does not mention classes, the models have independently identified patterns that align with the JSD, and entropy differences shown earlier.
According to
Table 4, for the
BENIGN_SFE-5_1K dataset all three models (Mistral, Granite, and LLaMA) describe the corpus as characterized by extensive attribute variability, a full spectrum of intensity levels (very low to very high), and the coexistence of stable and burst-like patterns without identifying a clearly dominant core. The global characterization is convergent: a diverse, heterogeneous, and structurally dispersed corpus.
In
Table 5, corresponding to the
DDOS_SFE-5_1K dataset, inter-model convergence is again visible, but there are differences in emphasis compared to BENIGN. The models highlight extreme intensities, recurrent combinations of throughput and packet rate, and a clearer structuring of behavior into distinct modes (steady versus burst) more frequently. Granite and LLaMA formulate almost identical descriptions of the existence of two dominant behavioral modes. Compared to
Table 4, the structure described is more focused on intensity and recurrent burst-like transmission patterns.
The results for the extended dimensionality SFE-8 maintain this trend. According to
Table 6 (
BENIGN_SFE-8_1K), the benign corpus is again described as exhibiting wide variability across multiple dimensions (
packet size variability,
average packet size,
directional balance), intensities distributed across the full spectrum, and the coexistence of stable and burst-like patterns without evident structural concentration. Inter-model convergence remains strong, and the global characterization is consistent with the one observed for SFE-5.
In
Table 7 (
DDOS_SFE-8_1K), the differences relative to BENIGN become more pronounced. The models frequently emphasize very high
active phase duration, high
packet size variability, and intensity-supported patterns. Although formal variability remains present, the descriptions indicate a more coherent structure around burst-like behaviors or sustained high activity. Semantic convergence among Mistral, LLaMA, and Granite is again high, particularly regarding emphasis on intensity and prolonged activity.
Taken together,
Table 4,
Table 5,
Table 6 and
Table 7 reveal two distinct semantic characterization patterns. BENIGN datasets are consistently described as heterogeneous and dispersed, with broad distributions and no rigid dominance. In contrast, DDOS datasets are repeatedly associated with extreme intensities, bimodal or dual structuring, and the presence of recurrent intense transmission patterns. These differences appear stably regardless of the model used and persist across variations in SFE dimensionality (5 vs. 8).
Therefore, the results suggest robust inter-model convergence and a consistent association between the numerically demonstrated informational differences and the differences observed in language characterizations. Without assuming internal model mechanisms, it can be stated that the generated outputs reflect sensitivity to the distributional structure of the corpora. Thus, the semantic analysis performed through controlled inference in IBM WatsonX provides additional support for the hypothesis that the numerical informational separability between BENIGN and DDOS is associated with stable differences at the level of language abstraction.