4.5. Results
Table 4 and
Table 5 present the evaluation results for baseline sequence-to-sequence models and various LLMs on the Kazakh text simplification task.
Table 4 presents evaluation results on the Kazakh text simplification dataset. Seq2Seq baselines fail to generalize, producing negligible BLEU and ROUGE scores, indicating traditional architectures lack sufficient capacity to model simplification under low-resource constraints. Zero-shot LLMs also struggle, with all variants producing overextended outputs (length ratios between 3.17 and 5.38). While Llama-3.3-70B shows moderate gains in BLEU (5.53) and F1 (73.24), the absence of length control limits overall performance.
Domain-specific models, including kazLLM and Sherkala, demonstrate stable performance across all metrics and clearly outperform both Seq2Seq baselines and zero-shot LLMs. BLEU scores range from 19.59 to 21.52, and all three configurations achieve F1 scores above 83, indicating strong surface-level fluency and content preservation. In contrast to zero-shot outputs, length ratios remain close to 1.0, confirming better control over generation length.
Among these models, kazLLM-70B achieves the highest BLEU score (21.52) and the highest recall (86.65), with a length ratio of 1.06. Sherkala-8B, while slightly behind in BLEU (19.59), achieves the highest precision (82.58) and a longer average output (length ratio = 1.17). kazLLM-8B falls between the two, with balanced precision and recall (82.52 / 84.51) and a BLEU score of 20.72, showing that smaller-scale models can still generalize well when exposed to sufficient domain-specific data.
KazSim models outperform all baselines. KazSim (Llama-3.3-70B) achieves the best overall results, with a BLEU of 33.5, F1 of 87.56, and a near-optimal length ratio of 0.98. Other KazSim variants (Qwen2-72B and Llama-3.2-3B) also perform well, confirming that targeted finetuning on task-specific Kazakh data is critical for achieving high-quality simplification.
Table 5 reports results on the semi-manually created test set for Kazakh text simplification. Compared to the generated test, performance patterns remain consistent, but scores are generally lower across all metrics. This drop suggests that the manually curated references are more diverse and structurally dissimilar from the model outputs, increasing the difficulty of achieving high lexical overlap.
Zero-shot models again show weak performance, with BLEU scores below 5 and F1 scores ranging from 58.42 to 72.08. Length ratios remain substantially inflated (3.06–5.60), indicating persistent overgeneration. Despite minor gains in F1, these models continue to underperform across all metrics, confirming the limitations of zero-shot simplification in low-resource settings.
Domain-specific models, kazLLM and Sherkala, maintain relatively good performance. BLEU scores fall between 16.35 and 17.09, and F1 remains above 82 for all configurations. Length ratios range from 1.04 to 1.25, consistent with more controlled generation behavior.
KazSim again outperforms all other approaches. KazSim (Llama-3.3-70B) achieves the highest BLEU (20.33) and F1 (84.27), with a near-optimal length ratio of 0.99. Other KazSim variants (Llama-3.2-3B and Qwen2-72B) also perform strongly, confirming that the benefits of fine-tuning extend to harder test cases with more diverse simplification references.
In comparison to the automatic test split, all models show slightly reduced BLEU and ROUGE scores on the manual set. This suggests that the manual references contain more lexical and syntactic variation, reducing surface-level overlap. However, models like KazSim that are trained on task-aligned supervision still generalize well, with minimal drop in F1 and consistent length control. Overall, the results reinforce the robustness of KazSim across evaluation settings and confirm that simplification in low-resource languages requires explicit adaptation not only to the language but also to the task.
Figure 4 presents SARI scores for all models evaluated on both the automatic and semi-manual test sets. SARI is used as the primary metric for measuring simplification quality, as it captures the balance between content preservation, deletion of unnecessary information, and appropriate addition of simplified expressions.
Seq2Seq baselines perform the worst, with SARI scores of 33.56 and 33.60. These results reflect the inability of standard encoder-decoder models to generalize under low-resource scenarios. Zero-shot models demonstrate slight performance improvements, with scores ranging from 33.92 (Llama-3.2-3B) to 40.02 (Llama-3.3-70B). Notably, performance on the semi-manual test set remains stable or slightly improves across all zero-shot models. This suggests that zero-shot models are not particularly sensitive to test set construction and may rely on generic rewriting patterns that generalize equally across both test types.
Domain-specific models, including kazLLM and Sherkala, achieve higher SARI scores in the 42.86–45.80 range, and show relatively small gaps between the two test sets. This indicates better stability and improved simplification quality when models are pretrained on Kazakh data.
KazSim models achieve the highest scores overall. KazSim (Llama-3.3-70B) reaches 56.38 on the automatic test set and 48.42 on the semi-manual set. Other KazSim variants (Qwen2 and Llama-3.2-3B) also outperform all baselines and zero-shot models. Although there is a drop in SARI when moving to the manual test set, KazSim maintains a clear advantage, showing that task-specific supervision enables better generalization.
To assess prompt sensitivity, we compare model outputs when using either English or Kazakh instruction prompts for the same simplification task. Results are summarized in
Table 6. Overall, model performance remains relatively stable across prompt languages, though minor variations are observed. For domain-specific models (e.g., kazLLM and Sherkala), Kazakh prompts yield slightly higher SARI scores, indicating better alignment with simplification objectives when instructions are provided in the target language. Sherkala, for instance, improves from 44.23 to 46.15 in SARI with a Kazakh prompt. Zero-shot performance is more sensitive to prompt language. Llama-3.3-70B sees a notable increase in BLEU (from 4.53 to 8.26) and F1 (from 72.08 to 76.42) under Kazakh instructions, suggesting that instruction-following behavior improves when prompts are better aligned with the target output language. The proposed KazSim model remains robust under both conditions, achieving the best results across all metrics. Performance is slightly higher with Kazakh prompts, reaching a SARI of 48.78 and an F1 of 84.39, confirming the benefit of aligning the prompt language with the generation task. These findings suggest that while multilingual LLMs can generalize across instruction languages, prompt language alignment can further improve output quality particularly for zero-shot settings and domain-specific models trained on Kazakh data.