A. Dataset
This study uses the open dataset e-SNLI as the primary data source. The dataset contains natural language inference pairs with human-annotated explanations. It extends traditional natural language inference by explaining each sample. The input, therefore, includes a premise, a hypothesis, and the corresponding reasoning statement. This structured form of reason and inference offers a natural basis for building questioning, reflection, and renewed reasoning mechanisms. It allows the model to learn interpretable logical structures and reflection cues from human reasoning chains.
Since e-SNLI covers the three relations of entailment, contradiction, and neutrality, its explanations include rich logical markers, contrastive cues, and key semantic evidence. This makes it well-suited for training a self-questioning mechanism. The model can use these explanation signals to generate questions about its own reasoning. It can also learn from the logical patterns in the explanations to identify inconsistency, semantic drift, or insufficient evidence. The paired natural language samples in the dataset contain diverse expressions, which support cross-sentence reasoning, semantic alignment, and reflective generation.
In addition, the open license of e-SNLI enables further exploration in complex reasoning, reflection-enhanced generation, and process supervision. It also allows natural integration with the cyclic supervision system proposed in this study. With the help of the explanatory annotations, the model can form a continuous structure that spans initial reasoning, question generation, reflection construction, and semantic calibration. This makes the dataset a key resource for exploring unified mechanisms of self-questioning and renewed reasoning. The dataset has a suitable scale, a clear structure, and well-defined logical tasks. It is therefore highly appropriate for the unified cyclic reasoning supervision framework proposed in this research.
B. Experimental Results
First, this paper carries out a comparative evaluation between the proposed framework and several representative baseline methods, and the detailed performance of each approach is summarized in
Table 1. This experiment provides an overall view of how the method behaves under the same task setting as existing models.
Compared with representative existing models, the cyclic self-questioning framework proposed in this study shows consistent and significant advantages across several key metrics. Traditional methods rely on static single-round generation or limited process supervision. They often suffer from broken reasoning chains or local deviations when handling complex semantic relations or cross-sentence logical judgments. The results in the table show that single-stage reasoning models generally reach an accuracy range of 78 to 83 percent. In contrast, the proposed method embeds questioning, reflection, and semantic calibration into the reasoning chain. This allows the final reasoning stage to continuously absorb self-feedback and correct early unstable factors. As a result, accuracy increases to 87.8 percent, which demonstrates the strong effect of the cyclic supervision structure on decision quality.
For explanation consistency, the differences in BLEU and BERTScore are especially clear. They indicate a structural improvement in the quality of reflective content and final reasoning explanations. Traditional models depend on a single explanation step and often fail to reduce semantic drift between the input and the internal reasoning process. The semantic calibration module in this study integrates reflective content in a structured way. It provides more precise contextual information for the renewed reasoning stage. This significantly improves the alignment between generated explanations and the true semantic chain. The higher BLEU and BERTScore values show that the cyclic feedback mechanism enhances the coherence of the entire reasoning chain and makes explanation generation more reliable and stable.
More importantly, the self-consistency metric shows that the proposed method achieves higher stability and consistency across multiple rounds of reasoning. Traditional approaches often produce scattered or even contradictory answers in repeated inference because they lack internal correction mechanisms. The unified cyclic supervision system introduced in this study nests questioning, reflection, and calibration into a closed reasoning loop. It guides the model to converge step by step toward a consistent semantic state. Self-consistency increases from 63 to 72 percent in traditional models to 81.4 percent in the proposed approach. This shows that the model not only generates more accurate results but also maintains its reasoning trajectory more stably. It provides structural evidence for the effectiveness of cyclic self-supervision.
Furthermore, this paper investigates how different learning rate configurations influence the training behavior and final performance of the model, with the corresponding settings and outcomes organized in
Table 2. This analysis helps to identify a reasonable learning rate range that balances convergence stability and optimization efficiency.
The learning rate sensitivity experiment shows that the cyclic self-questioning framework has a clear and stable response pattern during optimization. A small learning rate, such as 0.0001, ensures stable gradient updates. However, it fails to fully absorb the high-level semantic feedback produced during questioning and reflection in the early training stage. This leads to lower performance in reasoning accuracy, explanation quality, and self-consistency. When the learning rate increases to a medium range, such as 0.0002 to 0.0003, the model can integrate questioning signals and reflective information more effectively. The semantic calibration module can adjust the reasoning chain in a more timely manner. As a result, BLEU, BERTScore, and self-consistency show clear improvements, and the semantic cycle becomes more coherent.
When the learning rate further increases to 0.0005, the model converges faster. Yet the larger update step reduces the ability to make fine-grained semantic adjustments guided by questioning and reflection. This causes slight declines in reasoning consistency and explanation stability. This trend aligns closely with the characteristics of the internal cyclic supervision mechanism. The framework relies on a stable semantic evolution process and injects questioning, reflection, and calibration into the reasoning structure step by step. A small learning rate limits absorption, while a large learning rate undermines the detailed structure of the reasoning chain. A medium learning rate achieves the best balance. The overall results confirm that the cyclic reasoning system requires careful tuning of the learning rate to ensure that internal reasoning feedback is absorbed fully and stably.
In addition, this paper designs a sensitivity study on the upper bound of the reflection length, and the corresponding configurations and observations are illustrated in
Figure 2. By varying the maximum reflection length, the experiment explores how much reflective content is needed to effectively support multi-step reasoning without introducing excessive redundancy.
As the maximum reflection length is increased, the model’s overall reasoning capability is consistently enhanced, indicating that richer reflective content provides more effective support for complex inference. When the reflection length is short, the model captures only limited information during the reflection stage. It cannot accurately locate errors in the intermediate reasoning chain or complete missing semantic links. This results in weaker performance in accuracy and BLEU. When the reflection length increases from 64 to 128 and 256, the model can generate more complete self-questioning content. The semantic coverage of the reflection stage expands. This allows the model to better identify weak points in its own reasoning and to form a more reliable logical path in the renewed reasoning stage.
A similar trend appears in explanation-related metrics. BERTScore improves steadily as the maximum reflection length increases. This indicates that a longer reflection space helps the model align the semantic calibration mechanism with the input more effectively. It enhances consistency within and across the reasoning chain. A short reflection length often leads to brief or fragmented reflective content and fails to address implicit relations in the reasoning process. A larger reflection length allows the model to generate more complete and better supported reflective text. This strengthens the semantic alignment between the final explanation and the input.
The self-consistency metric also increases with longer reflection lengths. This shows that a more complete reflection chain helps the model form a more stable semantic state during cyclic reasoning. A long reflection space enables the model to absorb feedback from self-questioning more fully in each reasoning round. This makes the internal semantic representation more stable. When the reflection length reaches 512, self-consistency achieves its highest level. This indicates that the flow of information across questioning, reflection, and calibration is used most effectively when the reflection capacity is large. The model maintains a high level of consistency across multiple reasoning rounds. This highlights the essential role of the cyclic self-supervision system in enhancing reasoning stability.
Finally, this paper examines the effect of different inference temperature settings on the behavior of the reasoning process, and the relevant curves and comparisons are presented in
Figure 3. This part focuses on how the degree of sampling randomness influences not only accuracy but also the diversity and consistency of the generated reasoning paths.
Varying the reasoning temperature produces pronounced changes in decision stability and noticeably alters how well the model can internalize and utilize feedback within the cyclic reasoning process. At low temperature settings, the generation process becomes more deterministic. The information flow among questioning, reflection, and renewed reasoning remains stable and compact. At temperatures of 0.2 and 0.4, accuracy remains high. This indicates that the cyclic supervision structure can receive and utilize internal feedback more effectively in low randomness conditions and produce more reliable final reasoning results. When the temperature increases to 0.6 and 0.8, the randomness of generation grows. Reflective content and renewed reasoning paths become more prone to drift or local jumps, which leads to a gradual decline in accuracy.
A similar trend appears in explanation consistency metrics. BLEU and BERTScore remain high under low temperature settings. This shows that the model can generate reflective and explanatory texts with more stable semantic structures and maintain stronger coherence within the reasoning chain. As the temperature rises, the uncertainty of generation increases. The model is more likely to produce redundant or weakly related content during the reflection stage. This weakens its ability to maintain semantic alignment. At a temperature of 0.8, the evaluation metrics decrease noticeably. This indicates that high randomness disrupts the semantic calibration mechanism and weakens the correspondence between the explanation and the input evidence.
The self-consistency metric further reveals the deep influence of temperature on the cyclic reasoning framework. Self-consistency reaches its best levels under low temperature settings. This shows that the model can maintain a consistent logical state across multiple rounds of reasoning and that questioning, reflection, and renewed reasoning reinforce each other within a stable cycle. As temperature increases, the controllability of internal feedback declines. The model tends to produce divergent reasoning paths during consecutive cycles, which leads to a clear reduction in self-consistency. This shows that the cyclic supervision structure is highly sensitive to output stability and that temperature has a direct impact on whether its core mechanism can function effectively.
Overall, the results confirm that the cyclic chain of questioning, reflection, calibration, and renewed reasoning depends strongly on stability in the generation process. Moderate determinism helps ensure accurate transmission of reflective information and allows continuous optimization of the reasoning path. Excessive randomness weakens the feedback mechanism and prevents the model from maintaining a consistent semantic state across iterative cycles. The reasoning temperature experiment clearly illustrates how the cyclic self-supervision system behaves under different levels of generation control and provides important guidance for future model optimization.
This paper also presents an experiment on the sensitivity of inference parallelism settings to accuracy metrics, aiming to examine how different levels of parallel reasoning influence the stability and reliability of the model’s decision process. By adjusting the degree of parallelism during the inference stage, the experiment evaluates how expanded or reduced branching paths interact with the cyclic self-supervision mechanism. This design helps reveal whether the internal feedback loop—consisting of questioning, reflection, and semantic calibration—can remain effective under varying inference configurations. The experimental results corresponding to this analysis are shown in
Figure 4.
Adjusting the degree of inference parallelism leads to distinct differences in accuracy, and this influence is tightly coupled with the structural properties of the cyclic self-questioning framework itself. At low levels of parallelism, the model can fully absorb feedback from the questioning and reflection stages in each reasoning round. This allows the paths of semantic calibration and renewed reasoning to remain aligned and produces a stable and compact reasoning chain. The high consistency of this information flow promotes gradual convergence of the internal reasoning state. As a result, the model maintains high accuracy when the parallelism is set to 1 or 2.
When the parallelism increases to 4, the diversity of generated content expands. Different reasoning branches begin to diverge more in their semantic structure. The self-questioning mechanism can still correct local reasoning steps. However, the noise brought by multiple branches interferes with the ability of the reflection stage to capture the most important evidence. The calibration step cannot fully cover all reasoning paths. This leads to a slight decrease in accuracy. The results suggest that moderate parallelism can provide useful redundancy, but too many branches weaken the focused feedback effect of the cyclic supervision mechanism.
When the parallelism increases to 8, the number of reasoning branches grows sharply. The model faces greater uncertainty in the renewed reasoning stage, and the internal semantic alignment becomes harder to maintain across branches. Conflicts among a large number of parallel reasoning results increase the burden on semantic calibration. The internal representation space struggles to form a unified semantic center during iteration. This causes accuracy to drop more noticeably. High parallelism enlarges the exploration space, but it reduces the effectiveness of the reflection chain within the cyclic reasoning system.
Overall, the results further confirm the structural dependency of the cyclic self-questioning framework on reasoning stability. Moderate determinism and a limited number of generated branches help maintain a robust semantic feedback loop and allow the model to make full use of the internal supervision signals provided by questioning and reflection. Excessive parallelism disrupts the consistency of the internal iterative chain and makes it difficult for the reasoning process to maintain a coherent direction under noise. This highlights the need to tune inference parameters carefully around the cyclic supervision mechanism to ensure that the model remains in a reasoning state conducive to convergence.