1. Introduction
Code generation is an important application of large language models (LLMs). It can automate software development tasks and speed up programming workflows. Ultra-large models like Qwen-72B have shown success, but problems remain in producing consistent code quality. Full-parameter fine-tuning is effective but costly. Lightweight methods reduce cost but fail to keep deep semantics and structural integrity needed for executable code. Instruction alignment through supervised learning or reinforcement learning from human feedback improves task conformity, but it does not use execution semantics and structural fidelity. Current methods also depend too much on token-level accuracy, which ignores deeper program dependencies, and they provide little control over style and redundancy. These problems show the need for a framework that joins semantic, syntactic, and structural optimization with controllable generation and adaptive context use. CodeFusion-Qwen72B is built to solve these problems. It uses progressive adaptation, hybrid optimization, multimodal integration, and structure-preserving objectives. This work takes a technical approach to make Qwen-72B more adaptable and effective in code generation, aiming for outputs that are functionally correct and structurally coherent.
2. Related Work
Recent work in code generation stresses structural modeling and efficiency. Tipirneni et al. [
1] proposed StructCoder, a Transformer that uses AST and data-flow information to improve structural consistency. Sirbu and Czibula [
2] applied syntax-based encoding to malware detection tasks. Gong et al. [
3] presented AST-T5, a structure-aware pretraining model that uses syntactic segmentation to enhance code generation and understanding.
Work on parameter-efficient adaptation has also grown. Wang et al. [
4] proposed LoRA-GA with gradient approximation to improve optimization. Hayou et al. [
5] introduced LoRA+ with different learning rates for better convergence. Hounie et al. [
6] extended this with LoRTA, which uses tensor adaptation. Zhang et al. [
7] presented AutoLoRA, which tunes matrix ranks with meta learning. Chen et al. [
8] combined structure-aware signals with parameter-efficient tuning to reach results close to full fine-tuning.
Li et al. [
9] showed that large code models work well as few-shot information extractors. Guan [
10] used logistic regression and decision trees for medical claim denial prediction, proving that interpretable models are still useful in specialized tasks.
3. Methodology
Enhancing the code generation capabilities of ultra-large language models requires not only parameter-efficient fine-tuning but also structure-aware and context-rich optimization. We propose
CodeFusion-Qwen72B, a multi-stage fine-tuning framework for the Qwen-72B model that integrates Progressive Low-Rank Adaptation (PLoRA), Hybrid Instruction Optimization (HIO), Multi-Context Fusion (MCF), Structure-Preserving Hybrid Loss (SPHL), Controllable Generation Decoder (CGD), and Adaptive Prompt Retrieval (APR). PLoRA progressively adapts model layers to domain-specific semantics, while HIO combines supervised fine-tuning with reinforcement learning from human feedback to align both functionality and readability. MCF enriches inputs with problem descriptions, code comments, and execution traces, and SPHL enforces token, syntactic, and semantic coherence. CGD and syntax-constrained decoding enable controllable, compilable generation, and APR provides dynamic retrieval of relevant code snippets. We evaluate CodeFusion-Qwen72B on three benchmarks: HumanEval, MBPP, and CodeContests. Compared to a fully fine-tuned Qwen-72B baseline, our framework improves pass@1 accuracy from 53.8% to 65.4% (+11.6), BLEU from 71.2 to 78.9 (+7.7), and execution success rate from 68.3% to 81.5% (+13.2). Ablation studies confirm that MCF and SPHL contribute most to execution gains, while PLoRA and HIO drive significant improvements in syntactic and semantic alignment. These results demonstrate that multi-stage, multi-context, and structure-aware adaptation can substantially advance the state of the art in LLM-based code generation. The pipline of MHST-GB is shown in
Figure 1
4. Algorithm and Model
The proposed CodeFusion-Qwen72B is a multi-stage, multi-objective fine-tuning framework for enhancing the code generation capability of ultra-large language models, built upon the Qwen-72B architecture. This section introduces the key components of the framework, their technical motivations, and implementation details.
4.1. Qwen-72B Backbone Architecture
Qwen-72B adopts a Transformer-based autoregressive decoder architecture with
L stacked layers, each containing multi-head self-attention (MHSA) and feed-forward networks (FFN). The standard attention mechanism computes:
where
M is an attention mask ensuring causal dependency. While this formulation captures long-range dependencies, it does not explicitly enforce programming-specific constraints such as variable scope or syntactic validity, motivating the introduction of structure-aware adaptations.
4.2. Progressive Low-Rank Adaptation (PLoRA)
Full-parameter fine-tuning of Qwen-72B is computationally prohibitive and risks overfitting to domain-specific patterns. We employ
Progressive LoRA (PLoRA), a staged adaptation strategy where LoRA updates are first applied to high-level decoder layers to capture semantic patterns, then gradually extended to lower layers to refine syntactic generation. The parameter update is:
This gradual adaptation mitigates catastrophic forgetting and allows stable convergence.
Figure 2 details the Progressive LoRA adaptation mechanism, which strategically fine-tunes the 72-layer transformer architecture in three stages.
4.3. Hybrid Instruction Optimization (HIO)
Code generation quality depends heavily on alignment with user instructions. HIO combines supervised fine-tuning with reinforcement learning from human feedback (RLHF). The RLHF loss is:
where
measures preference alignment and
scores code functionality and readability. This hybrid approach ensures the model learns both semantic compliance and executable correctness.
4.4. Multi-Context Fusion (MCF)
High-quality code generation requires integrating multiple context sources beyond the textual problem description. MCF fuses three modalities: natural language descriptions, code comments, and execution traces. Each is encoded separately:
The fusion is implemented via gated attention:
This ensures that execution semantics and natural language semantics are jointly considered during generation.
4.5. Structure-Preserving Hybrid Loss (SPHL)
Token-level accuracy is insufficient for functional code correctness. SPHL integrates token-level cross-entropy, AST-based structure regularization, and semantic similarity:
where
is computed via tree edit distance between predicted and reference ASTs, and
measures cosine similarity between program dependency graph embeddings to preserve logical flow.
Figure 3 demonstrates the Multi-Context Fusion module’s gated attention mechanism for integrating heterogeneous code-related inputs and the Structure-Preserving Hybrid Loss formulation.
4.6. Controllable Generation Decoder (CGD)
The CGD module enables control over the style and verbosity of generated code by injecting a control vector into the decoder:
This mechanism supports different generation modes, such as minimal implementation or detailed, readable code, while maintaining syntax validity through constrained decoding:
4.7. Adaptive Prompt Retrieval (APR)
APR dynamically augments prompts with semantically similar code snippets retrieved from an indexed corpus:
These retrieved examples provide explicit patterns for the model to adapt, improving zero-shot and few-shot performance.
5. Data Preprocessing
The quality and consistency of training data are critical in adapting Qwen-72B for code generation. Our preprocessing pipeline ensures syntactic correctness, semantic diversity, and domain alignment before fine-tuning.
5.1. Code Deduplication and Normalization
Large-scale code corpora often contain near-duplicate functions, which can cause overfitting and reduce model generalization. We apply a MinHash-based similarity filtering mechanism to remove code pairs with Jaccard similarity above
:
where
denotes the set of
n-grams extracted from the code snippet. Normalization steps such as consistent indentation, removal of trailing whitespaces, and unification of naming conventions are also applied to reduce variability from irrelevant factors.
5.2. Syntactic Validation and Semantic Tagging
We use language-specific parsers to validate the syntactic correctness of each code snippet. Invalid code is automatically repaired using a lightweight sequence-to-sequence correction model. Additionally, we apply semantic tagging by extracting function names, API calls, and language constructs, and encoding them as auxiliary features
:
This tagging improves the model’s ability to attend to function-level semantics during fine-tuning.
6. Prompt Engineering Techniques
Effective prompt design significantly impacts the code generation quality of LLMs. In CodeFusion-Qwen72B, we integrate retrieval-augmented prompting and progressive prompt construction to enhance both accuracy and controllability.
6.1. Retrieval-Augmented Prompt Construction
We maintain an indexed database of high-quality code exemplars. Given a new task query
q, we compute its embedding
and retrieve the top-
k similar examples based on cosine similarity:
These retrieved examples are concatenated into the prompt before the target instruction. This method injects explicit patterns for the model to imitate, improving few-shot and zero-shot code generation performance.
6.2. Progressive Prompt Structuring
Instead of providing the entire problem description at once, we adopt a progressive reveal strategy, where the prompt is expanded in stages:
Present a high-level task description.
Include a partial code skeleton or function signature.
Gradually add constraints, examples, and edge cases.
Formally, let
denote progressive prompt segments; the model receives:
This technique conditions the model’s attention sequentially, improving coherence and reducing hallucination in generated code.
7. Evaluation Metrics
To comprehensively evaluate the performance of CodeFusion-Qwen72B and baseline models, we adopt six complementary metrics that assess syntactic correctness, semantic fidelity, and functional accuracy.
7.1. Pass@1 Accuracy
Pass@1 measures the percentage of tasks where the first generated solution passes all unit tests:
7.2. BLEU Score
BLEU evaluates
n-gram overlap between generated and reference code:
where BP is the brevity penalty and
is the
n-gram precision.
7.3. Code Execution Success Rate (CESR)
CESR measures the proportion of compilable and runnable code snippets:
7.4. Abstract Syntax Tree Similarity (ASTSim)
ASTSim measures structural similarity using normalized tree edit distance:
7.5. Semantic Similarity (SemSim)
SemSim is computed as cosine similarity between program dependency graph (PDG) embeddings:
7.6. Code Readability Score (CRS)
CRS measures readability using a learned regression model
trained on human-rated code samples:
8. Experiment Results
8.1. Experimental Setup
We evaluate CodeFusion-Qwen72B and baselines on three code generation benchmarks: HumanEval, MBPP, and CodeContests. Each model generates solutions with temperature and beam size . For fairness, all methods use the same training data and tokenization scheme. We compare against Qwen-72B Full FT, Qwen-72B LoRA, and CodeGen-16B. We also conduct ablation experiments by removing key components of CodeFusion-Qwen72B.
8.2. Overall and Ablation Results
Table 1 presents the overall performance and ablation results. The top section compares different models across all datasets, while the lower section shows the impact of removing each module from CodeFusion-Qwen72B. Six metrics are reported: Pass@1, BLEU, CESR, ASTSim, SemSim, and CRS. And the changes in model training indicators are shown in
Figure 4.
9. Conclusion
We proposed CodeFusion-Qwen72B, a multi-stage fine-tuning framework for Qwen-72B incorporating progressive LoRA, multi-context fusion, structure-preserving loss, and adaptive prompt retrieval. Experiments on HumanEval, MBPP, and CodeContests demonstrate consistent improvements across six metrics, with Pass@1 gains of over 11% on average compared to full-parameter fine-tuning. The integrated ablation analysis confirms that context fusion and structure-aware objectives are key contributors to performance gains.
The approach has practical constraints. First, progressive adaptation with RLHF requires substantial compute; our experiments consumed thousands of GPU-hours which may not be reproducible for smaller teams. Second, reliance on retrieval corpora and language-specific AST losses can reduce out-of-distribution and cross-language generalization. Third, AST-based structure penalties require robust parsers per language, increasing engineering complexity. Finally, syntax-constrained decoding and control vectors add inference latency that may hinder real-time interactive use.
References
- Tipirneni, S.; Zhu, M.; Reddy, C.K. Structcoder: Structure-aware transformer for code generation. ACM Transactions on Knowledge Discovery from Data 2024, 18, 1–20.
- Sirbu, A.G.; Czibula, G. Automatic code generation based on Abstract Syntax-based encoding. Application on malware detection code generation based on MITRE ATT&CK techniques. Expert Systems with Applications 2025, 264, 125821.
- Gong, L.; Elhoushi, M.; Cheung, A. Ast-t5: Structure-aware pretraining for code generation and understanding. arXiv preprint arXiv:2401.03003 2024.
- Wang, S.; Yu, L.; Li, J. Lora-ga: Low-rank adaptation with gradient approximation. Advances in Neural Information Processing Systems 2024, 37, 54905–54931.
- Hayou, S.; Ghosh, N.; Yu, B. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354 2024.
- Hounie, I.; Kanatsoulis, C.; Tandon, A.; Ribeiro, A. LoRTA: Low Rank Tensor Adaptation of Large Language Models. arXiv preprint arXiv:2410.04060 2024.
- Zhang, R.; Qiang, R.; Somayajula, S.A.; Xie, P. Autolora: Automatically tuning matrix ranks in low-rank adaptation based on meta learning. arXiv preprint arXiv:2403.09113 2024.
- Chen, N.; Sun, Q.; Wang, J.; Li, X.; Gao, M. Pass-tuning: Towards structure-aware parameter-efficient tuning for code representation learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 577–591.
- Li, P.; Sun, T.; Tang, Q.; Yan, H.; Wu, Y.; Huang, X.; Qiu, X. Codeie: Large code generation models are better few-shot information extractors. arXiv preprint arXiv:2305.05711 2023.
- Guan, S. Predicting Medical Claim Denial Using Logistic Regression and Decision Tree Algorithm. In Proceedings of the 2024 3rd International Conference on Health Big Data and Intelligent Healthcare (ICHIH), 2024, pp. 7–10. [CrossRef]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).