Structure-Aware Multi-Stage Adaptation of Qwen-72B for Code Generation

Rui Guo

doi:10.20944/preprints202509.2169.v1

Submitted:

25 September 2025

Posted:

26 September 2025

Read the latest preprint version here

Abstract

Large language models show strong abilities in code generation. They still face problems in semantic understanding, structural consistency, controllability, and generalization. Existing methods often focus on one aspect, like instruction following or structural accuracy. They do not offer a unified framework that balances correctness, readability, and adaptability. This work presents CodeFusion-Qwen72B, a multi-stage structure-aware tuning framework for Qwen-72B. It includes progressive low-rank adaptation, hybrid instruction optimization, multi-context fusion, structure-preserving hybrid loss, controllable generation decoding, and adaptive prompt retrieval. By improving semantic comprehension, structural alignment, and controlled code generation, the framework increases the robustness and versatility of large-scale models for software engineering tasks. This study shows that multi-stage optimization and structure-aware learning are effective ways to advance code generation with ultra-large models.

Keywords:

CodeFusion-Qwen72B

;

code generation

;

structure-aware tuning

;

multimodal fusion

;

controllable decoding

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Code generation is an important application of large language models (LLMs). It can automate software development tasks and speed up programming workflows. Ultra-large models like Qwen-72B have shown success, but problems remain in producing consistent code quality. Full-parameter fine-tuning is effective but costly. Lightweight methods reduce cost but fail to keep deep semantics and structural integrity needed for executable code. Instruction alignment through supervised learning or reinforcement learning from human feedback improves task conformity, but it does not use execution semantics and structural fidelity. Current methods also depend too much on token-level accuracy, which ignores deeper program dependencies, and they provide little control over style and redundancy. These problems show the need for a framework that joins semantic, syntactic, and structural optimization with controllable generation and adaptive context use. CodeFusion-Qwen72B is built to solve these problems. It uses progressive adaptation, hybrid optimization, multimodal integration, and structure-preserving objectives. This work takes a technical approach to make Qwen-72B more adaptable and effective in code generation, aiming for outputs that are functionally correct and structurally coherent.

2. Related Work

Recent work in code generation stresses structural modeling and efficiency. Tipirneni et al. [1] proposed StructCoder, a Transformer that uses AST and data-flow information to improve structural consistency. Sirbu and Czibula [2] applied syntax-based encoding to malware detection tasks. Gong et al. [3] presented AST-T5, a structure-aware pretraining model that uses syntactic segmentation to enhance code generation and understanding.

Work on parameter-efficient adaptation has also grown. Wang et al. [4] proposed LoRA-GA with gradient approximation to improve optimization. Hayou et al. [5] introduced LoRA+ with different learning rates for better convergence. Hounie et al. [6] extended this with LoRTA, which uses tensor adaptation. Zhang et al. [7] presented AutoLoRA, which tunes matrix ranks with meta learning. Chen et al. [8] combined structure-aware signals with parameter-efficient tuning to reach results close to full fine-tuning.

Li et al. [9] showed that large code models work well as few-shot information extractors. Guan [10] used logistic regression and decision trees for medical claim denial prediction, proving that interpretable models are still useful in specialized tasks.

3. Methodology

Enhancing the code generation capabilities of ultra-large language models requires not only parameter-efficient fine-tuning but also structure-aware and context-rich optimization. We propose CodeFusion-Qwen72B, a multi-stage fine-tuning framework for the Qwen-72B model that integrates Progressive Low-Rank Adaptation (PLoRA), Hybrid Instruction Optimization (HIO), Multi-Context Fusion (MCF), Structure-Preserving Hybrid Loss (SPHL), Controllable Generation Decoder (CGD), and Adaptive Prompt Retrieval (APR). PLoRA progressively adapts model layers to domain-specific semantics, while HIO combines supervised fine-tuning with reinforcement learning from human feedback to align both functionality and readability. MCF enriches inputs with problem descriptions, code comments, and execution traces, and SPHL enforces token, syntactic, and semantic coherence. CGD and syntax-constrained decoding enable controllable, compilable generation, and APR provides dynamic retrieval of relevant code snippets. We evaluate CodeFusion-Qwen72B on three benchmarks: HumanEval, MBPP, and CodeContests. Compared to a fully fine-tuned Qwen-72B baseline, our framework improves pass@1 accuracy from 53.8% to 65.4% (+11.6), BLEU from 71.2 to 78.9 (+7.7), and execution success rate from 68.3% to 81.5% (+13.2). Ablation studies confirm that MCF and SPHL contribute most to execution gains, while PLoRA and HIO drive significant improvements in syntactic and semantic alignment. These results demonstrate that multi-stage, multi-context, and structure-aware adaptation can substantially advance the state of the art in LLM-based code generation. The pipline of MHST-GB is shown in Figure 1

4. Algorithm and Model

The proposed CodeFusion-Qwen72B is a multi-stage, multi-objective fine-tuning framework for enhancing the code generation capability of ultra-large language models, built upon the Qwen-72B architecture. This section introduces the key components of the framework, their technical motivations, and implementation details.

4.1. Qwen-72B Backbone Architecture

Qwen-72B adopts a Transformer-based autoregressive decoder architecture with L stacked layers, each containing multi-head self-attention (MHSA) and feed-forward networks (FFN). The standard attention mechanism computes:

Attn (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} + M) V

(1)

where M is an attention mask ensuring causal dependency. While this formulation captures long-range dependencies, it does not explicitly enforce programming-specific constraints such as variable scope or syntactic validity, motivating the introduction of structure-aware adaptations.

4.2. Progressive Low-Rank Adaptation (PLoRA)

Full-parameter fine-tuning of Qwen-72B is computationally prohibitive and risks overfitting to domain-specific patterns. We employ Progressive LoRA (PLoRA), a staged adaptation strategy where LoRA updates are first applied to high-level decoder layers to capture semantic patterns, then gradually extended to lower layers to refine syntactic generation. The parameter update is:

W^{'} = W + B A, B \in R^{d \times r}, A \in R^{r \times d}, r ≪ d

(2)

This gradual adaptation mitigates catastrophic forgetting and allows stable convergence. Figure 2 details the Progressive LoRA adaptation mechanism, which strategically fine-tunes the 72-layer transformer architecture in three stages.

4.3. Hybrid Instruction Optimization (HIO)

Code generation quality depends heavily on alignment with user instructions. HIO combines supervised fine-tuning with reinforcement learning from human feedback (RLHF). The RLHF loss is:

L_{RLHF} = - E_{y \sim π_{θ}} [r_{human} (y) + η \cdot r_{auto} (y)]

(3)

where

r_{human}

measures preference alignment and

r_{auto}

scores code functionality and readability. This hybrid approach ensures the model learns both semantic compliance and executable correctness.

4.4. Multi-Context Fusion (MCF)

High-quality code generation requires integrating multiple context sources beyond the textual problem description. MCF fuses three modalities: natural language descriptions, code comments, and execution traces. Each is encoded separately:

h_{fused} = ϕ_{fusion} (E_{text} (x_{text}), E_{comment} (x_{comment}), E_{trace} (x_{trace}))

(4)

The fusion is implemented via gated attention:

ϕ_{fusion} (h_{1}, h_{2}, h_{3}) = \sum_{m = 1}^{3} σ (w_{m}^{⊤} h_{m}) \cdot h_{m}

(5)

This ensures that execution semantics and natural language semantics are jointly considered during generation.

4.5. Structure-Preserving Hybrid Loss (SPHL)

Token-level accuracy is insufficient for functional code correctness. SPHL integrates token-level cross-entropy, AST-based structure regularization, and semantic similarity:

L_{SPHL} = λ_{t} L_{CE} + λ_{a} L_{AST} + λ_{s} L_{sem}

(6)

where

L_{AST}

is computed via tree edit distance between predicted and reference ASTs, and

L_{sem}

measures cosine similarity between program dependency graph embeddings to preserve logical flow. Figure 3 demonstrates the Multi-Context Fusion module’s gated attention mechanism for integrating heterogeneous code-related inputs and the Structure-Preserving Hybrid Loss formulation.

4.6. Controllable Generation Decoder (CGD)

The CGD module enables control over the style and verbosity of generated code by injecting a control vector into the decoder:

h_{t}^{'} = h_{t} + W_{c} v_{ctrl}

(7)

This mechanism supports different generation modes, such as minimal implementation or detailed, readable code, while maintaining syntax validity through constrained decoding:

P_{masked} (t) = \{\begin{matrix} P (t) & if syntactically valid \\ 0 & otherwise \end{matrix}

(8)

4.7. Adaptive Prompt Retrieval (APR)

APR dynamically augments prompts with semantically similar code snippets retrieved from an indexed corpus:

s_{i} = \frac{〈 e_{q}, e_{i} 〉}{∥ e_{q} ∥ \cdot ∥ e_{i} ∥}, {Prompt}_{aug} = {top - k snippets by s_{i}}

(9)

These retrieved examples provide explicit patterns for the model to adapt, improving zero-shot and few-shot performance.

5. Data Preprocessing

The quality and consistency of training data are critical in adapting Qwen-72B for code generation. Our preprocessing pipeline ensures syntactic correctness, semantic diversity, and domain alignment before fine-tuning.

5.1. Code Deduplication and Normalization

Large-scale code corpora often contain near-duplicate functions, which can cause overfitting and reduce model generalization. We apply a MinHash-based similarity filtering mechanism to remove code pairs with Jaccard similarity above

τ = 0.85

:

Sim (A, B) = \frac{| S (A) \cap S (B) |}{| S (A) \cup S (B) |}

(10)

where

S (\cdot)

denotes the set of n-grams extracted from the code snippet. Normalization steps such as consistent indentation, removal of trailing whitespaces, and unification of naming conventions are also applied to reduce variability from irrelevant factors.

5.2. Syntactic Validation and Semantic Tagging

We use language-specific parsers to validate the syntactic correctness of each code snippet. Invalid code is automatically repaired using a lightweight sequence-to-sequence correction model. Additionally, we apply semantic tagging by extracting function names, API calls, and language constructs, and encoding them as auxiliary features

c_{tag}

:

x_{aug} = [x_{code}; c_{tag}]

(11)

This tagging improves the model’s ability to attend to function-level semantics during fine-tuning.

6. Prompt Engineering Techniques

Effective prompt design significantly impacts the code generation quality of LLMs. In CodeFusion-Qwen72B, we integrate retrieval-augmented prompting and progressive prompt construction to enhance both accuracy and controllability.

6.1. Retrieval-Augmented Prompt Construction

We maintain an indexed database of high-quality code exemplars. Given a new task query q, we compute its embedding

e_{q}

and retrieve the top-k similar examples based on cosine similarity:

s_{i} = \frac{〈 e_{q}, e_{i} 〉}{∥ e_{q} ∥ \cdot ∥ e_{i} ∥}

(12)

These retrieved examples are concatenated into the prompt before the target instruction. This method injects explicit patterns for the model to imitate, improving few-shot and zero-shot code generation performance.

6.2. Progressive Prompt Structuring

Instead of providing the entire problem description at once, we adopt a progressive reveal strategy, where the prompt is expanded in stages:

Present a high-level task description.
Include a partial code skeleton or function signature.
Gradually add constraints, examples, and edge cases.

Formally, let

p_{1}, p_{2}, \dots, p_{m}

denote progressive prompt segments; the model receives:

Prompt = [p_{1}] \to [p_{1}; p_{2}] \to \dots \to [p_{1}; p_{2}; \dots; p_{m}]

(13)

This technique conditions the model’s attention sequentially, improving coherence and reducing hallucination in generated code.

7. Evaluation Metrics

To comprehensively evaluate the performance of CodeFusion-Qwen72B and baseline models, we adopt six complementary metrics that assess syntactic correctness, semantic fidelity, and functional accuracy.

7.1. Pass@1 Accuracy

Pass@1 measures the percentage of tasks where the first generated solution passes all unit tests:

Pass @ 1 = \frac{1}{N} \sum_{i = 1}^{N} I (Exec ({\hat{y}}_{i}) = Exec (y_{i}))

(14)

7.2. BLEU Score

BLEU evaluates n-gram overlap between generated and reference code:

BLEU = BP \cdot exp (\sum_{n = 1}^{4} w_{n} log p_{n})

(15)

where BP is the brevity penalty and

p_{n}

is the n-gram precision.

7.3. Code Execution Success Rate (CESR)

CESR measures the proportion of compilable and runnable code snippets:

CESR = \frac{\sum_{i = 1}^{N} I (Compiles ({\hat{y}}_{i}) = True)}{N}

(16)

7.4. Abstract Syntax Tree Similarity (ASTSim)

ASTSim measures structural similarity using normalized tree edit distance:

ASTSim = 1 - \frac{TED (AST (\hat{y}), AST (y))}{max (| AST (\hat{y}) |, | AST (y) |)}

(17)

7.5. Semantic Similarity (SemSim)

SemSim is computed as cosine similarity between program dependency graph (PDG) embeddings:

SemSim = \frac{〈 e_{\hat{y}}, e_{y} 〉}{∥ e_{\hat{y}} ∥ \cdot ∥ e_{y} ∥}

(18)

7.6. Code Readability Score (CRS)

CRS measures readability using a learned regression model

f_{read}

trained on human-rated code samples:

CRS = \frac{1}{N} \sum_{i = 1}^{N} f_{read} ({\hat{y}}_{i})

(19)

8. Experiment Results

8.1. Experimental Setup

We evaluate CodeFusion-Qwen72B and baselines on three code generation benchmarks: HumanEval, MBPP, and CodeContests. Each model generates solutions with temperature

T = 0.2

and beam size

b = 5

. For fairness, all methods use the same training data and tokenization scheme. We compare against Qwen-72B Full FT, Qwen-72B LoRA, and CodeGen-16B. We also conduct ablation experiments by removing key components of CodeFusion-Qwen72B.

8.2. Overall and Ablation Results

Table 1 presents the overall performance and ablation results. The top section compares different models across all datasets, while the lower section shows the impact of removing each module from CodeFusion-Qwen72B. Six metrics are reported: Pass@1, BLEU, CESR, ASTSim, SemSim, and CRS. And the changes in model training indicators are shown in Figure 4.

9. Conclusion

We proposed CodeFusion-Qwen72B, a multi-stage fine-tuning framework for Qwen-72B incorporating progressive LoRA, multi-context fusion, structure-preserving loss, and adaptive prompt retrieval. Experiments on HumanEval, MBPP, and CodeContests demonstrate consistent improvements across six metrics, with Pass@1 gains of over 11% on average compared to full-parameter fine-tuning. The integrated ablation analysis confirms that context fusion and structure-aware objectives are key contributors to performance gains.

The approach has practical constraints. First, progressive adaptation with RLHF requires substantial compute; our experiments consumed thousands of GPU-hours which may not be reproducible for smaller teams. Second, reliance on retrieval corpora and language-specific AST losses can reduce out-of-distribution and cross-language generalization. Third, AST-based structure penalties require robust parsers per language, increasing engineering complexity. Finally, syntax-constrained decoding and control vectors add inference latency that may hinder real-time interactive use.

References

Tipirneni, S.; Zhu, M.; Reddy, C.K. Structcoder: Structure-aware transformer for code generation. ACM Transactions on Knowledge Discovery from Data 2024, 18, 1–20.
Sirbu, A.G.; Czibula, G. Automatic code generation based on Abstract Syntax-based encoding. Application on malware detection code generation based on MITRE ATT&CK techniques. Expert Systems with Applications 2025, 264, 125821.
Gong, L.; Elhoushi, M.; Cheung, A. Ast-t5: Structure-aware pretraining for code generation and understanding. arXiv preprint arXiv:2401.03003 2024.
Wang, S.; Yu, L.; Li, J. Lora-ga: Low-rank adaptation with gradient approximation. Advances in Neural Information Processing Systems 2024, 37, 54905–54931.
Hayou, S.; Ghosh, N.; Yu, B. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354 2024.
Hounie, I.; Kanatsoulis, C.; Tandon, A.; Ribeiro, A. LoRTA: Low Rank Tensor Adaptation of Large Language Models. arXiv preprint arXiv:2410.04060 2024.
Zhang, R.; Qiang, R.; Somayajula, S.A.; Xie, P. Autolora: Automatically tuning matrix ranks in low-rank adaptation based on meta learning. arXiv preprint arXiv:2403.09113 2024.
Chen, N.; Sun, Q.; Wang, J.; Li, X.; Gao, M. Pass-tuning: Towards structure-aware parameter-efficient tuning for code representation learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 577–591.
Li, P.; Sun, T.; Tang, Q.; Yan, H.; Wu, Y.; Huang, X.; Qiu, X. Codeie: Large code generation models are better few-shot information extractors. arXiv preprint arXiv:2305.05711 2023.
Guan, S. Predicting Medical Claim Denial Using Logistic Regression and Decision Tree Algorithm. In Proceedings of the 2024 3rd International Conference on Health Big Data and Intelligent Healthcare (ICHIH), 2024, pp. 7–10. [CrossRef]

Figure 1. Overview of the CodeFusion-Qwen72B framework architecture. The system integrates Multi-Context Fusion (MCF) for processing diverse input modalities, Adaptive Prompt Retrieval (APR) for dynamic context augmentation, and a multi-component Qwen-72B backbone enhanced with Progressive LoRA (PLoRA), Hybrid Instruction Optimization (HIO), Structure-Preserving Hybrid Loss (SPHL), and Controllable Generation Decoder (CGD). Performance metrics demonstrate significant improvements across all evaluation benchmarks.

Figure 2. Progressive Low-Rank Adaptation (PLoRA) strategy for the Qwen-72B model. The adaptation process occurs in three stages: (1) high-level layers (49-72) for semantic understanding, (2) mid-level layers (25-48) for pattern recognition, and (3) low-level layers (1-24) for syntactic refinement. Each stage applies low-rank matrices

B \in R^{d \times r}

and

A \in R^{r \times d}

where

r ≪ d

, enabling parameter-efficient fine-tuning while preserving model stability.

Figure 2. Progressive Low-Rank Adaptation (PLoRA) strategy for the Qwen-72B model. The adaptation process occurs in three stages: (1) high-level layers (49-72) for semantic understanding, (2) mid-level layers (25-48) for pattern recognition, and (3) low-level layers (1-24) for syntactic refinement. Each stage applies low-rank matrices

B \in R^{d \times r}

and

A \in R^{r \times d}

where

r ≪ d

, enabling parameter-efficient fine-tuning while preserving model stability.

Figure 3. Multi-Context Fusion (MCF) module and Structure-Preserving Hybrid Loss (SPHL) components. MCF employs gated attention to fuse text descriptions, code comments, and execution traces through specialized encoders. SPHL combines token-level cross-entropy (

L_{CE}

), AST-based structure loss (

L_{AST}

), and semantic similarity loss (

L_{sem}

) with learnable weights

λ_{1}

,

λ_{2}

, and

λ_{3}

to ensure both syntactic correctness and semantic alignment in generated code.

Figure 3. Multi-Context Fusion (MCF) module and Structure-Preserving Hybrid Loss (SPHL) components. MCF employs gated attention to fuse text descriptions, code comments, and execution traces through specialized encoders. SPHL combines token-level cross-entropy (

L_{CE}

), AST-based structure loss (

L_{AST}

), and semantic similarity loss (

L_{sem}

) with learnable weights

λ_{1}

,

λ_{2}

, and

λ_{3}

to ensure both syntactic correctness and semantic alignment in generated code.

Figure 4. Model indicator change chart.

Table 1. Overall performance and ablation study results across HumanEval, MBPP, and CodeContests. Best results per dataset are in bold.

Model / Variant	HumanEval						MBPP						CodeContests
	Pass@1	BLEU	CESR	ASTSim	SemSim	CRS	Pass@1	BLEU	CESR	ASTSim	SemSim	CRS	Pass@1	BLEU	CESR	ASTSim	SemSim	CRS
Qwen-72B Full FT	53.8	71.2	68.3	0.841	0.872	6.92	55.1	72.0	69.5	0.846	0.874	6.95	50.7	70.4	66.2	0.832	0.865	6.88
Qwen-72B LoRA	57.4	73.9	72.1	0.856	0.884	7.05	59.2	74.6	73.0	0.861	0.888	7.09	53.5	72.9	69.4	0.848	0.873	7.01
CodeGen-16B	49.6	69.4	65.0	0.823	0.861	6.78	51.0	70.1	66.1	0.828	0.863	6.80	48.4	68.7	64.3	0.819	0.856	6.75
CodeFusion-Qwen72B	65.4	78.9	81.5	0.893	0.912	7.56	67.1	79.4	82.3	0.897	0.916	7.61	62.9	77.8	79.6	0.889	0.907	7.50
w/o MCF	60.8	76.2	77.1	0.875	0.901	7.42	62.5	76.8	78.0	0.879	0.903	7.45	57.6	75.4	75.9	0.870	0.895	7.38
w/o SPHL	61.3	76.8	77.5	0.871	0.898	7.39	62.9	77.0	78.2	0.874	0.900	7.42	58.0	75.9	76.1	0.868	0.892	7.36
w/o PLoRA	62.5	77.5	78.4	0.881	0.906	7.44	63.8	78.1	79.0	0.885	0.908	7.46	59.1	76.7	77.2	0.877	0.898	7.40
w/o APR	63.2	78.0	79.0	0.884	0.908	7.48	64.5	78.5	79.6	0.888	0.910	7.50	59.7	77.2	77.8	0.880	0.900	7.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Structure-Aware Multi-Stage Adaptation of Qwen-72B for Code Generation

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Methodology

4. Algorithm and Model

4.1. Qwen-72B Backbone Architecture

4.2. Progressive Low-Rank Adaptation (PLoRA)

4.3. Hybrid Instruction Optimization (HIO)

4.4. Multi-Context Fusion (MCF)

4.5. Structure-Preserving Hybrid Loss (SPHL)

4.6. Controllable Generation Decoder (CGD)

4.7. Adaptive Prompt Retrieval (APR)

5. Data Preprocessing

5.1. Code Deduplication and Normalization

5.2. Syntactic Validation and Semantic Tagging

6. Prompt Engineering Techniques

6.1. Retrieval-Augmented Prompt Construction

6.2. Progressive Prompt Structuring

7. Evaluation Metrics

7.1. Pass@1 Accuracy

7.2. BLEU Score

7.3. Code Execution Success Rate (CESR)

7.4. Abstract Syntax Tree Similarity (ASTSim)

7.5. Semantic Similarity (SemSim)

7.6. Code Readability Score (CRS)

8. Experiment Results

8.1. Experimental Setup

8.2. Overall and Ablation Results

9. Conclusion

References

MDPI Initiatives

Important Links

Subscribe