Preprint
Article

This version is not peer-reviewed.

Investigating the Refactoring Capabilities of Small Open-Weight Language Models

Submitted:

03 March 2026

Posted:

04 March 2026

You are already at the latest version

Abstract
Refactoring is essential for developing maintainable software. Using Large Language Models in software engineering is widespread, but compared to well-established domains such as code generation, reliable refactoring is still relatively underexplored. In this paper, we perform a broad analysis on the refactoring capabilities of small open-weight language models (SLMs) by evaluating 12 models on 3,453 Python programs. Our study focuses on the two defining aspects of refactoring: behavior preservation and code quality improvement. We evaluate these properties using unit tests and various code metrics. Across models ranging from 0.5B to 8B parameters, most models improve code quality. Larger models are more reliable, as they preserve behavior more consistently. Reasoning models often make more significant changes while refactoring. Allowing models to generate reasoning traces improves performance, but only for models larger than 4B. For smaller models, reasoning in fact reduces refactoring reliability. The difficulty of the underlying task affects refactoring performance, with more complex tasks associated with higher failure rates. Our results indicate that current open SLMs can support refactoring tasks, especially larger ones with reasoning capabilities, but they are best used with human oversight.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated