Submitted:
21 February 2025
Posted:
24 February 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Preliminaries
3.1. Transformer Architecture
3.2. Autoregressive Training of Transformers
3.3. Improving Trained Model Using Reinforcement Learning
4. Proposed Method
4.1. TSP Graph Representation
4.2. Generation of Training Data for TSP
4.3. Transformer Architecture
4.4. Training and Generation for TSP Optimization
5. Results
- Using only step 1 of training where the model is trained for next-node prediction based on the cross-entropy loss only.
- Using only step 2 of training where the model is trained for tour prediction using a DPO loss only.
- Using both step 1 and step 2 where the model is trained using both cross-entropy loss and DPO loss.
6. Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zuo, Y.; Qu, S.; Li, Y.; Chen, Z.; Zhu, X.; Hua, E.; Zhang, K.; Ding, N.; Zhou, B. MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding. arXiv 2025, arXiv:2501.18362. [Google Scholar]
- Karp, R.M. Reducibility among Combinatorial Problems. In Complexity of Computer Computations: Proceedings of a symposium on the Complexity of Computer Computations; Springer US, 1972; pp. 85–103. [Google Scholar]
- Larranaga, P.; Kuijpers, C. M. H.; Murga, R. H.; Inza, I.; Dizdarevic, S. Genetic Algorithms for the Travelling Salesman Problem: A Review of Representations and Operators 1999. Artif. Intell. Rev. 1999, 13, 129–170. [Google Scholar] [CrossRef]
- Razali, N. M.; Geraghty, J. Genetic Algorithm Performance with Different Selection Strategies in Solving TSP. In Proceedings of the World Congress on Engineering; International Association of Engineers Hong Kong: Hong KongChina, 2011; Volume 2, pp. 1–6. [Google Scholar]
- Ezugwu, A. E.-S.; Adewumi, A. O.; Frîncu, M. E. Simulated Annealing Based Symbiotic Organisms Search Optimization Algorithm for Traveling Salesman Problem. Expert Syst. Appl. 2017, 77, 189–210. [Google Scholar] [CrossRef]
- Meer, K. Simulated Annealing versus Metropolis for a TSP Instance. Inf. Process. Lett. 2007, 106, 216–219. [Google Scholar] [CrossRef]
- Fiechter, C.-N. A Parallel Tabu Search Algorithm for Large Traveling Salesman Problems. Discret. Appl. Math. 1994, 51, 243–267. [Google Scholar] [CrossRef]
- Brandão, J. A Tabu Search Algorithm for the Open Vehicle Routing Problem. Eur. J. Oper. Res. 2004, 157, 552–564. [Google Scholar] [CrossRef]
- Wang, K.-P.; Huang, L.; Zhou, C.-G.; Pang, W. Particle Swarm Optimization for Traveling Salesman Problem. In Proceedings of the 2003 International Conference on Machine Learning and Cybernetics; IEEE, 2003; Volume 3, pp. 1583–1585. [Google Scholar]
- Lu, C.; Wang, Q. X. Particle Swarm Optimization-Based Algorithms for TSP and Generalized TSP. Inf. Process. Lett. 2007, 103, 169–176. [Google Scholar]
- Yang, J.; Shi, X.; Marchese, M.; Liang, Y. An Ant Colony Optimization Method for Generalized TSP Problem. Prog. Nat. Sci. 2008, 18, 1417–1422. [Google Scholar] [CrossRef]
- Ghimire, B.; Cohen, D.; Mahmood, A. Parallel Cooperating Ant Colonies with Improved Periodic Exchange Strategies. Proc. High Perform. Comput. Symp 2004, 1–6. [Google Scholar]
- Merz, P.; Freisleben, B. Memetic Algorithms for the Traveling Salesman Problem. Complex Syst 2001, 13, 297–346. [Google Scholar]
- Gutin, G.; Karapetyan, D. A Memetic Algorithm for the Generalized Traveling Salesman Proble. Nat. Comput. 2010, 9, 47–60. [Google Scholar] [CrossRef]
- Mahi, M.; Baykan, Ö. K.; Kodaz, H. A New Hybrid Method Based on Particle Swarm Optimization, Ant Colony Optimization and 3-opt Algorithms for Traveling Salesman Problem. Appl. Soft Comput. 2015, 30, 484–490. [Google Scholar] [CrossRef]
- Küçükoğlu, İ.; Dewil, R.; Cattrysse, D. Hybrid Simulated Annealing and Tabu Search Method for the Electric Travelling Salesman Problem with Time Windows and Mixed Charging Rates. Expert Syst. Appl. 2019, 134, 279–303. [Google Scholar] [CrossRef]
- Stützle, T. Parallelization Strategies for Ant Colony Optimization. In International Conference on Parallel Problem Solving from Nature; Springer, 1998; pp. 722–731. [Google Scholar]
- Cantu-Paz, E.; Goldberg, D. E. Efficient Parallel Genetic Algorithms: Theory and Practice. Comput. Methods Appl. Mech. Eng 2000, 186, 211–238. [Google Scholar] [CrossRef]
- Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer Networks. Adv. Neural Inf. Process. Syst 2015, 28. [Google Scholar]
- Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; Bengio, S. Neural Combinatorial Optimization with Reinforcement Learning. arXiv 2016, arXiv:1611.09940. [Google Scholar]
- Prates, M.; Avelar, P. H. C.; Lemos, H.; Lamb, L. C.; Vardi, M. Y. Learning to Solve NP-Complete Problems: A Graph Neural Network for Decision TSP. In Proceedings of the AAAI Conference on Artificial Intelligence; 2019; 33, pp. 4731–4738. [Google Scholar]
- Hu, Y.; Zhang, Z.; Yao, Y.; Huyan, X.; Zhou, X.; Lee, W. S. A Bidirectional Graph Neural Network for Traveling Salesman Problems on Arbitrary Symmetric. Graphs. Eng. Appl. Artif. Intell 202, 97, 104061. [Google Scholar] [CrossRef]
- Bresson, X.; Laurent, T. The Transformer Network for the Traveling Salesman Problem. arXiv 2021, arXiv:2103.03012. [Google Scholar]
- Pan, X.; Jin, Y.; Ding, Y.; Feng, M.; Zhao, L.; Song, L.; Bian, J. H-TSP: Hierarchically Solving the Large-Scale Traveling Salesman Problem. In Proceedings of the AAAI Conference on Artificial Intelligence; 2023; 37, pp. 9345–9353. [Google Scholar]
- Luo, F.; Lin, X.; Liu, F.; Zhang, Q.; Wang, Z. Neural Combinatorial Optimization with Heavy Decoder: Toward Large Scale Generalization. Adv. Neural Inf. Process. Syst 2023, 36, 8845–8864. [Google Scholar]
- Vaswani, A. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Mahmood, K.; Mahmood, R.; Van Dijk, M. On the Robustness of Vision Transformers to Adversarial Examples. In Proceedings of the IEEE/CVF International Conference on Computer Visio; 2021; pp. 7838–7847. [Google Scholar]
- Ye, H.; Wang, J.; Cao, Z.; Berto, F.; Hua, C.; Kim, H.; Park, J.; Song, G. Large Language Models as Hyper-Heuristics for Combinatorial Optimization. arXiv 2024, arXiv:2402.01145. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback; Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y. K.; Wu, Y.; et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv 2024, arXiv:2402.03300. [Google Scholar]
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
- Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
- Ghimire, B.; Cohen, D.; Mahmood, A. Parallel cooperating ant colonies with improved periodic exchange strategies. In Proceedings of the High Performance Computing Symposium; 2014; pp. 1–6. [Google Scholar]
- Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]






| Nodes | % Optimal 10000 T. Data Cross-Entropy Only |
% Optimal 10000 Data DPO Only |
% Optimal 10000 T. Data both Cross- Entropy and DPO |
% Optimal 50000 T. Data both Cross- Entropy and DPO |
|---|---|---|---|---|
| 20 | 3.4 | 3.5 | 3.3 | 2.9 |
| 40 | 8.8 | 8.9 | 8.4 | 6.3 |
| 60 | 13.6 | 13.9 | 13.4 | 11.8 |
| 80 | 17.4 | 17.9 | 17.2 | 15.9 |
| 100 | 21.8 | 21.3 | 21.6 | 19.6 |
| TSPLIB benchmark |
Number of Nodes |
% Optimal 10000 T. Data Cross-Entropy |
% Optimal 10000 T. Data DPO |
% Optimal 10000 T. Data Cross-Entropy and DPO |
% Optimal 50000 T. Data Cross-Entropy and DPO |
|---|---|---|---|---|---|
| Bays29 | 29 | 3.5 | 3.7 | 3.3 | 2.5 |
| Berlin52 | 52 | 4.5 | 4.8 | 4.2 | 3.7 |
| Eil76 | 76 | 7.8 | 8.1 | 7.6 | 5.6 |
| KroA100 | 100 | 15.6 | 15.9 | 15.4 | 9.8 |
| KroB100 | 100 | 17.1 | 17.7 | 17.7 | 10.1 |
| KroC100 | 100 | 16.3 | 16.4 | 16.9 | 10.9 |
| KroD100 | 100 | 19.8 | 19.9 | 18.8 | 12.9 |
| KroE100 | 100 | 18.8 | 18.2 | 18.7 | 11.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).