Submitted:
26 February 2026
Posted:
28 February 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Formulation of an ex-ante t → t+1 prediction problem: Using KoTaP, we define a one-year-ahead forecasting task for tax-avoidance proxies (CETR, GETR, TSTA, TSDA) and present an application scenario that reflects the firm–year panel structure.
- Leakage-free evaluation protocol: We provide a reproducible evaluation design that enables fair comparisons across methods by enforcing chronological splits and explicitly applying fit-on-train-only principles in preprocessing and model selection to prevent future information from leaking into training.
- Quantification of raw vs. derived vs. (Raw+Derived) feature effects (ablation): By constructing three input configurations—raw, derived, and Raw+Derived—we quantify, by target, the independent predictive contribution of derived variables and their complementary (combined) gains with raw variables.
- ML/DL model comparison and selection rationale under limited-sample conditions: Under a unified protocol, we compare three machine-learning models and one deep-learning model, summarizing performance and robustness differences under practical constraints of Korean financial panel data (sample size, missingness, heterogeneity) and providing deployment-oriented guidance for model choice.
- Analysis of feasibility and limitations of prediction-based risk screening: We compare target-specific prediction difficulty and performance patterns and discuss practical caveats and limitations for real-world use (e.g., availability constraints, persistence differences across indicators).
2. Related Works
2.1. Algorithms for Firm–Year Panel (Temporal/Panel) Data Analysis
2.2. Algorithms for Tabular Data Analysis
2.3. Application Studies on Prediction/Screening Using Financial and Tax Data
3. Methodology
3.1. Data and Variable Construction
3.1.1. Target Variables (Tax-Avoidance Proxies)
3.1.2. Input Variables and Feature Sets
| Group | Variables |
| Identifier / Meta | name, stock, year, KOSPI, fnd_year, fiscal, ind |
| Raw (directly extracted) |
big4, forn, own, c_asset, inv, asset, sales, cogs, dep, tax, rec, ni, ocf, cash, tan, land, cip, intan, liab, c_liab, pti, total, equit |
| Derived (ratios / indicators) |
SIZE, LEV, CUR, GRW, ROA, ROE, CFO, PPE, AGE, INVREC, MB, TQ, LOSS |
| Derived (lag features) |
lag_asset, lag_liab, lag_equit, lag_sales, lag_total, lag1_ni, lag_c_asset, lag_c_liab |
| Tax-avoidance proxies (targets) |
CETR, GETR, TSTA, TSDA |
| Tax-avoidance proxies (auxiliary) |
CETR3, GETR3, CETR5, GETR5, A_CETR, A_GETR, A_CETR3, A_GETR3, A_CETR5, A_GETR5 |
3.2. Ex-ante Forecasting and Leakage-Free Evaluation
3.3. Benchmark Models and Implementation Details
3.3.1. Benchmark Models
3.3.2. Training Objective and Target Handling
3.3.3. Preprocessing and Feature Protocol (Train-Only Fit)
3.3.4. Hyperparameter Tuning and Early Stopping
3.3.5. Reproducibility
4. Experiments
4.1. Machine Learning Results
4.2. Deep Learning Results
| Target | RMSE | MAE | R2 | Features | ΔRMSE vs FS1 |
| CETR | 0.221 | 0.165 | -0.001 | 48 | -0.009 |
| GETR | 0.134 | 0.088 | -0.01 | 48 | -0.006 |
| TSTA | 0.109 | 0.073 | 0.841 | 48 | -0.161 |
| TSDA | 0.106 | 0.069 | 0.855 | 48 | -0.167 |
4.3. Discussion
5. Conclusions
References
- Financial Services Commission (FSC). Korea Accounting Standards Board Announces Korean Translation of International Financial Reporting Standards. Press Release, 24 December 2007. Available online: https://www.fsc.go.kr/eng/pr010101/21771 (accessed on 22 January 2026).
- IFRS Foundation. IFRS Adoption and Implementation in Korea, and the Lessons Learned (IFRS Country Report). 2013. Available online: https://www.ifrs.org/content/dam/ifrs/meetings/2013/june/ifrs-advisory-council/ap2b-adoption-and-implementation-in-korea.pdf (accessed on 22 January 2026).
- Deloitte IAS Plus. Korea—The Korean Experience with IFRS Adoption. 19 March 2013. Available online: https://www.iasplus.com/en/news/2013/03/korea (accessed on 22 January 2026).
- Financial Supervisory Service (FSS). About DART | How Does DART Work? Available online: https://englishdart.fss.or.kr/about/engAbout1.do (accessed on 22 January 2026).
- XBRL International. Jurisdictions—XBRL Korea (DART Filing). Available online: https://www.xbrl.org/the-consortium/about/jurisdictions/ (accessed on 22 January 2026).
- Shin, H.; Oh, H. Mandatory Adoption of IFRS and Earnings Transparency in Korea. J. Appl. Bus. Res. 2017, 33, 1129–1138. [CrossRef]
- Hong, J.Y. Significance and Challenges of Expanding Mandatory XBRL-Based Financial Disclosure for Korean Firms. Capital Market Focus (KCMI) 2023, No. 2023-10. Korea Capital Market Institute, 15 May 2023. Available online: https://www.kcmi.re.kr/publications/pub_detail_view?cno=6119&syear=2023&zcd=002001016&zno=1722 (accessed on 22 January 2026).
- Jung, D.J.; Hur, J.A.; Jung, A.R. The Precondition of Benefits from IFRS Adoption: Financial Statement Comparability. J. Asian Financ. Econ. Bus. 2020, 7, 255–265. [CrossRef]
- Na, H.; Song, W.; Han, S.; Jo, D.; Myung, S.; Kim, H. KoTaP: A Panel Dataset for Corporate Tax Avoidance, Performance, and Governance in Korea. Scientific Data 2026. doi:10.1038/s41597-026-06722-5.
- Na, H.; Kim, H.; Song, W.; Myung, S.; Han, S.; Jo, D. KoTaP: A Panel Dataset for Corporate Tax Avoidance, Performance, and Governance in Korea (2011–2024). Version v2. Zenodo, 2025. doi:10.5281/zenodo.17149808.
- Hyndman, R.J.; Khandakar, Y. Automatic Time Series Forecasting: The forecast Package for R. J. Stat. Softw. 2008, 27, 1–22. [CrossRef]
- Hyndman, R.J.; Koehler, A.B.; Snyder, R.D.; Grose, S. A State Space Framework for Automatic Forecasting Using Exponential Smoothing Methods. Int. J. Forecast. 2002, 18, 439–454. [CrossRef]
- Arellano, M.; Bond, S. Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations. Rev. Econ. Stud. 1991, 58, 277–297. [CrossRef]
- Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. Int. J. Forecast. 2020, 36, 1181–1191. [CrossRef]
- Rangapuram, S.S.; Seeger, M.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. Deep State Space Models for Time Series Forecasting. In Advances in Neural Information Processing Systems; 2018. Available online: https://papers.nips.cc/paper/8004-deep-state-space-models-for-time-series-forecasting (accessed on 22 January 2026).
- Lai, G.; Chang, W.-C.; Yang, Y.; Liu, H. Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’18), Ann Arbor, MI, USA, 8–12 July 2018; pp. 95–104. [CrossRef]
- Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for Interpretable Multi-Horizon Time Series Forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [CrossRef]
- Chronopoulos, I.; Chrysikou, K.; Kapetanios, G.; Mitchell, J.; Raftapostolos, A. Deep Neural Network Estimation in Panel Data Models. Federal Reserve Bank of Cleveland Working Paper No. 23-15, 2023. Available online: https://arxiv.org/abs/2305.19921 (accessed on 22 January 2026).
- Yang, B.; Long, W.; Cai, Z. Machine Learning Based Panel Data Models. Working Papers Series in Theoretical and Applied Economics 202402, University of Kansas, 2024 (revised January 2024). Available online: https://kuwpaper.ku.edu/2024Papers/202402.pdf (accessed on 22 January 2026).
- Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [CrossRef]
- Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
- Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); 2017; pp. 3146–3154. Available online: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (accessed on 22 January 2026).
- Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient Boosting with Categorical Features Support. arXiv 2018, arXiv:1810.11363. [CrossRef]
- Arik, S.Ö.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. In Proceedings of the AAAI Conference on Artificial Intelligence 2021, 35, 6679–6687. Available online: https://ojs.aaai.org/index.php/AAAI/article/view/16826 (accessed on 22 January 2026).
- Huang, X.; Khetan, A.; Cvitkovic, M.; Karnin, Z. TabTransformer: Tabular Data Modeling Using Contextual Embeddings. arXiv 2020, arXiv:2012.06678. [CrossRef]
- Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting Deep Learning Models for Tabular Data. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021); 2021; pp. 18932–18943. Available online: https://dl.acm.org/doi/10.5555/3540261.3541708 (accessed on 22 January 2026).
- Hanlon, M.; Heitzman, S. A Review of Tax Research. J. Account. Econ. 2010, 50, 127–178. [CrossRef]
- Frank, M.M.; Lynch, L.J.; Rego, S.O. Tax Reporting Aggressiveness and Its Relation to Aggressive Financial Reporting. Account. Rev. 2009, 84, 467–496. [CrossRef]
- Guenther, D.A.; Peterson, K.; Searcy, J.; Williams, B.M. How Useful Are Tax Disclosures in Predicting Effective Tax Rates? A Machine Learning Approach. Account. Rev. 2023, 98, 297–322. [CrossRef]
- Borrotti, M.; Rabasco, M.; Santoro, A. Using Accounting Information to Predict Aggressive Tax Location Decisions by European Groups. Econ. Syst. 2023, 47, 101090. [CrossRef]
- Rahman, R.A.; Masrom, S.; Omar, N.; Zakaria, M. An Application of Machine Learning on Corporate Tax Avoidance Detection Model. IAES Int. J. Artif. Intell. 2020, 9, 721–725. [CrossRef]
- Beneish, M.D. The Detection of Earnings Manipulation. Financ. Anal. J. 1999, 55, 24–36. [CrossRef]
- Dechow, P.M.; Ge, W.; Larson, C.R.; Sloan, R.G. Predicting Material Accounting Misstatements. Contemp. Account. Res. 2011, 28, 17–82. [CrossRef]
- Altman, E.I. Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy. J. Financ. 1968, 23, 589–609. [CrossRef]
- Cecchini, M.; Aytug, H.; Koehler, G.J.; Pathak, P. Making Words Work: Using Financial Text as a Predictor of Financial Events. Decis. Support Syst. 2010, 50, 164–175. [CrossRef]
- Dyreng, S.D.; Hanlon, M.; Maydew, E.L. Long-Run Corporate Tax Avoidance. Account. Rev. 2008, 83, 61–82. [CrossRef]
- Desai, M.A.; Dharmapala, D. Corporate Tax Avoidance and High-Powered Incentives. J. Financ. Econ. 2006, 79, 145–179. [CrossRef]
- Choi, G.; et al. LFTD: Transformer-Enhanced Diffusion Model for Realistic Financial Time-Series Data Generation. Pre-prints.org 2026.

| Feature set | Raw/Derived setting | Aux tax-proxies used? | Target proxies at time t used? |
| FS1 | Raw-only / Derived-only / Raw+Derived | No | No |
| FS2 | Derived-only / Raw+Derived | Yes | No |
| FS3 | Derived-only / Raw+Derived | No | Yes |
| FS4 | Derived-only / Raw+Derived | Yes | Yes |
| Split | Input year(s) t | Target year(s) t+1 | Purpose | Notes |
| Train | 2011–2019 | 2012–2020 | Model fitting (parameter learning) | Samples are constructed only when both years exist for firm i |
| Validation | 2021 | 2022 | Hyperparameter/epoch selection and model/setting selection (out-of-time) | Out-of-time validation; no parameter updates. Used for hyperparameter/epoch selection |
| Test | 2023 | 2024 | Final evaluation (reported results) | Used once after model selection; no iterative tuning |
| Gap (unused) |
2020, 2022 | 2021, 2023 | — | Buffer years not used in training/validation/testing to reduce temporal leakage/overlap effects |
| Model | Objective (Loss) | Key hyperparameters (search space) |
Early stopping, Training notes |
| XGBoost [24] | Regression (squared error) | n_estimators: 500, 1000, 2000 | Early stopping on out-of-time Validation (t=2021→2022) Use best_iteration for final fit |
| learning_rate: 0.01, 0.05, 0.1 | |||
| max_depth: 3, 5, 7 | |||
| min_child_weight: 1, 5, 10 | |||
| subsample: 0.6, 0.8, 1.0 | |||
| colsample_bytree: 0.6, 0.8, 1.0 | |||
| reg_lambda(L2): 0, 1, 10 | |||
| reg_alpha(L1): 0, 0.1, 1 | |||
| LightGBM [25] | Regression (L2) | n_estimators: 1000, 3000, 8000 | Early stopping on out-of-time Validation (t=2021→2022) Use best_iteration |
| learning_rate: 0.01, 0.05, 0.1 | |||
| num_leaves: 31, 63, 127 | |||
| max_depth: -1, 6, 10 | |||
| min_data_in_leaf: 20, 50, 100 | |||
| feature_fraction: 0.7, 0.9, 1.0 | |||
| bagging_fraction: 0.7, 0.9, 1.0 | |||
| lambda_l2: 0, 1, 10 | |||
| lambda_l1: 0, 0.1, 1 | |||
| CatBoost [26] | Regression (RMSE) | iterations: 2000, 5000, 8000 | Early stopping on out-of-time Validation (t=2021→2022) Categorical/binary features handled by CatBoost mechanism |
| learning_rate: 0.01, 0.05, 0.1 | |||
| depth: 4, 6, 8, 10 | |||
| l2_leaf_reg: 1, 3, 10 | |||
| random_strength: 0, 1, 5 | |||
| bagging_temperature: 0, 1, 5 | |||
| rsm: 0.7, 0.9, 1.0 | |||
| TabTransformer [28] | Regression (MSE) | n_layers: 2, 4, 6 | AdamW optimizer Early stopping on out-of-time Validation (t=2021→2022) Continuous features standardized (fit on Train only) |
| d_model (embedding/hidden): 64, 128, 256 | |||
| n_heads: 4, 8 | |||
| dropout: 0.0, 0.1, 0.2 | |||
| mlp_hidden: 128, 256, 512 | |||
| batch_size: 256, 512 | |||
| learning_rate: 1e-4, 3e-4, 1e-3 | |||
| weight_decay: 0, 1e-5, 1e-4 |
| Target | Model | Raw-only | Derived-only | Raw+Derived | ||||||
| RMSE | MAE | R2 | RMSE | MAE | R2 | RMSE | MAE | R2 | ||
| CETR | XGBoost | 0.234 | 0.185 | -0.114 | 0.231 | 0.183 | -0.093 | 0.231 | 0.183 | -0.087 |
| LightGBM | 0.232 | 0.185 | -0.095 | 0.233 | 0.185 | -0.111 | 0.231 | 0.185 | -0.094 | |
| CatBoost | 0.229 | 0.181 | -0.068 | 0.230 | 0.182 | -0.078 | 0.228 | 0.18 | -0.064 | |
| GETR | XGBoost | 0.134 | 0.091 | -0.02 | 0.136 | 0.094 | -0.042 | 0.135 | 0.094 | -0.031 |
| LightGBM | 0.134 | 0.092 | -0.017 | 0.135 | 0.093 | -0.023 | 0.134 | 0.093 | -0.014 | |
| CatBoost | 0.134 | 0.091 | -0.013 | 0.136 | 0.094 | -0.042 | 0.135 | 0.092 | -0.027 | |
| TSTA | XGBoost | 0.231 | 0.141 | 0.282 | 0.235 | 0.144 | 0.258 | 0.224 | 0.136 | 0.327 |
| LightGBM | 0.230 | 0.141 | 0.289 | 0.238 | 0.145 | 0.239 | 0.221 | 0.134 | 0.345 | |
| CatBoost | 0.240 | 0.150 | 0.231 | 0.241 | 0.149 | 0.22 | 0.229 | 0.14 | 0.295 | |
| TSDA | XGBoost | 0.237 | 0.140 | 0.273 | 0.239 | 0.142 | 0.259 | 0.23 | 0.133 | 0.314 |
| LightGBM | 0.228 | 0.135 | 0.326 | 0.238 | 0.141 | 0.268 | 0.223 | 0.131 | 0.357 | |
| CatBoost | 0.241 | 0.146 | 0.185 | 0.241 | 0.143 | 0.191 | 0.232 | 0.134 | 0.252 | |
| Target | RMSE | MAE | R2 | Features | ΔRMSE vs FS1 |
| CETR | 0.216 | 0.160 | 0.052 | 54 | -0.016 |
| GETR | 0.130 | 0.086 | 0.053 | 54 | -0.005 |
| TSTA | 0.223 | 0.132 | 0.332 | 54 | 0.002 |
| TSDA | 0.222 | 0.130 | 0.358 | 54 | 0.000 |
| Target | RMSE | MAE | R2 | Features | ΔRMSE vs FS1 |
| CETR | 0.216 | 0.160 | 0.048 | 48 | -0.015 |
| GETR | 0.128 | 0.084 | 0.082 | 48 | -0.007 |
| TSTA | 0.108 | 0.065 | 0.843 | 48 | -0.113 |
| TSDA | 0.104 | 0.060 | 0.859 | 48 | -0.118 |
| Target | RMSE | MAE | R2 | Features | ΔRMSE vs FS1 |
| CETR | 0.214 | 0.158 | 0.061 | 58 | -0.017 |
| GETR | 0.128 | 0.084 | 0.082 | 58 | -0.007 |
| TSTA | 0.107 | 0.065 | 0.846 | 58 | -0.114 |
| TSDA | 0.102 | 0.060 | 0.865 | 58 | -0.120 |
| Target | Raw-only | Derived-only | Raw+Derived | ||||||
| RMSE | MAE | R2 | RMSE | MAE | R2 | RMSE | MAE | R2 | |
| CETR | 0.235 | 0.184 | -0.126 | 0.232 | 0.185 | -0.095 | 0.233 | 0.184 | -0.105 |
| GETR | 0.139 | 0.093 | -0.088 | 0.139 | 0.096 | -0.087 | 0.14 | 0.096 | -0.105 |
| TSTA | 0.287 | 0.165 | -0.108 | 0.272 | 0.161 | 0.008 | 0.265 | 0.158 | 0.055 |
| TSDA | 0.291 | 0.165 | -0.095 | 0.274 | 0.159 | 0.029 | 0.267 | 0.155 | 0.074 |
| Target | RMSE | MAE | R2 | Features | ΔRMSE vs FS1 |
| CETR | 0.219 | 0.165 | 0.018 | 54 | -0.012 |
| GETR | 0.135 | 0.09 | -0.026 | 54 | -0.005 |
| TSTA | 0.273 | 0.161 | -0.004 | 54 | 0.004 |
| TSDA | 0.275 | 0.16 | 0.016 | 54 | 0.003 |
| Target | RMSE | MAE | R2 | Features | ΔRMSE vs FS1 |
| CETR | 0.221 | 0.165 | -0.001 | 48 | -0.009 |
| GETR | 0.134 | 0.088 | -0.01 | 48 | -0.006 |
| TSTA | 0.109 | 0.073 | 0.841 | 48 | -0.161 |
| TSDA | 0.106 | 0.069 | 0.855 | 48 | -0.167 |
| Target | RMSE | MAE | R2 | Features | ΔRMSE vs FS1 |
| CETR | 0.220 | 0.158 | 0.011 | 58 | -0.011 |
| GETR | 0.133 | 0.087 | 0.007 | 58 | -0.007 |
| TSTA | 0.112 | 0.074 | 0.833 | 58 | -0.158 |
| TSDA | 0.108 | 0.07 | 0.848 | 58 | -0.165 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).