3.1. Data Introduction
In the research design part, the introduction and explanation of data is an important part of research methodology. The data used in this study comes from the 2024 Property Tax Roll data set, which provides the latest real estate tax-related data of American States and provides a solid data foundation for the empirical analysis of this study.
In the time dimension of data, this study adopts the latest real estate tax data up to 2024, and these data are kept timely by annual updating. From the geographical dimension, the data covers the tax situation of American States, including key indicators such as real estate tax rate and evaluation value, which provides the possibility for cross-regional comparative research.
In order to ensure the reliability and accuracy of the research results, this study systematically preprocessed the original data. Firstly, duplicate values and abnormal values are deleted through the data cleaning process; Secondly, the missing values in the data are filled by appropriate statistical methods; Thirdly, considering the dimensional differences between different indicators, the necessary data standardization is carried out. In the aspect of data quality control, this study strictly carried out the procedures of data consistency test, abnormal value identification and processing, data integrity verification and variable correlation analysis.
Figure 1 focuses on the distribution of P_ID. It can be observed from the first picture that P_ID presents a bimodal distribution, in which the main distribution interval is concentrated between 0 and 50000, and it shows a relatively uniform distribution trend in this interval. There is an obvious trough between 50000 and 60000, and a second smaller distribution peak is formed between 60000 and 80000. This distribution may reflect the coding law of different areas or different types of real estate, which provides an important classification basis for subsequent analysis.
Figure 2 shows the distribution of Property Classes, showing obvious right-leaning distribution characteristics. The data shows that the number of properties in categories 1.0 and 2.0 is the largest, with about 14,000 and 13,000 samples respectively, which is much higher than other categories. With the increase of category value, the number of samples showed a significant decreasing trend. This distribution shows that the low-category property is the main sample, which may be related to the use type or value evaluation grade of the property, which is of great significance to understand the tax structure of the property.
Figure 3 reveals the interrelationships among the variables. Among them, the most significant correlation appears between TOTAL_ASSMT (total evaluation estimate) and TOTAL_EXEMPT (total tax allowance), and the correlation coefficient is as high as 0.98, indicating that there is a strong positive correlation between these two indicators. In addition, there is a moderate positive correlation (0.51) between P_ID and plat. It is worth noting that the correlation between TOTAL_TAXES and most other variables is weak, and only shows a weak correlation of 0.21 with TOTAL_ASSMT. This finding implies that the determination of real estate tax may be influenced by multiple complex factors, rather than simply determined by a single factor.
3.2. Software and Hardware Configuration
In terms of software and hardware environment configuration, Python 3.8 is adopted as the main programming language in this study, which is based on its wide application in data science and machine learning and rich library support. The core data processing and modeling depend on several professional libraries, including pandas 1.5.3 and numpy 1.23.5 for data processing and analysis, matplotlib 3.7.1 and seaborn 0.12.2 for data visualization. In the machine learning framework, scikit-learn 1.0.2 is selected as the basic modeling tool, XGBoost 1.7.3 is introduced for the realization of integrated learning, and TensorFlow 2.12.0 is used to construct the deep learning model. In order to solve the problem of data imbalance, SMOTE technology in imbalanced-learn 0.10.1 library is adopted.
In terms of hardware environment, the experiment was carried out on a workstation equipped with Intel Core i7 processor and 32GB RAM, and the training of deep learning model was accelerated with NVIDIA GeForce RTX 3080 graphics card. Specific parameter information is shown in
Table 1 and
Table 2.
Table 1.
Software environment configuration.
Table 1.
Software environment configuration.
| Category |
Component |
Version |
| Programming Language |
Python |
3.8 |
| Data Processing |
pandas |
1.5.3 |
| numpy |
1.23.5 |
| Data Visualization |
matplotlib |
3.7.1 |
| seaborn |
0.12.2 |
| Machine Learning Framework |
scikit-learn |
1.0.2 |
| XGBoost |
1.7.3 |
| Data Balancing Processing |
imbalanced-learn |
0.10.1 |
Table 2.
Hardware environment configuration.
Table 2.
Hardware environment configuration.
| Device Type |
Configuration Parameter |
| Processor |
Intel Core i7 |
| Memory |
32GB RAM |
| Operating System |
Windows 10 |
| Storage Device |
512GB SSD |
3.3. Model Introduction
In terms of model theory, this study uses four representative machine learning algorithms to predict the classification of real estate categories. First of all, the core advantage of Random Forest algorithm is to realize the diversity of models by constructing multiple decision trees and adopting voting mechanism, thus improving the accuracy and robustness of prediction. The algorithm can effectively process high-dimensional data and provide feature importance evaluation, which is of great significance for understanding the key factors affecting real estate classification.
Secondly, XGBoost, as an efficient implementation of gradient lifting decision tree, uses second-order Taylor expansion to approximate the objective function, and introduces regularization term to control the model complexity. This algorithm design makes the model keep high accuracy while effectively preventing over-fitting, which is especially suitable for the classification task of structured data. Another advantage of XGBoost is its built-in feature importance evaluation mechanism, which can help us identify the most influential factors in real estate classification.
Thirdly, support vector machine (SVM) maps data to high-dimensional feature space through kernel function, and finds the optimal classification hyperplane in this space. This theoretical basis makes it especially suitable for dealing with nonlinear classification problems. In this study, we use radial basis function (RBF) kernel, which can effectively capture the nonlinear relationship between features and improve the classification performance of the model.
Fourthly, Logistic Regression, as a classical statistical learning method, classifies by establishing a nonlinear mapping relationship between features and category probabilities. Although its form is simple, it can effectively prevent over-fitting and realize feature selection by introducing L1 or L2 regularization. The advantage of this method is that the model has strong explanatory power, and it can intuitively show the influence degree of each feature on the classification results.