Submitted:
12 June 2025
Posted:
12 June 2025
You are already at the latest version
Abstract

Keywords:
1. Introduction
1.1. Background and Motivation
1.2. Problem Statement
1.3. Objectives
1.4. Scope and Limitations
2. Literature Review
2.1. Traditional Approaches
2.2. Machine Learning in Real Estate
| Statistical Method Error | Machine learning method Error | |
| London (AAPE) 0.0016 -0.073 | 0.0016 | -0.073 |
| Nizhny Novgorod (MAPE) | 14.5 | 10.3 |
2.3. Challenges in Real Estate Price Prediction
- Online data does not contain sufficient time information and may be distorted by seasonal fluctuations.
- Government data is often unreliable due to shady transactions and low prices.
3. Methodology
3.1. Data Collection
3.2. Data Preprocessing
- Translation: All feature names and categorical values were translated from Russian to English to maintain uniformity.
- Outlier Detection and Removal: Listings with extreme or implausible values (e.g., unusually high prices) were identified and excluded.
- Data Imputation: Missing values in both numerical and categorical fields were handled using appropriate imputation strategies, such as mean substitution or assigning default category labels.
- Encoding Categorical Features: Categorical variables, such as building type and district, were encoded using either label encoding or one-hot encoding, depending on the model requirements.
- Normalization of Numerical Features: Numerical features like price, area, and number of rooms were normalized to improve convergence in models that are sensitive to feature scaling.
3.3. Modeling Approaches
3.3.1. Linear Regression
- To evaluate how well the model’s predictions align with the true values, we define a loss function. One commonly used function is the Mean Squared Error (Linear models, 2025):
3.3.2. Ridge/Lasso Regression
3.3.3. Random Forest
3.3.4. Support Vector Machine (SVM)
3.4. Evaluation Metrics
3.4.1. Coefficient of Determination (R2)
3.4.2. Mean Absolute Percentage Error (MAPE)
4. Exploratory Data Analysis
4.1. Overview of the Dataset
| Feature | Type | Description |
|---|---|---|
| Area | Number | Total area of an apartment given in m2 |
| Series | Category | Apartment series1 |
| Floors | Number | The floor where the apartment is located |
| Floors number | Number | Total number of floors in the building |
| Rooms number | Number | Total number of rooms in the apartment |
| Construction Year | Number | Year when apartment was constructed |
| Heating | Category | Heating type (“Gas”, “Electric” etc) |
| Condition | Category | Technical condition of the house |
| Wall material | Category | Wall material (“Brick”, “Panel” etc) |
| Latitude | Number | |
| Longitude | Number |
| Mean | Std Dev | Min | Max | |
|---|---|---|---|---|
| Area (sqm) | 76.6 | 43.1 | 10 | 650 |
| Floor | 6.6 | 3.9 | 1 | 21 |
| Number of Floors | 10.8 | 3.9 | 1 | 25 |
| Number of Rooms | 2.2 | 0.99 | 1 | 6 |
| Built Year | 2017.6 | 13 | 1952 | 2028 |
| Price ($) | 110,707 | 75059 | 19,000 | 1,500,000 |
| Price per sqm ($) | 1430 | 345.8 | 305 | 3440 |
4.2. Distribution of the Target Variables
4.3. Correlation Analysis
- : perfect positive linear correlation
- : perfect negative linear correlation
- : no linear correlation
- As it can be seen in Figure 4, Area has the strongest correlation with the total price, which is expected. We also observe that we have very few features that negatively correlate with both of our target variables, only derived features like Number of hospitals withing 1 km are showing a coefficient with the values at most -0.2.
- Also we can employ scatterplot to gain more useful insights about the correlation of some specific features. For exampe, in the Figure 4 it is clearly seen that construction year has positive correlation with the price. The chart also suggests that the collapse of Soviet Union has introduced the market more diversification in terms of price and area.
5. Implementation
5.1. Model Training & Evaluation
5.1.1. Metrics
5.1.2. Linear Regression (Baseline)
5.1.3. Ridge & Lasso
5.1.4. Decision Tree
6. Results
6.1. Model Performance
6.2. Residual Analysis
7. Conclusions
References
- Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 2623–2631). Anchorage. [CrossRef]
- International Association of Assessing Officers. (2018). Retrieved from “Standard on Automated Valuation Models (AVMs) International Association of Assessing Officers: https://www.iaao.org/.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York: Springer.
- Lee, S. (2025, March 27). How Machine Learning Enhances Property Value and Investment. Retrieved from Number Analytics: https://www.numberanalytics.com/blog/machine-learning-enhances-property-value-investment.
- Linear models. (2025, May 7). Retrieved from education.yandex.ru: https://education.yandex.ru/handbook/ml/article/linear-models.
- Liu, Y. (2018, 11 1). Analytical Solution of Linear Regression. Retrieved from medium.com: https://medium.com/data-science/analytical-solution-of-linear-regression-a0e870b038d5.
- (n.d.). Massovaya ocenka ob"ektov nedvizhimosti na osnove tekhnologij mashinnogo obucheniya. Analiz tochnosti razlichnyh metodov na primere opredeleniya rynochnoj stoimosti kvartir.
- minstroy.gov.kg. (2025, May 7). Retrieved from minstroy.gov.kg: https://minstroy.gov.kg/ru/news/430/show.
- salyk.kg. (2025, May 7). Retrieved from calculator.salyk.kg: https://calculator.salyk.kg/infosti086.
- (2022). The future of automated real estate valuations. Saïd Business School.
- Vapnik, V., & Cortes, C. (1995). Support-vector networks. Machine Learning, 273-297.
- Vasques, X. (2024). Machine Learning Theory and Applications. Bois-Colombes: Wiley.
| 1 | Apartments built during the Soviet era typically follow specific standardized series (building types), which can influence their layout, construction quality, and market value. Check: https://www.salut.kg/serii.php
|





| MAE | RMSE | R2 (%) | MAPE (%) | |||||
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | |
| LR | 13,458 | 12,737z | 19,007 | 18,565 | 78.19 | 79.31 | 15.16 | 13.56 |
| LR (Ridge) | 13,455 | 12,737 | 19,006 | 18,566 | 78.19 | 79.31 | 15.16 | 13.67 |
| LR (Lasso) | 13,511 | 12,813 | 19,014 | 18.663 | 78.18 | 79.09 | 15.24 | 13.68 |
| Decision Tree | 10,766 | 10,64 | 16,374 | 17,305 | 83.82 | 82.03 | 11.73 | 10.68 |
| Random Forest | 8,883 | 8,367 | 13,154 | 14,219 | 89.55 | 87.87 | 10.32 | 9.02 |
| SVM | 12,937 | 12,727 | 19,065 | 18,933 | 78.06 | 78.49 | 14 | 13.24 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).