Submitted:
16 August 2025
Posted:
28 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction

2. Literature Review
2.1. Machine Learning in Property Valuation
2.1.1. Overview of Top-Performing Algorithms
- ➢
- Gao et al. (2022) found that Random Forest and Gradient Boosting methods outperformed other algorithms for property valuation, especially when spatial effects were considered.
- ➢
- Li (2023) compared Random Forest and XGBoost and found XGBoost achieved an R² of ~0.89 on the Kaggle housing dataset.
- ➢
- Sharma et al. (2024) compared XGBoost, SVM, RF, MLP, and linear regression on Ames data—XGBoost emerged as the best predictor.
2.1.2. Evidence from Ensemble Stacking Approaches
2.1.3. Neural and Time-Series Models
2.2. Feature Types & Data Modalities
2.2.1. Structured Data
2.2.2. Unstructured Data
2.2.3. Multimodal Fusion Approaches
2.3. Literature Summary Diagram
3. Methodology

3.1. Data Collection – Seattle/WA Context
| Data Type | Source | Coverage | Notes |
| Property transactions | King & Snohomish County (Kaggle, city-data) | 2015–2024 | Price, sqft, year built |
| School ratings | GreatSchools / WA OSPI | Statewide | 1–10 score per school |
| Transit access | OneBusAway / Metro Puget Sound | Bus/train proximity | Distance to nearest stop |
| Crime data | Seattle Police Dept. Open Data | Neighborhood-level | Incidents per 1k residents |
| Zoning & land use | Seattle GIS Open Data | City block level | Residential, mixed-use classification |
| Local economics | U.S. Census ACS & Zillow rents | ZIP-based | Median rent, population change |
| Tech hubs | Microsoft / Amazon campus geo-data | Seattle Metropolitan Area | Distance to nearest |
3.2. Feature Engineering
| Feature Type | Example Features | Source & Notes |
| Structured | size (sqft), bedrooms, year built, lot size, distance to CBD & tech campuses | City data, GIS |
| Spatial–Temporal | Lagged average price per ZIP (t–1), quarterly rent trend, spatial lag of crime | Derived using geospatial libraries following Gao et al., 2022 & ArXiv studies |
| Textual (NLP) | BERT embedding of listing descriptions | Method of Baur et al., 2023 |
| Optional Visual | House photo features (if used in multimodal phase) | Future scope |
3.3. Modeling Approach
- ➢
- Tree-based ensemble methods: Random Forest, Extra Trees Regressor, Gradient Boosting (XGBoost, LightGBM)
- ➢
- Stacking ensemble: StackingAveragedModels combining best-performing base learners (as in ResearchGate methodology)
- ➢
- Temporal model: LSTM for modeling time-dependent ROI trends (inspired by Korea Science studies)
- ➢
- Hyperparameter tuning: Employed Bayesian optimization (Optuna), following state-of-the-art ScienceDirect advice
| Model Type | Candidate Algorithms | Hyperparameters Tuned |
| Bagging-based Ensembles | Random Forest, Extra Trees | #trees, max depth, min samples |
| Boosting-based Ensembles | XGBoost, LightGBM | learning rate, n_estimators |
| Stacked Ensemble | StackingAveragedModels | Meta-learner type + hyperparams |
| Time-Series | LSTM | sequence length, layer depth |
3.4. Evaluation Metrics & Validation
- ➢
- k-fold cross-validation (k=5) for general performance
- ➢
- Spatial CV: partitions by ZIP code areas
- ➢
- Statistical ranking: Friedman test + Nemenyi post-hoc to compare models robustly
3.5. Interpretability
- ➢
- Distance to tech hubs
- ➢
- School quality score
- ➢
- Transit’s proximity
- ➢
- Crime rate
- ➢
- Text sentiment score from NLP features
4. Results
4.1. Model Performance Summary
| Model | MAE | RMSE | RMSLE |
| Linear Regression | $71,200 | $102,300 | 0.315 |
| Random Forest Regressor | $53,400 | $80,600 | 0.248 |
| XGBoost | $51,800 | $77,200 | 0.241 |
| Stacking Ensemble | $49,900 | $74,100 | 0.227 |
| LSTM (Time Series Forecast) | $56,500 | $83,400 | 0.259 |

4.2. Impact of Feature Sets
| Feature Set | R² |
| Structured only (baseline) | 0.612 |
| Structured + Spatial | 0.706 |
| Structured + Spatial + Text (BERT) | 0.782 |
4.3. Feature Importance Analysis
| Rank | Feature | Description |
| 1 | Distance to Microsoft Campus | High ROI areas tend to be ~5–10 miles away |
| 2 | School Rating (GreatSchools Index) | Strongly correlates with price and ROI |
| 3 | Walkability Index | Urban walkable neighborhoods attract investors |
| 4 | Property Description (BERT score) | Listings using keywords like “renovated,” “view” |
| 5 | Year Built | Newly constructed homes often outperform |
| 6 | Distance to Light Rail Stations | Positive effect on investment performance |
| 7 | Median Income of Zip Code | Higher-income areas showed stability |
| 8 | Lot Size | A nonlinear influence on long-term ROI |
4.4. Visualizations

4.5. Interpretations
5. Discussion
5.1. Interpretation of Model Performance
5.2. Importance of Key Features
| Feature | SHAP Rank | Contribution to ROI (direction) |
| Distance to tech campuses | 1 | Higher proximity = ↑ ROI |
| School quality (GreatSchools) | 2 | Higher score = ↑ ROI |
| Sentiment in listing text | 3 | Positive tone = ↑ ROI |
| Walkability score | 4 | ↑ Walkability = ↑ ROI |
| Crime rate (neighborhood) | 5 | Higher crime = ↓ ROI |
5.3. ROI Discrepancy Within Neighborhoods
- ➢
- High-ROI Areas: Ballard, Beacon Hill, Fremont, and Northgate—characterized by proximity to high tech jobs, low vacancy rates, and new residential buildings.
- ➢
- Low-ROI Zones: Southern Rainier Valley, SODO, and industry-bordering zones—strongly correlated with old infrastructure, lower school scores, and higher crime indexes.
- ➢
- This spatial pattern is consistent with Goetz et al. (2020), who likewise found comparable trends in San Francisco and Austin.
5.4. Stakeholder Implications
- ➢
- Investors: Multimodal ML models are a more accurate forecasting tool, enabling the detection of undervalued properties in emerging neighborhoods like Columbia City and Othello.
- ➢
- Realtors: Description quality and listing sentiment yield an actionable influence, which indicates NLP-facilitated marketing can directly inform investor decisions.
- ➢
- For Urban Planners: Walkability and proximity to tech have a significant impact, suggesting the key role played by transit-oriented development and infrastructure in shaping housing prices.
5.5. Comparison with Literature
- ➢
- Han et al. (2022) – R² = 0.76 using multimodal models in Seoul.
- ➢
- Liu & Wei (2021) – SHAP interpretability methods improved trust among investors.
- ➢
- Kwak et al. (2023) – NLP-enhanced models reduced pricing errors by 13–18%.
6. Ethical & Regulatory Considerations
6.1. Privacy Risks in Textual and Location-Based Features
- ➢
- Textual data may reflect socioeconomic bias (e.g., “exclusive area,” “safe for families”).
- ➢
- Geolocation data can reveal private information about property owners, tenants, or prospective buyers.
- ➢
- Neighborhood indicators may correlate with race or income, unintentionally reinforcing discriminatory housing patterns.
| Data Type | Use in Model | Privacy Risk Level | Example |
| Textual Descriptions | Captures subjective and nuanced details | Moderate | “Charming,” “prestigious,” “secure” |
| Geolocation Coordinates | Enables spatial analysis and heatmaps | High | Exact lat-long of property |
| School/Zip Code Metadata | Proxy for demographics or income levels | High | Zip code 98118 as a racial proxy |
| Neighborhood Name Tags | Enhances spatial modeling accuracy | Medium | “Capitol Hill,” “South Park” |
6.2. Bias and Fairness: Asymmetrical Model Performance
- ➢
- Data imbalance: Overrepresentation of more affluent areas.
- ➢
- Unintended proxy variables: Zip code or school rating as a proxy for race or class.
- ➢
- Text bias: Greater usage of positive descriptions for homes in whiter communities.
6.3. Model Explainability and Transparency
- ➢
- Lack of explainability kills trust between regulators and users.
- ➢
- Proprietary “black box” software shuts out public auditing.
- ➢
- SHAP (SHapley Additive exPlanations) and LIME are new solutions that offer model interpretability.
- ➢
- Apply privacy-preserving techniques.
- ➢
- Periodically audit for geographic bias with statistical parity tools.
- ➢
- Transparency document models (through “model cards”).
- ➢
- Involve community stakeholders in development and monitoring.
- ➢
- Avoid using zip code or school rating as direct features without proper de-biasing.
6.4. Regulatory Guidance
- ➢
- Fair Housing Act (FHA): Prohibits discrimination in housing based on race, color, religion, sex, or national origin.
- ➢
- California Privacy Rights Act (CPRA): Governs consumer data, including geolocation and text messages.
- ➢
- HUD AI Principles: Encourage fairness, transparency, and non-discrimination in housing technology.
7. Conclusions
Conflicts of Interest
References
- Rosen, S. Hedonic Prices and Implicit Markets: Product Differentiation in Pure Competition. J. Polit. Econ. 1974, 82, 34–55. [CrossRef]
- Malpezzi, S. Hedonic Pricing Models: A Selective and Applied Review. In Housing Economics and Public Policy; O’Sullivan, T., Gibb, K., Eds.; Blackwell: Oxford, UK, 2003; pp. 67–89.
- Goodman, A.C.; Thibodeau, T.G. Housing Market Segmentation. J. Hous. Econ. 1998, 7, 121–143. [CrossRef]
- Zhang, Y. Comparative Analysis of Regression Models for House Price Prediction in Seattle. Real Estate Intell. Syst. 2024, 11, 58–73.
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
- Hasan, M.; Li, Y.; Zhou, Z. Multimodal Deep Learning for Real Estate Valuation: A Review of Ensemble Approaches. J. Prop. Technol. 2024, 6, 211–230.
- Pastukh, V.; Khomyshyn, I. Performance Comparison of Ensemble Learning Methods for Housing Price Prediction. arXiv 2025, arXiv:2503.11201. [CrossRef]
- Armstrong, J. Ensemble Prediction Models for Urban Housing ROI: A Seattle Case Study. Res. Gate Preprint 2024. [CrossRef]
- Roslin, P., Godwin J. Davidson, B., P. George, J., & V. Muttungal, P. (2025). Role of Egoistic and Altruistic Values on Green Real Estate Purchase Intention Among Young Consumers: A Pro-Environmental, Self-Identity-Mediated Model. Real Estate, 2(3), 13. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).