Submitted:
03 April 2024
Posted:
04 April 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- What are the dynamics (features affecting) of housing price in the UK? Are these dynamics same by geographies (regions within England)?
- Is it possible to use machine learning algorithms as a research methodology to develop a housing price prediction model?
2. Literature Review
3. Methods and Data
3.1. Data Collection
3.2. Data Linkage
3.3. Data Cleansing
3.4. Determination of Variables or Features
3.5. Data Visualization
3.6. Correlational Analysis
- Households, as it has a high correlation with Population.
- Postcode_Area, as it is highly correlated with Postcode.
- Price per Square Area, as it is highly correlated with Price (the dependent variable).
- Number of rooms, as it has a high correlation with Total Floor area.
- Latitude and Longitude, as they are correlated with Postcode.
- Average Income, as it is correlated with Postcode, Index of Multiple Deprivation, and Postcode Area.
- Average Distance Field, as it is correlated with Average Distance Park.
- -
- Households: As it is highly correlated with Population.
- -
- Postcode_Area: As it is highly correlated with Postcode.
- -
- Index Of Multiple Deprivation: As it is highly correlated with Average Income.
- -
- Price per square area: As it is highly correlated with Price (the dependent variable).
- -
- Number of rooms: As it is highly correlated with Total Floor area.
- -
- Latitude and Longitude: As they are correlated with Postcode.
- -
- Household, due to high correlation with Population
- -
- Postcode_Area, due to high correlation with Postcode
- -
- Index Of Multiple Deprivation, due to high correlation with Average Income
- -
- Price per square area, due to high correlation with Price (which is the dependent variable)
- -
- Number of rooms, due to high correlation with Total Floor area
- -
- Latitude and Longitude, due to their correlation with Postcode.
- -
- Households, due to its high correlation with Population.
- -
- Postcode_Area, due to its high correlation with Postcode.
- -
- Price per square area, due to its high correlation with Price, which is the dependent variable.
- -
- Number of rooms, due to its high correlation with Total Floor area.
- -
- Latitude and Longitude, as they are correlated with Postcode.
- -
- LSOA, as it was correlated with Postcode.
- -
- Average Distance Field, due to its correlation with Average Distance Parks.

3.7. Linear Regression
3.8. Machine Learning Algorithms
3.8.1. K-NEAREST NEIGHBOUR (KNN)
3.8.2. Gradient Boosting
3.8.3. XGBoost Modelling
3.8.4. Random Forest
3.8.5. Extra Tree
3.8.6. Bagging Tree
3.8.7. Artificial Neutral Network
3.9. Data Splitting
4. Results
4.1. Model Prediction
4.2. Performance Metrics
- Accuracy
- 2.
- Mean Absolute Error
- 3.
- Mean Square Error (MSE) and Root Mean Square Error (RMSE)
5. Discussion and Conclusion
5.1. Discussion
5.2. Conclusion
References
- Jafar, A. et al. (2018): Machine Learning for a London Housing Price Prediction Mobile Application, BEng in Electronics and Information Engineering, Imperial College London.
- Aaron (2015): Machine Learning for a London Housing Price Prediction Mobile Application, BEng in Electronics and Information Engineering, Imperial College London.
- Schmidt, J. , Mário R. G. M., Silvana B. and Miguel A. L. M. (2019): Recent advances and applications of machine learning in solid-state materials science, npj Computational Materials volume 5, Article number: 83 (2019).
- Zhou, L. Machine Learning on Big Data: Opportunities and Challenges . 2017. [Google Scholar]
- https://who.rocq.inria.fr/Vassilis.Christophides/Big/local_copy/intro/BigDataOpportunitiesanChallenges.pdf.
- Ferreira, F. et al. (2007): A Unified Framework for Measuring Preferences for Schools and Neighbourhoods, Journal of Political Economy, Vol. 115, No. 4 (07), pp. 588-638.
- Hussain, I (2016); Do Consumers Respond to Short-Term Innovations in School Productivity? Evidence from the Housing Market and Parents’ School Choices: University oSussex.
- Rutzen, M. (2018): Urban Tech on the Rise: Big Data Disrupts the Real Estate Industry, built Hoizonshttps://medium.com/built-horizons/urban-tech-on-the-rise-big-data-disrupts-the-real-estate-industry-492d9e15aba5.
- Shinde, N, and Gawande, K. (2017): Kaggle Competition: Predicting House Prices in Ames, Iowa.
- https://nycdatascience.com/blog/student-works/machine-learning/kaggle-ompetition- house-pricing-in-ames-iowa/.
- Awonaike, A. et al. A Machine Learning Framework for House Price.
- Estimation, Journal of Network and Innovative Computing, ISSN 2160-2174 Volume 10 (2022) pp. 028-035. www.mirlabs.net/jnic/index.htm.
- Kim, J. et al. A Comparative Study of Machine Learning and Spatial Interpolation Methods for Predicting House Prices, 2022. [Google Scholar] [CrossRef]
- Antoniucci, V and Marella, G. Immigrants and the City: The Relevance of Immigration on Housing Price Gradient Buildings, 2017. [CrossRef]
- Fix and Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation,.
- https://www.jstor.org/stable/pdf/1403796.pdf?refreqid=excelsior%3Ad656fb48118218644479bd4a21660be5&ab_segments=&origin=&initiator=&acceptTC=1.
- COVER, T. M. (1968). Rates of Convergence for Nearest Neighbor procedures. In Proceedings of the Hawaii International Conference on System Sciences (B. K. Kinariwala and F. F. Kuo, eds.) 413–415. Univ. Hawaii Press, Honolulu.
- Cover, T. M. and Hart, P. E. (1967). Nearest Neighbor Pattern Classification. IEEE Trans. Inform. Theory 13 21–27.
- Band, A. (2020): How to find the optimal value of K in KNN?https://towardsdatascience.com/how-to-find-the-optimal-value-of-k-in-knn-35d936e554eb#:~:text=The%20optimal%20K%20value%20usually,be%20aware%20of%20the%20outliers.
- Schott, M. K-Nearest Neighbors (KNN) Algorithm for Machine Learning.https://medium.com/capital-one-tech/k-nearest-neighbors-knn-algorithm-for-machine-learning-e883219c8f26.
- Gupta, P. (2017): Decision Trees in Machine Learning.
- https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052.
- Gahukar, G. (2018): Classification Algorithms in Machine Learning…. https://medium.com/datadriveninvestor/classification-algorithms-in-machine-learning-85c0ab65ff4.
- Kangane, P. et al. (2021): International Journal of Engineering Applied Sciences and Technology, 2021 Vol. 5, Issue 11, ISSN No. 2455-2143, Pages 247-254Published Online 21 in IJEAST (http://www.ijeast.com).
- Scikit learn 1.2.1 documentation.
- https://scikit-learn.org/stable/modules/ensemble.html.
- Kaggle.
- https://www.kaggle.com/code/dansbecker/xgboost.
- Brownlee, J. How to Develop an Extra Trees Ensemble with Python. 2021. [Google Scholar]
- https://machinelearningmastery.com/extra-trees-ensemble-with-python.
- CFI Team (2022): Bagging (Bootstrap Aggregation).
- https://corporatefinanceinstitute.com/resources/data-science/bagging-bootstrap-ggregation/.
- Agatonovic-Kustrin, S. , Beresford, R. (2000): Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research, J Pharm Biomed Anal. 2000 Jun;22(5):717-27. [CrossRef]
- https://pubmed.ncbi.nlm.nih.gov/10815714/.
- Nagyfi, R. . (2018): The differences between Artificial and Biological Neural Networks.
- https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7.








| Number | Name of table | Source | Geographic hierarchy |
|---|---|---|---|
| 1 | UK House Price from 1995 to 2017 | Land Registry https://www.gov.uk/guidance/about-the-price-paid-data |
Postcode |
| 2 | England base rate 1979-2017 Bank of England | Bank of England Official Bank Rate History https://www.bankofengland.co.uk/boeapps/database/Bank-Rate.asp |
|
| 3 | Gross Disposable Household Income (GDHI) per head of population at current basic price (1997 to 2017). | Office for National Statistics (ONS, 2021) https://www.ons.gov.uk/economy/regionalaccounts/grossdisposablehouseholdincome/bulletins/regionalgrossdisposablehouseholdincomegdhi/1997to2017 |
Regional Level |
| 4 | Postcode Headcounts and Household Estimates - 2011 Census | Office for National Statistics https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/2011censusheadcountsandhouseholdestimatesforpostcodesinenglandandwales |
Postcode |
| Data item | Explanation (where appropriate) |
|---|---|
| Price Per Square Area | in square meters |
| Price | Housing sale price specified on the transfer deed. |
| Year | Completion date of the house sale as recorded on the transfer deed. |
| Postcode | It is the post code recorded at the time of the Sale. |
| Property Type | D = Detached, S = Semi-Detached, T = Terraced, F = Flats/Maisonettes, O = Other |
| Old Or New | Specifies the age of the house and relates to all price paid sales, non-residential and residential Y = a newly built house, N = an established residential building |
| Duration | This has to do with the tenure: L= Leasehold, F = Freehold etc. |
| Total Floor Area | in square meters |
| Number Of Rooms | |
| Latitude | |
| Longitude | |
| Population | in tenth |
| Households | |
| MSOA | Middle Layer Super Output Area |
| Rural Or Urban | |
| IMD | Index Of Multiple Deprivation |
| Distance To Station | |
| Quality | |
| LSOA | Lower Layer Super Output Area |
| Average Income | pound |
| Average Distance Parks | Average distance to nearest park or public garden (m) |
| Median Number Parks | Median number of parks and public gardens in 1,000 m radius |
| Average Distance Field | Average distance to nearest park or public garden or playing field (m) |
| Median Number Field | Median number of parks and public gardens and playing fields within 1,000 m radius |
| Potential Energy Efficiency | |
| Current Energy Efficiency | |
| Region | |
| Postcode Area | |
| Interest Rate |
| Regions | R-squared |
|---|---|
| York | 0.913 |
| South East England | 0.908 |
| North East England | 0.845 |
| London | 0.926 |
| York | South East England | North East England | London |
|---|---|---|---|
| Year | Year | Property Type | Year |
| Property Type | Property Type | Duration | Property Type |
| Duration | Duration | Total Floor Area | Duration |
| Postcode | Total Floor Area | Current Energy Efficiency | Postcode |
| Total Floor Area | Current Energy Efficiency | Potential Energy Efficiency | Total Floor Area |
| Current Energy Efficiency | Potential Energy Efficiency | Interest Rate | Current Energy Efficiency |
| Interest Rate | Interest Rate | Population | Potential Energy Efficiency |
| Population | Population | MSOA | Old or New |
| LSOA | MSOA | Distance To Station | Interest Rate |
| Average Income | Rural Or Urban | LSOA | Population |
| Average Distance Parks | LSOA | Average Income | MSOA |
| Median Number Parks | Average Income | Average Distance Field | Rural Or Urban |
| Average Distance Field | Median Number Parks | Distance To Station | |
| Median Number Field | Average Distance Field | LSOA | |
| Median Number Field | Average Income | ||
| Average Distance Parks | |||
| Median Number Parks | |||
| Average Distance Field | |||
| Median Number Field[DP1] |
| York | South East England | North East England | London |
|---|---|---|---|
| Old Or New | Old Or New | Old Or New | Population |
| MSOA | Postcode | Postcode | Rural Or Urban |
| Distance To Station | Distance To Station | Year | |
| Potential_Energy_Efficiency | Average Distance Parks | Average Distance Parks | |
| Rural Or Urban | Rural Or Urban | ||
| Median Number Parks | |||
| Median Number Field | |||
| Accuracy Training Dataset | Accuracy Test Dataset | MAE | MSE | RMSE | |
|---|---|---|---|---|---|
| Catboost Modelling | 0.93 | 0.92 | 57206.13 | 8634971075.26 | 92924.55 |
| Gradient Boosting Modelling | 0.95 | 0.94 | 58100.19 | 9039008051.25 | 95073.70 |
| Random Forest Modelling | 0.97 | 0.97 | 63472.69 | 11412995478.65 | 106831.62 |
| Bagging Modelling | 0.96 | 0.96 | 67157.41 | 12478228588.71 | 111705.99 |
| Extra Tree Modelling | 0.99995742 | 0.99996333 | 62722.57 | 10594822494.29 | 102931.15 |
| K Nearest Neighbour | 0.42 | 0.36 | 105393.98 | 23500114089.14 | 153297.47 |
| Artificial neural network (ANN) | 0.69 | 0.70 | 72055.89 | 13692922445.71 | 117016.76 |
| Accuracy Training Dataset | Accuracy Test Dataset | MAE | MSE | RMSE | |
|---|---|---|---|---|---|
| Catboost Modelling | 0.94 | 0.93 | 27949.50 | 2045005470.94 | 92924.55 |
| Gradient Boosting Modelling | 0.96 | 0.96 | 27941.77 | 2078620579.04 | 45591.89 |
| Random Forest Modelling | 0.96 | 0.96 | 29216.17 | 2546272818.35 | 50460.61 |
| Bagging Modelling | 0.95 | 0.94 | 31351.48 | 2922195661.99 | 54057.34 |
| Extra Tree Modelling | 0.999 | 0.999 | 28581.57 | 2319013654.12 | 48156.14 |
| K Nearest Neighbour | 0.46 | 0.40 | 40591.47 | 4561387630.17 | 67538.05 |
| Artificial neural network (ANN) | 0.10 | 0.23 | 36507.80 | 4451194801.00 | 66717.28 |
| Accuracy Training Dataset | Accuracy Test Dataset | MAE | MSE | RMSE | |
|---|---|---|---|---|---|
| Catboost Modelling | 0.94 | 0.94 | 29800.33 | 2235504818.96 | 47281.13 |
| Gradient Boosting Modelling | 0.96 | 0.96 | 29860.13 | 2385737831.62 | 48844.02 |
| Random Forest Modelling | 0.97 | 0.98 | 30835.19 | 2410152520.95 | 49093.30 |
| Bagging Modelling | 0.96 | 0.97 | 32369.18 | 2749485237.01 | 52435.53 |
| Extra Tree Modelling | 0.9998 | 0.9999 | 30532.97 | 2358225634.50 | 48561.57 |
| K Nearest Neighbour | 0.42 | 0.36 | 44280.62 | 5446350837.94 | 73799.40 |
| Artificial neural network (ANN) | 0.80 | 0.77 | 32067.49 | 2630997906.78 | 51293.25 |
| Accuracy Training Dataset | Accuracy Test Dataset | MAE | MSE | RMSE | |
|---|---|---|---|---|---|
| Catboost Modelling | 0.92 | 0.92 | 46672.64 | 7892542250.45 | 88839.98 |
| Gradient Boosting Modelling | 0.95 | 0.94 | 46572.09 | 7965198433.66 | 89247.96 |
| Random Forest Modelling | 0.97 | 0.97 | 51823.78 | 9703789163.98 | 98507.81 |
| Bagging Modelling | 0.96 | 0.96 | 54467.37 | 10018417668.38 | 100092.05 |
| Extra Tree Modelling | 0.9999 | 0.9999 | 52019.27 | 9921407318.15 | 99606.26 |
| K Nearest Neighbour | 0.52 | 0.46 | 76937.46 | 16912925841.32 | 130049.71 |
| Artificial neural network (ANN) | 0.65 | 0.61 | 58174.53 | 10562884875.36 | 102775.90 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).