Submitted:
16 August 2023
Posted:
17 August 2023
You are already at the latest version
Abstract
Keywords:
- A fully automated method for the processes of data extraction and data wrangling which permits immediate access to data.
- The results present a strong accuracy of 97.5% when classifying the input structure.
- The solution achieves a higher level of efficiency due to the task automatization.
1. Introduction
1.1. Related work
1.2. The novelty of the method.
- An end-to-end method: Many approaches managed to successfully solve a part of the data mining process, but very few ones encompass the process of data extraction, data wrangling and data preprocessing to make assets from different projects directly comparable.
- Strong validation: The suggested method has been assessed with a big number of assets coming from real historical projects presenting reliable and robust results.
- Different approach: The suggested approach relies on the usage of already existing technologies coming from the fields of data mining and machine learning assembled in an alternative way to target a different purpose. Making with this combination a unique method encompassing whole process.
2. The method
2.1. Materials
- Anaconda navigation version 2.2, for creating an environment.
- The IDE (Integrated development environment) Jupyter notebook version 6.4.5.
- Python language version 3.7.
- Different open-source libraries have been used, from them we can highlight: “pandas” for generating the data structures or “scikit learn” for providing the machine learning capabilities.
2.2. Input data
2.3. Understanding all processes of the method:
- Id: It consists of an integer number that gets increased sequentially, it numerically identifies the number of assets that has been registered in the dataset.
- Bill attribute: It is a string type attribute that identifies the number of the bill where the asset has been located and a short description of it. For example: “Bill 123 Mechanical and plumbing”
- Bill description: Another string type attribute which contains redundant information including only a short description of the bill. It will be lately used for categorization purposes.
- Category: It is a categorical attribute containing a string that uniquely identifies the higher level of category for the SMM7 standard that the asset belongs to.
- Subcategory: Another categorical attribute that identifies the second layer of category for the standard SMM7 including a more specific categorization. For example, for the category: “D groundwork” we can find the subcategory: “D20: excavating and filling”.
- Description 1, 2 and 3: As an additional information, each row contains three different descriptions where the first description contains the most generic information and the last one being the most specific. The information that the descriptions contain can vary a lot. For cite some examples, they can contain different unit of measures, for example: “maximum depth not exceeding 1.50m” or they can specify the type of work that has been carried out such as “Site preparation”.
- Quantity: An integer number that specifies the number of items needed.
- Unit: An integer number which describes the unit of measure such as meter, item or square meter. For example, if the quantity of an item says 100 and the unit of measure indicates square meter. The dataset indicates that 100 square meters of that specific asset were needed on a specific project.
- Rate: A Boolean number including the price that is charged for each unit of measure. For example, it can say that for each square meter of a constructed wall, the client will be charged 157.57 GDP.
- Total cost: It is the number obtained because of multiplying the rate and the quantity. Following the previous examples. If the rate for each square meter of a wall would be 157.57 and the quantity would be 100. The total cost would be 15,757 GBP.
- Letter: The BoQ used as input files contain a letter that uniquely identify each asset located in the same categories and subcategories.
- Page number: As a helpful information, the processed data structure includes the pages number where the original item was registered in the input file. In this way, the accountant surveyor can doublecheck the correctness of the attributes in a faster way.
- Trade based category name: One of the projects also contains a trade-based classification of all their assets. Hence this string attribute works as a classification attribute identifying the categories that it belongs to.
- Trade based category number: Additionally, it specifies the amount of the total cost that would be located on that specific trade-based category. In the case where the asset only belongs to one category, this number would be the same than the total cost attribute.
- Second Trade based category name: Since SMM7 it is not a trade-based standard, there are a few cases were the same asset in SMM7 belongs to two categories with a trade-based approach. Hence, this attribute would be blank in most of the cases, and it would specify the second category that the asset belongs to in case of conflict.
- Second Trade based category number: In those cases where the asset belongs to more than one trade-based category, this number would indicate the cost that would be located on the second category. For example, for a fictitious asset classified the SMM7 class “Masonry” with a total cost of 10,000 GDP. On trade-based standard it could locate 4,000 GDP for “Substructure” and 6,000 GDP for “external walls”.
3. Results

4. Conclusions
References
- Ahn, S. J., Han, S. U., & Al-Hussein, M. (2020). Improvement of transportation cost estimation for prefabricated construction using geo-fence-based large-scale GPS data feature extraction and support vector regression. Advanced Engineering Informatics, 43. [CrossRef]
- Akanbi, T., & Zhang, J. (2021). Design information extraction from construction specifications to support cost estimation. Automation in Construction, 131. [CrossRef]
- Desai, V. S. (n.d.). Improved Decision Tree Methodology for the Attributes of Unknown or Uncertain Characteristics-Construction Project Prospective. The International Journal of Applied Management and Technology, 6, 201.
- Fisher, D., Miertschin, S., & PollockJr., D. R. (1995). Benchmarking in Construction Industry. Journal of Management in Engineering 1995, 11(1), 50–57. [CrossRef]
- Hong, H., Tsangaratos, P., Ilia, I., Liu, J., Zhu, A.-X., & Chen, W. (2017). Application of fuzzy weight of evidence and data mining techniques in construction of flood susceptibility map of Poyang County, China. [CrossRef]
- Ilmi, A. A., Supriadi, L. S. R., Latief, Y., & Muslim, F. (2020). Development of dictionary and checklist based on Work Breakdown Structure (WBS) at seaport project construction for cost estimation planning. IOP Conference Series: Materials Science and Engineering, 930(1). [CrossRef]
- Moreno, V., Génova, G., Parra, E., & Fraga, A. (n.d.). Application of machine learning techniques to the flexible assessment and improvement of requirements quality.
- Murray, G. P. (1997). Rules and Techniques for Measurement of Services. Measurement of Building Services, 9–18. [CrossRef]
- Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47. [CrossRef]
- Soibelman, L., Wu, J., Caldas, C., Brilakis, I., & Lin, K. Y. (2008). Management and analysis of unstructured construction data types. Advanced Engineering Informatics, 22(1), 15–27. 1. [CrossRef]
- Stoy, C., Dreier, F., & Schalcher, H.-R. (n.d.). Construction duration of residential building projects in Germany. [CrossRef]
- Symonds, B., Barnes, P., & Robinson, H. (2015). New Approaches and Rules of Measurement for Cost Estimating and Planning. Design Economics for the Built Environment: Impact of Sustainability on Project Evaluation, 31–46. [CrossRef]
- Yan, H., Yang, N., Peng, Y., & Ren, Y. (2020). Data mining in the construction industry: Present status, opportunities, and future trends. [CrossRef]
- Zhong, Y. (n.d.). Research on Construction Engineering Project Management Optimization Based on C4.5 Improved Algorithm. [CrossRef]
- Zou, Y., Kiviniemi, A., & Jones, S. W. (2017). Retrieving similar cases for construction project risk management using Natural Language Processing techniques. Automation in Construction, 80, 66–76. [CrossRef]




| Aim | Approach | Reference |
|---|---|---|
| To identify similar construction projects for risk management. | The combination of NLP (Natural Language Processing) and Machine learning with a case base reasoning approach. | (Zou et al., 2017) |
| To enhance the attributes classification in construction projects. | The combination of data analysis and machine learning to identify the main factors that drive these classifications and provide reliable predictions. | (Desai, n.d.) |
| The optimization of risks applied to construction projects. | A two-step method is suggested based on the generation of the optimization attributes and the implementation of the algorithm C4.5. | (Zhong, n.d.) |
| Automatic text categorization of the project’s assets. | A system that harnesses the benefits of NLP and machine learning for making an automatic text categorization. | (Sebastiani, 2002) |
| To analyze the variability and the types of data structures used in construction projects. | A method that combines data extraction, data mining and analysis to assess the variability of structures among different projects. | (Soibelman et al., 2008) |
| To identify the non-flood areas in Poyang County, China. | To carry out different processes of data extraction and analysis that materialized in the identification of the flood risk areas. | (Moreno et al., n.d.) |
| To review and assess the current state of data mining in construction projects. | A systematic review of the historical application of data mining through the years to construction projects. | (Yan et al., 2020) |
| To decrease the transportation costs of prefabricated construction pieces | The approach extracts and processes geospatial data to feed the support vector machine for regression. | (Ahn et al., 2020) |
| To automatize the process of data extraction to support cost estimation | A method composed of three processes: The extraction of design information, to match the specified material from items in the database, to retrieve the price information of those materials | (Akanbi & Zhang, 2021) |
| To make a dictionary based on the WBS standard to support costs estimation | To carry out different surveys based on experts’ opinions to develop the dictionary | (Ilmi et al., 2020) |
| To assess the main factors of the duration of construction projects | A data analysis is performed to assess the main factors that influence in determining the length of the construction projects. | (Stoy et al., n.d.) |
| Id | Bill description | Category | Subcategory | Description level 1 | Description level 2 | |||
|---|---|---|---|---|---|---|---|---|
| 0 | Groundworks & substruct. | C demolition /… | C90 alterations… | Various loc. on site | Existing perimeter fencing and disp… | |||
| 1 | Groundworks & substruct. | C demolition /… | C90 alterations… | Various loc. on site | Remove existing timber fencing int… | |||
| 2 | Groundworks & substruct. | D groundwork | D20 excavating… | Site preparation | Site preparation | |||
| 3 | Groundworks & substruct. | D groundwork | D20 excavating… | excavating | To reduce levels | |||
| 4 | Groundworks & substruct. | D groundwork | D20 excavating… | excavating | Basements and the like | |||
| row | Description 3 | quantity | unit | rate | Total cost | letter | Page num. | |
| 0 | Complete; provisional | 113 | m | 2258 | 255154 | a | 1 | |
| 1 | Complete; provisional | 154 | m | 2258 | 347732 | b | 1 | |
| 2 | Brushes, scrub, undergrowth, hedges, trees and … | 3328 | m2 | 237 | 765036 | a | 1 | |
| 3 | Maximum depth not exceeding 2.00m | 1140 | m3 | 339 | 38646 | b | 1 | |
| 4 | Maximum depth not exceeding 1.00m | 242 | m3 | 339 | 82038 | c | 1 | |
| Row | Trade-based category name | Trade-based category code | Trade-based cat. name 2 | Trade-based cat. code 2 | ||||
| 0 | Site works | 255154 | - | 0 | ||||
| 1 | Site works | 347732 | - | 0 | ||||
| 2 | Substructure | 76036 | - | 0 | ||||
| 3 | Substructure | 38646 | - | 0 | ||||
| 4 | Substructure | 82038 | - | 0 | ||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).