Submitted:
18 August 2023
Posted:
18 August 2023
You are already at the latest version
Abstract
Keywords:
1) Introduction
2) Related work
2.1) Data science and machine learning on infrastructure costs
2.2) Railway infrastructure studies
2.3) Strongly related studies
3) Materials and methods
3.1) Materials
- Anaconda navigation version 2.2.
- The IDE (Integrated development environment) Jupyter notebook version 6.4.5.
- Python language.
- Different open-source libraries have been used from them we can highlight: “xlrd” for reading excel files or “os” for accessing some operative system capabilities.
3.2) The scenario
- Tier 1: Describes costs at projects or sub-projects level for either ‘buildings’ or ‘Civil engineering’. The following categories can be found for this attribute: Buildings and property, civil engineering, electrical power plant, operational telecommunications, permanent way, railway control systems and train power systems.
- Tier 2: Describes broad ‘cost categories’ such as Acquisition Costs; Construction Costs; Renewal Costs; Operation Costs; Maintenance Costs; End of Life Costs; and Life Cycle Cost. And it takes a wider range of values: AC (OLE), AC Traction Power System, Buildings, Businesses, Canopies, Car parks and roads, DC, DC Traction Power System, Depot plant, Drainage, Earthworks, Electrical, Fencing, Level crossing, Lifts and escalators, Mechanical, Network, Operational telecoms, Plain line, Platforms, Signaling, Signaling power supplies, Station Information and Security Systems, Structures, Switches and crossings, Train sheds.
- Tier 3: Describes ‘cost groups’ covering the sub-division of cost category totals into a more detailed breakdown in each case. For instance, in construction costs category, this includes key elements such as Substructure, Structure, Preliminaries, Services and Equipment and it can take the following values: Approaches, Auto (MSL), Auto (RTL), Auxiliary Transformer, Ballast, Business Voice, Cables, Cabling and Containments, Clocks, Closed Circuit, Television, Coastal and Estuarine Defenses, Concentrator, Conductor Rail System, Control, Control System Only, Controls and Interlocking , Culverts, Customer Information Systems, Disconnectors, DNO Supply, Driver Only Operation System Components, Embarkments, Footbridges, FSP Auto Reconfigurable, FSP Manual Reconfigurable, FSP Radial Feed, Generator, GSM-R, HV Cables, HV Switchgear, HV Transformers, Interlocking Only, Level Crossing Refurbishment Treatments, Lineside Telephone, LV dc Cables, LV Switchgear, Negative Short Circuit Device, Neutral Section, OLE system, Operational Voice, Over bridges, Phones only, Power, Principal Supply Point, Protection Relays, Protection System Upgrade, Public Address, Public Address / Voice Alarm, Public Emergency Telephone System, Radio, Rail, Rail Ballast, Rail Sleepers, Rail sleepers ballast, Retaining Walls, Rock Cuttings, RTU (SCADA), Signaling System, Sleepers, Soil cuttings, Station Help Points, Structures, TNO/DNO HV Supply, Trackside Equipment Only, Transformers/Rectifiers, Transmission FTN, Transmission IP, Transmission Legacy, Tunnels, Under bridges, Uninterruptable Power Supply, User Operated, Voice Recorders, Wire Run.
- Work Type: A label that describes the work that has been done such as refurbishment, replace full or replace partial.
- Work Type code: The unique identifier code linking the work type that has been carried out.
- Primary reference: A group of eight numbers and letters uniquely identifying each asset of each project.
- Asset: A generic classification attribute which is slightly like the old Tier 1 attribute on the previous sections. The range of values that this attribute can take are the following: Buildings and property, civils (drainage - resilience), civils (drainage - earthworks), civils (drainage - track), civils (earthworks), civils (structures), electric power and plant, permanent way, railway control systems, telecommunications, train power systems.
- Structures: A more specific classification attribute slightly similar to the previous Tier 3 categories where a wider range of attributes can be distinguished: AC HV Cables, AC HV switchgear, AC HV transformer, AC overhead line equipment (OLE), AC protection Relay, AC remote terminal unit, AC transmission or distribution network operator HV supply, auxiliary transformer, bespoke color light signaling, buildings, canopies, car parks and roads, chamber, channel, coastal defenses, conductor rail heating, control system, controls and interlocking, culvert, DC conductor rail system, DC disconnectors, DC HV cables, DC HV switchgear, DC HV transformer, DC LV cables, DC LV switchgear, DC negative short circuit device, DC protection relay, DC remote terminal unit, depot plant, distribution network operator (DNO), electrical wiring and lighting system, embarkment, European train control system (ETCS), fencing, footbridges, FSP auto reconfiguration, FSO manual reconfiguration, FSP radial feed, generator, gravel drain, hot axle box detector (HABD), interlocking, level crossing, lifts and escalators, lighting, mechanical heating, mineworking’s – deep, mineworking’s – shallow, mineworking’s – surface, moving bridges, network, operational communications, over bridge, pantograph measuring system (PMS), pipe, plain line, platforms, points heating, principal supply point (PSP), pumps, ramp, remote condition monitoring (RCM), retaining wall, rock cutting, signaling cables, simple modular color light signaling, soil cutting, station information and surveillance system, switch and crossings, trackside equipment, train sheds, tunnel, under bridge, uninterruptible power supply, water tanking, wheel force measuring system.
- Work type: A label that identifies uniquely the work that has been done such as refurbishment or new building.
- Work solution: An attribute which shortly describes the work that has been carried out to accomplish the task.
3.3) The output structure
3.4) The method step by step
- Description: During this step, the suggested method loads iteratively each of the CAF files taken as inputs to extract inside them four different types of information: Project details, cost details, stage GRIP details and possession strategy.
- Input: The input of this step consists of the information distributed into 23 CAF files coming from real historical projects with different structure depending on their version which ranges from 1.5 until 2.3.
- Output: As a main result for this step, four different folders are being created one of them for storing the project information, the second one for project details, whereas the last two would be for GRIP stage details and possession strategy respectively. Each folder contains 299 different excel files with information extracted from the initial CAF files.
- Description: During the process of data merging the data generated in the previous step is being gathered and combined, considering not only the fact that there are four types of information that will be merged into one file but also that that different versions of CAF files contain different attributes.
- Input: The input for this step would be the same as the output for the previous step consisting of four different folders each of them with 299 different files with its information extracted for each CAF.
- Output: There are two main outputs that can be distinguished for this step: On the one hand, a new folder is generated with 299 different files combining the four types of information. Alternatively, five breakdown documents are created summarizing all project depending on the existing CAF version (1.5,1.7,2.0,2.1 and 2.2)
- Description: As a final step, some analysis techniques are being implemented to demonstrate that converting data to a common format allows to see the whole picture and to find the relationships between the different attributes. Additionally, three different machine learning algorithms are being implemented to predict future project costs: Linear regression, lasso regression and random forest. Additionally, different machine learning algorithms are implemented.
- Input: A main input for this step all the attributes extracted in the precious step coming from 23 CAF files are combined for analysis and comparison.
- Output: As a main result some inferences will be made, and some knowledge of the current data is extracted to validate the suggested method.
4) Results
| First fold | Second fold | average | |
|---|---|---|---|
| Linear regression | 0,845 | 0,832 | 0,839 |
| Lasso regression | 0,844 | 0,833 | 0,838 |
| Random Forest | 0,939 | 0,928 | 0,934 |
5) Conclusions
Acknowledgements
References
- Allan, J. J., Wessex Institute of Technology., & International Conference on Computer Aided Design, M. and O. in the R. and O. A. M. T. S. (9th : 2004 : D. (2004). Swedish Data For Railway Infrastructure Maintenance And Renewal Cost Modelling. WIT Transactions on The Built Environment, 74, 1015. [CrossRef]
- Caíno-Lores, S., García, A., García-Carballeira, F., & Carretero, J. (2017). Efficient design assessment in the railway electric infrastructure domain using cloud computing. Integrated Computer-Aided Engineering, 24(1), 57–72. [CrossRef]
- Chen, D., Hajderanj, L., & Fiske, J. (2019). Towards automated cost analysis, benchmarking and estimating in construction: A machine learning approach. Multi Conference on Computer Science and Information Systems, MCCSIS 2019 - Proceedings of the International Conferences on Big Data Analytics, Data Mining and Computational Intelligence 2019 and Theory and Practice in Modern Computing 2019, 85–91. [CrossRef]
- Deb, S., & Zhang, Y. (2004). An overview of content-based image retrieval techniques. Proceedings - International Conference on Advanced Information Networking and Application (AINA), 1, 59–64. [CrossRef]
- Desai, V. S. Desai, V. S. (n.d.). Improved Decision Tree Methodology for the Attributes of Unknown or Uncertain Characteristics-Construction Project Prospective. The International Journal of Applied Management and Technology.
- Durazo-Cardenas, I., Starr, A., Turner, C. J., Tiwari, A., Kirkwood, L., Bevilacqua, M., Tsourdos, A., Shehab, E., Baguley, P., Xu, Y., & Emmanouilidis, C. (2018). An autonomous system for maintenance scheduling data-rich complex infrastructure: Fusing the railways’ condition, planning and cost. Transportation Research Part C: Emerging Technologies, 89, 234–253. [CrossRef]
- Fan, M., Fan, H., Chen, N., Chen, Z., & Du, W. (2013). Active on-demand service method based on event-driven architecture for geospatial data retrieval. Computers and Geosciences, 56, 1–11. [CrossRef]
- Fereshtehnejad, E., & Shafieezadeh, A. (2018). A multi-type multi-occurrence hazard lifecycle cost analysis framework for infrastructure management decision making. Engineering Structures, 167, 504–517. [CrossRef]
- Ji, C., Conferences, C. X.-E. W. of, & 2021, undefined. (n.d.). New method for allocating high-speed railway infrastructure costs among train types. E3s-Conferences.Org. [CrossRef]
- Kouris, I. N., Makris, C. H., & Tsakalidis, A. K. (2005). Using Information Retrieval techniques for supporting data mining. Data & Knowledge Engineering, 52(3), 353–383. [CrossRef]
- Miller, C., & Meggers, F. (2017). Mining electrical meter data to predict principal building use, performance class, and operations strategy for hundreds of non-residential buildings. Energy and Buildings, 156, 360–373. [CrossRef]
- Rama, D., & Andrews, J. D. (2016). Railway infrastructure asset management: the whole-system life cost analysis. IET Intelligent Transport Systems, 10(1), 58–64. [CrossRef]
- Schonlau, M. , Gweon, H., & Wenemark, M. (2019). Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions: Https://Doi.Org/10.1177/0894439319869210, 39(4), 562–572. [CrossRef]
- Soibelman, L., Wu, J., Caldas, C., Brilakis, I., & Lin, K. Y. (2008). Management and analysis of unstructured construction data types. Advanced Engineering Informatics, 22(1), 15–27. [CrossRef]
- Wang, Y., Kung, L. A., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change, 126, 3–13. [CrossRef]
- Zhong, Y. (n.d.). Research on Construction Engineering Project Management Optimization Based on C4.5 Improved Algorithm. [CrossRef]







| Reference | Main aim | Approach |
|---|---|---|
| (Desai, n.d.) | To enhance the data classification in construction projects | The creation of a method implementing machine learning and the knowledge for variable correlation |
| (Zhong, n.d.) | To optimize the management in construction engineering projects | The creation of a method that performs a risk assessment, an evaluation using rough set theory and the implementation of machine learning for optimization |
| (Soibelman et al., 2008) | To identify and analyze a big variety of data structures in construction projects | A study that encompasses the search and extraction of different data structures used in a big range project. |
| (Chen et al., 2019) | To analyze and estimate costs in construction projects | The development of a method that combines surveyors’ knowledge with machine learning to effectively assess and predict costs |
| Reference | Main aim | Approach |
|---|---|---|
| (Ji et al., n.d.) | To perform a deeper analysis of high-speed railway infrastructure costs | To develop a framework considering the type of train to perform a better costs estimation |
| (Caíno-Lores et al., 2017) | To perform a massive number of simulations to make an efficient design in railway electric infrastructures | A simulation model to perform a massive number of simulations efficiently in a cloud environment |
| (Durazo-Cardenas et al., 2018) | An automatic and efficient job scheduling maintenance on railways infrastructures | The fusion of technical and business drivers scheduling and optimizing the intervention plans that impact on costs. |
| (Allan et al., 2004) | To perform an analysis of infrastructure, costs, and traffic on Swedish railway infrastructures | The study incorporates data gathering and data recovering techniques to conclude with some data analysis |
| (Rama & Andrews, 2016) | Railway infrastructure asset management | A proposed framework to assess the lifecycle cost analysis |
| Reference | Main aim | Approach |
|---|---|---|
| (Kouris et al., 2005) | The usage of information retrieval techniques to support data mining | To develop a two-step algorithm acting as a search engine for making recommendations to customers using data mining. |
| (Fan et al., 2013) | A service for geospatial data retrieval on-demand | The development of a prototype based on sensor web technologies |
| (Deb & Zhang, 2004) | To review the extract of information using content-based image retrieval techniques | A systematic review analyzing a group of selected papers with content-based image retrieval systems. |
| (Miller & Meggers, 2017) | To predict the building use, performance, and operations strategies of non-residential buildings | To use data mining and machine learning for analyzing predicting data. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
