Cost Modelling from the Contractor Perspective: Application to Residential and Office Buildings

For the majority of the contractual arrangements used in construction projects, the owner is not responsible for the cost deviations due to the variability of labor productivity or material price, amongst many other aspects. Consequently, the cost performance of a project may be entirely distinct for the owner and the contractor. Since the majority of the quantitative research on cost estimation and deviation found in the literature adopts the owners’ perspective, this research provides a contribution towards modelling costs and cost deviation from a contractors’ perspective. From an initial sample of 13 residential building and 10 office building projects, it was possible to develop models for cost estimation at the early stage of development including both endogenous and exogenous variables. Although the sample is relatively small, the authors were able to fully analyze all the cost data, using no secondary sources of data (very frequent in cost modelling studies). The statistically significant variables in the cost estimation models were the areas above and below ground and the years following the 2008 financial crisis, including the international bailout (2011-2014) period. For estimating the unit cost, a nonlinear model was obtained with the number of underground and total floor, the floor ratio and the years following the 2008 financial crisis, including the international bailout (2011-2014) period as predictors. For the office buildings, it was also found a statistically significant correlation between the cost deviation and the number of underground floors.

lists summarizes the main research on the topic, along with the methods and explanatory variables used in each study. It should be noted that some models were developed to estimate the total cost (when the area is included in the model) whereas others were developed to estimate the unit cost (when the area is not included in the model). Some variables listed in Table 1 should be interpreted as a category of variables rather than a single variable, in some cases simply because they are measured differently depending on the author. For instance, the construction area may be gross, usable, or other; the number of stories may also be total, above ground and underground; the height may be of the building or of the floor. Others are naturally a category of Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 13 October 2021 doi:10.20944/preprints202110.0201.v1 variables, such as the structural characteristics that may include the type of structure or foundation (e.g., Jin et al., 2012). A few are even impossible to quantify adequately at the early stages of the project development, namely the duration. In fact, it is far more common to use cost as an independent variable to estimate the construction duration (e.g., see Sousa et al. (2014a,b) or Sousa and Meireles (2018) for examples of timerelationships), because cost estimate tends to be done by the designer before the contractor develops the construction schedule.
There are also authors attempting to use BIM for conceptual cost estimation (e.g., Muratova and Ptukhina, 2019).
However, this approach requires a quantities takeoff, which implies a degree of project development that is incompatible with the early stages of development in this research (definition of general characteristics of the project, such as area and number of floors, and a preliminary sketch). In fact, even some models reported in the literature review presented herein use variables that may be unavailable at this stage of the project development (e.g., proportion of walls and windows in the external envelope). There is a clear trade-off between model adjustment, i.e., estimation accuracy and the availability of information in the early stages of the project. The review presented was focused on cost estimation for building projects and it is not intended to be exhaustive, but rather illustrate that different tools, sample sizes and variables have been used. There is also an extensive literature on other types of projects (e.g., transportation infrastructure projects - Karaca et al. 2020;Swei et al., 2017;Flyvbjerg et al., 2016;Gunduz et al., 2011;Al-Tabtabai et al. 1999).
The topic of cost deviations is closely related to cost estimation, since a more accurate cost estimation should reduce cost deviations. There is an extensive literature on the magnitude (e.g., Shehu et al., 2014;Sweis et al., 2013;Love et al., 2013Love et al., , 2015Love et al., , 2019Love et al., , 2020 and causes (e.g., Kaming et al., 1997;Abusafiya and Suliman, 2017;Derakhshanalavijeh and Teixeira, 2017;Annamalaisami and Kuppuswamy, 2019;Balali et al., 2020) of cost deviations. The former tends to be quantitative, based on the analysis of the performance of past project, while the latter is mostly qualitative, resorting to questionnaires or interviews with experts. The research relating the magnitude with the causes of cost deviation is less extensive and the causes are limited to macro variables of the projects, such as: i) the size of the project (Shrestha and Fathi, 2019;Flyvbjerg et al., 2004); ii) the nature of ownership/promotor (public or private -e.g., Flyvbjerg et al., 2002;Sweis et al., 2013); iii) the type of intervention (new build or refurbishment/rehabilitation - Shehu et al., 2014); iv) the type of project (residential, infrastructure, commercial, and other -e.g., Pearl et al. 2003); v) the procurement model (designbid-build, design and build, project management -e.g., Buccciol et al., 2013;Shrestha and Fathi, 2019); or vi) the tender method (open, selection, negotiated tendering -e.g., Reyers and Mansfield, 2001).
Most research on cost modeling in general (cost estimation and cost deviations) tends to focus on variables endogenous to the projects. Table 1 is provides a clear illustration of this claim, with the variables used by the various authors being exclusively related to the project or its management. There is a smaller body of literature on the influence of exogenous variables on the financial performance of construction projects. For instance, Catalão et al. (2019aCatalão et al. ( ,b, 2020 demonstrated the relation between political and economic cycles and the cost deviation in public projects.
The quantitative research available in the literature, both in terms of cost estimation and quantitative analysis of cost variations, tends to reflect the construction projects' financial performance from the owners' perspective.
The records used by most of the authors were obtained from the owners (or from the contractors) and represent the payments made to the contractors and not the expenses of the contractors. However, the amounts payed by the owners do not match perfectly the amounts spent by the contractors to execute the projects after deducting the profit margin. Regarding the cost estimation, the owners' perspective is affected by the commercial strategy adopted by the contractors in each moment, frequently represented by the margin defined in their bids. In highly competitive contexts, the margins tend to decrease, whereas in low competitive contexts the margins tend to increase. Concerning cost deviations, the variability of materials prices, labor productivity or site overheads, amongst other potential causes of cost deviation (e.g., accidents, equipment breakdown or failure) are not measured when analyzing historical construction cost data from the owners' perspective. From the owners' perspective, change orders and errors/omissions (if the design is provided by the owner) are the most relevant causes of cost deviations.
The literature has provided recently an active discussion whether cost deviations are motivated by more technical aspects (e.g., cost escalation, scope changes, unforeseen events/conditions) of the projects (Love and Ahiaga-Dagbui, 2018;Love et al. 2019) or by estimator bias (Flyvbjerg et al., 2018(Flyvbjerg et al., , 2019. However, this discussion is outside of the scope of the present research. This discussion is focusing on the cost deviations between the first estimate and the final cost, and in the context of major infrastructure projects more applicable to public projects. This includes references to the benefits of the projects for the society. Herein, the scope is restricted to private projects and cost deviations between the detailed design and final cost. Furthermore, the cost-benefit ratio is simply the cost of the project versus the income generated by its commercialization. So, fundamentally the technical aspects will drive the cost deviations and the potential estimator bias will be more on the expected market valuation of the project.

DATA AND METHODS
As referred above, the data used was obtained from a large industrial group in Portugal that include a real estate and a contractor in their portfolio of companies. All projects were developed in collaboration between these two companies of the group and, despite the formal split between then, they end up working as single entity with complementary expertise.
The 23 building projects were developed mostly in Portugal, with only 2 being abroad (Angola and Mozambique).
The projects in Portugal are concentrated in the Lisbon and Porto metropolitan areas (the two major cities in Portugal) and can be classified as premium. The information on the projects includes the: i) proportion of the cost by major category of works (structure, architecture, technical installations and site overheads); ii) estimated cost; iii) profit margin; iv) estimated price; v) final price; vi) total area, above ground and underground gross-built area; and vii) total floors, above ground and underground number of floors. There is also information on the start year and duration of the projects. Both the cost and prices of the projects were update to 2019 values using the formulas for price adjustment applicable to public residential and office buildings in Portugal. In Portugal, the reimbursements to contractors in public construction projects are corrected to account for inflation. Since this is mandatory, there are formulas defined by law for estimating the increase (or decrease) in the payments to the contractor for 23 different types of projects (Law-Decree nº 6/2004). These formulas represent the average weight of labor, materials (a selection from 51 different materials) and equipment on the total price of the projects. The price indexes of the labor, materials and equipment are published monthly by the government based on the official inflation data. The estimated and final unit prices and the cost deviations were calculated from the available data. Not all fields were possible to retrieve for all the projects, particularly the final price that was available for only 16 projects.
In addition to the endogenous variables, the influence of the 2008 financial crisis and subsequent international bailout that Portugal had between 2011 and 2014 was also included. This exogenous variable was modeled with a categorical predictor assuming the value of 1 between 2008 and 2014 and 0 in the remaining years. A lag of 1 year was also considered at the start and end of the crisis to evaluate if there was a delay between these events and the impact on the cost of the projects.
Due to confidentiality issues regarding some of the data (revealing the cost without the profit margin of the contractor for an external client), indexes were computed dividing the value of each project by the average of all the projects in the sample. This was done particularly for the projects profit margin, total and unit cost, and total and unit initial and final prices. Area and floor ratios were also computed dividing the values above ground by the values underground since there is typically a relation both due to parking requirements.
A statistical approach was used to analyze the data, comprising of two steps: i) a preliminary data analysis; and ii) a data modeling. The preliminary data analysis included calculation of descriptive statistics, assumptions testing and unidimensional statistical analysis. The normality and homogeneity of variance were tested using the Shapiro-Wilk and Levene's tests, respectively, and the unidimensional analysis was done using either parametric or non-parametric distribution comparison (t-test / ANOVA or Mann-Whitney/Kruskal-Wallis), for categorical variables, and correlation (Pearson or Spearman), for continuous variables. The data modeling was based on the traditional least squares multiple linear regression. Non-linear regression was also used, when necessary, but given the sample size the use of artificial intelligence tools (e.g., artificial neural networks, support vector machines, random forests) was not considered. Given the small sample size, bootstrapping (1000 simulations with simple sampling and 95% confidence interval based on percentile) was used to strengthen the confidence in the results.
The restriction of the context (projects from a single company), scope (all buildings are classified as premium in terms of quality) and location (the spatial variability of the locations is small) limits the generalization of the results. However, it excludes these variables from the cost estimation and deviations of the projects and enables the possibility of capturing the cost estimation and deviations drivers that are specific to the projects. This is an important difference from most past research effort, which in most cases use data samples with projects that may be very different, developed by distinct contractors, designed by different teams and, in some cases, promoted by various owners in many locations. This broader scope allows capturing an overall average cost performance of the projects, but it is impossible to assess if it was due to the contractor competence, design quality, owner experience, nature of the project, local factors or other aspects that are controlled for in the analysis. Consequently, using large mixed samples of data may fail in terms of applicability to a specific project.

RESULTS AND DISCUSSION
The projects totalize a cost of over 155 million euros, with the residential buildings contributing with 57% and the office buildings 43%. The initial price (cost plus typical margin used by the contractor for external clients) of each individual project ranged from 1.5 to 20 million euros. The average initial unit price is 560 €/m 2 , for office buildings, and 785 €/m 2 , for residential buildings. This difference is, however, strongly influenced by the two residential buildings outside Portugal (one in Angola and another in Mozambique) that had an average initial unit price of 1 408 €/m 2 . Table 2 presents some descriptive statistics characterizing the dataset. Comparing the weight of the cost categories between residential and office building, it is visible a difference in all cost categories except for the site overheads. These differences were found to be statistically significant (Table   3), and the site overheads would also be considered statistically significant for a significance level of 0.10 instead of the typical 0.05. The parametric t-test was used since the data was found to be normally distributed for both residential and office buildings subsets according with the Shapiro-Wilk test.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 13 October 2021 doi:10.20944/preprints202110.0201.v1 The unit cost and initial price are also statistically different between residential and office buildings, if a 10% threshold is considered for the unit cost. The same is not verified for the final cost, but this can be attributed to the combination of the cost deviations and, mostly, to the smaller sample of project with final price data available. The bootstrapping results (not presented herein the full table of results) confirms the results obtained for the parameters (unit cost, initial or final price), with the unit cost difference closer to be statistically significant at a 5% significance level (p-value= 0.055).
It is interesting to notice that the total cost and prices (initial and final) of office buildings are slightly higher than for residential buildings, but the unit cost and prices are slightly lower. This implies that the office buildings in the sample are larger, in average, than the residential buildings, but that the lower expenses on architecture are only partially compensated by the more expensive structure and technical installations.  The economic crisis impacted more severely on labor cost (there was a high unemployment and salary cuts) than on materials and equipment (a portion are imported and subject to less devaluation). This is consistent with the statistical significance of the site overheads on the office building projects, considering that a large portion of the cost in this category is due to the management team.
Since the majority of the data was found to be normally distributed based on the Shapiro-Wilk test (the nonnormally distributed variables were the site overheads, margin and the underground and above ground floors), the Pearson correlation was used. The results (Table 5) reveal the expected correlation between the cost and prices with the areas and between the areas and the weight of the structure. Some less obvious results include the negative correlation between the unit cost and prices and the underground area, total area and area ratio.
However, this is logic since the underground areas tend to be for parking spaces, with lower demands for architecture (and technical installations works) that justify lower unit cost and prices compared to the areas above ground. The negative relation between the unit cost and price and the total may indicate the existence of a scale effect. The bootstrap results confirm the correlations (not presented herein the full table of results). For instance, the 95% confidence interval of the correlation between the total cost and the above ground area is estimated to be between 0.705 and 0.980.
For the variables that are not normally distributed, the non-parametric Spearman correlation was also used (not presented herein), leading to similar results. The exception was a positive statistically significant correlation between the number of floors above ground and the weight of the architecture costs.
The previous unidimensional statistical analysis provides some insight on the data, but fails to account for the potential interaction between the variables. In fact, a comparison of mean assumes that all the projects in each category are identical regarding all other variables and the same applies for the correlation between two variables. Since all projects are distinct amongst them, modeling the data with multiple linear regression allows identifying the independent variables that are statistically significant to explain the dependent variable, while controlling for the influence of the other independent variables variability. This approach has its own limitations, namely the fact that a linear relation and specific relation (sum) of the variables is assumed. The cost and prices, both total and unit, were selected as independent variables, along with the cost deviation.
All other variables were considered as potential predictors. A hybrid approach was used to select the predictors to include in the models, combining expert judgment and the best subsets tool with the Akaike Information Criterion. The option for this hybrid approach resulted from an experimental stage using only statistical tools to select the predictors (stepwise and best subsets using Akaike Information Criterion, Ajusted R2 and Overfit Prevention Criterion) produced models with very high fit, but not robust from an engineering point of view.
Furthermore, the models for predicting total cost and price were developed without intercept to ensure that the value tends to zero when the project size decreases. There were no signs of heteroscedasticity (White and Breusch-Pagan tests), non-normal distribution of the residuals (Shapiro-Wilk test) or influential observations (Cook's distance) in all hybrid models. Still, robust standard errors were used in all models. There is also no evidence of specification problems (linktest), and the functional forms seem appropriate (Ramsey test).
The regression models for the initial and final price model are presented in Table 6. The R2 of the models is 0.92.
Given the high R2 obtained, the models with the predictors selected with statistical tools alone produced similar results in terms of fit to the data. For instance, using the best subsets with the adjusted R2 as criterion to select variables it was possible to obtain a model for the initial price with an R2 of 0.95 using the following variables: i) area above ground; ii) area x type; iii) floors above ground; iv) total floors; and v) area ratio. However, this comes with a cost in terms of outliers (3 cases were identified as outliers using the Cook's distance) and represents a potential overfit (a model with 5 variables for a dataset with 18 cases). Due to the reduced size of the sample available (8 residential and 6 office buildings) for developing the final price model, the result should be looked with due care. Due to the confidentiality, the model for the total cost cannot be disclosed. The variables in the models were the same of the initial price models, which is logic since the difference between both is the margin set by the contractor. However, the results of the model are depicted in Figure 2, corresponding to an R2 of 0.94.

Figure 2 -Observed versus estimated total cost and initial and final price
Both total and unit cost or prices are connected, but the high correlation between the total cost or price and the construction area may mask the influence of other variables. Considering the confidentiality issues and the limitations of sample size, only the initial unit price was modelled. The first model obtained attained an R2 of 0.505 using as predictors the variables: i) floors above ground; ii) total floors; iii) floor ratio; and iv) economic crisis.
However, since a clear non-linear pattern was visible when plotting observed versus predicted initial unit prices, a non-linear multiple regression model was developed. The non-linearity was accounted for by including power coefficients in the scale predictors. The best model resulted in a power of 1.011 for the floors above ground and 1.608 for the total floors, increasing the R2 to 0.720 (Table 7).
There is influence of the economic crisis, but the proportion of underground and above ground floors became statistically significant with the removal of the area from the model. The difference between the linear and nonlinear models can be observed in Figure 3, evidencing the fit increase in the later.   The apparently lower fit of the models for the unit price is misleading. In fact, multiplying the area by the initial unit prices estimated with the non-linear model to determine the total initial price achieves an R2 of 0.97 ( Figure   4). This fit difference between the models for the total and unit prices results from the correlation between the total area and the number of floors. This correlation produces multicollinearity between the variables, resulting in the exclusion of the number of floors from any model in which the area is also used. Removing the influence of the area by modelling the unit price allows for the influence of the number of floors to be accounted for, which explains the accuracy increase.
Bootstrapping was also used in the development of the regression models and confirm the statistical significance of the regression coefficients for a 95% confidence interval. Generally, the significance of the regression coefficients decreased, but the p-value remained lower than 0.05 in all cases except for the final price model.  With the purpose of testing and validating the models developed in this research, the model for the initial price was applied to a project currently under development by the organization. Considering that the project used for validation was estimated in over 45 million euros, significantly higher than the projects in the dataset, and that the difference to the price estimated by the organization was less than 5%, there was a positive feedback from the organization regarding the accuracy and the extrapolation capability of the model.
In the sample of 13 projects (6 office and 7 residential) for which initial and final prices were available, an average cost deviation of 3.5% was obtained. Only 3 projects had a final price lower than the initial estimate (average of -6.5%). The projects with positive cost deviations were, in average, 6.5% costlier and there was no project without cost deviation. Comparing with the literature available, which generally adopts the owner perspective, the magnitude of the cost deviation is clearly smaller than usually reported and it becomes evident that the contractor always experiences some cost deviation, even if that is not reflected on the bill of the owner.
Either due to the limitations of the dataset, the fact that the projects are limited in type, the spatial context and stakeholders involved, or a combination of these and other factors, the cost deviation depend on specific aspects of each project that are not captured by the general information used herein and it was not possible to model them. The only statistically significant result obtained was the high Person correlation (0.814) between the number of underground floors and the cost deviation of office buildings. The corresponding regression model indicates that the average cost deviation in office buildings increase 0.65% per underground floor, but this was obtained from a sample of only 6 projects and its validity is questionable.

CONCLUSIONS
This research revisits the topic of cost estimation and deviation of construction projects, but adopting an innovative perspective of a contractor, which seems uncommon from the literature review carried out. Furthermore, to the best of our knowledge, this is one of the few efforts linking endogenous and exogenous variables in cost estimation functions.
Contrarily to most research available, only similar projects (premium residential and office buildings) from a single promotor-contractor are used. This compromises the size of the database available, but eliminates the variability of cost estimates and deviations due to: i) factors related to the contractor or the designer (e.g., experience; competencies; dimension; management models); ii) characteristics of the projects (e.g., premium buildings, social buildings, public buildings); iii) relation between owner and contractor (e.g., type of ownerpublic, private; type of contract -design-bid-build, design-build; payment method -lump sum, unit prices); and iii) aspects associated with the location (e.g., weather conditions; laws and regulations). Since the projects are promoted by the real estate company of the same group, the commercial strategy issues related to the degree of competition of the market has less effect on the cost of the projects. The contract does not have to adjust its margin to win the contract and so the influence of the level of competition in the market is only limited to the portion of the project that is executed by subcontractors. By doing so, the results presented herein grasp the "real" cost estimation and deviations driven by project related factors. The high accuracy of the cost estimation m The results obtained with these restrictions support the importance of the technical expertise of the involved parties in the cost estimation and deviations reported in the literature. Comparing the average and range of the cost deviations in this study with other authors, it is licit to assume that, at least, a portion of the difference is due to the experience of the teams involved and not only due to project (e.g., construction technology) or context (e.g., weather conditions) specificities. Other factor possibly underlying the differences in terms of the magnitude of the cost deviations is the collaborative effort of promoter and contractor in this case, reducing the conflicts that are not rare in the traditional design-bid-build contracts where the promoter has limited expertise/resources regarding the execution stage of the construction project.
Despite the reduced sample size when compared to other studies, it is noticeable that the cost deviations in this context are smaller than what is typically reported when adopting the owner, either public or private, perspective. The generalization of the results may be limited, but they do provide a source for other contractors benchmark their performance and the methodology proposed sets a basis for developing similar studies both in research or practical contexts. In fact, the linear and non-linear regression models developed are of easy interpretation and assessment from an expert, which was done with good results, whereas artificial intelligence models are black-boxes impossible or very difficult to be validated by experts. The practical expert validation carried out, along with the bootstrapping results, reinforce the applicability of the models for the specific context in which was developed and corroborates the applicability of the methodology in other contexts.
The models developed for estimating cost have a very high fit to the data and highlight the influence of the economic crisis and international bailout on the construction costs. In Portugal, the price of construction projects in open competition also suffered a strong reduction during this period due to the lack of both private and public construction projects. However, since the price is driven not only by the cost but also by the market conditions (e.g., relation between demand and supply), the variation is not necessarily identical, and this research is able to capture the pattern of the cost.
The cost deviations seem to depend more in particular aspects of each project than overall characteristics, despite the positive statistically significant relation between the number of underground floors and the cost deviations in office buildings found.

Data Availability Statement
Some or all data, models, or code generated or used during the study are proprietary or confidential in nature and may only be provided with restrictions. The estimated cost and initial and final values of each individual project can only be disclosed normalized format (actual value divided by the sample average). The model for the estimated cost cannot be disclosed.