Multiple Linear Regression, its Statistical Analysis and Application in Energy Efficiency

In this project, we use a statistical multiple regression to study the impact of eight various predictors (relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, glazing area distribution) to estimate the cooling load energy efficiency of residential buildings. We try to analyze and visualize the effect of each predictor with each of the response variable using different classical statistical analysis tools used in describing linear models, in such a way so that we can find out the most strongly related predictor variables. Before starting all of this, we use the idea of model selection by stepwise regression technique and compare the AIC of these models and identified a better model between all of them. Then, we compare a classical linear regression approach by simulations on 768 diverse residential buildings show that we can predict CL with low mean absolute error. By using ANOVA we determine variation in the different residuals. Also, we use non constant variance test to verify it. Furthermore, we check leverage and influence points as well as outliers as well as determined cook distance for influential points. By taking box cox transformation and weights, we also introduce WLS technique to fit the model for better results and did all type of important analysis to understand the energy efficiency. Finally, we show 5-fold cross validation to verify our model.

designers need information about the characteristics of the buildings. The ultimate goal of this paper is to achieve minimum energy consumption and improve system efficiency. Meng et al (2014) studied multi-zone variable air volume and variable water volume air-conditioning systems. The dynamic models of HVAC sub-systems were built by the adaptive directional forgetting method. Growth in population, increasing demand for building services and comfort levels, together with the rise in time spent inside buildings, assure the upward trend in energy demand will continue in the future. Lombard et al. (2008) analyzed information concerning energy consumption in buildings, and particularly related to HVAC systems and comparisons between different countries are presented specially for commercial buildings. In this study we only work with cooling load of buildings.

Source of Data:
Applicable data is the primary criteria of any sort of regression model, because we use this data to actually make the model. If the data is flawed, the model will be flawed. Thus, the first step in regression modeling is to ensure that your data is reliable. There is no universal approach to verifying the quality of your data, unfortunately. Our job then becomes verifying your source's reliability and correctness as much as possible. Here, the data we collected, is from reliable source.
They perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. They simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer. Mathematical representation of the predictors and response to facilitate the presentation of the subsequent regression analysis and results. Grazing area 8 Grazing area distribution Cooling Load
In classical linear regression are similar if n >> p and the priors are uninformative. However, the results can be different for challenging problems, and the interpretation is different in all cases.
The Pearson correlation coefficient is used to measure the strength of a linear association between two variables, where the value r = 1 means a perfect positive correlation and the value r = -1 means a perfect negative correlation. So, for example, you could use this test to find out whether people's height and weight are correlated (they will be -the taller people are, the heavier they're likely to be). There are few requirements for Pearson's correlation coefficient, firstly, scale of measurement should be interval or ratio, variables should be approximately normally distributed, association should be linear and there should be no outliers in the data. We can use heat plot for understanding the frequency of correlation between predictors and with response.
Firstly, we need to start with model selection. Model selection method based on p values or adjusted R square by stepwise regression. These models allow you to assess the relationship between variables in a data set and a continuous response variable. Sometimes we pick variable based on expert's opinion. We can check AIC too. Lowest AIC means better model. However, there are many ways to select a model for MLR.
One of the mostly used model selection method is backward elimination. In this case, we use full model and then we will drop predictor variable until we get our parsimonious model. With backward elimination technique we start with full model and record adjusted R square of each model and its reduced one and pick the model which has highest adjusted R square and minimum AIC. After selecting the number of predictors that we would like to use in the model we will estimate LSEs for each of the input variables by using following rules The least squares estimate of β = (β 0 , β 1 , . . . , β ) is Where µ = β 0 + X 1 β 1 + . . . + X β , β ̂ is unbiased even if the errors are non-Gaussian. If the errors are Gaussian then the likelihood is proportional to Therefore, if the errors are Gaussian β ̂i s also the MLE. Linear regression is often simpler to describe using linear models using matrix form.
Let Y = (Y 1 , . . . , Y ) be the response vector and be the n × (p + 1) matrix of covariates If the errors are Gaussian then the sampling distribution is If the variance 2 is estimated using the mean squared residual error then the sampling distribution is multivariate . As with a least squares' analysis, it is crucial to verify this is appropriate using QQ-plots, added variable plots. In regression analysis we make certain assumptions about the conditional distributions of the dependent variable which we try to predict. The following three assumptions that are quite similar to the assumptions we made in ANOVA.
Normality: All conditional distributions are normally distributed (e.g. the distribution of sale volumes in all months in which advertising has been or will ever be some fixed level is normal).
Homoscedasticity: All conditional (normal) distributions have the same variance 2 , For checking this we can see the plot of fitted versus standardized residuals and also, we can do the equal variance hypothesis tests.
Linearity: The means of the conditional distributions are linearly related to the value of the independent variable. In statistics, the variance inflation factor (VIF) is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares' regression analysis. Then, calculate the VIF factor for ̂ by the following formula Where, 2 is the coefficient of determination of the regression equation. Analyze the magnitude of multicollinearity by considering the size of the . A rule of thumb is that if > 10.
Again, the matrix form of the predicted responses can be alternatively written as: Where the hat matrix is given by involves the weights ℎ ; = 1 … depends on predictors.
The leverage, ℎ , quantifies the influence that the observed response variable has on its predicted value ̂ That is, if ℎ is small, then the observed response plays only a small role in the value of the predicted response ̂. On the other hand, if ℎ is large, then the observed response plays a large role in the value of the predicted response ̂. It's for this reason that the ℎ are called the "leverages" which lies between zero and one.
Cook's distance of observation = 1 … can be expressed in terms of leverage Before investigating all of the extreme cases discussed above, if We use this transformation in case of non-normal residuals popped up in QQ plots. The Box-Cox method considers a family of transformations on strictly positive response variables, The λ parameter is chosen by numerically maximizing the log-likelihood L(λ).
The method of ordinary least squares assumes that there is constant variance in the errors (which is called homoscedasticity). The method of weighted least squares can be used when the ordinary least squares assumption of constant variance in the errors is violated (which is called heteroscedasticity). Error is assumed to be (multivariate) normally distributed with mean vector 0 and nonconstant variance-covariance matrix If we define the reciprocal of each variance, 2 , as the weight, = 1 2 , then let matrix W be a diagonal matrix containing these weights: The weighted least squares estimate is then Moreover, 100(1 − )% confidence interval for the regression coefficient, is given by

Flow chart 01: Cross Validation methodology
Cross validation of Model: K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed. The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set (k-1) times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k different times. The advantage of doing this is that you can independently choose how large each test set is and how many trials you average over.

Results Discussion:
Exploratory analysis of data: It is always good to see the plot of data, to get a basic sense of its shape and ensure that nothing looks out of place. For instance, we may expect to see a somewhat linear relationship between two parameters. If we see something else, such as a horizontal line, you should investigate further. Our assumption about a linear relationship could be wrong, or the data may be corrupted (see figure 01, below). Or perhaps something completely unexpected is going on. Regardless, one must understand what might be happening before one begins developing the model. After watching the data exploration, we can investigate the response (cooling load) of the data set. Most importantly we need to verify whether the output is abiding the assumption of our model in equation (1).  From table 03 given below, we can see the rate of change of cooling load with respect to five different predictors. I deliberately used orientation to show it has no effect to predict cooling load because we can reject it with 5% level of significance. In the model selection part, we also noticed the same idea about orientation. Without orientation, our model gave lowest AIC. According to table 03 we realize that orientation is not an important factor for cooling load of building however we can not rely on p value always. According to the model section part of this study we also saw that orientation is not a very important predictor for our response. That is true for our data set. Although, orientation may play an important factor to maintain energy efficiency of buildings. Thus, we want to see the residual plots of model 2 and analyze them. So, it is better to select model 2 for our analysis.
So, the summary statistics of the model 2 can be like below From the regression model we can observe that rate of change with respect to relative compactness, surface area has negative effect on cooling load of buildings. On the other hand, rate of change with respect to wall area, overall height and grazing area have positive effect, however it is numerically small for wall area. Which means that small wall area may have less effect on cooling load according to our observation.
There is a problem with the R 2 for multiple regression. Yes, it is still the percent of the total variation that can be explained by the regression equation, but the largest value of R 2 will always occur when all of the predictor variables are included, even if those predictor variables don't significantly contribute to the model. R 2 will only go down (or stay the same) as variables are removed, but never increase.
The Adjusted-R 2 uses the variances instead of the variations. That means that it takes into consideration the sample size and the number of predictor variables. The value of the adjusted-R 2 can actually increase with fewer variables or smaller sample sizes. We should always look at the adjusted-R 2 when comparing models with different sample sizes or number of predictor variables, not the R 2 . If you have a tie for two models that have the same adjusted-R 2 , then take the one with the fewer variables as it's a simpler model. Here's a summary of the table of coefficients. We're making our decision at an α = 0.05 level of significance, so if the p-value < 0.05, we'll reject the null hypothesis and retain it otherwise. For this multiple linear regression, we have 0 : 1 = 2 = 3 = 4 = 5 = 0 1 : The null hypothesis claims that there is no significant correlation at all. That is, all of the coefficients are zero and none of the variables belong in the model.
The alternative hypothesis is not that every variable belongs in the model but that at least one of the variables belongs in the model. If you remember back to probability, the complement of "none" is "at least one" and that's what we're seeing here. In this case, because our p-value is very small (almost close to zero), we reject that there is no correlation at all and say that we do have a good model for prediction. The method of weighted least squares can be used when the ordinary least squares assumption of constant variance in the errors is violated (which is called heteroscedasticity). When conducting a residual analysis, a "residuals versus fits plot" is the most frequently created plot. It is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers. In this plot, we see that the residuals tend to increase as we move to the right. Additionally, the residuals are uniformly scattered above and below zero. We can go to data set and delete these values to get rid of outliers. An outlier is a data point whose response y does not follow the general trend of the rest of the data. So, we can cut it off. QQ plot also tell us, our residuals are not normal approximately. It has a fat tail at the end. It is better to fit Variance Gamma or Normal inverse gaussian distribution for better assumption. In our study we do not need to consider such extreme stuffs.
To show it we used the histogram of the response variable and check the normally. We also showed the QQ plot of the response variable to see the actual condition of data beside straight line. We found that the output variable cooling load is not behave like perfectly normal. The quantile-quantile (Q-Q) plot, shows the distribution of the data against the expected normal distribution. For normally distributed data, observations should lie approximately on a straight line. However out data does not look like this. So, our data is not quite normal. To make it normal we can use box cox transformation described in equation (2) of the data set. We get λ = −0.8 from the following box cox plot given by R codes.
we see that λ=-0.8 and is extremely close to the maximum, which suggests a transformation of the form The following figure estimate the lambda having max log likelihood.  Here, we get better fit with the straight line and it is approximately normal now. In this case we have 45 and 48 are the outliers. Also, we should remember, we failed in constant variance test. We found that our model is heteroscedastic. Thus, we should introduce WLS. Collinearity is found here. VIF tests has been shown and identified values greaten than 10 gives multi collinearity  Since each weight is inversely proportional to the error variance, it reflects the information in that observation. So, an observation with small error variance has a large weight since it contains relatively more information than an observation with large error variance (small weight). Set of weights will (legitimately) impact the widths of statistical intervals. That is meaning that if we change these two predictors by 1 unit there is no significant anomalies to be observed. This suggests that the variances of the error terms are equal. However, we see very few non constant variances terms which can be ignorable. Finally, very few residual "stands out" from the basic random pattern of residuals. This suggests that there are few outliers. We can see, 45 and 48.

Figure 10: WLS model diagnosis plots
The scale-location plot (square rooted standardized residual vs. predicted value). This is useful for checking the assumption of homoscedasticity. In this particular plot we are checking to see if there is a pattern in the residuals and we found that there is a concave pattern in it. However, because of WLS fit, we found collinearity here. Thus, we need to reduce a predictor and check it again.  The assumption of a random sample and independent observations cannot be tested with diagnostic plots. It is an assumption that you can test by examining the study design. The 2 nd plot in figure 09 is of "Cook's distance", which is a measure of the influence of each observation on the regression coefficients. The Cook's distance statistic is a measure, for each observation in turn, of the extent of change in model estimates when that particular observation is omitted. Any observation for which the Cook's distance is close to 1 or more, or that is substantially larger than other Cook's distances (highly influential data points), requires investigation. Here we have found four data points which may need to account. As we know, outliers may or may not be influential points. In fact, influential outliers are of the greatest concern. They should never be disregarded. Careful scrutiny of the original data may reveal an error in data entry that can be corrected. In our model we did not exclude it. From the QQ plot in figure 07 we can see 45 and 48 becomes influential.
Hypothesis test with final WLS models 0 : 2 = 3 = 4 = 5 = 0 1 : In this case, because our p-value is very small (almost close to zero) for grazing area, wall area and overall heights, we reject that there is no correlation at all and say that we do have a good model for prediction.
Also, in table 05, we got the test values for VIF to check co-linearity. There is statistical theory that shows that the appropriate choice of k depends on n and the type of predictor. Below, we have shown our results for cross validation. There is a small problem with this method for assessing prediction error. The final predictor will be based on all n, or 100% percent of the sample but the estimated prediction error is based on predictor developed on a smaller sample: n > n − n/k. So, the cross-validation estimate of predi ction error might actually be pessimistic, might have slightly better prediction error than we assu med. However, with 10-fold cross-validation can't be too far off because you are using at least 9 0% of your samples. For small number of data set like, we choose here, 5-fold CV is a good choice. It also validates our model.

Conclusion
▪ Rate of change with respect to surface area, overall height, wall area and grazing area has a positive effect on cooling load ▪ However, wall area as well as surface area is numerically small in case of rate of change.
▪ MLR is not a good model to predict cooling load because we are losing important predictors.
▪ Although cross validation verified a good fit.
▪ Elastic net could be a better model because we can use two different penalties with regularization parameters which avail from CV.