Cloud Service Providers Optimized Ranking Algorithm based on Machine Learning and Multi-criteria Decision Analysis

Multi-criteria decision analysis (MCDA), one of the prevalent branches of operations research, aims to design mathematical and computational tools for selecting the best alternative among several choices with respect to specific criteria. In the cloud, MCDA based online brokers uses customer specified criteria to rank different service providers. However, subjected to limited domain knowledge, the customer may exclude relevant or include irrelevant criterion, which could result in suboptimal ranking of service providers. To deal with such misspecification, this research proposes a model, which uses notion of factor analysis from the domain of unsupervised machine learning. The model is evaluated using two quality-of-service (QoS) based datasets. The first dataset i.e., feedback from customers, was compiled using leading review websites such as Cloud Hosting Reviews, Best Cloud Computing Providers, and Cloud Storage Reviews and Ratings. The second dataset i.e., feedback from servers, was generated from cloud brokerage architecture that was emulated using high performance computing (HPC) cluster at University of Luxembourg (HPC @ Uni.lu). The simulation runs in a stable cloud environment i.e. when uncertainty is low, shows that online broker (equipped with the proposed model) produces optimized ranking of service providers as compared to other brokers. This is due the fact that proposed model assigns priorities to criteria objectively (using machine learning) rather than using priorities based on subjective judgments of the customer. This research will benefit potential cloud customers that view insufficient domain knowledge as a limiting factor for acquisition of web services in the cloud. Keywords—multi-criteria decision analysis (MCDA), online broker, misspecification of criteria, structural uncertainty, unsupervised machine learning, factor analysis, quality of service (QoS).


I. INTRODUCTION
Multi-criteria decision analysis (MCDA), one of the prevalent branches of operations research, aims to design mathematical and computational tools for selecting the best alternative among several choices, with respect to specific criteria [1,2].It prescribes a methodology that deals with the most important components in the process of decision making and aims at supplying reliable information to take an unbiased decision [3].These components include a preestablished goal achievable under given constraints.Constraints are criteria used to rank potential alternatives.An unbiased ranking of alternatives is based upon selection of relevant criteria by a decision maker which strongly relates to his/her profound knowledge of the subject matter [1,4].Hence, the approach is termed ineffective when the decision maker has insufficient domain knowledge [5][6][7].
For example, let's assume a startup called Moogle is using cloud based brokerage architecture to buy online storage service for data backups.The goal of online broker is to select a service provider with best QoS from the list: carbonite, dropbox, ibackup, justcloud, sos online backup, sugarsync, and zip cloud.A ranking of these service providers is generated by online broker using following QoS based criteria: availability, response time, price, speed, ease of use, technical support, and customer services.However, Moogle as per its insufficient domain knowledge for cloud based storage environment includes an additional criterion of storage space to the list.As a result, the ranking generated by online broker for service provider is off by a certain amount and consequently, Moogle bypasses an optimal choice for online storage service in the cloud.
Since most common MCDA methods used by online brokers fail to operate without customer interference, a model to deal with misspecification of criterion owing to insufficient knowledge of a customer is needed [8][9][10].For this purpose, the integration of unsupervised machine learning and MCDA has been explored in this research.The remaining parts of this paper are organized as follows.Section II presents related literature and research gap.Section III and IV present proposed model and its evaluation in an online cloud environment, respectively.Finally, section V concludes the paper and present directions for future research.
In [22] authors proposes a hybrid decision-making model based on affinity diagram, fuzzy Analytic Hierarchy Process (F-AHP) and fuzzy Technique for Order Preference by Similarity to an Ideal Solution (F-TOPSIS) to evaluate cloud solutions to host Big Data projects.In the first stage of this model, identification of evaluation criteria is performed by a decision-making committee using Affinity Diagram.Due to the varied importance of the selected criteria, F-AHP process is used in the second stage to assign weights for each criterion.F-TOPSIS in the third stage employ these weighted criteria as inputs to evaluate and measure the performance of each alternative (cloud solutions).In the last step, a sensitivity analysis is performed to evaluate the impact of criteria weights on the final rankings of alternatives.
In [23] authors discusses evaluation of Trade-offs based Methodology for Adoption of cloud based Services (TrAdeCIS) using TOPSIS and Analytic Network Process (ANP).They argue that the decision to use such services is based upon criteria which can be mutually interdependent and conflicting and hence, a trade-offs-based methodology is needed to make such decisions.TrAdeCIS is the first methodology that supports an automated and quantified trade-offs based decision making for selection of a best cloud based service.In [29] authors propose a model which uses Fuzzy TOPSIS for web service selection.Based on the fact that web service selection is highly influenced by customer preferences, a simulated environment represented by 8x8 LED matrices on a circuit board was used to demonstrate the selection.
In [24] authors compares behavior and quality of TOPSIS and VIKOR based multi-objective decision methods with the Pareto optimality solutions.In [25] authors propose a Service Measurement Index Cloud framework (SMICloud).It provides a holistic view of criteria to benchmark service providers.It is divided into seven categories that include accountability, agility, assurance, financial, performance, security and privacy, and usability.Each of these categories is further subdivided into three or more mid-level criteria.For example, mid-level criteria assigned to agility include, beside others, capacity and elasticity.Then within each mid-level criterion, a set of low-level criteria are defined for data collection.For example, low-level criteria assigned to capacity include, beside others, CPU and memory.For each criterion in these levels, relative weights are assigned using AHP to generate relative ranking.
In [26] authors propose consumer centered cloud service selection model.They argue that, QoS criteria in the cloud are solely related to service provider.However, as cloud service spread all over the internet, part of them (e.g.availability and reliability) are largely influenced by a network which eventually impact customers.For this reason, selection of a cloud service must be subjected to customer's interest.In this regards, AHP is used for ranking of service providers based on customer preferences.
In [27] authors propose fuzzy based AHP model for cloud service selection.They argue that, it is often difficult for a customer to exactly quantify his or her opinion as a number.However, if expressed as an interval then it will be better description of an opinion.In this regard, proposed model combined interval valued fuzzy sets (IVFs) with AHP to generate ranking.
In [28] authors propose fuzzy based TOPSIS model for cloud service selection.They argue that, QoS based cloud service selection can be treated as a multi-criteria group decision making problem when selection is performed by a group of experts with different experiences and skills.In this regard, proposed model uses triangular fuzzy numbers to represent opinions of experts.Afterwards, these fuzzy numbers are transformed into crisp numbers by using graded mean integration representation method.The canonical representation of addition and multiplication operations on triangular fuzzy numbers is then used to obtain the positive ideal solution (PIS) and the negative ideal solution (NIS).Due to the use of crisp number rather than triangular fuzzy number for canonical representation, the complicated calculations involving triangular fuzzy numbers is avoided.Afterwards, Minkowski distance function is applied to measure the distance of each alternative (cloud service) from the PIS and the NIS.The shortest distance from the PIS and the farthest distance from the NIS is selected as a best alternative.
In [30] authors propose a cloud service selection model that uses subjective assessment of customers and objective performance assessment conducted by a trusted third party.The model is composed of four services: (i) Cloud Selection Service -it chooses cloud services which meets all the objective requirements of a customer; (ii) Benchmark Testing Service -this service is provided by a trusted third party which designs a variety of testing scenarios to conduct objective performance analysis; (iii) User Feedback Management Service -it is used to collect and manage the feedback from the customers who are already consuming selected cloud services.For every performance aspect of a cloud service, a customer gives his/her subjective assessment (e.g., "good", "fair" and "poor"); and (iv) Assessment Aggregation Service -it is responsible for accumulating assessments (subjective and objective) and perform benchmarking using fuzzy simple additive weighting system to generate ranking.Others techniques: ANP [23], VIKOR [24], and Fuzzy [30] Based on above review, table 1 lists most commonly used MCDA methods used by online brokers to generate ranking of service providers in the cloud.They are: AHP and TOPSIS.The prime objective of AHP is to decompose the decision problem into a hierarchical structure of goal, criteria and alternatives.It then evaluates them in a series of pairwise comparisons that uses priorities provided by the decision maker [26].TOPSIS on the other hand, compares a set of alternatives by using weights for each criterion provided by the decision maker and afterwards, calculate the geometric distance between each alternative and the expected ideal alternative [28].
It is evident that AHP and TOPSIS use distinct approaches to evaluate alternatives.However, at the very outset, both of them equally rely upon subjective judgments of the decision maker to ensure that all relevant criteria are included in the process.Thus it can be concluded that MCDA based online brokers that are using AHP or TOPSIS overlook misspecification for criterion owning to subjective judgments of the decision maker.Hence, the conclusion acknowledges a need to develop a model to deal with such misspecification owing to insufficient knowledge of a customer (decision maker) in the cloud.

III. PROPOSED MODEL
This research proposes a model called as self-regulated MCDA, which resolves misspecification for criterion owning to its statistical relevance that is estimated using notion of communality.Communality belongs to broader concept of factor analysis from the domain of unsupervised machine learning [33,34].Numerically, it is a measure of a relationship between a criterion and a goal [33].Its high value indicates strong correlation between the two and hence, endorses the criterion as relevant with reference to a goal.In the example of Moogle, except for the additional criterion of storage space, all other criteria have strong correlation with QoS and hence, relevant to generate QoS based ranking of service providers.
Communality is estimated by using structural equation modeling (SEM).SEM is a statistical approach used to examine association between a latent variable (or goal) and an observed variables (or criteria) [33,34].Latent variable is a theoretical construct that is inferred from the variables that are observed in the field.In the example of Moogle, QoS is a latent variable since it represents intent of a customer and is inferred from the variables (availability, response time, price, speed, ease of use, technical support, and customer service) that are observed during the test or survey.
In SEM, the most popular and frequently used methods to estimate communality are Principal Factor Analysis (PFA) and Maximum Likelihood (ML) [33,34].Considering that ML estimation assumes normal distribution of observed variables and this research is dealing with observed variables without making any prior assumption, PFA was used to estimate communality.
In PFA, the relationship vector ⋀ = (λ λ … λ ) ′ between a latent variable F and observed variables vector Y = (y y … y )′ is expressed in a variance-covariance matrix notation as: cov(Y) = cov(⋀F) + ψ ψ is a vector that represent uniqueness of observed variables not shared with the latent variable.By using covariance property cov(AZ) = A cov(Z) A , cov(⋀F) in the right hand side of above equation can be expanded to ⋀ cov(F) ⋀ + ψ.Moreover, since F being an identity matrix has cov(F) = 1, ⋀ cov(F) ⋀ can be further reduced to: ⋀⋀ + ψ and the equation becomes: cov(Y) = ⋀⋀ + ψ If Y is not commensurate i.e. observed variables are measured in different units and scales, then standardized Y is used.After standardization, covariance becomes correlation (r) and subsequently, covariance matrix cov(Y) becomes a correlation matrix R. R = ⋀⋀ + ψ we can expand above equation as: Bringing ψ to left hand side and preforming subtraction, Subtracting unique variance from the one (1 − ψ ) will yield shared variance of an observed variable for the latent variable, which is equal to square of λ [33,34].Respectively, (λ ) can replace 1 − ψ and above equation will become: where (left hand side), Accordingly, in a reduce form, equation 1 becomes: R − ψ = ⋀⋀ (3) R − ψ is a 'reduced correlation matrix' with (λ ) on the diagonal.If R − ψ is positive semi-definite matrix i.e. it satisfy R − ψ = (R − ψ) , then this implies that left hand side in equation 3 is symmetric and has a following spectral decomposition.
R − ψ = UDU (4) Spectral decomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvectors to identify latent variable and corresponding eigenvalues to show strength of identified latent variable.In equation 4, U is the matrix of eigenvectors In an expanded form, right hand side in above equation can be written as: Hence, from the right hand side of above equation we take the largest eigenvalue Θ and corresponding eigenvector U for calculation of Λ i.e., Λ = U Θ .The squared value of Λ is called communality ( ) and can be written as: In the equation, eigenvector contains estimated unitscaled loadings or weights ( ) that are associated with each observed variable.The eigenvalue Θ is a shared variance among all the observed variables that represent the latent variable.Communality is obtained by multiplying squared value of with Θ, which represents the relationship of latent variable with observed variable.The strong correlation between the two is identified by using the condition > ω.
Where, ω is a controlled variables (or constant) and its value is assigned by a substantive specialist in the field or a statistical technique [33].The value of ω lies between 0 and 1 and is used for identification of relevant criterion.For example, ω = 0.60 ensures that a criterion which contributes less than 60% to the goal is not selected for further processing.In the example of Moogle, storage space was one such criterion.Accordingly, equation 7 can be rewritten as:

IV. EVALUATION AND RESULTS
A two-stage procedure was implemented in order to evaluate self-regulated MCDA in an online cloud environment.In stage one; relevance of criterion was assessed by using equation 8.In stage two, a comparative analysis was performed between two types of MCDA based online brokers.Only one type was equipped with selfregulated MCDA.The two datasets used during these stages comprised of "feedback from customers" and "feedback from servers" on QoS of cloud storage providers.The first dataset i.e., feedback from customers, was compiled using leading review websites such as Cloud Hosting Reviews, Best Cloud Computing Providers, and Cloud Storage Reviews and Ratings.In this dataset, the feedbacks 1 were provided for the following QoS based criteria (or observed variables): Availability (AV), Response Time (RT), Price (PR), Speed (SP), Storage Space (SS), Ease of Use (EU), Technical Support (TS), and Customer Services (CS).Each of these criteria was assessed on the following ordinal scale: excellent (1), very good (2), good (3), satisfactory (4), and sufficient (5).In total, the dataset contained 390 feedbacks for seven cloud storage providers that included: Carbonite, Dropbox, iBackup, JustCloud, SOS Online Backup, SugarSync, and Zip Cloud.The latent variable (or the goal) was QoS.The second dataset i.e., feedback from servers, was generated from cloud brokerage architecture that was emulated using high performance computing (HPC) cluster at University of Luxembourg (HPC @ Uni.lu).More specifically, a virtual machine in HPC cluster together with docker (a software container platform) was used to emulate three cloud storage providers running NoSQL databases [35]: Redis, MongoDB, and Memcached.Each of these service providers were operating under a workload comprising of operations ranging from 0 to 10,000, records ranging from 0 to 10,000, and threads ranging from 0 to 100.
Yahoo Cloud Service Benchmark (YCSB) [36] was deployed at the customer end i.e., second virtual machine in HPC cluster, to continuously monitor QoS of these storage providers in terms of throughput (operations per second), read latency (time to read data from database), and update latency (time to update data in database).For eight simulation runs with small workload (number of operations < 5000) and big workload (number of operations > 5000), Figure 1 depicts descriptive statistics of three storage providers in terms of standardized values of throughput, read latency, and update latency.Based on these statistics, none of the storage provider can be classified "more superior" as compared to others.The data analysis, scripting, and visualizations tools used during the two-stage procedure include: Python [37,38], R/R Studio [39], Arena Rockwell Input analyzer [40,41], STATA -Data Analysis and Statistical Software [42][43][44], IBM Statistical Analysis Software Package (SPSS) [45,46], and Microsoft Excel. 1 TrustFeedback@http://cs.adelaide.edu.au/~cloudarmor/ds.html

A. Assessing Relvance of a Criterion
This section presents following five steps to assess relevance of a criterion in the dataset containing customer feedbacks.
Step 1: the correlation matrix (R) is generated for QoS based criteria using the dataset.As these criteria are assessed on ordinal scale, the generated matrix contains polychronic correlations that are used to measure associations between ordinal variables [42][43][44].Step 2: In order to generate reduced correlation matrix, initial estimates for (λ ) were required, see equation 2. In [33] author lists several approximation techniques, among which the most commonly used are "average correlation of a variable with other variables" and "highest correlation of a variable".In this research we have used highest correlation of a variable as an initial estimate for (λ ) .Step 3: Reduced correlation matrix R − ψ is generated with (λ ) on the diagonal of the matrix, see equation 3. R − ψ is positive semi-definite matrix i.e. it satisfy R − ψ = (R − ψ) , and so it is symmetric and has a spectral decomposition as per equation 4. Step 5: Using equations 6 and 7, ⋀ = (λ , λ , … , λ ) and ς are calculated, respectively.Using the opinion of substantive specialist, the value of ω was set to 0.60 (60%) for the dataset with feedback from customers and was set to 0.30 (30%) for the dataset with feedback from servers.Afterwards, based on equation 8, criterion SS (0.552 < 0.60) was omitted from further processing and termed as irrelevant to generate QoS based ranking of storage providers.Above mentioned steps were also used to calculate relevance of each criterion in the dataset with feedback from servers.For throughput, ς was calculated to be 0.379, whereas, for read latency and update latency it was 0.463 and 0.338.As none of criterion has shared variance less than 30% and hence, endorses: throughput, read latency, and updates latency; as relevant to generate QoS based ranking of storage providers.

B. Comparative Analysis
In this stage, a comparative analysis is performed between MCDA based online Broker that is using selfregulated MCDA and the one that is not.More specifically, it's a comparison between AHP (identified in section II) and AHP based upon proposed model i.e.Self-regulated AHP.
For dataset with feedback from customers, the prior i.e.AHP, performs series of pair-wise comparisons for eight QoS based criteria using priorities provided by the customer (decision maker).The later i.e.Self-regulated AHP, uses seven QoS based criteria (excluding SS) with priorities assigned based on the communality that was calculated in preceding section.For example, Availability with communality of 0.819 was given highest priority (followed by Customer Service, Technical Support, Response Time, Speed, Price, and Ease of Use).Based on the fact that Selfregulated AHP in this dataset was using "relevant criteria" and "priorities assigned objectively", it was expected that it will produce better results as compared to AHP.
For dataset with feedback from servers, AHP performs series of pair-wise comparisons for three QoS based criteria using priorities provided by the customer (experts at HPC @ Uni.lu).As mentioned, there was no omission of criterion based on the condition ς > ω, Self-regulated AHP uses the same three QoS based criteria with priorities assigned based on the communality.However, based on the fact that Selfregulated AHP in this dataset was only using "priorities assigned objectively", it was expected that it might not produce better results as compared to AHP.This is true when priorities assigned by the customer in AHP are not substantially different from priorities in Self-regulated AHP.
A similar setting was also applied for comparison between TOPSIS (identified in section II) and Self-regulated TOPSIS.The motivation for performing two pairs of comparative assessment (AHP v. Self-regulated AHP and TOPSIS v. Self-regulated TOPSIS) for each dataset was to produce results for both certain and uncertain online cloud environment.High degree of randomness was induced by using random probability distribution to simulate uncertainty in the datasets for TOPSIS v. Self-regulated TOPSIS.
Figure 2 presents results for comparative assessment of AHP v. Self-regulated AHP for dataset with feedback from customers.Following observations show that Self-regulated AHP produces more explicit ranking of storage providers as compared to AHP.
• Considering all simulations in both images (15+15=30), it can be observed that Just Cloud is commonly ranked 1 st .However, as per simulation 10 in AHP graph, assigned rank for Just Cloud is 2 nd , whereas, in corresponding simulations in Selfregulated AHP graph, the rank is 1 st .• Considering all simulations in both images, it can be observed that Sos.online.backup is commonly ranked in the range of 2 nd to 6 th .However, as per simulation run 10 and 15 in AHP graph, assigned ranks for Sos.online.backup are 1 st and 7 th , whereas, in corresponding simulations in Self-regulated AHP graph, the ranks are 2 nd and 6 th .• Considering all simulations in both images, it can be observed that Zip Cloud is commonly ranked in the range of 5 th to 7 th .However, as per simulations 7 and 13 in AHP graph, assigned ranks for Zip Cloud are 4 th and 2 nd , whereas, in corresponding simulations in Self-regulated AHP graph, the ranks are 6 th and 5 th .• Considering all simulations in both images, it can be observed that SugarSync is commonly ranked at 7 th position.However, as per simulations 2, 10, 11, 14 and 15 in AHP graph, assigned rank for SugarSync is 6th, whereas, in corresponding simulations in Selfregulated AHP graph (except for 2), the rank is 7 th .The remaining cloud storage providers i.e.Carbonite, Dropbox, and iBackup, have almost similar ranks in both graphs.Fig. 2. Comparative Assessment AHP v. Self-regulated AHP Figure 3 and 4 presents' results for comparative assessment of AHP v. Self-regulated AHP for dataset with feedback from servers.The assessment was performed for two workloads (small load and big load, see figure 1).For big load, the priorities assigned by the customer (experts at HPC @ Uni.lu) in AHP (Update Latency was given highest priority followed by Read Latency and Throughput) were substantially different from priorities in Self-regulated AHP (Read Latency was given highest priority followed by Throughput and Update Latency).Hence, Self-regulated AHP produced better results as compared to AHP.However, for small load, the priorities were not substantially different and therefore, the results of Self-regulated AHP were same as AHP. Figure 5 presents results for comparative assessment of TOPSIS v. Self-regulated TOPSIS for dataset with feedback from customers.Both the graphs in the figure clearly show effects of induced uncertainly and respectively, every storage provider is ranked in the range of 1 st to 7 th .However for Sos.online.backup and Dropbox the range has reduced to 2 nd to 7 th and 1 st to 6 th respectively, in Self-regulated TOPSIS graph.This certainly highlights the limited ability of proposed model to produce better results even in presence of uncertainty.Figure 6 and 7 presents' results for comparative assessment of TOPSIS v. Self-regulated TOPSIS for dataset with feedback from servers.The assessment was performed for two workloads (small load and big load, see figure 1).The results are almost similar to results in Figure 5 i.e. it is not clear which service provider outperforms the others.These results show limitation of proposed model and suggest a direction of future research to augment proposed model to deal with uncertainly in the cloud.However, in the stable environment i.e. when uncertainty is low, based on above observations, it can be stated that MCDA based online brokers equipped with Self-regulated AHP or Self-regulated TOPSIS will produce optimized ranking of service providers in the cloud as compared to brokers that are using AHP and TOPSIS.In this regards, the research has successfully integrated notion of unsupervised machine learning and multi-criteria decision analysis.A two-stage procedure was implemented in order to evaluate self-regulated MCDA in an online cloud environment.Two quality-of-service (QoS) based datasets were used during the evaluation.The first dataset i.e., feedback from customers, was compiled using leading review websites such as Cloud Hosting Reviews, Best Cloud Computing Providers, and Cloud Storage Reviews and Ratings.The second dataset i.e., feedback from servers, was generated from cloud brokerage architecture that was emulated using high performance computing (HPC) cluster at University of Luxembourg (HPC @ Uni.lu).
The simulation runs in the stable cloud environment i.e. when uncertainty is low, shows that online broker that is using Self-regulated AHP or Self-regulated TOPSIS produces optimized ranking of service providers as compared to brokers that are using AHP and TOPSIS.This is due the fact that Self-regulated AHP or Self-regulated TOPSIS assigns priorities to criteria objectively (using unsupervised machine learning) rather than using priorities based on subjective judgments of the customer.In particular, the results have implications for enterprises that view insufficient domain knowledge as a limiting factor for acquisition of cloud services.In the next stage of the research, the goal is twofold, first, to enhance proposed model to deal with uncertainty in the system and second, to test in-field execution of enhanced proposed model in real time cloud brokerage architecture.

Preprints
(www.preprints.org)| NOT PEER-REVIEWED | Posted: 15 January 2018 doi:10.20944/preprints201801.0125.v1 of R − ψ and D is the diagonal matrix of corresponding eigenvalues Θ Θ … Θ .The important property of a positive semi-definite matrix is that its eigenvalues are always positive or null.Hence, Θ ≥ 0 and consequently, D can be factored into D / D / and right hand side in equation 4 becomes: Equation 5 is in the form of equation 3 and accordingly, following can be deduced for ⋀.

TABLE I .
COMMONLY USED MCDA METHODS BY ONLINE BROKERS