DCR : Double Component Ranking for Building Reliable Cloud Applications

Since cloud applications are usually large-scale, it is too expensive to enhance the reliability of all components for building highly reliable cloud applications. Therefore, we need to identify significant components which have great impact on the system reliability. FTCloud, an existing approach, ranks the components only considering the impact of component internal failures and ignoring error propagation. However, error propagation is also an important factor on the system reliability. To attack the problem, we propose an improved component ranking framework, named DCR, to identify significant components in cloud applications. DCR employs two individual algorithms to rank the components twice and determines a set of the most significant components based on the two ranking results. In addition, DCR does not require information of component invocation frequencies. Extensive experiments are provided to evaluate DCR and compare it with FTCloud. The experimental results show that DCR outperforms FTCloud in almost all cases.


Introduction
Cloud computing is an Internet-based computing paradigm, which provides shared processing resources and data to computers and other devices on demand [1,2].In recent years, cloud computing is becoming more and more popular and many enterprises and individuals prefer to build their systems in the cloud environment.The software systems in the cloud are named as cloud applications which usually consist of various cloud components communicating with each other.The cloud applications are usually large-scale and very complex [3], which may pose a threat to the system reliability and hinder transferring critical systems to the cloud.Nowadays, end-users hate applications with low-reliability and the demand for high reliability is continually increasing.Building highly reliable cloud applications has become a challenging and required research problem.
The major approach for improving the cloud application reliability is to enhance the reliability of each individual component.This may be accomplished either by employing functionally equivalent but more reliable components to reduce component failures or by adding fault-tolerance strategies to tolerate component failures.Unfortunately, both of them will incur extra cost.As cloud applications usually involve a large number of components, it is too expensive to provide alternative components or add fault-tolerance strategies for all the components.Based on the 80-20 rule [4], FTCloud-an existing approach [5] attempts to improve the reliability of cloud applications by ranking the components to identify a small set of significant components and enhancing their reliability.However, FTCloud only considers the impact of the component internal failures on the system and does not take into account error propagation which is also a serious threat to the global reliability [6].
To attack the problem, we propose a component ranking framework for identifying significant components and helping designers to build highly reliable cloud applications.This framework includes two component ranking algorithms, taking into account the direct impact of component internal failures on the system and the harm of error propagation, respectively.Based on the two algorithms, two ranking results are obtained and then a small set of the most significant components which have great impact on the system reliability are determined.By enhancing the reliability of these significant components, the system reliability can be greatly The main contribution of this paper is: 1.This paper identifies the importance of error propagation in locating significant components of cloud applications which is not considered by FTCloud, and proposes an improved component ranking framework, named DCR.DCR only employs component invocation relationships to independently rank the components twice and selects the critical components which have great impact on the system reliability from the two ranking results.2. Extensive experiments are provided to evaluate the impact of significant components identified by DCR on the reliability of cloud applications and draw performance comparison between DCR and FTCloud.The results show that DCR is effective and outperforms FTCloud in almost all cases.The rest of this paper is organized as follows.Section 2 introduces the two descriptions of significant components, the system architecture of DCR and related work.Section 3 details the double component ranking framework.Section 4 shows the experiments to evaluate DCR.Section 5 draws the conclusion and future work finally.

Preliminaries 2.1. Significant Components
A failure of a component in software systems can be attributed to two reasons [6], as shown in Figure 1.One is that an error caused by faults in the component (such as bugs) is delivered at the output interface, i.e. component internal failure.The other is that the component receives an incorrect input and generates an erroneous output, namely error propagation leads to a component failure.A system failure occurs only if an error eventually reaches the system interface, no matter how the error is produced and propagated.In a word, component internal failures and error propagation are two major threats to the system reliability.It is apparent that not only the direct impact of component internal failures on the system should be reduced, but also the harm of error propagation should also be minimized, if we want to build highly reliable cloud applications.Therefore, the significant components in this paper are able to be described from two perspectives.1• The significant components are the ones whose failures have great impact on the system.2• The significant components are also the ones which may severely affect a lot of other components and further harm the global reliability by propagating errors out when they fail.

System Architecture
The system architecture of DCR is showed in Figure 2, which includes three parts: structure graph building, component ranking and significant component determination.The procedures of DCR are as follows: 1.The system designer provides the structure information of a cloud application to DCR.A structure graph is generated based on the component invocation relationships.2. Two series of significance values of the cloud components are calculated by employing two different component ranking algorithms which are proposed in terms of the two descriptions of significant components in the last subsection, respectively.According to the two series of significance values, the components are ranked twice.
ISSN: 1693-6930  DCR: Double Component Ranking for Building Reliable Cloud Applications (Lixing Xue) 1567 3. Based on the two ranking results, the most significant components which have strong impact on the global reliability are determined and returned to the system designer for building a reliable cloud application.
Figure 2. System architecture of DCR

Related Work
In traditional software reliability engineering, there are four common methods to build reliable software systems, namely fault prevention, fault removal, fault tolerance and fault forecasting [7].However, fault prevention and fault removal are not able to be applied when we build cloud applications.This is because building cloud applications usually uses existing cloud components and we cannot participate in the development of them.But we can select components with high reliability according to design requirements.Another method we can employ is software fault tolerance.Software fault-tolerance techniques, such as recovery block [8] and N-Version Programming (N-Modular Redundancy) [9], are widely used in various systems.In the cloud environment, a great number of functionally equivalent but independently designed components can be used for designing fault-tolerance mechanisms.
As cloud computing is becoming popular, a number of works have been carried out on it.Service component selection and composition is one of the hotspots.Many approaches have been proposed, such as QoS-aware web service composition [10], web service reputation model [11], OWL-S service profile based web service selection [12] and web service selection based on concurrent requests [13].Component ranking is a prerequisite for applying these research findings and some studies have been carried out.However, the approaches do not take into account error propagation, which is also a major threat to the reliability of cloud applications.In addition, they require the structure information as well as the information of component invocation frequencies.Our approach which attacks the weakness requires only the structure information and takes into account the harm of error propagation in the system, obtaining wonderful results.

Double Component Ranking
As shown in Figure 2, DCR includes three parts, which will be detailed in this section, respectively.Structure graph building is introduced within Section 3.1.Then the two component ranking algorithms are proposed according to the two descriptions of significant components in Section 3.2 and 3.3 respectively.In Section 3.4, determination of significant components is discussed.

Structure Graph Building
The structure of a cloud application, that is, the component invocation relationships, can be modeled as a directed graph . Each entry ij a in the matrix is defined by: In the matrix, The number of edges starting from node i C is called out-degree of i C , denoted as deg ( ) It is able to be obtained by: Similarly, the number of edges ending at node i C is called in-degree of i C , denoted as It can be calculated by:

Failure-Based Component Ranking
In a cloud application, some components are frequently invoked by many other components.It is obvious that their failures will directly affect the system reliability much more than other components [14].These components follow the first description of significant components discussed in the last section.Intuitively, these significant components in a structure graph are the ones which have many ingoing links from other important components.On the basis of the PageRank algorithm [15], we propose an algorithm to calculate the first series of the significance values of the cloud components, named as failure-based significance values.
For a cloud application which contains n components, the failure-based significance value ( ) i VF C of a component i C is defined as: ) Where ) 4) is utilized to adjust the proportion of the two values, which is usually set to be 0.85.By (4), a component i C has a large failure-based significance value if the sum of failure-based significance values of the components that invoke i C is large, indicating that i C is invoked by many other significant components.
Equation ( 4) can be written in matrix form: Where the matrix ( ) The procedures of calculating the failure-based significance values are simple.First, randomly assign initial values between 0 and 1 to the failure-based significance values Then, solve (5) by repeating the computation until all significance values become stable.
Using the above approach, the failure-based significance values of the cloud components can be obtained.According to these values, the components are ranked.A component with a larger value is considered to be more significant.The failures of the significant components selected from this ranking result will have great impact on the system reliability.

Propagation-Based Component Ranking
In a cloud application, there must be some components that frequently invoke a lot of other components.Their failures may affect a lot of subsequent components by error propagation and further harm the system reliability.So these components are considered to be important and they accord with the second description of significant components in Section 2.1.Intuitively, these significant components in a structure graph are the ones which have many outgoing links to other important components.Illuminated by the TrustRank algorithm [16], we propose another algorithm to calculate the second series of significance values of the cloud components, named as propagation-based significance values.
Assuming that a cloud application contains n components, the propagation-based significance value  7) is employed to adjust the proportion of the two values, which is usually set as 0.85.By (7), a component i C has a large propagation-based significance value if the sum of propagation-based significance values of the components which are invoked by i C is large, showing that i C invokes a large quantity of other significant components.The equivalent matrix equation of ( 7) is: Where the matrix ( ) is defined by: The procedures of calculating the propagation-based significance values are identical with those of calculating the failure-based significance values.First, randomly assign initial Then, solve (8) by repeating the computation until all significance values become stable.
With the above approach, the propagation-based significance values of the cloud components can be obtained.On the basis of the values, the components are ranked.A component is considered to be more significant if it has a larger value.The failures of these significant components selected from this ranking result will severely affect other components in the cloud application and further affect the system reliability.

Significant Component Determination
Based on the two series of significance values, the components in the cloud application can be ranked respectively.The failure-based significance values enable us to identify the significant components which have great direct impact on the system reliability while the propagation-based significance values help us locate the significant components which severely affect other components and further harm the system reliability.Which ranking result is more important?We believed that there is no accurate answer.
To reduce both of the direct and indirect threats and better improve the system reliability, Top-2 k ( 2 k n   and k is even) components are respectively selected from the two ranking results and hence a total of k components are determined as the most significant components.In this way, the designer of the cloud application can improve the system reliability efficiently by enhancing the reliability of these components.

Experiments and Evaluation
In this section, extensive experiments are provided to validate DCR, evaluate the impact of different parameter settings on DCR and compare DCR with FTCloud.

Experimental Setup
In this section we compare the following approaches: 1. DCR: The components are ranked by DCR and the Top-K percent components are selected as the significant components for enhancing the reliability.2. RandCR: K percent components are randomly selected as the significant components for enhancing the reliability.3. FTCloud: The components are ranked by FTCloud and the Top-K percent components are selected as the significant components for enhancing the reliability.
The system reliability is considered to be the probability of generating correct output with correct input [17].For a fair comparison, we assume that the internal failure probability ( intf ) of the selected components can be reduced to 80% after enhancing the reliability no matter which approach is employed.In addition, in DCR and FTCloud, the parameter  is used to balance the significance values derived from other components and the basic values of the components themselves.In previous studies [18,19], it has been proved that 0.85 is a good choice.Thus, in our experiments,  is also set to be 0.85.
A scale-free graph is a graph whose degree distribution follows a power law, at least asymptotically.Previous studies have demonstrated that not only the Internet [20] but also the internal structures of common software such as Linux Kernel, Mozilla, Xfree86 and MySQL [21,22] appear to be scale-free.Therefore, the network analysis software Pajek [23] is utilized to generate scale-free directed graphs as structure graphs of cloud applications in the experiments.
Three scale-free directed graphs with different settings of node numbers (i.e.500, 1000 and 2000) are generated by Pajek in our experiments.Then the component invocation frequencies of each graph are randomly generated to simulate the statistical data during a period of running online.These component invocation frequencies are used in FTCloud and calculating the system reliability.

Validation and Performance Comparison
In order to validate DCR and compare DCR with FTCloud, the approaches are applied to the three graphs respectively and the experimental results of application reliability are To study the impact of Top-K on the system reliability, we compare the approaches in different Top-K settings.In this experiment, the node number is still 1000 and ep is set as 0.99, too.The experimental results of cloud application reliability in Figure 5 show that: 1. DCR consistently provides better reliability performance than FTCloud in all cases when 0.05 intf  and almost all cases when 0.
and Top-K is set as 2%, the reliability provide by FTCloud is 0.0002 more than that provide by DCR.
To sum up, DCR outperforms FTCloud in almost all cases.Only if Top-K=2% as well as ep is small or intf is large, the performance of FTCloud may approach or slightly exceed that of DCR.This observation is due to the significant components determination of DCR and the inequality between impact of component internal failures on the reliability and impact of error propagation in cloud applications.DCR treats impact of component internal failures and error propagation equally, and selects Top-2 k components from the two ranking results respectively.
In these extreme cases, the impact of the first Top-2 k components of the propagation-based ranking result on the reliability may be a little weaker than the impact of the second Top-2 k components of the failure-based ranking result, causing the performance of DCR to be slightly worse than that of FTCloud in this case.The observation can only be found when node number is not big.When the scale of the cloud application reaches 2000 nodes, DCR outperforms FTCloud without any exception.Anyway, the negligible performance difference in a few extreme cases does not cover the effectiveness and advantages of DCR.

Conclusion
This paper proposes a component ranking framework for identifying significant components which have great impact on the cloud application reliability to help designers build reliable cloud applications.This framework takes into account the impact of component internal failures as well as the harm of error propagation to rank the components twice only employing the system structure information.The significant components are determined based on the two ranking results.The reliability of cloud applications can be greatly improved by enhancing the reliability of these significant components.Compared with FTCloud, the proposed framework considers more but requires less.Plenty of experiments are conducted to draw performance comparison and the results show that our framework is effective and outperforms FTCloud in almost all cases.
The future work includes: a) improving the determination of significant components, b) more experimental analysis of actual cloud applications, and c) considering more factors to identify significant components.

Figure 1 .
Figure 1.Two threats to reliability value derived from other components which are invoked by i C .Similarly, the parameter  ( 0 1

Figure 3 .Figure 4 .Figure 5 .
Figure 3. Impact of component internal failure probability where a node i C in the node set C denotes a component and a directed edge ij e from i C to j C in the edge set E represents that i C invokes j C (denoted as i j C C  ).