Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Using Random Test Selection to Gain Condence in Modied Software

Wanchun Li and Mary Jean Harrold College of Computing Georgia Institute of Technology Atlanta, GA 30332-0280 {wli7|harrold}@cc.gatech.edu

Abstract
This paper presents a method that addresses two practical issues concerning the use of random test selection for regression testing: the number of random samples needed from the test suite to provide reliable results, and the condence levels of the predictions made by the random samples. The method applies the Chernoff bound, which has been applied in various randomized algorithms, to compute the error bound for random test selection. The paper presents three example applications, based on the method, for regression testing. The main benets of the method are that it requires no distribution information about the test suite from which the samples are taken, and the computation of the condence level is independent of the size of the test suite. The paper also presents the results of an empirical evaluation of the technique on a set of C programs, which have been used in many testing experiments, along with three of the GCC compilers. The results demonstrate the effectiveness of the method and show its potential for regression testing on real-world, large-scale applications.

Introduction

Regression testing is an important activity used to gain condence in the quality of modied software. The testing is applied to the modied version of the software to ensure that it behaves as intended, and that the modications have not adversely impacted its quality. Regression testing may be used to detect failures. In this case, the failure information is used to locate and x the fault(s) causing the failures, and thus, to improve the quality of the software. Regression testing may also be used to estimate the number of failures that remain in the system. In this case, the failure information is used to indicate the softwares quality, and thus, to help in deciding when the software is ready to release. One approach to regression testing saves the test suite T used to test one version of a program P , and uses it to

test the next (modied) version of the program P . Because it is sometimes too expensive or time consuming to rerun all of T on P , researchers have developed regressiontest-selection techniques that select a subset of T , T , and use it to test P (e.g., [6, 7, 16, 21, 24, 27]). Studies have shown that these regression-test-selection techniques can be effective in reducing the cost of regression testing (e.g., [6, 21, 24, 27]). Most of these regression-test-selection techniques use some representation of the system, such as a system model or the source code, on which to gather coverage data when testing P using T , and use it to assist in identifying test cases in T to rerun on P . Interactions with our industrial collaborators, however, have revealed that, in practice, such representations may not be available for use in the testing. Thus, these existing regression test-selection techniques may not be applicable in many practical scenarios. In these situations, practitioners often use random test selection from T to nd failures and to predict the number of failures before releasing the modied versions of the software. For predicting the number of failures, testers can randomly select a subset of test cases from T . By running these randomly-selected test cases (i.e., random samples) from T , testers can predict the number of remaining failures based on the failure information of the random samples. However, due to the randomness of the selection of test cases, testers are concerned about the potential error in the prediction the deviation of the prediction from the actual number of remaining failures. Thus, one main issue in using random test selection is the computation of the error bound for gaining condence in the prediction made by running the randomlyselected test cases. Computing the error bound can be expressed as two questions: (1) how many random samples are needed to reach a certain condence level, and (2) what is the condence level of the failures predicted by a specic number of random samples. Previous research has addressed aspects of random test selection. One approach uses operational testing [12, 14], which applies random selection to an operational prole

to estimate software quality. However, the technique does not address the condence issue. Another set of techniques [10, 14, 29] investigates the probability that the random testing nds at least one failure; these techniques assume that testing stops when a failure is discovered. However, when a failure is detected in real-world software, testers usually report the failure to developers but do not stop the testing. A third set of techniques [5, 10] studies the expected number of failures found by random testing of software with uniform distribution. However, it may be difcult to determine whether the input distribution of todays complex software systems is uniform. A nal approach [18] applies Monte Carlo integration and the central limit theorem to estimate the failing rate of a test suite by executing randomly-selected test cases. However, the technique does not specify a number of random test cases required for achieving an estimation, given a condence level. To address the lack of practical condence measures for estimating the number of failures using randomly-selected test cases, we developed, and present in this paper, a new method that uses the Chernoff bound [20] to measure the condence level of random regression-test selection. Using the method, testers select a random subset from a test suite, nd the passing rate of the random samples, and, using this passing rate and the number of samples, apply the Chernoff bound to estimate the passing rate of the entire test suite with the corresponding condence level. Our method can assist regression testing where a large number of test cases is available for the testing, but time constraints do not permit all test cases to be rerun on the modied software. We discuss three example applications in this paper. The rst is operational-prole testing for regression testing of systems, including those whose regression test suite is obtained from events recorded in production logs. The second is coverage-based testing that requires running a large number of test cases to satisfy certain coverage criteria. The third is a regular-build (e.g., daily-build) process that tests the latest version of the software regularly, and that requires running a large number of test cases, has limited time for the testing, or both. We investigated the effectiveness of our method for evaluating random test selection with two empirical studies. The rst study used eight C programsseven programs developed by Siemens Corporate Research [15] and a program developed by the European Space Agency [28]that have been used for many previous testing experiments. The results of this study demonstrate the applicability and effectiveness of our method using the Chernoff bound for measuring condence levels of random test selection. The second study used three GCC compilers. The results of this study demonstrate the potential benets of our method for real-world, large-scale software.

Our method answers two practical questions about random test selection: how many test cases are needed and what condence level can a specic number of random samples provide. The main benets of our method are that (1) it requires no distribution information about the input universe and (2) the condence level that it computes is independent of the size of the test suite but dependent on the passing rate of the samples, the number of test cases, and a user-dened value of deviation. These benets make our method scalable and practical for reducing the cost of testing when a large number of test cases are available. Instead of running the entire test suite, our method runs a randomly-selected subset of the test cases but provides an estimation of the expected execution results for the entire test suite with a specic condence level. The main contributions of this paper are: A description of a new method that answers two practical questions about applying random regression test selection A discussion of three applications of the method to regression testing The results of empirical studies that demonstrate the effectiveness of the method and the benets to realworld, large-scale applications

The Chernoff Bound

In this section, we discuss the Chernoff bound, which we use for our evaluation of random test selection. The Chernoff bound is a probability technique for computing tail probabilitythe probability that a random variable deviates signicantly from its expectation. To dene tail probability more formally, let P r[X ] denote the probability of a random variable X , and let E [X ] denote the expectation of X . Suppose that X1 , X2 , ..., Xn are independent Bernoulli trials such that, P r[Xi = 1] = pi , n P r[Xi = 0] = 1 pi , and = E [X ] = i=1 pi , where 1 i n and 0 < pi < 1. The tail probability is given by
n

P r[
1

Xi > (1 + )]

(1)

where is the deviation set by users and usually is a small real number. The Chernoff bound gives a bound, , of the tail probability of Bernoulli trials as follows1 e ] (2) <[ (1 + )1+ Because Equation 2 is not easy to interpret, the bound is often used as Equation 3 < e(
2

/3)

(3)

1 The Chernoff bound contains the lower bound and the upper bound. In our approach, we apply the upper bound. Details of the Chernoff bound can be found in Reference [20].

Figure 1. Example curves for the condence level computed by Equation 4. For convenience of discussion, the condence level, 1- , is commonly used 1 > 1 e(
2

Evaluating Random Test Selection

/3)

(4)

One property of the Chernoff bound is that the bound is dependent on the expectation (i.e., ) and the deviation (i.e., ), which means that is independent of the size of the universe. Another property of the Chernoff bound is that the effectiveness (i.e., the closeness of the Chernoff bound to the actual tail probability) increases with an increase in the expectation. To illustrate the way in which the condence level changes with the changes in the number of random samples, in the user dened deviation, and in the expectation, consider the example curves of the condence levels shown in Figure 1. In the gure, the horizontal axis represents the random sample size, and the vertical axis represents the condence level. These curves are computed using Equation 4, with deviations =0.1, 0.05, 0.025 and expectations =0.95m, 0.90m, 0.8m, 0.7m, where m is the sample size. There are three sets of curves: the top set represents the bounds with =0.1; the middle set represents the bounds with =0.05; and the bottom set represents the bounds with =0.025. In each set, there are four curves: from top to bottom, the curves represent the bounds with =0.95m, 0.90m, 0.8m, 0.7m, respectively. These curves present a general idea of the number of random samples that are needed to reach certain values of the condence level, given a specic expectation . For =0.1 (the top set of curves), no more than 1000 samples are needed for the condence level to be greater than 0.8; for =0.05 (the middle set of curves), between 2000 and 3000 samples are needed; for =0.025 (the bottom set of curves), between 8,000 and 10,000 samples are needed.

In this section, we rst present our method for evaluating random test selection using the Chernoff bound and describe the way in which we apply the method. We then discuss three applications of the method for testing modied software.

3.1

Overview of Our Method

To address the issues of measuring the condence of using randomly-selected test cases to estimate the number of failures remaining in the software, we provide a method that uses the Chernoff bound for evaluating the estimation of the number of failures discovered by testing with random samples. Our method consists of (1) identifying a set of inputs or a test suite T , (2) selecting random samples from T , (3) executing these random samples and computing the passing rate, (4) estimating the passing rate of T , and (5) applying the Chernoff bound to compute the condence level of the estimation. Thus, rather than running all test cases in T to satisfy the requirement, to save the cost of the testing, a random subset Tr can be used for testing the software. Given a specic condence level, our method determines the number of random samples that are needed. Given a specic number of random samples, our technique computes the condence level. Given a test suite, the execution results can be represented as an array of 1s and 0s, where 1 represents a passing test case and 0 represents a failing test case. Then the problem of estimating failures of a test suite is converted to the problem of estimating the number of 0s in an array of 1s and 0s. In such an array, the elements (i.e., 0s and 1s) are independent of one another, and thus, are Bernoulli trials.

According to Equation 4 (see Section 2), Table 1. Notation used for computing the condence level using the Chernoff bound. Symbol T m p r 1 Meaning Test suite Passing rate of T Number of randomly-selected test cases Number of randomly-selected test cases that pass 1 p Prediction of , given by r = 1+ m Deviation, a user-dened small real number Error bound computed by the Chernoff bound Condence level P r[ > p 1 ] > 1 e( (1 + ) m
2

/3)

(5)

Now, we can use Equation 5 to estimate . For example, suppose m = 1000 and p = 900, if we set = 0.1, then P r[ > that is, P r[ > 0.81] > 0.95 The interpretation of the result is that the probability is greater than 0.95 that is greater than 0.81. If we set = 0.05, then P r[ > 0.85] > 1 e(0.05
2

1 900 1+0.1 1000 ]

> 1 e(0.1

900/3)

> 0.95

900/3)

> 0.53

Therefore, we can apply the Chernoff bound to compute the error bound of estimating the number of 0s. Our method, therefore, is to select a random subset of elements from the array, use the number of 0s in the random samples to predict number of 0s in the entire array, and apply the Chernoff bound to compute the error bound/condence level of the prediction. Our method does not assume, and does not need, the distribution of inputs for the universe. Furthermore, the condence level computed by our method is dependent on the number of random samples, the passing rate of samples, and the user dened deviation, but it is independent of the size of the input universe. Thus, our method is scalable.

If we need a higher condence level for the estimation of in this example, we can run more randomly-selected test cases. For example, if m = 2000, = 0.05, and p = 1800, then P r[ > 0.85] > 0.89. Note that if no failure is detected, such as for m = 2000 and p = 2000 in the above case, we cannot claim that there is no failure in the entire test suite. In such a case, suppose = 0.05, we compute using Equation 5 as P r[ > 0.95] > 1 e(0.05
2

1000/3)

> 0.92

that is, the probability is greater than 0.92 that 95% of the test cases pass.

3.2

Computation of the Condence Level Using the Cherno Bound

3.3

Applications to Regression Testing

In this section, we rst dene the notations used in the rest of the paper (and summarized in Table 1). We then present the way in which we compute the condence level of random selection using the Chernoff bound. Suppose T is a test suite. We randomly select m test cases from T and execute them. If p random test cases pass, we can estimate (the passing rate of T ) using the Chernoff bound as follows. For each m randomly-selected test case, there is probability that the test case passes and probability 1 that the test case fails. According to Equation 1, P r[p > (1 + )m] < Substituting and rewriting the equation above gives P r[ < that is, P r[ >
p 1 (1+ ) m ] p 1 (1+ ) m ]

Our method has applications for regression testing. To illustrate its usefulness, in this section, we present three example applications. Operational-prole Testing As a straightforward application, our method can be applied to a test suite that represents an operational prole of expected usage. This test suite is then used for the regression testing. The operational-prole-based test suite can be created during development for use in estimating the reliability of the software, and the application of the Chernoff bound can provide the condence level in the estimated reliability for the retesting. Although it is often difcult to create such an operational prole, we can approximate it using the data collected from actual users of the software. This approximation of the operational prole can come from many sources. One source is the set of events recorded in production logs, which can be (and are often) captured. Such events can be used as a

<

>1

regression test suite for updating systems or migrating systems to new platforms or environments. However, the number of events, and thus, the size of the regression test suite, is often quite large, and rerunning the entire test suite may be too expensive. Our method selects a set of random samples from the entire set of events, and computes the passing rate for this set of samples. Using the number of random samples and the passing rate, the method applies the Chernoff bound to predict the number of passing events and provide the corresponding condence level for the estimation of the software. The tester can then run a subset of the regression test suite and, within the condence level, get similar results as running the entire regression test suite. This reduction in the number of test cases to run can save time in the running of the test cases and in the inspection of the outputs. Coverage-based Testing Coverage-based testing is used to create and execute test cases that satisfy certain coverage criteria. However, a test suite that satises certain coverage criteria can be quite large, especially for complex software. Full coverage using code coverage criteria may be difcult to achieve. There are tools that can generate partial coverage. For example, Agitar claims to automatically generate test cases with coverage of 80% or better2 for unit testing of Java programs. There are also heuristics for reducing the number of test cases for satisfying a coverage criterion. For example, Reference [3] applies covering arrays as a heuristic for reducing the number of test cases for conguration testing. Note that even such a reduced set of test cases generated by these heuristics or tools can still be quite large. For example, creating test cases to satisfy covering arrays is an NP-complete problem [26]. With our method, software testers can select a random subset of test cases and estimate the results of executing all the test cases. For example, consider a relatively small congurable system that has 20 options, each of which has two settings. Such a system requires up to 220 (i.e., 1,048,576) test cases to cover all possible congurations. If the software is modied, it may be too expensive to test all congurations. However, our method needs only thousands of test cases to reach a high condence level for such a large number of test cases. Furthermore, because of the scalability of our method (i.e., the condence level computed by our method is independent of the size of universe), when the number of settings increases, our method does not require an increase in the number of test cases. Thus, our method signicantly reduces the cost of testing but provides information about the thoroughness of software testing.
2 http://www.agitar.com/

Regular-build Testing Regular-build tests, or smoke tests, are sets of test cases that are created for use in testing the built system often. Thus, they are usually run at specied times, possibly daily or after the software is changed, as the software is being developed, to nd failures in the software early in the development. As the software system increases in size and complexity, these regular-build or smoke-test suites can increase in size so that the entire suite cannot be run in the time allotted for the testing. Our method can be used to select a random set of test cases from the test suite, and provide condence in estimating faultiness that would be found by running the entire suite. This approach can reduce the cost of the testing so that it can be performed in a reasonable time (e.g., overnight).

Empirical Studies

To evaluate the effectiveness of the Chernoff bound for estimating random test selection, we performed two empirical studies. The rst study uses a set of eight C programs that have been used in many testing studies. The results demonstrate the applicability and the effectiveness of the Chernoff bound for use in random test selection. The second study uses three GCC compilers. The study illustrates the way in which our method can be applied to real-world applications and the results further demonstrate the effectiveness of our method.

4.1

Study 1

The goal of this study is to demonstrate the applicability and effectiveness of the Chernoff bound applied to random test selection, by showing empirically that the error rate of estimating results of the random selection is bounded by the Chernoff bound. For the study, we used the set of programs, shown in Table 2. In the table, the rst column lists the programs. For each program, the second column lists the number of faulty versions (each version contains one fault), the third column lists the number of lines of code (averaged over the program and its versions), the fourth column lists the number of test cases in the universe, and the last column gives a brief description of the program. In all, the eight programs and their versions result in 190 different programs. The rst seven programs in the table, along with their versions and inputs, were assembled at Siemens Corporate Research for a study of the fault-detection capabilities of control-ow and data-ow coverage criteria [15], and the programs have been used for many testing and debugging studies. The last program, space, was developed by the European Space Agency. A test suite for space was

constructed from 10,000 test cases generated randomly by Vokolos and Frankl [28] and 3,585 test cases created by researchers at Oregon State University. To describe the results of this study, we used the notations listed in Table 1, along with newly dened notations, shown in Table 3. We used the following process to perform the experiment: 1. For each subject program, we created 1000 test suites S r by randomly selecting test cases from S , which resulted in 8000 random test suites S r (i.e., r = 8000). Each S r contains m test cases, and we determined the value of m using N : (a) if N 2000, m = 1000 (e.g., if N = 5000, then m = 1000) (b) otherwise, m = N 2 (e.g., if N = 1500, then m = 750) 2. We executed the 8000 random test suites S , collected the execution data for the faulty versions, and computed p. 3. For each S r , we computed r and using the Chernoff bound, with deviations = 0.05 and = 0.1. 4. For each r , we compared it with . If r (i.e., the actual passing rate is out of the prediction), k = k + 1. k 5. We computed error = 8000 . 6. We compared error and . Informally, in the process above, rst, we executed 8000 randomly created test suites S r by selecting test cases from the test suites S distributed with the subjects. Then, we applied the data for the 8000 executions to predict the passing rate of S , and applied the Chernoff bound to compute the error bound . Finally, we veried the accuracy of the Chernoff bound by comparing with error, where k error = 8000 and k is the number of S r s whose results are outside the passing-rate prediction made by the Chernoff bound. The Chernoff bound is accurate in evaluating random testing if error < in all the 8000 random test suites. Tables 4 and 5 show the results of the study, with = 0.1 and = 0.05, respectively, which are conventional deviations used by statistical methods. Because of the large number of executions, the tables partition the data into ve groups based on . The rst column shows the values of ; the second and third columns list the value of 1 and respectively; the fourth and fth columns list the value of k and error, respectively. For example, the data in the rst row of Table 4 show that, for the test suites whose is greater than 80%, the actual error rate is 0, whereas the error bound is 0.07. From the tables, we can observe two facts. First, is always greater than error. Such a result means that, for the subjects and test cases we studied, errors in estimating random test cases are bounded by the error bound computed by
r

Table 2. Subject programs used for Study 1. Faulty Test Program Versions LOC Cases Description print tokens 7 472 4056 lexical analyzer print tokens2 10 399 4071 lexical analyzer schedule 9 292 2650 priority scheduler schedule2 10 301 2680 priority scheduler tcas 41 141 1578 altitude separation replace 32 512 5542 pattern tot info space 23 58 440 6218 1054 13585
replacement information measure array denition interpreter

Table 3. Additional notation (over that in Table 1) used to describe the empirical studies. Symbol S N P Sr r k Meaning Test suite distributed with a subject program Size of S Number of test cases in S that pass Test suite randomly-selected from S Number of random test suites Number of test suites where r (i.e., the actual passing rate is out of the prediction made by the random samples) error = k r The error rate for predictions of random samples

error

the Chernoff bound.This result demonstrates the applicability of the Chernoff bound to random test selection. Second, the value of error decreases as the value of increases. This result supports the property of the Chernoff bound discussed in Section 2: the effectiveness of the Chernoff bound increases with the increase in the expectation. Note that the goal of this study is to demonstrate that error bounds computed by the Chernoff bound are effective, but the values of error bound are not the focus. In some cases, the error bounds in Table 5 might be too high to be useful in practice. To lower the error bound, we can either decrease or increase the number of random tests in each test suite.

4.2

Study 2

The goal of this study is to show the benets of using random test selection on real-world, large-scale software. For the study, we used the GNU Compiler Collection GCC, which is a set of compilers that handle multiple programming languages on various of target proces-

Table 4. Results of Study 1, with = 0.1. 1- k error 80% 0.93 0.07 0 0 70-79% 0.90 0.09 0 0 60-69% 0.86 0.14 0 0 50-59% 0.81 0.19 1 0 < 50% < 0.81 > 0.19 1415 0.18 Table 5. Results of Study 1, with = 0.05. 1- k error 80% 0.49 0.51 0 0 70-79% 0.44 0.56 17 0 60-69% 0.39 0.61 40 0.01 50-59% 0.34 0.66 99 0.01 <50% < 0.34 > 0.66 2663 0.33 sors.3 After many years of evolution, GCC has accumulated tens of thousands of test cases, which were created using D EJAG NU, a framework for testing.4 A new version of GCC is tested by all the test cases and is released only when a small number of test cases fail. Furthermore, it is recommended that the software be tested on all possible targets to identify the failed test cases before a new version is released.5 However, it is unlikely that such thorough testing will be done because of the large number of test cases and possible platforms. For our study, we used Red Hat Enterprise Linux 4 on the x86 64 platform, and we used three GCC 4.1.0 compilers as the subjects: gcc, g++, and gfortran. We used as the test pool for our study all the test cases designed for, and included with, this subject. We seeded two faults in the compilersone at a timeto simulate a stage before the nal release; we refer to the faults as Fault 1 and Fault 2. During this stage our method can be applied to estimate the possible number of failures in all the test cases, and can help to decide whether the software is ready to ship. For each subject, we used the following process to perform the experiment (see Table 1 for the meanings of notations): 1. We executed all N test cases and collected the data from their executions to check P and computed . 2. We executed m test cases randomly selected from the N test cases and computed p. 3. We computed r and . Tables 6 and 7 show two sets of results of estimating the passing rate of GCCs test cases with different seeded
3 There are 51 target processors supported by version 4.1. http://en.wikipedia.org/wiki/GNU Compiler Collection 4 http://www.gnu.org/software/dejagnu/ 5 http://gcc.gnu.org/install/test.html

Table 6. The rst set of estimations rate of GCC 4 1 0 (with Fault 1). gcc g++ N 30646 11782 P 23486 11559 0.74 0.98 =0.1 r 0.67 0.89 m=1000 1- 97.49% 99.27% =0.1 r 0.68 0.89 m=2000 1- 99.99% 99.9% =0.05 r 0.71 0.93 m=2000 1- 83.99% 91.3% =0.05 r 0.70 0.93 m=4000 1- 97.48% 99.25%

for the passing gfortran 10046 8385 0.84 0.77 98.58% 0.76 99.9% 0.81 87.9% 0.79 98.46% total 53474 43430 0.81 0.74 98.27% 0.73 99.9% 0.75 86.1% 0.77 98.3%

Table 7. The second set of estimations for the passing rate of GCC 4 1 0 (with Fault 2). gcc g++ gfortran total N 38112 12074 11670 61856 P 37746 12050 11662 61458 0.99 0.99 0.99 0.99 =0.1 r 0.90 0.91 0.91 0.90 m=1000 1- 99.29% 99.32% 99.32% 99.30% =0.1 r 0.90 0.91 0.91 0.90 m=2000 1- 99.99% 99.99% 99.99% 99.99% =0.05 r 0.94 0.95 0.95 0.95 m=2000 1- 91.57% 91.77% 91.79% 91.70% =0.05 r 0.94 0.95 0.95 0.95 m=4000 1- 99.29% 99.32% 99.32% 99.30%

faults. In the two tables, the second through fourth columns show the results for a specic subject, and the last column shows the results of considering the three subjects as one subject. The second row (labeled N ) shows the number of test cases executed on a specic subject. The third row (labeled P ) shows the number of passing test cases. The fourth row (labeled ) shows the value of the actual passing rate. The fth through eighth rows show the predicted passing rate and the corresponding condence level, computed by the Chernoff bound with different values of m and . For example, in Table 6, the second column shows that there are 30,646 test cases for gcc, the number of passing test cases is 23,486, and the actual passing rate is 0.74. Using the Chernoff bound, we predict that the passing rates are 0.669, 0.678, 0.71, and 0.700, with condence levels 97.49%, 99.99%, 83.99%, and 97.48%, respectively, with different combinations of m and . Note that in the fth and the sixth rows, the predictions and condence levels are computed using the same value of . According to Equation 5, we would have the same pre-

diction of passing rate r . However, because we randomly selected test cases, r is different in different runs. This difference accounts for the difference in random selection. The seventh and eighth rows show the same observation. Also note that the values of N (i.e., the number of test cases used to test a specic subject) differ in Tables 6 and 7, although the same test cases were executed on the same subject. The reason for the difference is that the outcome of test cases created by D EJAG NU can have different status values, such as pass, fail, unexpected pass, expected fail, unsolved, and unsupported. We consider status values of pass, unexpected pass, expected fail to be correct executions of the program under test, and a status value of failed to be an incorrect execution of the program (i.e., a detection of a failure). We do not consider the remaining status values, such as unsolved and unsupported, which can represent different meanings. For example, the status value of unsolved means that investigations by the human are needed, and the status value of unsupported means that a test case depends on some facility that does not exist in the testing environment.6 For the two experiments we conducted, the number of these remaining status values are different, and thus, the values of N are different in the two experiments. Table 6 shows that 1000 random test cases can provide a high condence level (i.e., greater than 97%) when =0.1 in all subjects. The table also shows that 2000 random test cases can provide a condence level greater than 90% when the passing rate is high (e.g., 0.98), and that 4000 random test cases can provide a high condence level (i.e., greater than 97%) when =0.05 in all subjects. From Table 7, we can see that with a higher passing rate (i.e., higher expectation), 2000 random test cases can provide a condence level greater than 90% for both =0.05 and =0.1. The results conrm the properties of the Chernoff bound: the condence level increases with the number of random test cases and the passing rate, and decreases with . This table also shows that when we consider all three subjects as one subject, the same number of random test cases provides the same condence level as when three subjects are viewed as individual subjects. This result demonstrates the scalability of our method: because the condence level is independent of the size of the universe, our method can be applied to large-scale applications without increasing the number of random test cases.

Related Work

The Chernoff bound has been widely used in randomized algorithms for estimating the error [20]. There are
6 Details of test outcomes of D EJAG NU can be found at http://ftp.gnu.org/old-gnu/Manuals/dejagnu1.3/html chapter/auldejagnu 1.html#SEC1

other inequalities for measuring error bound. One of such inequality is the Chebyshev Inequality [20]. Compared to the Chebyshev Inequality, the Chernoff bound is a tighter bound if the majority (i.e., the expectation in Equation 4) is close to 1. Because it is expected that a small percentage of test cases fail, the Chernoff bound is tighter than the Chebyshev Inequality for estimating the number of failures remaining in a system. There are machine learning framework methods, such as cross validation [9], bootstrapping [11], and bagging [2], for measuring errors. The common idea of these methods is the iterative selection of random samples from a data set to estimate the patterns of the data. The error bounds computed by these framework methods can be tighter than the Chernoff bound. These methods are similar to general random selection in that the more random samples (or iterations) that are selected, the more reliable is the prediction of the samples. However, unlike the Chernoff bound, these framework methods have the same practical issues as random selection: there are no prescribed numbers of iterations and random samples for these methods to reach a certain condence level. In contrast, our method uses the Chernoff bound, which can provide a condence level for a given number of random samples. Frankl and colleagues [12] and Hamlet [14] investigate using an operational prole as a large test suite for testing the software. They believe that even if a fault exists in the software, if the fault is never executed (i.e., inputs that invoke the fault are excluded from an operational prole), users of the software are not concerned about the fault. Thus, they propose random selection of test cases from the operational prole to estimate the number of failures in the test suite. However, unlike our method, they do not address the practical issue of a condence level. Mankefors and colleagues [18] propose a method for estimating the number of failures remaining in the software using randomly-selected samples. This method applies Monte Carlo integration and the central limit theorem to compute the condence level of the random selection. This method does not assume uniform distribution of the input universe. However, benets of this method need to be justied. First, it is unclear why integration is used to represent the sum of discrete events (i.e., execution results of individual test cases) in that on the right-hand side of Monte Carlo integration, the value of integration is simply the sum of the number of passed test cases and the error term is the deviation of random samples. Second, the method computes the condence level based on the central limit theorem (i.e., as the sample size increases, the sample mean is approximately normally distributed). However, like many other random selection techniques, there is no specic number corresponding to a certain condence level. Our method is different from References [12, 14, 18] in that, given a spe-

cic number of random samples, ours provides a bounded condence level. There are a number of techniques that have investigated the evaluation of random testing. Duran and Ntafos [10], Hamlet[14], and Weyuker and Jeng [29] investigated the probability that random testing reveals at least one failure. Chen and Yu [5] and Duran and Ntafos [10] studied the expected number of failures detected by random testing. Our work differs from References [10, 29] in that they assume testing stops until the fault causing a detected failure is located. Our work does not require such an assumption, which is unlikely realistic because the testing usually does not stop when a failure is detected. The work in References [5, 10] assumes the input universe is uniformly distributed, but our work does not. Thus, ours can be more practically useful because the input distributions can be complex and difcult to determine in todays software systems. One general type of research on random testing is the automatic generation of random test cases. Several researchers [13, 17, 25] combine random testing with symbolic testing to create test cases aiming at improving code coverage. Others [22, 23] make use of program execution information to generate random test cases. Still others [1] focus on automatic generation of random test inputs for complex inputs of program. A nal type of research on random testing explores the discovery of failures by randomly selecting test cases from a large set of test cases using adaptive selection techniques [4, 8, 19]. Our method differs from these methods in that it does not generate random test cases. Instead, given a large set of test cases generated by existing techniques [1, 13, 17, 22, 23, 25], our technique can reduce the cost of executing all the test cases by running a randomly-selected subset of the test cases and providing the condence level of the random selection.

Conclusion and Future Work

In this paper, we present a new method for evaluating random regression-test selection using the Chernoff bound. Our method addresses two open practical questions: (1) how many test cases are needed to reach a certain level of condence in the software and (2) what condence level can a specic number of random samples provide. Our method does not require input distribution information, which makes our method practical, and the condence level computed is independent of the size of the universe used for the testing, which makes our method scalable. Our method can help to reduce the cost of different types of testing that run test suites containing a large of number test cases. We illustrate three example applications in this paper: operational-prole testing, coverage-based testing, and regular-build testing. The common thread in these ap-

plications is running of a subset of test cases that are randomly selected from the test suites and providing the condence level of using such randomly-selected test cases. We investigated the effectiveness of our method using a set of empirical studies. We rst conducted experiments on eight C programs that have been used in many testing experiments. The results demonstrate the effectiveness of our method. We then conducted experiments on three GCC compilers. The results demonstrate the benets of our method to real-world, large scale software. Our method applies to pure random selection but not to adaptive random selection or random selection that is mixed with preferences. For instance, in the example application of coverage-based testing for conguration testing, users can apply our method to randomly-selected conguration settings. However, if users select conguration settings with certain preferences (e.g., selecting settings whose legal values are easy to set), they cannot use our method to compute the condence level based on these selections. Our method has several limitations. First, the condence level computed by the Chernoff bound can be wide. For example, in Table 4, when the passing rate is above 60% (the fourth row), the condence level is 0.864 (the second column), whereas the actual error is 0. However, as the passing rate increases, the condence level becomes tighter, according to the property of the Chernoff bound (see Section 2 for details). For instance, in Table 7 where all passing rates are above 90%, most corresponding condence levels are above 99% and none is below 90%. The passing rate is high in real-world applications, and thus, the results computed by our method still can be tight in practice. Second, test suites used in our case studies may contain redundant test cases. In our case study of the GCC compilers, one of the simplest types of redundance is that a test case appears several times because the same test case was added by different developers. When the redundance exists in the test suite, the independence of test cases can be weak and the prediction can be overestimated. One of our areas of future work is to remove redundant test cases from a test suite. However, the Chernoff bound can be wide (as discussed above), but the wide bounds provide our method the capacity of tolerating redundance. Thus, the results of our method can still present an important indication of the number of failures, even though the redundance exists in a test suite. Third, although our empirical results are encouraging, there is a threat to the validity of our studies in that the subjects used in our case studies may not generalize to other subjects. Although we conducted studies on a set of C programs including one large C program, we need to apply the method to more and larger subjects along with different kinds and numbers of faults. We plan to address this issue in future work, and perform additional studies.

Acknowledgements

This work was supported in part by National Science Foundation under awards CCR-0205422, CCF-0429117, and CCF-0541049 and by Tata Consultancy Services, Ltd. Santosh Vempala, College of Computing, Georgia Institute of Technology, provided helpful advice on the statistical analysis that we used for our work. The anonymous reviewers provided many comments and suggestions that improved the presentation of the work.

References
[1] C. Boyapati, S. Khurshid, and D. Marinov. Korat: automated testing based on java predicates. In Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 123 133, July 2002. [2] L. Breiman. Bagging predictors. Machine Learning, 24(2):123140, 1996. [3] R. Brownlie, J. Prowse, and M. S. Phadke. Robust testing of AT&T PMXStarMAIL using oats. Communications of the Association for Computing Machinery, 71(3):4147, December 1992. [4] T. Y. Chen and R. Merkel. Quasi-random testing. In Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering, pages 309312, May 2005. [5] T. Y. Chen and Y. T. Yu. On the expected number of failures detected by subdomain testing and random testing. IEEE Transactions on Software Engineering, 22(2):109 119, February 1996. [6] Y. F. Chen, D. S. Rosenblum, and K. P. Vo. Testtube: A system for selective regression testing. In Proceedings of the 16th International Conference on Software Engineering, pages 211222, May 1994. [7] P. K. Chittimalli and M. J. Harrold. Regression test selection on system requirements. In 1st India Software Engineering Conference (ISEC 2008), pages 8796, Hyderabad, India, February 2008. [8] I. Ciupa, A. Leitner, M. Oriol, and B. Meyer. Artoo: Adaptive random testing for object-oriented software. In Proceedings of the 30th international conference on Software engineering, pages 7180, May 2008. [9] P. A. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach. Prentice-Hall, 1982. [10] J. Duran and S. Ntafos. An evaluation of random testing. IEEE Transactions on Software Engineering, 10(4):438 444, July 1984. [11] B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):126, 1979. [12] P. G. Frankl and Y. Deng. Comparison of delivered reliability of branch, data ow and operational testing: A case study. In Proceedings of the 2000 ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 124134, August 2000. [13] P. Godefroid, N. Klarlund, and K. Sen. DART: Directed automated random testing. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, pages 213223, June 2005.

[14] R. Hamlet. Random testing. In J.Marciniak, editor, Encyclopedia of Software Engineering, pages 970978. Wiley, 1994. [15] M. Hutchinsand, H. Foster, T. Goradia, and T. Ostrand. Experiments on the effectiveness of dataow- and control ow-based test adequacy criteria. In the 16th International Conference on Software Engineering, pages 191200, May 1994. [16] D. Kung, J. Gao, P. Hsia, Y. Toyashima, and C. Chen. Firewall regression testing and software maintenance of objectoriented systems. 1994. [17] R. Majumdar and K. Sen. Hybrid concolic testing. In Proceedings of the 29th International Conference on Software Engineering, pages 416426, May 2007. [18] S. Mankefors, R. Torkar, and A. Boklund. New quality estimations in random testing. In Proceedings of the 14th International Symposium on Software Reliability Engineering, page 468, November 2003. [19] J. Mayer. Lattice-based adaptive random testing. In Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering, pages 33336, May 2005. [20] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. [21] A. Orso, N. Shi, and M. J. Harrold. Scaling regression testing to large software systems. In Proceedings of the 12th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE 2004), pages 241252, November 2004. [22] C. Pacheco and M. D. Ernst. Eclat: Automatic generation and classication of test inputs. In 19th European Conference Object-Oriented Programming (ECOOP 2005), pages 504527, July 2005. [23] C. Pacheco, S. K. Lahiri, M. D. Ernst, and T. Ball. Feedbackdirected random test generation. In 29th International Conference on Software Engineering, pages 7584, May 2007. [24] G. Rothermel and M. J. Harrold. A safe, efcient regression test selection technique. In ACM Transactions on Software Engineering and Methodology, pages 6(2):173210, April 1997. [25] K. Sen, D. Marinov, and G. Agha. Cute: A concolic unit testing engine for c. In 5th joint meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE05), pages 263272, September 2005. [26] G. Seroussi and N. H. Bshouty. Vector sets for exhaustive testing of logic circuits. IEEE Transactions on Information Theory, 34:513 522, May 1988. [27] F. Vokolos and P. Frankl. Pythia: A regression test selection tool based on text differencing. In International Conference on Reliability, Quality and Safety of Software Intensive Systems, May 1997. [28] F. Vokolos and P. Frankl. Empirical evaluation of the textual differencing regression testing techniques. In Proceedings of the International Conference on Software Maintenance, pages 4453, November 1998. [29] E. J. Weyuker and B. Jeng. Analyzing partition testing strategies. IEEE Transactions on Software Engineering, 17:703711, July 1991.

You might also like