Professional Documents
Culture Documents
Better Testing Through Oracle Selection (NIER Track)
Better Testing Through Oracle Selection (NIER Track)
892
spite the apparent difficulty of doing so—by developing test ora- oracle. We are interested in how the addition of internal state infor-
cles based on methods other than concretely specified oracles [14]. mation to the oracle data influences the effectiveness of the testing
Baresi and Young evaluate several examples of these (and other) process. We term oracles consisting of all the outputs and one or
approaches to addressing the oracle problem [2]. more internal state variables output-base oracles, and term an ora-
In our domain of interest, however, concrete oracles with ex- cle considering all outputs and internal state variables the maximum
pected values manually specified by users are generally used. We oracle. We explore how the addition of internal state variables influ-
are therefore interested in determining which variables should be ences the effectiveness of the testing process by varying the oracle
considered by such an oracle. Our study explores how the selection data, ranging from the output-only oracle to the maximum oracle.
of these variables influences the effectiveness of the testing process.
To the best of our knowledge, only two other works (by Memon et
al. and Briand et al. [16, 4]) empirically study oracle selection.
4. PAIRING TEST DATA WITH ORACLES
Memon et al. define oracles for GUI systems using combina- Consider the process of testing a system. Suppose we plan to
tions of oracle information and procedures, and examine their fault use two test suites: (1) a test suite generated to meet the Modified
finding effectiveness using mutation testing [16]. Our work differs Condition/Decision Coverage (MCDC) criterion [5], and (2) a test
primarily in the method used to vary the oracle (we do not vary suite generated to meet the Unique First Cause (UFC) coverage cri-
the oracle procedure), the analyses performed, and the domain in terion [15], a black box test coverage criterion based on exercising
which the studies were conducted. In particular, while test cases Linear Temporal Logic (LTL) requirements. Now consider devel-
of varying length are explored by the authors, the joint influence of oping a test oracle for these test suites, and ask: what should the
test suite and oracle is not explored in the same manner or detail. oracle data be? One common answer in the domain of avionics,
Briand et al. demonstrate that in the domain of object-oriented and perhaps in software engineering research in general, is to sim-
systems: (1) a concrete, precise oracle outperforms a state-based ply defined expected values for all of the outputs, irrespective of
oracle defined as state invariants for the case examples used, (2) the test suite used.
there exist faults only detectable when using both precise oracles For M CDC, the tests are designed to exercise internal code
and rigorous coverage metrics, and (3) the cost effectiveness of structures. Accordingly, it is possible for each test to be very short,
precise oracles varies from system to system [4]. Our work dif- with its impact limited to the Boolean decision it is designed to
fers primarily in the method used to vary the oracle, the analysis exercise. In contrast, for U F C, the tests are designed to exercise
performed, and the domain in which the studies were conducted. functional requirements. These requirements are defined in terms
Voas has developed the propagation, infection, and execution of inputs and outputs, and tests, thus, tend to be longer, as they must
technique (PIE) for computing testability metrics for locations in typically cause the system to exhibit a sequence of behaviors.
source code. Of relevance here is Voas and Miller’s proposal to Given the differences between M CDC and U F C, can we not
determine where faults may be masked using PIE and monitor in- tailor a test oracle to their varying characteristics? It seems reason-
ternal state accordingly (using assertions) [13]. As PIE essentially able to use a test oracle considering internal state information when
performs an empirical study to determine masking, it is similar to using M CDC, as the tests are designed specifically to manipulate
running our empirical study to find internal variables to monitor. internal behavior, and might not cause information to propagate
However, no large scale empirical study has been conducted to de- to outputs. When using U F C, however, the tests are designed to
termine the efficacy of using the PIE technique for oracle selection. test black-box behavior, and, thus, seem more likely to propagate
information to the outputs—considering internal state information
seems less less valuable when using U F C.
3. TEST ORACLES In this example, we have paired a single test oracle with a test
We use the definition presented by Richardson et al. that divides suite. However, we can take this idea to its logical extreme by
test oracles into two parts: the oracle information and the oracle considering pairing a separate test oracle with each test input. For
procedure [11] . The oracle information specifies what is consid- example, if a test input T is designed to exercise a branch deep in
ered correct behavior, and the oracle procedure verifies that test the source code, we could consider only variables related to this
executions correspond to the oracle information. branch when running test T . When running a test T 0 exploring
Our case examples are all avionics systems developed in Simu- a different branch deep in another portion of the source code, we
link [8] and translated to Lustre [6]. Simulink and Lustre are ex- might consider a completely different separate set of variables.
amples of synchronous dataflow languages [6]. A test input for
a synchronous reactive system is defined as a sequence of steps,
with each step specifying a value for each input variable. This pro- 5. STUDY AND RESULTS
duces a corresponding sequence of values of the system’s internal In our study, we were interested in how the use of internal state
state variables and outputs. In our study, the oracle information is information in oracle data influences the effectiveness of the testing
specified as sequences of test inputs with corresponding expected process, as measured by the number of faults detected (as defined
changes to the AUT output variables and internal state variables, by [1]).
i.e., input/expected value(s) pairs. The oracle procedure simply
matches the actual results from the AUT with the oracle informa- 5.1 Experimental Setup Overview
tion. We term oracles constructed in this fashion as expected value We used four industrial synchronous reactive systems developed
oracles. In our experience with industrial partners, expected value by Rockwell Collins Inc. in our experiment. Two systems, DWM_1
oracles are commonly used in testing synchronous reactive systems and DWM_2, represent portions of a Display Window Manager
in the avionics domain. (DWM) for a commercial cockpit display system. Two systems
A key property of expected value oracles is the subset of outputs represent various aspects of a Flight Guidance System (FGS), which
and internal state variables considered by the oracle information; is a component of the overall Flight Control System (FCS) in a
variables that we term the oracle data. Currently, from our ob- commercial aircraft. The two FGS systems in this paper focus on
servations, the prevalent oracle data used in testing critical systems the mode logic of the FGS. The Vertmax_Batch and Latctl_Batch
consists of all of the outputs. We term such an oracle an output-only systems describe the vertical and lateral mode logic for a flight
893
Relative Improvement in FF (%) OO Size MX Size Rel. Size Inc.
10 15 20 25 30
DWM_1 9 53 489%
DWM_2 DWM_1
DWM_2 7 130 1757%
Latctl Vert
Latctl_Batch 1 23 2200%
Vertmax_Batch 2 79 3850%
We can also see the maximum oracle is between 4.8 and 133
0 200 400 600 800 1000 times larger than the output-only oracle—a substantial increase in
oracle size for potentially modest gains in fault finding. This high-
Test Suite Size
lights the need for effective methods of oracle selection: the fault
Figure 1: Relative FF Improvement Across Test Sizes
finding improvements are desirable, particularly in the domain of
critical systems, but using the maximum oracle is likely more ex-
guidance system. All represent sizable, operational systems. pensive than using the maximum output-only oracle. To gain these
For each case example, we performed the following steps: improvements without incurring potentially unacceptable increases
1. Generated random test data: We generated random inputs for in testing cost, we must develop methods of determine which com-
1,000 tests of test lengths between 2-10 steps. bination of variables is most effectively used in an oracle.
2. Generated mutants: We generated 200 mutants, each contain- Note that for several of the test-oracles/test-suite combinations,
ing a single fault, and removed functionally equivalent mutants. we manually examined how internal state improved the effective-
3. Ran test suite on mutants: We ran each mutant and the original ness of testing, and found that each internal variable reveals a small
case example using the entire randomized test suite and collected number of additional faults not caught by the output-only oracle,
the internal state and output variable values produced at every step. with many internal variables revealing no additional faults. Fur-
This yields raw data used for the remaining steps in our study. thermore, the set of additional faults caught is nearly always dis-
4. Generated oracles: We generated a sets of oracles ranging, joint between variables. Thus, while the maximum oracle is con-
in size from the output-only oracle to the maximum oracle. We siderably larger than the output-only oracle, only a small portion
randomly selected which internal variables were to be included in of the added internal state variables contribute to improving testing
an oracle. We also generated all oracles of size 1 (i.e., observing 1 effectiveness.
variable).
5. Generated test suites: We randomly generated 36 subsets of 5.3 Test Suites, Oracles, and Fault Finding
the randomized test suite, with subsets containing 10 to 1,000 tests We present the relationship between test suite size, oracle size
from the original test suite. and fault finding as a contour map in Figure 2 for the DWM_1
6. Assessed fault finding ability of each oracle and test suite and DWM_2 case examples. As with Figure 1, these figures have
combination: We determined how many mutants were detected by been smoothed using LOESS. In these figures, the light areas repre-
every oracle and a reduced test suite combination. Note that the full sent combinations of test suites and oracles that reveal a relatively
results of our analysis are available online. high number of faults, and dark areas represent combinations of
These steps are routine in experiments based on mutation testing; test suites and oracles that reveal a relatively low number of faults.
in this case, the details are similar to the steps used in [10]. Each contour line represents a constant level of fault finding. (The
One relevant risk of using mutation testing in examining these straight dashed line is explained below.) For example, we can
questions is functionally equivalent mutants, in which faults ex- see that for the DWM_1 system, 800 tests and the largest output-
ist but these faults cannot cause a failure, which is an externally only oracle reveals roughly the same number of faults as 200 tests
visible deviation from correct behavior. For our study, we used and an output-base oracle containing 60% of the internal variables
model checking to detect and remove functionally equivalent mu- (166 faults), whereas for the DWM_2 system, 800 tests and and the
tants. This is made possible due to our use of synchronous reactive largest output-only oracle reveals more faults than even 600 tests
systems as case examples—each system is finite, and, thus, deter- with the maximum oracle.
mining equivalence is decidable. Given this information and a suitable cost function, we can de-
After conducting our study, we performed a number of analyses. termine what pairing of test suite and oracle maximizes the number
Here we examine those analyses that we believe support the idea of of faults revealed. For example, assume that the testing costs are
pairing test inputs and test oracles. such that using a test suite of size x with the maximum oracle and
a test suite of size 3x with the output-only oracle cost the same.
5.2 Improvement Using Maximum Oracle Furthermore, assume that the testing resources allocated allow us
In Figure 1 we plot the relative improvement in fault finding to use 900 tests with the output-only oracle. Plotting this cost func-
when using the maximum oracle over the output-only oracle for tion yields the straight dashed lines present in Figure 2. Examining
each test suite (smoothed with LOESS using a smoothing factor of these lines, we can determine that the most effective, affordable
0.70). In Table 1, we list the oracle sizes for each oracle and the test suite and oracle combination for the DWM_1 system is 410
relative increase in oracle size when using the maximum oracle. tests with an oracle containing 80% of internal variables, whereas
For each case example, we can see that the relative improvement the most effective, affordable combination for the DWM_2 system
when using a maximum oracle starts fairly high, between 4% and is 900 tests with the output-only oracle.
27%, but drops off quickly as we near 200 tests, dropping below As shown, given a set cost function and finite resources, the most
10% for all case examples at 1,000 tests. This indicates that for effective test suite and oracle combination can vary depending on
894
1000
1000
800
800
600
600
# Tests
# Tests
400
400
200
200
10
10
0 20 40 60 80 100 0 20 40 60 80 100
Figure 2: Combined Effect of Oracle Size and Test Suite Size on Fault Finding
how test suite and oracle size jointly influence the number of faults [4] L. Briand, M. DiPenta, and Y. Labiche. Assessing and
revealed. For example, if we were to reverse our test suite and or- improving state-based class testing: A series of experiments.
acle selections in the previous paragraph—i.e., use 900 tests with IEEE Trans. on Software Engineering, 30 (11), 2004.
the output-only oracle for the DWM_1 system and 300 tests with [5] J. J. Chilenski and S. P. Miller. Applicability of Modified
the maximum oracle for the DWM_2 system—we would find ap- Condition/Decision Coverage to Software Testing. Software
proximately 5.6% fewer faults for the DWM_1 system and 11.1% Engineering Journal, pages 193–200, September 1994.
fewer faults for the DWM_2 system. [6] N. Halbwachs. Synchronous Programming of Reactive
Systems. Klower Academic Press, 1993.
6. CONCLUSION [7] M. Harrold. Testing: a roadmap. In Proc. of the Conf. on the
We believe this study demonstrates an opportunity to improve Future of Software Engineering, pages 61–72. ACM New
testing effectiveness through better selection of oracle data. How- York, NY, USA, 2000.
ever, given the large variability between case examples, determin- [8] Mathworks Inc. Simulink product web site.
ing how to improve the test oracle is non-obvious and precludes http://www.mathworks.com/products/simulink.
general guidelines. This brings us to the goal of this paper: moti- [9] M. Pezze and M. Young. Software Test and Analysis:
vating the need for further study on test oracles. Our initial results Process, Principles, and Techniques. John Wiley and Sons,
indicate that careful selection of the test oracle, perhaps consid- October 2006.
ering the test suite used, may improve testing. Nevertheless, our [10] A. Rajan, M. Whalen, and M. Heimdahl. The effect of
results also indicate this is a challenge, requiring future study. We program and model structure on MC/DC test adequacy
therefore propose the following challenges for future research: coverage. In Proc. of the 30th Int’l Conference on Software
engineering, pages 161–170. ACM New York, NY, USA,
• First, for a given method of selecting oracles and test suites, 2008.
how can we efficiently determine the most effective combi- [11] D. J. Richardson, S. L. Aha, and T. O’Malley.
nation of test suite and oracle for a given system? Specification-based test oracles for reactive systems. In Proc.
• Second, how does the use of test suites generated to meet of the 14th Int’l Conference on Software Engineering, pages
coverage criteria (both structural and requirements coverage 105–118. Springer, May 1992.
criteria) influence the effect of oracle selection on the number [12] M. Staats, M. Whalen, and M. Heimdahl. Programs, tests,
of faults revealed? and oracles: The foundations of testing revisited. In Proc. of
• Finally, is it possible to prioritize tests more effectively given the Int’l Conf. on Software Engineering 2011, 2011.
additional information about a program’s internal state? [13] J. Voas and K. Miller. Putting assertions in their place. In
Software Reliability Engineering, 1994., 5th Int’l Symposium
7. REFERENCES on, pages 152–157, 1994.
[1] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. [14] E. Weyuker. The oracle assumption of program testing. In
Basic Concepts and Taxonomy of Dependable and Secure 13th Int’l Conf on System Sciences, pages 44–49.
Computing. IEEE Trans. Dependable and Secure [15] M. Whalen, A. Rajan, and M. Heimdahl. Coverage metrics
Computing, 1(1):11–33, 2004. for requirements-based testing. In Proceedings of
[2] L. Baresi and M. Young. Test oracles. In Technical Report International Symposium on Software Testing and Analysis,
CIS-TR-01-02, Dept. of Computer and Information Science, pages 25–36. ACM, July 2006.
Univ. of Oregon. [16] Q. Xie and A. Memon. Designing and comparing automated
[3] A. Bertolino. Software testing research: Achievements, test oracles for gui-based software applications. ACM Trans.
challenges, dreams. In L. Briand and A. Wolf, editors, Future on Software Engineering and Methodology (TOSEM),
of Software Engineering 2007. IEEE-CS Press, 2007. 16(1):4, 2007.
895