Professional Documents
Culture Documents
Kitchenham - Repeatability of Systematic Literature Reviews - 2010
Kitchenham - Repeatability of Systematic Literature Reviews - 2010
1
Barbara Kitchenham, 1Pearl Brereton, 2Zhi Li, 3David Budgen and 3Andrew Burn
1
School of Computing and Mathematics, Keele University, Staffordshire ST5 5BG
{O.P.Brereton,B.A.Kitchenham}@cs.keele.ac.uk
2
College of Computer Science and Information Technology, Guangxi Normal University,
No.15 Yu Cai Road, Guilin, Guangxi 541004, P.R. China.
zhili@mailbox.gxnu.edu.cn
3
School of Engineering and Computing Sciences, Durham University,
South Road, Durham City, DH1 3LE, UK
{David.Budgen,A.J.Burn}@durham.ac.uk
Abstract
1. Introduction
Background: One of the anticipated benefits of
systematic literature reviews (SLRs) is that they can be The EPIC (Evidence-based Practices Informing
conducted in an auditable way to produce repeatable results. Computing) project has been investigating properties
Aim: This study aims to identify under what conditions of Systematic Literature Reviews (SLR) in the
SLRs are likely to be stable, with respect to the primary Software Engineering (SE) domain. This paper reports
studies selected, when used in software engineering. The
the results for one of the research questions addressed
conditions we investigate in this report are when novice
researchers undertake searches with a common goal. by a case study to investigate the stability of the SLR
Method: We undertook a participant-observer multi-case process. The specific research question addressed is:
study to investigate the repeatability of systematic literature
reviews. The “cases” in this study were the early stages, RQ1: To what extent does the SLR
involving identification of relevant literature, of two SLRs of methodology provide repeatable results?
unit testing methods. The SLRs were performed
independently by two novice researchers. The SLRs were The basic methodology used is a participant-
restricted to the ACM and IEEE digital libraries for the years observer multi-case study using Yin’s approach to case
1986-2005 so their results could be compared with a
study design (Yin 2003). The “case” is the part of the
published expert literature review of unit testing papers.
Results: The two SLRs selected very different papers with
SLR conduct stage that involves identifying and
only six papers out of 32 in common, and both differed selecting the relevant primary studies. In this study, the
substantially from a published secondary study of unit testing results from two SLRs carried out independently by
papers finding only three of 21 papers. Of the 29 additional two research assistants are compared to one another.
papers found by the novice researchers, only 10 were They are also compared with an expert literature
considered relevant. The 10 additional relevant papers would review addressing the topic of empirical studies of unit
have had an impact on the results of the published study by testing undertaken by Juristo and colleagues (Juristo et
adding three new categories to the framework and adding al., 2004 and Juristo et al., 2006), henceforth referred
papers to three, otherwise empty, cells. Conclusions: In the to as the JMV study.
case of novice researchers, having broadly the same research
question will not necessarily guarantee repeatability with
respect to primary studies. Systematic reviews must be Following Yin’s terminology, the case study
careful to report their search process fully or they will not be protocol defined the following propositions related to
repeatable. Missing papers can have a significant impact on the research question:
the stability of the results of a secondary study. RQ1-P1: Where an SLR is conducted within
well-defined parameters describing the topic,
Keywords: Systematic Literature Review, Repeatability, information sources and chronological bounds,
Case Study
2-&%$(&'"!&(&"%'!#!!'&
SLR Research question
researcher
Feng Bian What empirical evidence has been found in the studies that assess the effectiveness and efficiency of
different unit testing strategies?
Zhi Li What empirical studies have been carried out to compare unit testing methods in software development?
3- ( %"##%&"(!,'!#!!'%)*&
!!
#"%
22 16
18 13 6 100
1
3
There are several possible explanations for the As an expert review, the JMV study might have
lack of overlap between the different searches. For rejected poor quality papers without explicitly
example: stating the fact. In order to investigate this
The search strings used by Zhi Li and Feng Bian possibility we undertook a quality evaluation of
did not include the term “regression testing”. the papers found by our search and the papers
They had been tasked specifically to search for found by the JMV study. This is examined in the
unit testing papers not regression testing papers. next section.
In fact, they found five regression testing papers. As an expert review, the JMV study might have
Strangely enough the only overlap they had with concentrated on the most prominent/important
the JMV papers was based on three regression papers. This is discussed in a later section.
testing papers.
1
Including some screening of obviously irrelevant papers
4- (',&'"%!"",!'%##%&
Question Question Available answers
No
1 Are the study measures valid? None=0 / Some (0.33) / Most (0.67) / All (1) Note: should
score down if not actually detecting faults or changes to
code.
2 Was there any replication, i.e. multiple test Yes - both (1) / Yes – either (0.5)/No (0)
objects, multiple test sets?
3 If test cases were required by the Test Not applicable / By the experimenters (Yes=0) / By an
Treatment, how were the test cases generated? independent third party (Yes=0.5) /Automatically (Yes=0.5)/
By industry practitioners when the test object was created
(Yes=1)
4 How were Test Objects generated? Small programs (Yes=0) / Derived from industrial programs
but simplified (Yes=0.5) /Real industrial programs. (Yes=1)
5 How were the faults/modifications found? Not applicable /Naturally occurring Yes=1, go to question 6
If No go to question 5a
5a For seeded faults/modifications, how were the Faults introduced by the experimenters (Yes=0), /
faults identified? Independent third party (Yes=0.25) / Generated
automatically (Yes=0.5) – inc. mutants Go to 5b.
5b For seeded faults/modifications, were the type Type and Number (of seeds): Yes (0.5) / Type or Number (of
and number of faults introduced justified? seeds) : (Yes=0.25) / No=0 / Mutants: (Yes=0.25, No=0)
6 Did the statistical analysis match the study No=(0) / Somewhat (0.33) / Mostly (0.67) / Completely (1)
design?
7 Was any sensitivity analysis done to assess Not applicable / Yes=1 / Somewhat=0.5 / No=0
whether results were due to a specific test
object or a specific type of fault? Note: look
out for a further experiment which is just to
check original assumptions.
8 Were limitations of the study reported either No=0 / Somewhat=0.5 / Extensively=1
during the explanation of the study design or
during the discussion of the study results?
.8 75
.7
50
.6
.5 25
.4
.3 0