Professional Documents
Culture Documents
Kitchenham SystematicReview Tools Features
Kitchenham SystematicReview Tools Features
3.2.2 Feature Set 2: Ease of introduction and setup automated search from within the tool which should
This feature set focuses on the level of difficulty inherent in identify duplicate papers and handle them accordingly.
setting up and using the tool for the first time. Each tool should: x study selection and validation (F3-SF04). In
particular, the tool should provide support for a multi-
x have reasonable system requirements (F2-SF01) and stage selection process (i.e. title/abstract then full
not require any advanced hardware or software to paper), for multiple users to apply the
function, inclusion/exclusion criteria independently and a facility
x have a simple installation and setup procedure (F2- to reconcile disagreements.
SF02) that is supported by an installation guide (F2- x quality assessment and validation (F3-SF05). The
SF03) and/or a tutorial (F2-SF04), tool should enable the use of suitable quality
x be as self-contained as possible i.e. able to function, assessment criteria, should allow multiple users to
primarily, as a stand-alone application with minimal perform the scoring and should provide a facility to
requirements for other external technologies (F2-SF05). resolve conflicts.
3.2.3 Feature Set 3: SR Activity Support x data extraction (F3-SF06). In particular, the tool
These features relate to how well the tool supports each of the should support the extraction and storage of qualitative
three main phases of an SR and the steps within these phases. data using classification and mapping techniques. In
addition, the extraction of quantitative data, which
Planning Phase manages specific numerical information from a
For this phase, the tool should support the collaborative reported study, should also be supported.
development of a review protocol, using a template, and the x data synthesis (F3-SF07). The tool should be able to
control of versions, to keep track of any changes to the protocol provide automated analysis of extracted data. Other
during its development (F3-SF01). It should also support types of analysis, such as text analysis (F3-SF08) and
validation of the protocol (F3-SF02). This might be achieved by meta-analysis (F3-SF09), would also be useful.
enabling evaluation checklists to be distributed to and completed
by members of a review team. Reporting Phase
The tool should support the reporting phase of the SR process.
Conduct Phase This might be achieved using a template to assist the write-up
For the conduct phase, the tool should support: (F3-SF10) and using automated checklists to support the
x automated searching for relevant papers (F3-SF03). validation (F3-SF11).
Ideally, the user should be able to perform an
Table 2. JI1 Interpretation of Judgement Scale Table 3. JI2 - Interpretation of Judgement Scale
Is the feature present? Score Is the tool simple to install and setup? Score
Yes 1
Yes 1
Partly 0.5
Some difficulties
No 0 0.5
The tool could be installed, but there were a number of
slight difficulties throughout the process.
3.2.4 Feature Set 4: Process Management No
This set of features relates to the management of an SR.
Undertaking an SR is a collaborative process. Therefore, the tool The tool could be installed but the process was very
difficult. 0
should allow multiple users to work on a single review (F4-SF01).
It should support document management (F4-SF02), in particular, No - The tool could not be installed.
managing large collections of papers, studies and the relationships
between them. The tool should be secure (F4-SF03) and include a
user log-in or similar system. It should be able to manage the roles 3.3.2 Level of Importance
of users (F4-SF04). For example, it would be useful to state which An effective tool is one that includes features that are most
users will perform certain activities (e.g. study selection, quality important to its target users [10]. Kitchenham et al. state that if a
assessment, data extraction etc.) and allocate papers accordingly. tool fails to include a mandatory feature, then it is, by definition,
Finally, the tool should be able to support multiple SR projects unacceptable [10]. Non-mandatory features allow the evaluator to
(F4-SF05). judge the relative merit of a group of otherwise acceptable tools
[10]. For this study, a feature can be considered Mandatory (M)
3.3 Scoring Candidate Tools or one of three gradations of desirability; namely, Highly
In this section we look at three elements of the scoring process: Desirable (HD), Desirable (D) or Nice to have (N). Table 5 shows
x scoring each tool against each feature to produce a raw the multiplier (i.e. the weighting) associated with each level of
score, importance. The level of importance assigned to each feature set
is shown in Table 1. The importance levels were determined
x assigning a level of importance to each feature which is
through discussion between the authors.
used as a weighting (i.e. a multiplier) to convert raw
scores to weighted scores for each feature, 3.3.3 Feature Set and Overall Scores
x determining scores for each feature set and an overall As indicated above, a weighted score for each feature is calculated
score for each candidate tool. by multiplying the raw score by the importance weighting for that
Each tool was initially scored against each feature by the first feature. These weighted scores can be combined to determine a
author (CM). The scores were then discussed by all of the authors percentage score for each feature set (as shown for example in
to produce a set of validated raw scores. A spreadsheet was used column 5 of Table 7).
to record raw scores, weighted scores and overall scores. The percentage score for a feature set is determined as follows:
The following sections provide some more details about the
judgement scale and its interpretation for specific features, the ܵݏ݁ݎܿܵ݀݁ݐ݄ܹ݂݃݅݁݉ݑ
ܲ݁ ݁ݎܿܵ݁݃ܽݐ݊݁ܿݎൌ ൈ ͳͲͲΨ
assignment of a level of importance to each feature and the ݁ݎܿܵ݉ݑ݉݅ݔܽܯ
approach taken to calculating overall scores for each of the tools.
The maximum score for a feature set is assumed to be the sum of
3.3.1 Judgement Scale and its Interpretation the weighted scores where all features in the set are fully present
A single simple judgement scale was used to score the features. (or fully supported).
Where a feature is fully present or strongly supported it was
awarded a score of 1, where it was partly present or partially For example, Feature Set 1 (F1) has two subfeatures (F1-SF01
supported it was awarded a score of 0.5 and where it was absent and F1-SF02). F1-SF01 has been classified as highly desirable
or minimally supported it was awarded a score of 0. (HD). This means the maximum weighted score for this
subfeature is three. Similarly, F1-SF02 has been classified as HD
The judgement scale was interpreted for each of the features in so its maximum score is also three. Therefore, the maximum score
one of three ways, labeled JI1, JI2 and JI3 in Table 1. The for F1 is six. Similarly, the maximum scores for the remaining
interpretations are shown in Tables 2, 3 and 4. feature sets are 16 for Feature Set 2, 23 for Feature Set 3 and 17
for Feature Set 4.
Table 4. JI3 - Interpretation of Judgement Scale An overall percentage score for each tool can be determined by
taking a (weighted) average of the percentage scores for each
Is the activity supported? Score feature set. Since there are a different number of subfeatures in
Yes - Fully 1 each of the feature sets it is necessary to use normalised scores
(i.e. the percentage scores) for this.
Partly
Support is limited. Some aspects of the 0.5 For this calculation, we use the Feature Set weighting shown in
activity are not supported. Table 6. The values here emphasise support for SR activities (F3)
and for process management (F4). Other weightings could be
No 0 used, perhaps to emphasise usability, as tools to support SRs
become more mature. The overall score for each tool can be
determined using the following equation:
Table 6. Feature Set Weighting
Feature Set Weight
σସୀଵሺݓ ܶܲ ሻ F1 0.1
ܱ ݁ݎܿݏ݈݈ܽݎ݁ݒൌ ሺ͵Ǥͳሻ
σସୀଵሺݓ ሻ F2 0.2
F3 0.4
where ݓ is the weighting for the ith feature set and ܶܲ is the
percentage score for the ith feature set. F4 0.3
was agreed that the ability to load an example project into the suggest that StArt is not yet suitable for standard SRs (as opposed
tool, is a highly useful feature (and more useful than initially to mapping studies).
thought). Finally, partial marks were originally awarded for SLR- SLR-Tool has an overall score of 53.2%. The tool’s main
Tool’s support of role management. The tool allows the user to strengths are:
make note of who (i.e. which members of the review team) will
perform certain activities; specifically, the search, study selection x Strong support for developing a review protocol.
and quality assessment. However, since the tool does not support x Effective support provided to new users; notably, the
multiple users, it was agreed that management of roles cannot be ability to load an example project into the tool.
supported effectively. Therefore, the score was reduced. x Effective support for automated analysis.
4.4.6 Overall Score Its main weakness is its lack of support for multiple users. Also,
as indicated in Section 4.4.4, we were unable to import collections
Using Equation 3.1 and the Feature Set Weightings shown in
of papers. This meant that papers had to be manually imported on
Table 6, the overall score for SLR-Tool is 53.2%.
a paper-by-paper basis.
5. DISCUSSION SLRTOOL has the lowest overall score of 45.1%. The tool has a
This section presents a discussion of the results of the feature number of promising and potential features, yet fails to implement
analysis highlighting the main strengths and weaknesses of each them effectively. In particular, it is clear that support for
candidate tool. Limitations of the study are also discussed. collaboration, amongst multiple users, was a primary design
objective. The facility to add/remove users to and from on-going
5.1 Discussion of Results projects is impressive and, generally, works well. Unfortunately,
SRs in SE usually take one of two forms. The ‘standard’ form, SLRTOOL doesn’t really allow users to collaborate, in any
aims to address specific research questions relating to SE methods meaningful way. Due to this, much of its support for the SR
or procedures. The alternative form, termed a mapping study, process is quite limited.
aims to classify the literature on a specific SE topic [18]. For
mapping studies, the search strategy is often less stringent than for A number of limitations are common across all (or most) of the
standard SRs and quality assessment is not usually required. We candidate tools. Support for protocol development, by most tools,
consider these slightly different requirements within the is generally quite limited. Only one tool, SLR-Tool, assisted this
discussion of results. stage effectively. In addition, support for the search process (a
frequently stressed issue within the community [5,6,7,8]), is
As shown in Table 11, SLuRp achieves the highest overall score largely absent. SLRTOOL is the only tool that provides an
of 65.4% and so within the constraints of the study can be internal search facility (i.e. a facility within the tool for searching
considered the most suitable tool to support SRs in SE. Its main digital libraries). However, as noted in Section 4.2.3, its
strengths are: implementation is rather limited. Poor support for this aspect of an
x Provides full support for a team-based SR process. SR may be a consequence of the inherent difficulties associated
x Can be used for standard SRs as well as for mapping with automated searching [7,19,20,21] which, we suggest need to
studies (good support for quality assessment). be addressed, before effective tool support can be realised.
x Actively supported by its developer. Support for collaboration within a team-based SR is also limited.
Only SLuRp provides reasonably effective facilities for
SLuRp’s main weaknesses are its complex installation, lack of collaboration amongst multiple users. It is believed that a deeper
support for protocol development and difficulties associated with understanding of what are considered the collaborative activities
the use of the performance form, as described in Section 4.1.3. in an SR is needed, in order to develop effective support.
StArt has an overall score of 53.3%. Its main strengths are:
5.2 Limitations of the Study
x Active support and maintenance by its developers. The main threats to validity arise from the subjective nature of
x Its simple setup procedure. many of the elements of the feature analysis process. The features
StArt is the only tool that does not rely on the installation of any used are essentially a preliminary set based on our own
additional applications in order to function. One of StArt’s experiences and those reported in the literature. We hope that this
weaknesses is an absence of support for multiple users. As a exercise will provide the foundations for further study of the
consequence, many of the SR stages that are considered features expected from an SR tool. Similarly, the levels of
collaborative activities are only partially supported by the tool. In importance, both for individual features and for feature sets, are
addition, it does not support quality assessment. Since we have
assigned quality assessment a mandatory level of importance we
2
Using the feature set weightings, as shown in Table 6.
based on experience. However, these can easily be adjusted and researchers. In Proceedings of the 2009 3rd International
weighted scores re-calculated where priorities differ. Symposium on Empirical Software Engineering and Measurement,
pp. 346-355.
Of course the scoring is also subjective however as independent [7] Brereton, P., Kitchenham, B. A., Budgen, D., Turner, M., & Khalil,
evaluators we have no vested interest in any of the candidate M. (2007). Lessons from applying the systematic literature review
tools. Also, to mitigate any potential bias we performed a process within the software engineering domain. Journal of Systems
substantial validation exercise with all authors reviewing all and Software, Vol. 80, no. 4, pp. 571-583.
scores for all tools. [8] Carver, J. C., Hassler, E., Hernandes, E., & Kraft, N. A. (2013).
Identifying Barriers to the Systematic Literature Review Process.
The performance form feature (see Section 4.1.3) of SLuRp and
In Empirical Software Engineering and Measurement, 2013
‘bulk-import’ feature of SLR-Tool (see Section 4.4.4) were not ACM/IEEE International Symposium on (pp. 203-212). IEEE.
fully evaluated. It is, however, intended that these features be
examined further, with additional support from their developers. [9] Marshall, C., & Brereton, P. (2013). Tools to Support Systematic
Literature Reviews in Software Engineering: A Mapping Study.
6. CONCLUSIONS In Empirical Software Engineering and Measurement, 2013
ACM/IEEE International Symposium on (pp. 296-299).
This study has evaluated four candidate tools; namely SLuRp,
StArt, SLR-Tool and SLRTOOL, which aim to support the whole [10] Kitchenham, B., Linkman, S., & Law, D. (1997). DESMET: a
SR process. A set of features that such tools should possess has methodology for evaluating software engineering methods and tools.
been developed and used as the criteria against which to evaluate Computing & Control Engineering Journal, 8(3), 120-126.
the candidate tools. [11] Kitchenham, B. A. (1997). Evaluating software engineering methods
and tools, part 7: planning feature analysis evaluation. ACM
SLuRp received the highest overall score and is, therefore, based SIGSOFT Software Engineering Notes, 22(4), 21-24.
on the results of this study; the most suitable tool to support SRs [12] Kitchenham, B. A., & Jones, L. (1997). Evaluating SW Eng.
in SE. SLRTOOL received the lowest overall score, making it the methods and tools, part 8: analysing a feature analysis evaluation.
least suitable. ACM SIGSOFT Software Engineering Notes, 22(5), 10-12.
[13] Grimán, A., Pérez, M., Mendoza, L., & Losavio, F. (2006). Feature
The results of this study provide new insight into tools that analysis for architectural evaluation methods. Journal of Systems and
support SRs in SE. We believe that one of its most interesting and Software, 79(6), 871-888.
significant outputs are the features presented in Table 1 and
[14] Hedberg, H., & Lappalainen, J. (2005). A preliminary evaluation of
described in Section 3.2. The feature set is based on our software inspection tools, with the DESMET method. In Quality
assessment of what we believe an effective tool should include. Software, 2005.(QSIC 2005). Fifth International Conference on (pp.
The next stage is to circulate these features within the community 45-52).
in order to refine and validate them. It would also be interesting to
[15] Bowes, D., Hall, T., & Beecham, S. (2012). SLuRp: a tool to help
explore SR tools in other domains to determine whether they large complex systematic literature reviews deliver valid and
could inform the development of tools in SE. rigorous results. In Proceedings of the 2nd international workshop
on Evidential assessment of software technologies, pp. 33-36.
7. ACKNOWLEDGMENTS
The authors express thanks to the developers of each candidate [16] Hernandes, E., Zamboni, A., Fabbri, S., & Di Thommazo, A. (2012).
Using GQM and TAM to evaluate StArt–a tool that supports
tool for their cooperation. Thanks are also given to Keele Systematic Review. CLEI Electronic Journal, Vol. 15 no. 1, pp. 13-
University’s Environment, Physical Sciences and Applied 25.
Mathematics Research Institute for its partial support for
Christopher Marshall. [17] M. Fernández-Sáez, M. G. Bocco, F.P. Romero. (2010). SLR-tool - a
tool for performing systematic literature reviews. In Proceedings of
8. REFERENCES the 2010 International Conference on Software and Data
Technologies, pp. 144.
[1] Kitchenham, B. A., Dyba, T., & Jorgensen, M. (2004). Evidence-
[18] Kitchenham, B. A., Budgen, D., & Pearl Brereton, O. (2011). Using
based software engineering. ICSE 2004. Proceedings. 26th
mapping studies as the basis for further research–a participant-
International Conference on Software Engineering pp. 273-281.
observer case study.Information and Software Technology, 53(6),
[2] Kitchenham, B., Brereton, O. P., Budgen, D., Turner, M., Bailey, J., 638-651.
& Linkman, S. (2009). Systematic literature reviews in software
[19] Dieste, O., Grimán, A., & Juristo, N. (2009). Developing search
engineering–a systematic literature review. Information and software
strategies for detecting relevant experiments. Empirical Software
technology, Vol. 51, no. 1, pp. 7-15.
Engineering, 14(5), 513-539.
[3] Kitchenham, B. A., & Charters, S. (2007). Guidelines for performing
[20] Bailey, J., Zhang, C., Budgen, D., Charters, S., & Turner, M. (2007).
systematic literature reviews in software engineering. Keele
Search Engine Overlaps: Do they agree or disagree?. In Realising
University and University of Durham, EBSE Technical Report
Evidence-Based Software Engineering, 2007. REBSE'07. Second
[4] Staples, M., & Niazi, M. (2007). Experiences using systematic International Workshop on (pp. 2-2). IEEE.
review guidelines. Journal of Systems and Software, 80(9), 1425-
1437. [21] Dyba, T., Dingsoyr, T., & Hanssen, G. K. (2007). Applying
[5] Riaz, M., Sulayman, M., Salleh, N., & Mendes, E. (2010). systematic reviews to diverse study types: An experience report.
Experiences conducting systematic reviews from novices’ In Empirical Software Engineering and Measurement, 2007. ESEM
perspective. In Proceedings. of EASE Vol. 10, pp. 1-10. 2007. First International Symposium on (pp. 225-234). IEEE.
[6] Babar, M. A., & Zhang, H. (2009). Systematic literature reviews in
software engineering: Preliminary results from interviews with