Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324706645

Technical Report: Do software engineering practitioners cite research on


software testing in their online articles? A structured search of grey data

Technical Report · April 2018

CITATIONS READS

6 1,499

2 authors:

Austen Rainer Ashley Williams


Queen's University Belfast Manchester Metropolitan University
109 PUBLICATIONS   2,543 CITATIONS    24 PUBLICATIONS   118 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Evidence based software engineering with students View project

Serious Games for REBT View project

All content following this page was uploaded by Ashley Williams on 10 May 2018.

The user has requested enhancement of the downloaded file.


Do software engineering practitioners cite research on software
testing in their online articles? A structured search of grey data.
Technical Report

Austen Rainer
Department of Computer Science and Software Engineering
University of Canterbury, NZ
austen.rainer@canterbury.ac.nz

Ashley Williams
Department of Computer Science and Software Engineering
University of Canterbury, NZ
ashley.williams@pg.canterbury.ac.nz

February 2018

Background: Software engineering (SE) research continues to study the degree to which practitioners perceive
research as relevant to practice. Such studies typically comprise surveys of practitioner opinions. These
surveys could be complemented with other in situ practitioner sources e.g. grey literature, grey data and grey
information. These other sources need to be evaluated for rigour and relevance.
Objective: To investigate whether and how practitioners cite software testing research in the (more rigorous
and relevant) grey data that they produce e.g. blogs.
Method: We develop an abstract search logic to distinguish relevance and rigour, determine keywords to
instantiate that logic, conduct 25,200 web searches using a customized Google-based search tool, scrape the
pages of over 3,200 results, and then analyse the articles for their rigour, and for the frequency and content of
citations to research.
Results: We find few citations to research (range 0% - 1% in our datasets) although this is similar to the
frequency of citations to practitioner sites (0% - 4%). Almost all citations to research are from source pages
that can themselves be classified as science and research, education or online encyclopedias. We find and discuss
the only two significant instances of practitioners citing research in our datasets.
Conclusion: Practitioners rarely cite research in their online articles. The lack of citation may be due to
reporting practices (practitioners have different expectations on citing sources compared to academics), and
this confounds our ability to assess the in situ use of research in practice.

Keywords: evidence, research impact, research relevance, grey literature

Version History

Version No. Date Comment


1.0.0 February 2018 Initial version
1.0.1 April 2018 Cleaned formatting for publishing online
1.0.2 May 2018 Updated data in Table 9.
1.0.3 May 2018 Updated data in abstract to reflect new Table 9 data.

1 Introduction
1.1 Contextual background
We are interested in the degree to which software practitioners cite research outputs in their online content e.g. in
their blog posts, web pages and articles. These citations indicate the in situ relevance of research to practice (e.g.
the use of research to support or rebut a published opinion, argument or decision), and therefore an indication of
the potential impact of research.
A growing body of software engineering research is already investigating online content e.g. question and
answer sites such as StackOverflow [20], source code repositories such as GitHub [30] and SourceForge [24], and
development communities such as Python [25]. In these investigations, researchers must work with at least some
grey documents i.e. grey literature, grey data or grey information (we discuss these terms later in the paper).
We are also interested in the relevance and rigour of online content, and in the relationship of rigour to frequency
of citation. We hypothesise that the more relevant and rigorous grey documents would contain more citations to
research.

1
We investigate practitioners’ online citations to software testing research. We focus on software testing because
software testing is a well-established area within software engineering research [19, 14, 22]; because software
testing is often conducted in conjunction with industry and practitioners [12]; and because there is at least one
multi-vocal literature review on software testing [10], therefore providing a future opportunity for us to compare
our results with an independent study.

1.2 Research questions & contributions


We investigate three research questions:

• RQ1. Do software engineering practitioners cite research on software testing in their online articles?

• RQ2. Does the more relevant and rigorous online content contain more citations to research on software
testing?

• RQ3. How are citations to research on software testing used by practitioners in their online articles?

The paper makes the following contributions:


1. A new semi-automated approach for researchers to investigate the relevance of research to practice.

2. An initial benchmark for the frequency of citations to software testing research in grey documents. (This
benchmark will need corroborating.)

3. Preliminary insights on the way in which practitioners use research in software testing in relation to their
opinions and arguments.

1.3 Structure of the paper


The remainder of the paper is organised as follows: in section 2 we briefly review related literature; in section 3
we present our method of data collection and analysis; in section 4 we present our results; finally, in section 5,
we review the research questions, consider threats to validity, discuss directions for further research, and briefly
conclude.

2 Related work
2.1 Practitioners’ perceptions of research relevance
Lo et al [18] conducted a survey to investigate practitioners’ perceived relevance of research. Lo et al. summarised
571 full-track conferences papers that were published over a five-year period (2009–2014). They then invited
3,000 Microsoft employees to rate the relevance of the 571 papers based on the summaries of those papers.
512 practitioners participated. Lo et al found respondents to be positive toward the studies undertaken by the
software engineering research community. Lo et al. identified three main limitations to their work: 1) that they
used summaries of the papers; 2) that the survey was conducted at one company, albeit a large organisation
operating across many countries, and developing and supporting a diversity of software products and services;
and 3) that they assessed perceived relevance.
In their discussion of limitations to their study, Lo et al. write, “. . . summaries are needed because it is not
practical to ask survey participants to read entire papers and many abstracts are not concise enough.” ([18]).
Garousi and Felderer make a similar observation: “. . . many practitioners believe that academic papers are too
formal and hard to understand.” ([7]). This suggests that one barrier to practitioners adopting the findings of
research is the readability of research outputs. To help address readability, Cartaxo et al. [3, 4] are developing
evidence briefings, and Storey et al. [29] are developing a visual template to help communicate design science
research in software engineering.
In the study reported later in this paper, we seek to address Lo et al.’s three limitations through a complementary
research design (see section 5.2). We do not provide summaries but instead look for how practitioners cite research
for which they are already aware. We search for practitioners on the internet, and therefore are not constrained
to one company. And we look at the citation of research in situ i.e. how practitioners cite research within the
articles they write and publish.
To the three limitations summarized above, we identify two further potential limitations to Lo et al.s study.
The first potential limitation is that the respondents are a self-selecting sample from the company and, could
therefore potentially self-select on the basis of being interested in research, or motivated to read research. The
second potential limitation is that the practitioners were asked to focus on the prescribed research (as summaries),
and therefore were not asked about research of which they were already aware. In other words, respondents were
asked to evaluate research outputs that had been identified for them. As a contrasting example for both of these
potential limitations, Garousi and Zhi [11] conducted a survey of 246 practitioners from Canada, finding that
56% of the respondents had never interacted with researchers during their career, and 32% reported infrequent
interaction.

2
Lo et al. identified several reasons, summarised here in Table 1, why practitioners consider some research
unwise. These reasons might explain why practitioners may be aware of (some) research but then not disseminate
it.

Table 1: Reasons why practitioners consider some research ’unwise’ (from [18])
Reason Brief explanation
A tool is not needed The current state of practice is good enough and no additional sup-
port is needed.
An empirical study is not ac- The research is not relevant, not deemed to be actionable, or of
tionable little benefit.
Generalisability The research may not be applicable to the systems that matter to
the respective practitioner.
Cost outweighs the benefits The costs of adopting the technology would be higher than the
benefits.
Questionable assumptions A particular input or condition crucial to the research may not hold
in practice.
Disbelief in particular technol- There is a (prior) strong disbelief in the technology.
ogy
Other solutions seem better Alternative, better solutions are available and better.
Proposed solution has side ef- The proposed technology may have unintended, negative side ef-
fects fects.

Lo et al. also found no correlation between citations (by other researchers) of a conference paper and the
practitioners’ rating of the relevance of the paper. Again, this is a reminder that research needs to be better
disseminated to practitioners so that practitioners are aware of the research outputs, and can then evaluate the
relevance of those outputs. Devanbu et al. [5] write that disseminating empirical findings to developers is a key
step to ensure that findings have (the potential) to impact practice. Practitioners citing research can complement
researchers disseminating their research.
For Devanbu et al. [5], practitioners beliefs are primarily formed based on personal experience rather on
findings from empirical research. Devanbu et al. looked at the sources that influence belief formation and
they found the following sources (in ranked order) influenced belief formation: personal experience, peer opinion,
mentor/manager, trade journal, research paper, other. Thus, of the five explicit categories they identified, research
was ranked last. Conversely, encouraging practitioners to cite research would be a way to exploit the second ranked
item, peer opinion.
Devanbu et al.’s ranking does not explicitly include social media. Storey et al. ([27]; see also [28]) recognise
a range of types of social media, observing that practitioners use blogs to write about their experiences and
opinions, and that software engineers increasingly use social media to become informed about new technologies.
Given the above, we focus on the frequency to which practitioners cite research on software testing, and on the
rigour and relevance of the online articles they publish. We focus on software testing because software testing is a
well-established area within software engineering research [19, 14, 22]; because software testing is often conducted
in conjunction with industry and practitioners [12]; and because there is at least one multi-vocal literature review
on software testing [10], therefore providing a future opportunity for us to compare our results with an independent
study.

2.2 Grey literature, grey information and grey data


There has been increasing interest in software engineering research in the value of grey literature. For exam-
ple, Garousi and his colleagues have investigated the development of multi-vocal literature reviews (MLRs) to
incorporate grey literature into systematic literature reviews (SLRs) [10, 9].
Adams et al. [1] recognise difficulties in defining the concept of grey literature, introduce the concept of ‘grey
information’, and distinguish grey literature, grey information and grey data. These distinctions – they overlap and
are not definitions – are reproduced in Table 2. There is the implication in Adams et al.’s paper that grey literature
is typically more credible than grey data, which is typically more credible than grey information: Adams et al.
write, “The term ‘grey data’ has also been used specifically to describe user-generated web content—something
that we feel is more formal and public than ‘grey information’ but less formal than ‘grey literature’.” ([1], p. 165).
In the current paper, we collectively refer to these three categories of information as grey documents.
Software engineering researchers typically seek grey literature that contains some element of empirical data.
For example, Bailey et al. [2] reported a literature survey of evidence for object oriented design. Their inclusion
criteria included, “. . . books, papers, technical reports and ‘grey literature’ describing empirical studies regarding
OO software design;. . . ” ([2], p. 483; emphasis added here). There is the implication that grey literature should
contain some element of empirical study. As a contrasting example, other software engineering researchers have
studied what Adams et al. define as grey data e.g. tweets, blogs, Q&A sites.

3
Table 2: Grey literature, information and data (from [1])
Term Defining aspect Examples
Grey literature Not controlled by commercial publish- Internal reports, Working papers,
ing organisations Newsletters
Grey data User-generated, web-based Tweets, Blogs, Facebook status up-
dates
Grey information Informally published or not published Meeting notes, Emails, Personal mem-
at all ories

2.3 Relevance and rigour


As noted in the previous subsection, Adams et al.’s [1] paper implies a hierarchy of credibility for grey literature,
grey data and grey information. Adams et al. do not provide definitions, or criteria for determining credibility.
With their work on guidelines for multi-vocal literature reviews, Garousi et al. [8] are developing an approach to
help researchers identify and incorporate both the more relevant and also the more rigorous grey literature into
systematic literature reviews.
Evaluations of the rigour of grey data are challenging. For some social media, researchers rely on aggregating
readers’ evaluations of the grey data as a proxy for rigour or credibility e.g. up-voting of answers in a Q&A site,
or retweets. We are interested in evaluating the quality of longer-written texts, such as blog articles, where it is
not always possible, or easy, to rely on aggregating an (simple) external measure of credibility. As a result, we
seek to evaluate the rigour of the content of the text in terms of the presence of reasoning indicators, drawing on
Rainers [23] work. We discuss reasoning indicators in more detail later in this paper.

2.4 Summary
We are interested in how practitioners’ use research in their grey documents. To bound the investigation reported
in the current paper, we focus on practitioners’ citations of software testing research in grey data. We also seek
to identify the more rigorous grey data. Previous work (reviewed above) has tended to survey practitioners for
their opinions on the relevance of research. In our study, we take a complementary perspective, investigating how
practitioners use research in situ e.g. how they use research in relation to the opinions or arguments they express
in their online articles.

3 Method
3.1 Conceptual framework
Relevance and rigour are inherently challenging to assess for research. Such challenges increase when using a
search engine to find appropriate online content. (We consider the threats to validity of search engines later in
this paper.) Consider, for example, that a search engine:

• typically uses a keyword-based search query, and such queries do not necessarily allow for finer-grained
searching;

• is likely to optimise the search results to the searcher, based on the search engine’s history of prior searches
by that searcher;

• is likely to maintain its own topic model to determine relevance of results e.g. whether an online article
relates to software testing

We decided to use the Google Custom Search API1 [13] so that we could automate the Google searches. Prior
research in software engineering (e.g. [6, 21]) has tended to use manual searching of Google. For the criterion of
relevance, we chose to use a simple keyword search (i.e. the presence of the words ¡“software” AND “testing”¿
in an online document) and to allow the Google search engine’s internal topic model(s) to find relevant content
based on that simple keyword search. This is consistent with the way that practitioners would look for documents
online. For the criterion of rigour, we further distinguish between reasoning and experience, and construct a set
of keywords to seek to limit searches based on reasoning and experience. Our reasoning indicators were derived
from a review of prior research ([31]). There is little prior research on searching for experience online (but see
[16, 15, 17]), so we constructed a basic set of keywords. The keywords are summarised in Table 3.
We then constructed a logic of nine distinct queries, summarised in Table 4 and visualised in Figure 1 (see
Table 6 for the actual search strings). (To clarify: search set S1 would generate dataset S1.) Ideally, we want the
search engine to find online content that contains reasoning and experience relating to software testing; content
that ideally also contains URL citations to research. The search query for set S6 targets that ideal content. We
conduct the other eight sets of searches to allow us to evaluate the quality of content in S6, and to evaluate the
1
https://developers.google.com/custom-search/

4
Table 3: Keyword indicators for reasoning and experience
Reasoning indicators Experience indicators
but i
because me
for example we
due to us
first of all my
however experience
as a result experiences
since experienced
reason our
therefore

frequency of URL citations to research. For example, search S3 is intended to find online content that contains
reasoning and experience, and (ideally) URL citations to research, but where the content is not about software
testing.

Table 4: Logic for each set of searches and resulting datasets (T=topic; R=reasoning; E=Experience; !=logical
not)
Search T R E !T !R !E
set
S1 • • •
S2 • • •
S3 • • •
S4 • • •
S5 • • •
S6 • • •
S7 • • •
S8 • • •
S9* ◦ • •

Search sets S1 and S9 are special cases. Logically, search set S1 contains the universe of (other, potential)
online content and should therefore be included for completeness of evaluation. Practically, we do not have the
resources to adequately search the universe of online content (or even Google’s indices of the universe of online
content). We anticipate that we would have a sparse (relatively speaking) and uncertain dataset resulting from
search set S1. Accepting these constraints, we constructed a random sample of search queries (with query length
between two and five keywords) for set S1. We then complemented search set S1 with a more constrained search
set, S9. Search set S9 is defined as the set of all articles relating to “software engineering” excluding those articles
referring to “testing”. Table 5 summarises the searches executed to generate set S1.

3.2 Operationalisation
We wrote a search tool in Python to call the Google Custom Search API. That tool executes the search queries
defined for the nine search sets, storing the results of the searches. We then used Pattern [26] to iterate though
the results, crawling the respective websites to scrape the HTML online content, and then removing the HTML
to leave the raw text for subsequent analyses. We tested and refined the search tool between September and
October 2017. To collect the data for the current study, we performed all nine sets of searches each day for a
continuous four-week (28 days) period between October and November 2017. The Google Custom Search API
places a limit of 100 free searches per day. This allowed us to execute 10 searches per search set per day, each
retrieving 10 pages of results. Each page returns 10 results, and overall we retrieve 1,000 results per day for each
set (we run the queries 10 times a day in an attempt to smooth the variation of results from the Google Custom
Search API). Table 6 presents the base search queries. We removed PDFs or other binary-formatted documents
from the results returned from the Google API (see Table 7). This allowed us to remove some reports that would
be classified as grey literature, helping us to focus on grey data.
To identify URL citations, we wrote another Python script to search for anchor (’<a>’) HTML tags. We
constructed a three-dimensional table of search set, source URL and cited URLs. We classified source (citing)
URL domain names into categories. We used several categories to provide a means to evaluate the frequency of
citations to research.

5
Table 5: Summary of searches executed to generate search set S1
Day Query terms # results
1 jar cosal angle tires 80
2 doses drawings collectors examination 160
3 associate tourniquet hitches 0
4 rust screws 1000
5 stationery stripe signature regulation 10
6 hearts mat safeguard maximum cash 0
7 halt dip 1000
8 navigations room cable 1000
9 bracing nut 889
10 opinion multisystems merchants arma- 0
ments cams
11 lubrication plastic inch 722
12 junctions capacitance article mountain 10
13 conspiracies behaviors preserver pro- 0
grammer recruiters
14 abuses alloy 1000
15 finishes shoulder mail 1000
16 deviations drafts basin 220
17 writing gleams decibels mountain 0
18 chaplain report jump speeds dew 0
19 steels leak damping guideline 160
20 deeds cry may balloons 10
21 winters envelope jacket swallow 20
22 reliability preserver topic 350
23 punches wonder 1000
24 installations incline consequences inter- 0
coms settlement
25 advantage blows stream storm 33
26 patter polishes 400
27 dye speeder 162
28 bulbs water flashes misses platter 30
Total: actual 9256
Total: potential 28,000
Actual/potential: 0.33
Mean: 330.6

6
Table 6: Example search queries
S Logic Search term
1 !(T + E + R) todays random query + ’ -”software” -”testing” -”i” -”me” -”we” -”us” -”my” -”experience”
-”experiences” -”experienced” -”our” -”but” -”because” -”for example” -”due to” -”first of
all” -”however” -”as a result” -”since” -”reason” -”therefore”’
2 R + !(T + E) ’(”but” OR ”because” OR ”for example” OR ”due to” OR ”first of all” OR ”however” OR
”as a result” OR ”since” OR ”reason” OR ”therefore”) -”software” -”testing” -”i” -”me”
-”we” -”us” -”my” -”experience” -”experiences” -”experienced” -”our”’
3 (R + E) + !T ’((”but” OR ”because” OR ”for example” OR ”due to” OR ”first of all” OR ”however”
OR ”as a result” OR ”since” OR ”reason” OR ”therefore”) AND (”i” OR ”me” OR ”we”
OR ”us” OR ”my” OR ”experience” OR ”experiences” OR ”experienced” OR ”our”)) -
”software” -”testing”’
4 E + !(T + R) ’(”i” OR ”me” OR ”we” OR ”us” OR ”my” OR ”experience” OR ”experiences” OR ”experi-
enced” OR ”our”) -”software” -”testing” -”but” -”because” -”for example” -”due to” -”first
of all” -”however” -”as a result” -”since” -”reason” -”therefore”’
5 (T + R) + !E ’((”software” AND “testing”) AND (”but” OR ”because” OR ”for example” OR ”due to”
OR ”first of all” OR ”however” OR ”as a result” OR ”since” OR ”reason” OR ”therefore”))
-”i” -”me” -”we” -”us” -”my” -”experience” -”experiences” -”experienced” -”our”’
6 T+R+E ’(”software” AND “testing”) AND (”but” OR ”because” OR ”for example” OR ”due to”
OR ”first of all” OR ”however” OR ”as a result” OR ”since” OR ”reason” OR ”therefore”)
AND (”i” OR ”me” OR ”we” OR ”us” OR ”my” OR ”experience” OR ”experiences” OR
”experienced” OR ”our”)’
7 (T + E) + !R ’((”software” AND “testing”) AND (”i” OR ”me” OR ”we” OR ”us” OR ”my” OR ”expe-
rience” OR ”experiences” OR ”experienced” OR ”our”)) -”but” -”because” -”for example”
-”due to” -”first of all” -”however” -”as a result” -”since” -”reason” -”therefore”’
8 T + !(R + E) ’(”software” AND “testing”) -”but” -”because” -”for example” -”due to” -”first of all” -
”however” -”as a result” -”since” -”reason” -”therefore” -”i” -”me” -”we” -”us” -”my” -
”experience” -”experiences” -”experienced” -”our”’
9 !(T + E + R) ’”software engineering” -”testing” -”i” -”me” -”we” -”us” -”my” -”experience” -”experiences”
-”experienced” -”our” -”but” -”because” -”for example” -”due to” -”first of all” -”however”
-”as a result” -”since” -”reason” -”therefore”’ (inner universe (software engineering as seed))

7
Figure 1: Venn diagram for the sets of searches and resulting search sets (The shading suggests a sample is drawn
from the overall population.)

4 Quantitative analyses
4.1 Summary of search results
Table 7 summarises the results of our search for each search set. Successfully downloaded articles may still cause
problems for subsequent analysis e.g. a binary file, such as a PDF.

Table 7: Summary of unique articles found in each search


S Dataset Returned in search Successfully crawled Not analysed Analysed
1 !(T + E + R) 984 925 247 PDF 678
2 R + !(T + E) 257 242 8 PDF 234
3 (R + E) + !T 545 536 26 PDF 510
4 E + !(T + R) 597 567 0 567
5 (T + R) + !E 207 190 52 PDF + 1 PSA 137
6 T+R+E 282 270 9 PDFs 261
7 (T + E) + !R 214 207 4 PDF 203
8 T + !(R + E) 154 142 21 PDF 121
9 !(T + E + R) 154 144 3 PDF 141
Total 3394 3223 370 PDF, 1 PSA 2852
Mean 377 358 317

4.2 Domain frequency analysis for entire dataset


To provide a preliminary benchmark, we examine the frequency of domain names that were cited across all of the
nine search sets. Table 8 presents the top nine most frequently cited domain names, together with a selection of
relevant other domain names. The top three research domains together account for 1% of cited domain names.
For Wikipedia, the list entry is only for the English-language site and does not include, for example, the German
site (de.wikipedia.org), the English mobile site (en.m.wikipedia.org), the French site (fr.wikipedia.org), or the
generic main site (wikipedia.org).

4.3 Frequency of citations to research


Table 9 presents the percentage of the source URLs (organised by set) that cite the category of external URL.
The table also ranks by the category across the nine sets. For example, 5% of the source URLs in search set S1
cite external URLs that we have categorised as education, and this is ranked 3rd against all nine sets. Values in
the cells that are more than 100% are because there may be more external URLs cited than source URLs, for
example one source URL could cite two external URLs.
Table 9 indicates that research and developer resources are infrequently cited, but that segments S5 – S8 tend
to cite research and developer resources more (as we would expect given the search queries).
Overall, the table indicates that there is very little citation of research amongst the articles that we have
downloaded. Set 6 has one of the higher percentages of citations to research (4%; compare with set 8 and 9), the

8
Table 8: Most frequently cited domain names
Rank Cited domain name Frequency % total citations Comment
1 twitter.com 2167 4.5 Social media
2 facebook.com 1795 3.7 Social media
3 microfocus.com 1470 3.1 Software company
4 bandcamp.com 712 1.5 Music service for artists and
labels
5 congress.gov 666 1.4 Government
6 instagram.com 660 1.4 Social media
7 linkedin.com 638 1.3 Professional network
8 youtube.com 613 1.3 Social media
9 plus.google.com 449 0.9 Social media
12 software.microfocus.com 323 0.7 Enterprise software (see
above)
15 en.wikipedia.org 287 0.6 Encyclopedia
17 github.com 259 0.5 Code repository
19 dx.doi.org 241 0.5 Research: Search page for
Digital Object Identifiers
20 t.co 234 0.5 Twitter server
35 ieee.org 129 0.3 Research
41 amazon.com 103 0.2 eCommerce
42 umassd.edu 101 0.2 Highest ranked .edu site
45 scholar.google.com 96 0.2 Research
63 wordpress.org 75 0.1 Blogs
64 microsoft.com 73 0.1 Technology corporation
74 stackexchange.com 65 0.1 Q&A site
78 agilemodeling.com 63 0.1 Topic-specific site
94 doi.org 56 0.1 Related domain name to
dx.doi.org
Unique domains 9,280
Total number of 48,136
URLs cited

9
Table 9: Percentage and ranking of source URLs (articles) that cite the category of external URL
Category of external URL Search set (source URLs)
1 2 3 4 5 6 7 8 9
Research % 0% 0% 1% 0% 1% 1% 0% 1% 1%
Rank 2 8 1 9 5 2 6 6 4
Research search % 0% 0% 1% 0% 0% 0% 0% 0% 1%
Rank 2 7 1 7 4 4 4 7 3
Education % 2% 2% 3% 2% 3% 3% 1% 4% 12%
Rank 3 8 1 4 7 5 9 6 2
Government % 2% 4% 13% 13% 3% 6% 2% 2% 3%
Rank 3 5 2 1 6 4 7 9 8
Developer authorities % 0% 0% 1% 0% 2% 4% 2% 1% 1%
Rank 5 7 2 5 3 1 4 7 7
Developer Q&A % 0% 0% 0% 0% 0% 1% 1% 0% 0%
Rank 5 7 1 4 5 2 3 7 7
Repository % 0% 1% 1% 0% 4% 2% 1% 2% 1%
Rank 8 5 1 9 2 3 6 4 6
Sandbox (jsfiddle) % 0% 0% 0% 0% 0% 0% 0% 0% 0%
Rank 2 2 1 2 2 2 2 2 2
Social Media % 11% 14% 28% 28% 5% 30% 30% 11% 12%
Rank 3 6 2 1 9 4 5 8 7
e-Commerce % 1% 1% 1% 0% 1% 4% 0% 0% 0%
Rank 3 4 2 4 6 1 9 7 7
News % 0% 0% 2% 0% 0% 0% 0% 0% 0%
Rank 4 5 1 2 5 3 7 7 7
Sum of ranks all categories 40 64 15 48 54 31 62 70 60
Highest rank 2 2 1 1 2 1 2 2 2
Lowest rank 8 8 2 9 9 5 9 9 8
Range of ranks 6 6 1 8 7 4 7 7 6
Sum of research ranks (2 categories) 4 15 2 16 9 6 10 13 7
Sum of developer ranks (3 categories) 18 19 4 18 10 6 13 18 20
Highest research rank 2 7 1 7 4 2 4 6 3
Lowest research rank 2 8 1 9 5 4 6 7 4
Range of research ranks 0 1 0 2 1 2 2 1 1
Highest developer rank 5 5 1 4 2 1 3 4 6
Lowest developer rank 8 7 2 9 5 3 6 7 7
Range of developer ranks 3 2 1 5 3 2 3 3 1
Ranked sum of research ranks 2 8 1 9 5 3 6 7 4
Ranked sum of developer ranks 5 8 1 5 3 2 4 5 9

highest percentage of citations to developer authorities, and one of the higher percentage of citations to Q&A and
to repositories. Curiously, set 5 has a very high percentage (relatively) of citations to repositories.
We are particularly interested in sets 5 – 8 as these sets explicitly include the topic of software testing in the
respective search queries. Table 10 compares the total (summed) and average percentages of research-related and
developer-related categories for those sets explicitly about software testing against those sets explicitly not about
software testing.
For the ranking of all categories, Table 9 indicates that set S3 consistently has the highest ranking across all
categories, with the lowest range in ranking too (see rows Sum of all ranks, Highest rank, Lowest rank, and Range
of ranks). Set S6 is then the next highest ranked set, and also the next most consistent in ranking. These two
sets are generated using search queries that combine reasoning and experience (see Table 4).
For the ranking of the research and developer categories, Table 9 indicates that set S3 is the most consistently
and highest ranked on external URL citations to research and developer web content. Set S6 also has a relatively
high ranking. Curiously, set S1 is ranked high on the external citation to research too.

4.4 Who cites research?


We extracted the URLs of all articles that made a citation to the top-three research domains i.e. ieee, dx.doi.org,
or scholar.google. We then classified the source/citing URLs. We summarise the classification for two of our three
extractions below. We have not reported the third classification, for dx.doi.org, as it is very similar to the two
classifications presented.

10
Table 10: Comparison of summed percentages of select categories for software testing and not software testing
Category Total percentage Average percentage
S1-S4+S9 S5-S8 S1-S4+S9 S5-S8
Research 0.115 0.135 0.023 0.034
Research search 0.037 0.014 0.007 0.003
Develop authorities 0.049 0.171 0.010 0.043
Q&A 0.023 0.072 0.005 0.018
Repository 0.12 0.311 0.024 0.078

Table 11: Source URLs citing IEEEXplore research


Type of citing domain # of URLs Total citations to IEEEXplore
IEEEXplore 10 150
Academic conference 6 12
IEEE Computer 4 4
Wikis 2 4
Academic’s personal page 1 10
Website for C++ testing framework 1 1
Practitioner’s blog 1 1

Table 12: Source URLs citing scholar.google


Type of citing domain # of URLs Total citations to IEEEXplore
Nature.com 7 41
Wikipedia 3 3
Academic’s personal page 2 2
Springer 2 53
Hindawi 1 26
US Supreme Court 1 1
ScienceMatter.io 1 12
Academic’s personal online repository of (others’) 1 1
research publications

11
Table 13: Instances of practitioners citing research
URL: http://cute-test.com/projects/macronator/wiki (search set S5) (academic citations=0)
Quotation: “This software is based on the paper The demacrofier [link to research] by Kumar, Sutton,
and Stroustrup.”
Reference: Kumar, A., Sutton, A. and Stroustrup, B., 2012, September. The demacrofier. In Software
Maintenance (ICSM), 2012 28th IEEE International Conference on (pp. 658-661). IEEE.
URL: https://danluu.com/testing/ (search set S6) (academic citations=756)
Quotation: “That brings up the question of why QuickCheck and most of its clones don’t use heuristics
to generate random numbers. The QuickCheck paper mentions that it uses random testing because it’s
nearly as good as partition testing [link to IEEExplore] and much easier to implement. That may be true,
but it doesn’t mean that generating some values using simple heuristics can’t generate better results with
the same amount of effort. Since Zalewski has already done the work of figuring out, empirically, what
heuristics are likely to exercise more code paths, [link to blog] it seems like a waste to ignore that and just
generate totally random values.”
Reference: Duran, J.W. and Ntafos, S.C., 1984. An evaluation of random testing. IEEE transactions on
Software Engineering, (4), pp.438-444.

The tables indicate that with two exceptions (discussed below) all source URLs were either from higher educa-
tion, science research or wikipedia. In other words, there are almost no practitioner citations of research.

4.5 How are practitioners citing research?


As noted, we identified two instances were practitioners cited research. We summarise those instances in Table
13. The table indicates that both articles cite IEEE sources, not doi.org or
scholar.google sources. The cute-test.com article simply cites the respective IEEE paper, and does not discuss
it. The cute-test.com URL is a generated as a result from search S5. By contrast, the danluu.com article cites
the respective IEEE paper but also critically discusses the paper (see the excerpted discussion in Table 13). The
respective IEEE paper is over 30 years old, and has been cited over 750 times by academics.
Given the substance of the danluu article, and the background of the author (Dan Luu has worked at Google
and currently works at Microsoft) we looked at the About page of his blog site, where he writes:

“This [blog] started out as a way to jot down thoughts on areas that seem interesting but underap-
preciated. Since then, this site has grown to the point where it gets millions of hits a month and I see
that it’s commonly cited by professors in their courses and on stackoverflow.
That’s flattering, but more than anything else, I view that as a sign there’s a desperate shortage of
understandable explanation of technical topics. There’s nothing here that most of my co-workers don’t
know (with the exception of maybe three or four posts where I propose novel ideas). It’s just that
they don’t blog and I do. I’m not going to try to convince you to start writing a blog, since that has
to be something you want to do, but I will point out that there’s a large gap that’s waiting to be filled
by your knowledge. When I started writing this blog, I figured almost no one would ever read it; sure
Joel Spolsky and Steve Yegge created widely read blogs, but that was back when almost no one was
blogging. Now that there are millions of blogs, there’s just no way to start a new blog and get noticed.
Turns out that’s not true.”

Dan Luu’s comments support our view that there is higher-quality (rigorous and relevant) practitioner-generated
grey data that seeks to incorporate research, however this higher-quality grey data appears to occur very infre-
quently.

5 Discussion and conclusion


5.1 The research questions
Returning to the three research questions:

RQ1. Do software engineering practitioners cite research on software testing in their online
articles?

Our quantitative results (Table 8 and Table 9) and qualitative results (Table 13) suggest that there is very little
citation of software testing research by practitioners in their online articles, even for those search sets (S5 – S8)
that particularly target software testing. As a comparison, practitioners also infrequently cite other practitioners.
One (obvious) explanation for the very low frequency and percentage of citation is that practitioners simply do
not write according to academic standards of good practice i.e. practitioners do not cite work that they draw
upon (this is not to imply that practitioners are plagiarizing). One implication, supported by Devanbu et al.’s

12
observations, is that practitioners do draw upon research, but simply do not report that they do. Similarly,
practitioners draw on the experience of other practitioners (cf. Devanbu et al.’s ranking of sources of opinion)
but often do not explicitly acknowledge those other practitioners in their online articles. A second implication is
of a ’vicious circle’ for researchers: practitioners do not cite research and therefore do not encourage and promote
others to consider research or to cite it.

RQ2. Does the more relevant and rigorous online content contain more citations to research
on software testing?

With such small percentages and a relatively small sample size, our results (Table 10) are inconclusive. There
appears to be some advantages to using our search logic and keyword indicators to identify the more rigorous
articles based on the presence of reasoning and experience indicators.

RQ3. How are citations to research on software testing used by practitioners in their online
articles?

The two relevant data points suggest a continuum: at least some practitioners simply cite an article and do not
discuss it, whilst other practitioners cite an article and critically discuss it in relation to other sources (although
not necessarily research sources). Our sample is too small to say anymore towards answering this question in this
paper.

5.2 Addressing the limitations of previous research


In section two, we have listed the three limitations that Lo et al. [18] recognise for their study:

• They used summaries, because it is not practical to ask participants to read full papers and many abstracts
are not concise enough.

• The survey was conducted at one company, albeit a large organisation operating internationally

• They assessed perceived relevance.

Further to this, we identified to more limitations to Lo et al.’s study:

• The respondents are a self-selecting sample from the company and, could therefore potentially self-select on
the basis of being interested in research, or motivated to read research.

• The practitioners were asked to focus on the prescribed research (as summaries), and therefore were not
asked about research of which they were already aware.

In this paper, we attempt to address these issues by providing a complementary research design. The approach
that we undertake does not provide summaries of research to practitioners. Instead we look for how practitioners
cite research for which they are already aware and which they are citing in their published content. Also, in
searching for practitioner-generated written documents online, we do not limit ourselves to practitioners from
only one organisation or industry sector. We instead sample from the entirety of the web and are only limited
by the number of results that the Google API allows us to retrieve, as well as by the Google API indices e.g. we
can’t sample from a blog that Google doesn’t index. In terms of representativeness, we think it is reasonable to
assume that the Google API indexes a high proportion of web content. But it is much more difficult to assess the
way in which the Google API ‘decides’ on what articles it returns in the results. For example, we assume that the
Google API is ‘optimising’ in some way the results it returns, and we assume the optimisation is based in some
way on prior search history for the respective searcher.
Where Lo et al. assess the perceived relevance of the research being analysed, our approach provides another
layer of insight. When a practitioner includes a citation into their blog article, they are using the citation in
relation to their reasoning and experience. This usage is an indicator of actual relevance (from the perspective of
the practitioner) rather than perceived relevance (the practitioner saying that they would use the citation).
In our study, practitioners are not self-selecting at the data collection stage because we automatically search
their publicly available data via the Google API without their knowledge. For most of our analyses, the identities
of the authors are not disclosed, and do not need to be. It is when we come to the analysis of specific blogs
and blog articles, such as Dan Luu’s, that the identity of the authors may (need to) be disclosed. It is at this
point that the respective author may choose to withdraw consent, and therefore may de-select themselves from
the study.

5.3 Threats to validity


There are several threats to the validity of this study. We rely on the Google search engine for determining
relevance. Google produces results based on a number of factors (search history, location etc) and these results
vary continuously. For this reasoning, it is not possible to re-produce the search results.

13
Another threat is that the sample of results that we retrieve over the 28 day period is not representative relative
to the total amount of content that is available on the web. Even with a much larger sample, it would still be
difficult to determine how representative the sample is as we have no index of the total web, and search engines
do not provide this data.
With regards to the query strings used within this study, ¡’software’ AND ’testing’¿ may be considered too
broad a topic. Future research plans to look at alternative topics (e.g. the phrase ”software testing,” and other
sub-topics such as automated software testing). Similarly, the results retrieved may have been very different if
we were to alter our reasoning and experience indicators. Although the quality of our reasoning indicators have
been evaluated in a previous study, the experience indicators have been chosen arbitrarily (a study that assesses
the quality of experience indicators has been left for future research). The experience indicators that we have
chosen in this study may be problematic in trying to identify blog articles. For example, one of our indicators is
the word ’i.’ However, ’i’ is also used a lot in code examples for looping. It may be a good indicator of expression
of self, but may also identify a lot of code tutorials and online code repositories.

5.4 Further research


The work reported here is part of a methodology we are developing that is intended to enable a searcher (e.g. a
researcher or a practitioner) to identify more rigorous and relevant grey documents. The identified grey documents
may then be used in further analyses or work (e.g. as part of a multi-vocal literature review, as examples of
industry opinion for teaching purposes, or as peer opinion in some practitioner decision-making or discussion). In
further researcher, we intend to evaluate the usefulness of the identified grey documents to candidate searchers.
For example, does the Dan Luu article have value in a multi-vocal literature review? We intend to both extend
our searches for software testing, and undertake complementary searches for other software engineering topics e.g.
agile. We plan to apply our criteria and approach to other kinds of grey documents. For example, the dataset of
Python Enhancement Proposal (PEP) emails. We also need to extend, validate and evaluate our criteria (e.g. the
keywords used for experience indicators). Finally, we also recognise the need to incorporate other search engines
(e.g. Bing, Yahoo) into our work.

5.5 Conclusion
In this paper, we have used a novel method for searching grey documents to create a dataset of practitioner-
generated, written content. This dataset was collected over 28 days using the Google Custom Search API. We
include both reasoning and experience indicators within the queries to help identify the type of grey documents
that we are interested in (practitioner online articles). We have then analysed this dataset to try to understand
how practitioners cite research on software testing. We found very little evidence of practitioners citing software
testing research in their online written articles. By comparison we also found little citation of practitioners citing
the articles of other practitioners. One explanation relates to academic good practice (e.g. to cite sources).

Acknowledgement
Thanks to LB&Co for the countless stream of hot chocolates with Sante Bar’s dipped into them that have been
consumed within the past 12 months. Without which, this research would likely not have been possible.

References
[1] Jean Adams et al. “Searching and synthesising ‘grey literature’and ‘grey information’in public health: critical
reflections on three case studies”. In: Systematic reviews 5.1 (2016), p. 164.
[2] John Bailey et al. “Evidence relating to Object-Oriented software design: A survey”. In: Empirical Software
Engineering and Measurement, 2007. ESEM 2007. First International Symposium on. IEEE. 2007, pp. 482–
484.
[3] Bruno Cartaxo. “Integrating evidence from systematic reviews with software engineering practice through
evidence briefings”. In: Proceedings of the 20th International Conference on Evaluation and Assessment in
Software Engineering. ACM. 2016, p. 6.
[4] Bruno Cartaxo et al. “Evidence briefings: Towards a medium to transfer knowledge from systematic reviews
to practitioners”. In: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software
Engineering and Measurement. ACM. 2016, p. 57.
[5] Prem Devanbu, Thomas Zimmermann, and Christian Bird. “Belief & evidence in empirical software engi-
neering”. In: Proceedings of the 38th international conference on software engineering. ACM. 2016, pp. 108–
119.
[6] Oscar Dieste and Natalia Juristo. “Systematic review and aggregation of empirical studies on elicitation
techniques”. In: IEEE Transactions on Software Engineering 37.2 (2011), pp. 283–304.
[7] Vahid Garousi and Michael Felderer. “Living in two different worlds”. In: ().

14
[8] Vahid Garousi, Michael Felderer, and Mika V Mäntylä. “Guidelines for including the grey literature and con-
ducting multivocal literature reviews in software engineering”. In: arXiv preprint arXiv:1707.02553 (2017).
[9] Vahid Garousi, Michael Felderer, and Mika V Mäntylä. “The need for multivocal literature reviews in
software engineering: complementing systematic literature reviews with grey literature”. In: Proceedings
of the 20th International Conference on Evaluation and Assessment in Software Engineering. ACM. 2016,
p. 26.
[10] Vahid Garousi and Mika V Mäntylä. “When and what to automate in software testing? A multi-vocal
literature review”. In: Information and Software Technology 76 (2016), pp. 92–117.
[11] Vahid Garousi and Junji Zhi. “A survey of software testing practices in Canada”. In: Journal of Systems
and Software 86.5 (2013), pp. 1354–1376.
[12] Vahid Garousi et al. “What industry wants from academia in software testing?: Hearing practitioners’
opinions”. In: Proceedings of the 21st International Conference on Evaluation and Assessment in Software
Engineering. ACM. 2017, pp. 65–69.
[13] Katelyn Godin et al. “Applying systematic review search methods to the grey literature: a case study
examining guidelines for school-based breakfast programs in Canada”. In: Systematic reviews 4.1 (2015),
p. 138.
[14] Børge Haugset and Geir Kjetil Hanssen. “Automated acceptance testing: A literature review and an indus-
trial case study”. In: Agile, 2008. AGILE’08. Conference. IEEE. 2008, pp. 27–38.
[15] Kentaro Inui et al. “Experience mining: Building a large-scale database of personal experiences and opinions
from web documents”. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology-Volume 01. IEEE Computer Society. 2008, pp. 314–321.
[16] Valentin Jijkoun et al. “Mining user experiences from online forums: an exploration”. In: Proceedings of the
NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media. Association for
Computational Linguistics. 2010, pp. 17–18.
[17] Takeshi Kurashima, Taro Tezuka, and Katsumi Tanaka. “Mining and visualizing local experiences from blog
entries”. In: DEXA. Springer. 2006, pp. 213–222.
[18] David Lo, Nachiappan Nagappan, and Thomas Zimmermann. “How practitioners perceive the relevance of
software engineering research”. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software
Engineering. ACM. 2015, pp. 415–425.
[19] Mika V Mäntylä et al. “On rapid releases and software testing: a case study and a semi-systematic literature
review”. In: Empirical Software Engineering 20.5 (2015), pp. 1384–1425.
[20] Seyed Mehdi Nasehi et al. “What makes a good code example?: A study of programming Q&A in Stack-
Overflow”. In: Software Maintenance (ICSM), 2012 28th IEEE International Conference on. IEEE. 2012,
pp. 25–34.
[21] Julio Cesar Sampaio do Prado Leite and Claudia Cappelli. “Software transparency”. In: Business & Infor-
mation Systems Engineering 2.3 (2010), pp. 127–139.
[22] Dudekula Mohammad Rafi et al. “Benefits and limitations of automated software testing: Systematic liter-
ature review and practitioner survey”. In: Proceedings of the 7th International Workshop on Automation of
Software Test. IEEE Press. 2012, pp. 36–42.
[23] Austen Rainer. “Using argumentation theory to analyse software practitioners’ defeasible evidence, inference
and belief”. In: Information and Software Technology (2017).
[24] Austen Rainer and Stephen Gale. “Evaluating the quality and quantity of data on open source software
projects”. In: Procs 1st Int Conf on Open Source Software. 2005.
[25] Pankajeshwara Sharma et al. “Investigating developers’ email discussions during decision-making in Python
language evolution”. In: Proceedings of the 21st International Conference on Evaluation and Assessment in
Software Engineering. ACM. 2017, pp. 286–291.
[26] Tom De Smedt and Walter Daelemans. “Pattern for python”. In: Journal of Machine Learning Research
13.Jun (2012), pp. 2063–2067.
[27] Margaret-Anne Storey et al. “The impact of social media on software engineering practices and tools”. In:
Proceedings of the FSE/SDP workshop on Future of software engineering research. ACM. 2010, pp. 359–364.
[28] Margaret-Anne Storey et al. “The (r) evolution of social media in software engineering”. In: Proceedings of
the on Future of Software Engineering. ACM. 2014, pp. 100–116.
[29] Margaret-Anne Storey et al. “Using a Visual Abstract as a Lens for Communicating and Promoting De-
sign Science Research in Software Engineering”. In: 11th International Symposium on Empirical Software
Engineering and Measurement (ESEM), 2017. IEEE–Institute of Electrical and Electronics Engineers Inc.
2017.

15
[30] Bogdan Vasilescu, Vladimir Filkov, and Alexander Serebrenik. “StackOverflow and GitHub: Associations
between software development and crowdsourced knowledge”. In: Social computing (SocialCom), 2013 in-
ternational conference on. IEEE. 2013, pp. 188–195.
[31] Ashley Williams. “Using reasoning markers to detect rigour of software practitioners’ blog content for grey
literature reviews (GLRs)”. In: Submitted to EASE 2018.

16

View publication stats

You might also like