Landscape

A Landscape of the Adoption of Empirical
Evaluations in the Brazilian Symposium on Human

Factors in Computing Systems
Adson Damasceno Andressa Ferreira Eliakim Gama
State University of Ceará State University of Ceará State University of Ceará
Fortaleza, Brazil Fortaleza, Brazil Fortaleza, Brazil
adson.damasceno@aluno.uece.br andressa.magda@aluno.uece.br eliakim.gama@aluno.uece.br
José Paulo Rodrigues Moraes Lucas Vieira Alves Marx Haron Barbosa
State University of Ceará State University of Ceará State University of Ceará
Fortaleza, Brazil Fortaleza, Brazil Fortaleza, Brazil
jpaulo.moraes@aluno.uece.br lucas.vieira@aluno.uece.br marx.barbosa@aluno.uece.br
Matheus Lima Chagas Emmanuel Sávio Silva Freire Mariela Inés Cortés
State University of Ceará Federal Institute of Ceará State University of Ceará
Fortaleza, Brazil Morada Nova, Brazil Fortaleza, Brazil
matheus.chagas@aluno.uece.br savio.freire@ifce.edu.br mariela@larces.uece.br
ABSTRACT was found with 95% confidence that the quality of the empir-
Context: Empirical evaluations have been widely applied as ical studies increased, by comparing two different divisions
a formalism to validate and ensure the credibility of research of the 20-year conference period. Findings: The assessment
works. In the context of Human-Computer Interaction (HCI) questionnaire used in our research and the learned lessons
field, the usage of empirical methods plays a vital role due to about each kind of empirical study can support the conduc-
its focus on evaluating the end-user and usability of software tion of new empirical evaluations in the HCI community,
solutions. Even though the adoption of experimental evalu- improving their quality.
ation techniques has gained popularity in recent years, its
CCS CONCEPTS
application is still questioned both qualitatively and quantita-
tively. Goal: To analyze how empirical research has evolved • Human-centered computing → Empirical studies
in the Human-Computer Interaction Brazilian community. in HCI.
Method: We performed a quasi-experiment, using published
papers over the 17 editions in Brazilian Symposium on Hu- KEYWORDS
man Factors in Computing Systems. Our experiment was empirical evaluation, research protocol, quality assessment,
divided into two phases, in order to evaluate the quality as- human-computer research.
sessment and classify the type of realized empirical study, ACM Reference Format:
respectively. Results: From the sample of 231 studies, 113 Adson Damasceno, Andressa Ferreira, Eliakim Gama, José Paulo Ro-
reached satisfactory quality in the first phase. It was found drigues Moraes, Lucas Vieira Alves, Marx Haron Barbosa, Matheus
that empirical studies were becoming more present in lastly Lima Chagas, Emmanuel Sávio Silva Freire, and Mariela Inés Cortés.
years of the conference. Moreover, in the second phase, it 2019. A Landscape of the Adoption of Empirical Evaluations in the
Brazilian Symposium on Human Factors in Computing Systems.
In Proceedings of the 18th Brazilian Symposium on Human Factors
Publication rights licensed to ACM. ACM acknowledges that this contribu- in Computing Systems (IHC 2019), October 22–25, 2019, Vitoria - ES,
tion was authored or co-authored by an employee, contractor or affiliate of Brazil. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/
a national government. As such, the Government retains a nonexclusive, 3357155.3358465
royalty-free right to publish or reproduce this article, or to allow others to
do so, for Government purposes only. 1 INTRODUCTION
IHC 2019, October 22–25, 2019, Vitoria - ES, Brazil
In Human-Computer Interaction (HCI) field, specialists need
© 2019 Copyright held by the owner/author(s). Publication rights licensed
to ACM.
to understand psychological, organizational, and social fac-
ACM ISBN 978-1-4503-6971-8/19/10. . . $15.00 tors of the combined human and computer system in order
https://doi.org/10.1145/3357155.3358465 to build competitive software interfaces and evaluate their
IHC 2019, October 22–25, 2019, Vitoria - ES, Brazil Damasceno, et al.
effects [24]. Hence, the usage of empirical methods based on Section 6 introduces the main inferences obtained from the
systematic observations and experiments are a necessity in results. Finally, Sections 7 and 8 mention threats to validity
this field [2]. and conclusions, respectively.
The main event in the HCI area in Brazil is called Brazil-
ian Symposium on Human Factors in Computing Systems 2 BACKGROUND
(IHC, the acronym in Portuguese) is reached the 17th edition Empirical evaluations provide a framework to evaluate theo-
in 2018 and annually gathers professionals from industry, ries, verify hypotheses, and answer research questions by ob-
lecturers, researchers, and students interested in scientific servations or experiments. One of the goals of empirical eval-
investigation and practices related to creating, building and uations is to provide means that can be integrated with prac-
evaluating computing solutions to be used by people. In this tical experience and human values in the decision-making
context, empirical evaluation can be useful for HCI special- process, regarding the development and maintenance of soft-
ists to understand the social, physical, and cognitive environ- ware [11].
ments and their effects as part of the interface design process Empirical evaluations can be categorized as a (quasi) con-
in order to develop methodologies to aid appropriate HCI trolled experiment, case study, systematic review, and sur-
design. We used the term IHC Symposium to identify the vey [14, 16].These empirical evaluations are characterized
Brazilian Symposium in the remainder of this paper. by having a well-defined protocol that should be followed
Empirical methods have been able to expand scientific by research during the evaluation process. How our study
knowledge because can guide the development of new tech- aims to evaluate the quality of these empirical evaluation
nologies and contribute with information that can support categories, we presents them below in details.
the decision-making process, both in the industry and in the An experiment is a method widely used in many areas of
area of services [23]. Empirical evaluations have been widely science to tests the established hypothesis by finding the
applied as a systematic approach to validate and ensure the effect of variables of interest on the outcome variables [16].
credibility of the works proposed by researchers [16, 21, 26]. Experiments can be oriented by humans or by technology. In
Even though the adoption of empirical evaluation techniques the first case, humans apply treatments to objects, whereas
has gained popularity in recent years, its application has been in the second one, tools are responsible for performing the
questioned both qualitatively and quantitatively by numer- experiment, providing greater control. The usage of experi-
ous studies [3, 18, 28]. ment is recommended to confirm theories and widespread
Our work aims to analyze how empirical research has acknowledgment or to evaluate models [26]. The experiment
evolved in the Human-Computer Interaction Brazilian commu- process involves scope definition, planning, operation, anal-
nity. Therefore, we performed a controlled quasi-experiment, ysis, interpretation, and the experimentation report.
which was divided into two phases: quality assessment and Case study research is a standard empirical method used
classification process. In the first one, we examined 231 pa- in various sciences such as sociology, medicine, and psy-
pers to determine whether it comprises an empirical evalua- chology. A case study can be defined as an empirical study
tion. In the second one, the papers that include experimental to investigate a single entity or contemporary phenomenon
studies were analyzed from the researchers' perspective, within its real-life context [26] and specific time-space. In the
to classify them in experiments, case studies, systematic re- case study processes, researchers must not interfere, directly
views, and surveys. As a result, we aim to show the evolution or indirectly, since it is a purely observational study of real-
of the utilization of experimental methods in HCI research, world scenarios. The primarily sources to collect qualitative
in quantitative and qualitative aspects. data are interviews, group discussions, and observations.
In this paper, we developed a protocol to support the classi- A systematic literature review provides a comprehensive
fication process, and conducted a quality assessment process. and valid landscape of the current position of literature in
The protocol includes checklists based on quality assessment an area, both the identification, analysis, and interpretation,
criteria from the literature. Therefore, we aim at improving to answer research questions. The systematic reviews are
the conduction of the studies and consequently helping to methodologically undertaken with a specific search strategy
produce more reliable results by reducing or eliminating bi- and well-defined methodology [26]. As reported by Kitchen-
ases. All the material used in the execution of this experiment ham and Charters [7], the most common reasons to conduct
is available at http://split.to/ydo9X1C. a systematic review are: to summarize the existing evidence
The remainder of this paper is organized as follows. Sec- on a specific topic, to identify any gaps in current research
tion 2 provides an overview of the types of empirical inves- to suggest areas for further investigation, and to provide a
tigation. In Section 3, we discuss some of the main related background to position new research activities appropriately.
work. Section 4 describes the composition of research pro- The process to conduct a systematic review has included ac-
tocol and its operation. Section 5 presents the results, and tivities to planning, conducting, and reporting the review.
Adoption of Empirical Evaluations in the IHC Symposium IHC 2019, October 22–25, 2019, Vitoria - ES, Brazil
A survey is conducted to collecting information, consis- regarding the quality in the application of the empirical
tently and systematically, from or about people, to describe, methods, such as the significant lack of definition about the
compare, predict, or explain their knowledge, attitudes, and empirical study type performed in the analyzed papers.
behavior [4]. A survey can be useful for determining the Although the research community has been interested in
characteristics of a population, comparing groups, and mak- the assessment of empirical studies, we only found evalua-
ing explanatory claims about a population [26].In a survey, tions in the SE field.
the primary techniques to collect quantitative or qualitative
data are interviews and questionnaires. 4 RESEARCH PROTOCOL
There are other types of empirical evaluations which do We defined a research protocol based on Barbosa et al.[1]. We
not have a well-defined protocol, such as discussion [28], needed to make adaptations in Barbosa’s protocol in order to
example application [21], experience report [19], exploratory reduce the risk of wrong identification and classification of
study [13], and meta-analysis [5, 28]. However, our work empirical study type and to guarantee a sound quality assess-
concentrates on the types that are commonly used in the ment. Consequently, the classification process, the research
HCI area [14]: experiment, case study, systematic literature questions, and the checklist went through improvements.
review, and survey.
Hypotheses
3 RELATED WORK
Our empirical evaluation took into account the following
This section presents an analysis of some works related to hypotheses, which address the qualitative and quantitative
our paper’s goal. We looked for works which propose (quasi) aspect of our study:
experiments to assess the quality of empirical studies pub-
lished in research papers. • H 01 . The quantity of empirical evaluations has not
Zannier et al. [28] performed a quasi-random experiment increased over the IHC Symposium proceedings.
of peer-reviewed papers published by International Con- • H 11 . The quantity of empirical evaluations performed
ference on Software Engineering (ICSE) across the past 29 has increased over the IHC Symposium proceedings.
years of proceedings. The experiment evaluated the quan- • H 02 . The quality of empirical evaluations has not in-
tity and soundness of the performed empirical evaluations. creased over the IHC Symposium proceedings.
In order to perform the experiment, 5% papers of the total • H 12 . The quality of empirical evaluations has increased
population were randomly drawn, and the results showed over the IHC Symposium proceedings.
that 70% of the sample contained some form of evaluation. In order to statistically evaluate those hypotheses, a way
The experiment also indicated the increase in the number for analyzing the results would be by dividing the sample
of published papers and a slight improvement of the papers' into two populations (P1 and P 2 ), where P1 and P2 represent
soundness. However, this study [28] did not define a formal the population from the oldest years period and the popu-
protocol which can be used by other researchers to conduct lation from the recent years period, respectively. Although
or replicate quality assessments in empirical evaluation. the comparison year by year might be the ideal strategy,
Silveira Neto et al. [22] investigated the evolution in the the usage of only two separated populations has been done
Software Engineering (SE) field by analyzing the first 24 satisfactorily in experimental studies [1, 28], allowing the
editions of Brazilian Symposium on Software Engineering accepting or rejecting the proposed hypothesis.
(SBES, the acronym in Portuguese). Combining good prac- In this sense, we can reject H 01 and accept H 11 whether
tices of scoping study and systematic review, the authors the ratio given by the number of empirical studies over the
analyzed the state-of-art regarding SE in Brazil and compared total number of works in P 2 is greater than in P1 . Otherwise,
with the international community using the International H 11 is rejected and H 01 is accepted. Moreover, aiming to
Conference on Software Engineering (ICSE) as a similar sce- know whether H 02 can be rejected and, consequently, H 12
nario. Furthermore, the authors performed a quantitative can be accepted, while we performed the Mann-Whitney
analysis of empirical research in published papers and pre- U Test [16, 26] to the quality rates obtained by the papers
sented results that showed an increase in the use of empirical of both populations, and we also calculated the effect size
methods in the last years. Nonetheless, this study did not between them. Such a test determines whether two samples
address the quality of the empirical evaluations. are statistically different for a given confidence level.
The research of Barbosa et al. [1] empirically analyzed the As proposed by Vargha [25], the effect size that was per-
papers published in SBES, presenting quantitative and qual- formed is a real number ranging from 0 to 1. Besides, this
itative results of a quasi-random experiment for empirical real number relatively represents the number of times that a
evaluations in the past ten years of proceedings. The results value in one population is greater than the others. We used
showed that several analyzed papers had several limitations Â21 to represent the effect size of P2 in relation to P1 . Thus,
the greater Â21 , the bigger the values in P2 is in comparison the number of papers per year. Whether this number was
to the ones in P 1 . not integer, we leveled it up. At finished of this process, we
obtained 243 papers. However, 12 papers did not available
Research Questions in on-line IHC Symposium proceedings, leaving 231 papers
Aiming at providing an assessment of the current status in our sample. It represents 49.15% of the population. Also,
of the empirical evaluations presented in the papers over we calculated a confidence level of 95% accuracy and a confi-
the IHC Symposium, we formulated the following Research dence interval of 5%, resulting in a minimum of 212 papers.
Questions (RQ): Therefore, our sample size is within the required confidence
level range.
• RQ1: Has the quantity of performed empirical
evaluations increased over the IHC Symposium Procedure
lifetime (since 1998 until 2018)? Rationale: Empiri-
cal evaluations are necessary for the HCI area because Initially, we elaborated a checklist to capture the general de-
they allow that the theory can be compared with the scription common to empirical assessments and to identify
real world. In this sense, we expected that the number specific descriptions related to each type of assessment used
of studies using empirical assessments had been raised in the papers. This checklist is available at http://split.to/
in the considered period. 61WnyBI. We defined randomly four groups of researchers
• RQ2: What were the most widely used empirical composed of two members, considering the first eight au-
evaluation methods? Rationale: This RQ has the pur- thors. The members of these groups did not change during
pose of identifying the most used empirical evaluation the execution of our study. More specifically, the first seven
types. In order to answering this RQ, we considered the authors are students from the Academic Master's in Com-
following kinds of empirical evaluations: experiment, puter Science at the State University of Ceará. They have
case study, systematic literature review, and survey. some experience level in software development, and have
• RQ3: Were the empirical evaluation methods well recently completed the course Empirical Studies in Software
implemented? Rationale: this RQ aims at investigat- Engineering as a requirement of the program. The team is
ing if the papers under analysis were, in fact, ade- also composed of a doctoral student (penultimate author)
quately conducted in the empirical evaluation used and a doctor (last author) in the field.
by their authors. We only evaluated the quality of the A pilot project was conducted in order to validate the elab-
papers classified as study cases, experiments, system- orated questions and synchronize the understanding of them
atic reviews, or surveys. by each researcher. We used renowned research publications
about each kind of empirical studies (case study [6], exper-
Variables iment [27], systematic review [12], and survey [15]), and
the group of researchers used the checklist to assess these
The independent variables of this work are the checklists,
publications. Our goal was to identify whether the check-
the period under analysis, and the chosen conference. In its
list guaranteed correctly the identification of high-quality
turn, the dependent variables explored in this work are the
papers for each type of empirical evaluation. Moreover, we
quality of empirical strategies depending on the type of the
analyzed 8 papers (2 for each kind of empirical study) of IHC
paper and the reviewers' evaluation.
proceedings after the improvements applied to the checklist.
The papers used in the pilot were not part of the sample.
Material
After this, we performed more two phases. Firstly, we care-
IHC Symposium is originally an annual event, except for fully examined each of the 231 papers to determine whether
the period between 2003 and 2010 in which the symposium it contained the main empirical evaluation components based
was biennial. In this period, the conference had only four on the general questions in the checklist. Secondly, we ana-
editions (2003, 2005, 2007 and 2009). In the total, the con- lyzed all the papers from the previous phase that contained
ference had 17 editions until 2018. Given these editions and an empirical study to determine its type following the classifi-
our research goal, we defined the following inclusion and cation scheme and the quality assessment, based on Section 2.
exclusion criteria: In both phases, at least two researchers independently re-
Inclusion criteria: Research full papers published from viewed each paper to mitigate personal bias. In the following
1998 until 2018 in the main track of the conference. sections, we detail both phases.
Exclusion criteria: Short papers, invited talks, panels,
banners, tutorials, or tool sessions papers. Classification Process. In this process, as illustrated in Fig-
Due to these criteria, we obtained a population size of 470 ure 1, papers were randomly assigned to pairs of reviewers.
papers. From this population, we got randomly a half for This process followed the all defined RQ. In order to answer
to solve the detected inconsistencies. When the pair of re-

viewers could not fix the divergence, the last author supports
in solving it.
Quality Assessment Process. In order to answer RQ3, we per-

formed a quality assessment process applied to the papers
that included empirical studies, i.e., classified as experiments,
case studies, systematic reviews or surveys in the previous
phase, which totaled 113 papers. These types can be charac-
terized, and quality attributes for themselves can be estab-
lished from the literature.
We created a checklist for each empirical method based on
quality assessment criteria from the literature [8, 15, 20, 26].
Each question is evaluated as "Yes" (indicating that data for
the specific question is clearly available), "Partially" (data
is vaguely available), or "No" (data is unavailable) with a
corresponding score of 1, 0.5, or 0, respectively. We defined
this scale because a simple Yes/No answer may be mislead-
ing. This decision is also in line with the quality checklist
recommendations presented by Kitchenham [9]. Some of the
questions were not applicable (N/A) to some studies; these
studies were not evaluated for those questions.
To assess a paper, we added the scores for each question,
and found the percentage over the number of applicable
questions for that paper. We called the resulting number as
Figure 1: Classification and quality assessment process.
quality rate.
5 RESEARCH RESULTS
RQ1, each pair analyzed each study to determine whether In this section, we provide results for the entire sample in
the validation of contribution used any empirical evaluation. order to answer the research questions.
In this sense, we formulated general questions to identify
experimental evidence in the validation process. These ques- Quantity of performed empirical evaluations (RQ1)
tions are standard for all empirical study types. Each pair of In order to reject the null hypothesis H 01 , it was necessary to
reviewers analyze their group of papers individually. divide the period of the IHC Symposium proceedings in two
After that, for the papers identified with empirical evalua- groups, P1 and P2 , and compare them to check if there was
tions, the pair of reviewers answered the questions about the an increase in P 2 with respect to P1 over the years. Thus, 16
four categories described in Section 2. Each reviewer did this configurations were developed and analyzed, represented by
classification individually to answering RQ2 and RQ3. The the expression Ci = {P1i , P2i } where i ranges from 1 to 16
classification of the studies followed the empirical method i 17
(See Table 1), P1i = Ax e P 2i Ax . The term Ax belongs
Ð Ð
used to validate the contribution. We documented this pro- x =1 x =i+1
cess collaboratively in a spreadsheet in Google Docs. When to the set S, the set of all years of the conference presented
the reviewers entered the data extracted from the papers, in the sample. Table 1 details the number of papers of the
they provided a short rationale to justify the classification sample (231) distributed (P1i and P2i ) along the configura-
of the paper in a particular kind of empirical study. tions, indicating the rate of papers where the authors have
From the final spreadsheet, we calculated the frequencies used any empirical method to validate their contribution, in
of papers in each category. Moreover, in both steps, each re- the reviewer’s perspective.
viewer from each pair, individually, answered the questions To perform this comparison, it was used the Mann-Whitney
about the identification and the categorization of empiri- U test [17], considering the confidence level of 95% (α = 0.05),
cal assessment performed in the papers. Whether the title, and Vargha-Delaney A measure for effect size [25].
abstract, keywords, introduction, and conclusion were not Splitting the conference lifetime approximately by
sufficient for this identification, the entire paper was read. half, P1 represents the period from 1998 to 2008 (8 edi-
Next, each pair of reviewers compared the answers and tried tions) and P2 refers to the period from 2010 to 2018 (9
editions) (see line 8 of Table 1). Considering this config- Considering the author’s perspective and the types of
uration, the data show that there is an increase of 36.69% studies that we addressed in the protocol, Case Study was
in the number of empirical studies. Since the obtained p- the most used empirical evaluation, present in 43 studies,
value is less than 5%, this increase represents a significant 18.61% of the sample. The second most used was Experiment,
statistical difference. Moreover, the effect size shows that used in 23 studies, 9.96% of the sample. At third and forth,
the magnitude of the difference is medium. Thus, we can are Systematic Review and Survey, with 12 and 7 studies
reject the null hypothesis (H 01 ) and prove that there is an present in the sample, which represents 5.19% and 3.03% of
increase in the number of empirical studies, considering the the total amount, respectively.
comparison between these two periods. In reviewers’ perspective, note that 118 studies were clas-
sified as not empirical, and then, the quality assessment
Table 1: Distribution of studies with empirical evaluation. was done with the remaining 113 empirical studies. In this
perspective, the most used empirical evaluation was Case
Conf. Sample Size Emp. Studies Means p-value Â21 Study, with 80 studies, 34.63% from the complete sample
Ci P1 P2 P1 P2 P1 P2
1 8 223 1 112 0.125 0.5022422 0.03666 0.6886211 (medium) (231 studies). Differently from author’s perspective, System-
2 14 217 2 111 0.1428571 0.5115207 0.007668 0.6843318 (medium)
3 22 209 2 111 0.09090909 0.5311005 8.92E-05 0.7200957 (medium) atic Review was the second most used empirical evaluation,
4 33 198 5 108 0.1515152 0.5454545 2.91E-05 0.6969697 (medium)
5 39 192 7 106 0.1794872 0.5520833 2.31E-05 0.6862981 (medium)
with 16 studies, which corresponds to 6.92% of the sample
6 46 185 8 105 0.173913 0.5675676 1.86E-06 0.6968273 (medium) size. Experiment positioned in third, with 13 studies (5.62%);
7 56 175 11 102 0.1964286 0.5828571 5.09E-07 0.6932143 (medium)
8 69 162 16 97 0.2318841 0.5987654 3.52E-07 0.6834407 (medium) followed by Survey, with 4 studies (1.73%).
9 79 152 23 90 0.2911392 0.5921053 1.49E-05 0.650483 (small)
10 95 136 30 83 0.3157895 0.6102941 1.11E-05 0.6472523 (small) After executing the first phase of the evaluation, there
11 110 121 38 75 0.3454545 0.6198347 3.23E-05 0.6371901 (small)
12 123 108 47 66 0.3821138 0.6111111 0.0005297 0.6144986 (small)
were several cases that the author’s perspective differed from
13 140 91 58 55 0.4142857 0.6043956 0.004847 0.5950549 (small) this reviewers’ perspective (see Table 3). From the 43 studies
14 162 69 73 40 0.4506173 0.5797101 0.07325 0.5645464 (negligible)
15 179 52 82 31 0.4581006 0.5961538 0.08048 0.5690266 (negligible) that supposed to have a case study, 1 were classified as a
16 203 28 98 15 0.4827586 0.5357143 0.6012 0.5264778 (negligible)
Systematic Review, and 18 others did not even reach the min-
imum quality of the empirical study to be selected. Similarly,
In the last 3 configurations, the comparisons between P1 from the 23 studies that alleged to have an experiment, 8
and P2 presented a p-value greater than α (5%), indicating were classified as Case Studies, 1 as a Survey, and 7 were not
that we can not reject the null hypothesis H 01 with 95% selected in the first phase. Systematic Review type presented
confidence. This finding could be an indication that in recent only one divergence, which was a study that were not se-
years, the proportion of the present empirical studies stayed lected as empirical study. Finally, from the 7 Surveys, 3 were
stable. Even if the obtained p-value were less than 0.05 and actually classified as case studies and 1 as an experiment.
the effect sizes were the same, they would show that the This result indicates a lack of clear understanding of the
difference is negligible. authors about the characteristics of each empirical method
and its applicability in the specific context.
The most used empirical evaluations (RQ2)
Table 2 illustrates the amount of each type of study from Table 3: Disagreements in the classification of the studies.
the sample (231 studies), from the author’s perspective and
Case Study Experiment Systematic Review Survey Not Qualified
the reviewers’ perspective. Since other types of empirical Case Study 0 0 1 0 18
evaluations are not addressed, they are labeled as "Others", Experiment 8 0 0 1 7
Systematic Review 0 0 0 0 1
and the ones that did not present any type of study that could Survey 3 1 0 0 0
be regarded as empirical are labeled as "No Empirical Evalu-
ation". From the reviewers’ perspective, the studies that did
not reach a satisfactory quality to classify them as empirical Quality of the empirical evaluation methods (RQ3)
studies are also labeled as "No Empirical Evaluation". Similarly to the evaluation about the quantity of performed
empirical evaluations, we performed several comparisons
Table 2: Types of studies included in the sample. considering all the possible partitions of sample of the studies
with empirical evaluation (113) in two groups, P1 for the
Author’s Perspective Reviewers’ Perspective oldest years period and P2 for the recent years period. For
Others 93 0 each comparison, we performed an Mann-Whitney U Test
No Empirical Evaluation 53 118
Case Study 43 80
and calculated the effect size of the difference by Vargha-
Experiment 23 13 Delaney A measure.
Systematic Review 12 16 Initially, we considered the partition that was analyzed in
Survey 7 4 Section 5.1 by splitting the conference lifetime approximately
by half, involving the papers from the early 8 editions to In short, it was found, with 95% confidence, that is possi-
P1 and the rest to P2. Table 4 shows P1 and P2 sizes, the ble to reject the null hypothesis and affirm the alternative
quality rate average of each group (see Section 4.5.2), the hypothesis, considering all the empirical studies when con-
p-value, effect size between the P1 (1998 to 2008) and P2 sidering those periods were 1998-2014 and 2015-2018. There-
(2010 to 2018, in 2009 did not have the conference) for each fore, we can state that there is evidence that there was an
empirical evaluation types and for all studies together. In improvement in the quality of the works published in the
this comparison, for each type of empirical evaluation, the symposium from the year 2015 edition.
obtained p-value is greater than an α = 5%. Thus, we can not
reject the null hypothesis H 02 with 95% confidence for any 6 DISCUSSION OF KEY FINDINGS
type of empirical evaluation. For the overall comparison, the In this section, we draw some conclusions about each type
obtained p-value is also greater than α = 5%. Therefore, we of empirical investigation based on the quality evaluations
can not reject the null hypothesis H 02 with 95% confidence carried out by the reviewers. For each type of empirical inves-
considering the overall as well. tigation, we analyzed the results chart, obtaining conclusions
that allow for suggestions and critiques about how authors
Table 4: Quality rate averages, p-values and effect sizes be- of IHC Symposium have conducted empirical studies over
tween P1 (1998 to 2008) and P2 (2010 to 2018).
the years.
Sample Size Quality Rate Average
Empirical Method p-value Â21
P1 P2 P1 P2 General Considerations about the Studies
All 16 97 0.6291625 0.6914948 0.2356 0.5924613 (small)
Case Study 11 69 0.5454545 0.6608696 0.05058 0.6824769 (medium) Hence considering all the 231 papers from the sample, the
Experiment 3 10 0.8888667 0.75 0.3819 0.3166667 (medium)
Systematic Review 0 16 - 0.7734375 - - percentage distribution splitting the studies from the authors’
Survey 2 2 0.7 0.8 0.6171 0.75 (large)
perspective versus researchers’ perspective (See Table 6).
As we can notice, 40.26% of the studies labeled as "Others"
involves studies that do not fit into any of the four types
After the evaluation of all possible partitions, it was not
evaluated from the authors’ perspective. From researchers’
possible to reject the null hypothesis, therefore it was not
perspective, these papers could be distributed into the four
possible to demonstrate an increase in the quality of the
addressed empirical types or considered without the mini-
articles, except for the configuration where P1 represents
mum required quality.
the period from 1998 to 2014 and P2 represents the period
from 2015 to 2018. Table 5 presents the quality rates and the
results of the Mann-Whitney U Test and Vargha-Delaney Table 6: Distribution of the Studies (%) by perspectives
A measure to this case. Analyzing each type of empirical
Perspective Case Study Experiment Sys. Review Survey Others No Empirical Evaluation
evaluation individually, we can not reject the null hypothesis Author 18.61 (43) 9.96 (23) 5.19 (12) 3.03 (7) 40.26 (93) 22.94 (53)
Researcher
H 02 with 95% confidence, since the obtained p-values are
34.63 (80) 5.63 (13) 6.93 (16) 1.73 (4) 00.00 51.08 (118)
greater than α = 5%. However, considering all the studies in We can notice that over half (51.08%) of the papers did
the comparison, the obtained p-value is less than 5%. This not present empirical validation or had poor quality. The
indicates that the difference in terms of quality rate between last characteristic makes the replication of studies extremely
these two periods is statistically significant. Moreover, the difficult or unfeasible, and compromises the validity of the
quality average in P2 is greater than P1 , and although the results. In the other hand, the considerable discrepancy in
effect size shows that the magnitude of such a difference the classification of empirical studies (Case Studies 18.61%
is small, it is not negligible. Thus, we can reject H 02 and from author perspective versus 34.63% from the research
affirm the alternative hypothesis H 12 , which means that the perspective) suggests a deviation from the author’s perspec-
quality of the empirical studies has increased over the IHC tive about the requirements and suitability of each empirical
Symposium proceedings, considering the partition in this study. This finding leads us to question the extent to which
configuration. the author knows about empirical evaluations and under-
stand what was being evaluated.
Table 5: Quality rate averages, p-values and effect sizes be-
tween P1 (1998 to 2014) and P2 (2015 to 2018). Considerations about Case Studies
Sample Size Quality Rate Average Regarding case studies, there were 24 papers with case stud-
Empirical Method p-value Â21
P1 P2 P1 P2
All 58 55 0.6454017 0.7219691 0.04953 0.6065831 (small)
ies included in which the authors’ perspective was confirmed
Case Study 45 35 0.6155556 0.6828571 0.1262 0.5990476 (small) by the researchers’ perspective. This confirmation was given
Experiment 6 7 0.80555 0.7619 1 0.5 (negligible)
Systematic Review 5 11 0.7 0.8068182 0.6386 0.5818182 (small) through scores extracted from the specific checklist for this
Survey 2 2 0.7 0.8 0.6171 0.75 (large) type of empirical study. In a total, 80 papers with case studies
were found and analyzed by researchers, corresponding to point, which leads us to believe that researchers are directing
approximately 70.8% of the 113 selected studies. their studies to beneficial results for practice.
Table 7 shows the frequency of the scores for each question
in the case study checklist for all case studies identified by Considerations about Experiments
the researchers. The first question (Q1) assessed whether Between the 113 papers selected by the researchers as empir-
the authors provided precise details about the cases of their ical studies, considering the authors’ perspective, 16 papers
analysis. More than 80% of the authors provided some detail, (14.16%) were categorized as Experiments, but only 13 papers
while only about 10% did not provide any detail. This data (11.5%) from this set were categorized by the researchers’
shows that the authors have been more attentive, exposing perspective as Experiments. Of these 16 articles classified as
in a more precise way the object analyzed in their work. experiments by the author, 9 (56.25%) were reclassified in the
researchers’ view as being another type of empirical study.
More than half of the empirical studies pointed out as an ex-
Table 7: Answers to specific case study questions
periment by the authors did not reach an expressive quality
Question Yes (%) Partially (%) No (%) rate in the evaluation as an experiment, which suggests that
Q1 89.38 - 10.62 the authors did not take into account the essential nature of
Q2 63.72 12.39 23.89 the fundamental characteristics that an experiment should
Q3 19.47 24.78 55.75
Q4 29.20 1.77 69.03
present as an empirical study. Through the evaluation of the
Q5 77.88 - 22.12 answers to questions related to the Experimental type, we
find the results presented in the Table 8.
In question Q2, the authors who avoided any interference Table 8: Answers to specific experiments questions
in the data collection of their study were positively punctu-
ated. Already, the authors that interfere with the process, but Question Yes (%) Partially (%) No (%)
justify such interference were partially punctuated. Finally, a Q1 69.23 - 30.77
Q2 69.23 - 30.77
negative score was provided to the authors who interfere in Q3 84.62 0.00 15.38
the process used and do not attest a justification for this. As Q4 92.31 - 7.69
can be seen from the Table 7, a higher percentage of authors Q5 100.00 - 0.00
Q6 53.85 - 46.15
avoid some interference, thus avoiding possible biases in
their studies. This point contributes mainly to the reliability
of the analyzed studies. Considering the related questions about the hypothesis
The question Q3 investigates the use of data triangulation. presentation (Q1) and the method to reject or validate the
A positive score was given if at more than one data input hypothesis (Q2), the researchers’ evaluation had shown that
was used. When only one data input was used, the assess- 69.23% of the papers classified as experiments by researcher’s
ment process scored partially. Lastly, a negative score was view, had presented the null and alternative hypothesis in
provided, whether there is no data input. As can be seen in formal suitability. As the same proportion, the methods used
Table 7, most of the authors did not do well at this point, to validate or reject them.
which can lead mainly to a decrease in the safety of the Hypothesis definition is part of the planning level to the
studies results. experiment goal execution [26]. It is substantial to the sta-
In question Q4, the studies that deal with ethical issues tistical analysis - "The basis for the statistical analysis of
were positively punctuated. Already, studies scored with an experiment is hypothesis testing" - to the goal achieve-
partially do not address such these issues, but justify the ment. Once defined the null and alternatives hypothesis,
reason or do not yet need this. A negative score was awarded one methodology is necessary to verify the collected data
to studies that did not address the ethical issues of their in the light of those ones, applying statistical models to val-
study, even though it was necessary. An example is a study idate or reject those hypotheses. "All these factors have to
involving human participants, but without guaranteeing the be considered when planning an experiment" [26]. Thus,
anonymity of the participants. As shown in Table 7, most the affirmative answers to the questions Q1 and Q2 suggest
researchers leave this point aside. regular care necessary to achieve the experiment’s goals in
Question Q5 concerns the usefulness of the study. Case following a model proposal in the planning level of this type
studies, ideally, provide much input into practice. A positive of empirical study.
score was provided for those studies that provide a useful The third more expressive result was pointed out in the
contribution and a negative score to those who do not pro- question Q3 that deals with the variables presented in the
vide. As can be seen in Table 7, most studies did well in this empirical study and their metrics. In 84.62% of the studies,
the authors elected the independent and dependent variables The majority of papers classified as Systematic Review
as well as the metrics to evaluate those variables through (87.5%) described their used search process (Question Q3),
the analysis and results from the papers. Also, the effect of increasing the possibility to found relevant primary studies.
treatments depends on the measurement of the variables and By this process, other researches can understand how the
increase the quality of the study analysis [26] . primary papers were identified and can update a systematic
The quantitative methods (Q4) were identified in 92.31%, review adding new primary papers published afterward. In
raising the hypothesis that the authors did care about to base Question Q4, the quality assessment of primary studies was
their results in a quantitative statistical model, “to draw a considered. We identified that a little over 50% of systematic
meaningful conclusion from an experiment” is necessary to reviews did this evaluation. It is essential to highlight that
“apply statistical analysis methods on the collected data to this evaluation guides the interpretation of findings and the
interpret the results” [26]. recommendations for further research [10].
The question Q5 that refers to the experiments treatments
presentation were appointed in 100% from the evaluated Considerations about Surveys
papers. Every experiment is derived from planning that en- In the Survey study configuration, a general analysis indi-
compasses a specific context, scope, and the treatment given cated that there were three studies in which the perspective
to the variables listed to meet the proposed goals and evalu- of the authors of these studies was confirmed by the perspec-
ation of the hypotheses [26]. In this view, the treatment is tive of the researchers of this article, as shown in Table 2. This
fundamental as a basis in the experiment, since the study sug- confirmation was given employing scores extracted from a
gests as being a characteristic that the authors had attention specific Checklist for the type of empirical study in question.
in detail in their studies. Almost fifty percent of all reviewed Table 10 presents the results for the specific questions that
studies, categorized as experiments did not randomize the characterize a study as Survey.
data sample (Q6), which did not avoid the possibility of bias
from the authors perspective.
Table 10: Answers to specific survey questions
Considerations about Systematic Reviews
Question Yes (%) Partially (%) No (%)
Conform Table 6, only twelve (∼43%) papers were classified Q1 50.00 - 50.00
as Systematic Review in authors’ and researches’ perspective. Q2 100.00 - 0.00
This finding was obtained after the checklist application Q3 50.00 - 50.00
Q4 100.00 - 0.00
considering the 113 papers classified by the researches as Q5 75.00 - 25.00
empirical studies. More specifically, when we analyzed the
answers given for each question related to Systematic Review
configuration, we can perceive that the most of these studies The Questions Q2 and Q4 scored 100%, which shows that
have strong characteristics associated with this configuration. researchers are concerned with describing how the ques-
Table 9 presents this detail. tionnaires are assembled, showing information such as the
number of questions, type, and texts of questions, sequence,
Table 9: Answers to specific systematic reviews questions and groupings (Q2), as well as information on the response
rate (Q4). In Questions Q1 and Q3 the answers "Yes" and "No"
Question Yes (%) Partially (%) No (%) reached 50%. Q1 addresses the specification and detailing of
Q1 75.00 25.00 0.00
Q2 68.75 18.75 12.50 the sampling method, while Q3 checks whether the ques-
Q3 87.50 - 12.50 tionnaire is available, either in the paper or in an external
Q4 56.25 - 43.75 link. Both information can support in the study replication
in another context.
Question Q1 took into account the primary research ques- For Question Q5, 75% showed concern in evaluating their
tion formulation and its elements: population, intervention, reliability, while 25%, which represents only 1 study, did not
and outcome. These elements limited the scope of a system- evaluate their reliability. According to the presented data,
atic review, avoiding unmanageable primary studies [26]. In it is noticed that some of the main problems are related to
our results, we identified that all assessed papers had at least not the more detailed specification of the sample used in
one of these elements. About inclusion and exclusion criteria the study and the availability of the questionnaire, which
(Question Q2), only 12.5% of papers did not present these reached 50% of the studies evaluated. For the first problem,
criteria. These criteria support the researchers in order to in- it is necessary that the authors seek to better detail who the
clude or exclude the study. Moreover, inclusion and exclusion participants of their studies are. For the other one, it would
criteria should be based on the main research question [16]. be enough if the authors added the questionnaire used in the
study as an attachment or appendix, or even through link papers along 17 editions spread over 20 years, we extracted
access. a sample representing more than 49% of this population.
We develop a protocol that involves the definition of a
7 THREATS TO VALIDITY classification and quality assessment processes that can be
In this section, we evaluated the threats of our study follow- reapplied using the same data or in the context of another
ing the taxonomies defined by Wholin et al. [26]. conference. Also, we define a checklist to support the appli-
External validity. We extracted a sample of papers from cation of the protocol based on well-defined criteria from
IHC Symposium. Our sample size over 49.15% of the popula- the literature. The classification process in the authors per-
tion that represents 231 papers. This number is inside of the spective shows that there is a significant lack of definition
confidence level range (95%) and confidence level (5%). We concerning the type of study they performed. Moreover,
developed a detailed protocol focused on major four types of many considerable disagreements in the type classification
empirical studies from the literature. So, this protocol could between the authors’ and reviewer’s perspectives show that
not be applied to consider other empirical methods, such as we, as a community, do not agree on definitions of empiri-
discussions, applications samples, so on. All data used in this cal evaluations or do not know what types of studies that
study has been made available for replication. Even so, we precisely are being publishing.
not indicate the generalization of the results. We accept our first research hypothesis that the quan-
Internal validity. This threat could be present in the tity of empirical methods performed has increased over the
assessment process. In order to mitigate it, we randomly lifetime of the IHC Symposium, considering several data
distributed all papers among all the reviewers aiming to partitions. This finding seems to be promising since demon-
avoid biased results. However, the classification process in- strating the increase in the maturity of the area. Considering
volved subjective decisions by the researchers. Thus, a pair the reviewers’ perspective, the most widely used empirical
of researchers reviewed individual and separated the papers evaluation has been case study (34.63%). This result might
considering the protocol. Following, they compared the re- explain due to case study is an observational method that
sults and tried to solve the divergences among them. When is perfectly applicable to investigate instances of a contem-
the divergence remains, the last author assisted the process. porary phenomenon within its real-life context. Really, the
Construct validity. The construction of the assessment usage of case studies is recurrent in the HCI field because it
questionnaire could jeopardize the validity. We looked for is a multidisciplinary discipline involving, beyond Computer
characteristics of a sound empirical studies in classical refer- Science, Psychology, Sociology, Social Work, so on.
ences published in the Literature. To validate this question- In terms of the second hypothesis, it is possible to confirm
naire in a pilot project, the last author indicates renowned an increase in the soundness of the empirical validations
publications about each kind of empirical studies. Another before the XIV IHC Symposium (IHC 2015). This finding
point was kinds of empirical study considered in our analysis. seems to show that the HCI community became aware of the
Although there are nine kinds, we choose only the empiri- importance of empirical evaluation and are taking steps to
cal evaluations that have a well-defined and formal proto- accomplish it. In this sense, we discuss the results about the
col (case study, experiment, systematic review, and survey). utilization of each type of empirical method in the IHC Sym-
These actions contributed to mitigating this threat. posium, to indicate the current strengthens and weakness
Conclusion validity. We believe that this study would on the studies to contribute with the improvement efforts.
bring some threats related to the subjective bias when as- In short, empirical assessments support the validation
sessing the quality of the studies from a renowned author or of contributions in several areas. However, the conduction
from a lastly edition of the conference. In both situation, the of these assessments should follow the characteristics of
reviewers could be expected high-quality papers. To reduce each kind of empirical evaluation in order to find out high-
this threat, we randomized sample choice and paper distribu- quality results. In this sense, we hope that our checklist
tion among peer reviewers. We performed statistical tests to and the discussion about main types of empirical studies
reinforce our conclusions and different data partition along (case study, experiment, systematic review, and survey) can
the lifetime was considered, minimizing this threat. support the conduction of new empirical evaluations in the
HCI community, improving their quality.
8 FINAL REMARKS
This paper presents a quasi-random experiment with quanti-
tative and validated results in order to understand how the ACKNOWLEDGMENTS
adoption of empirical methods by the HCI Brazilian commu- This study was financed in part by the Coordenação de Aper-
nity. In this sense, we formulate two hypotheses to address feiçoamento de Pessoal de Nível Superior - Brasil (CAPES) -
quantitative and qualitative aspects. From the population of Finance Code 001.
REFERENCES [18] Marcus R Munafò, Brian A Nosek, Dorothy VM Bishop, Katherine S

[1] Davi Monteiro Barbosa, Rômulo Gadelha, Thayse Alencar, Bruno Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simon-
Neves, Italo Yeltsin, Thiago Gomes, and Mariela I. Cortés. 2017. An sohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis.
Analysis of the Empirical Software Engineering over the last 10 Edi- 2017. A manifesto for reproducible science. Nature Human Behaviour
tions of Brazilian Software Engineering Symposium. In Proceedings 1 (2017), 0021.
of the 31st Brazilian Symposium on Software Engineering, SBES 2017, [19] Dewayne E. Perry, Susan Elliott Sim, and Steve M. Easterbrook. 2004.
Fortaleza, CE, Brazil, September 20-22, 2017. 44–53. https://doi.org/10. Case Studies for Software Engineers. In Proceedings of the 26th Interna-
1145/3131151.3131158 tional Conference on Software Engineering (ICSE ’04). IEEE Computer
[2] Victor R. Basili. 1996. The Role of Experimentation in Software En- Society, Washington, DC, USA, 736–738.
gineering: Past, Current, and Future. In Proceedings of the 18th Inter- [20] Per Runeson, Martin Host, Austen Rainer, and Bjorn Regnell. 2012.
national Conference on Software Engineering (ICSE). IEEE Computer Case Study Research in Software Engineering: Guidelines and Examples.
Society, 442–449. Wiley Publishing.
[3] Jeffrey C. Carver, Eugene Syriani, and Jeff Gray. 2011. Assessing the [21] Mary Shaw. 2003. Writing good software engineering research pa-
Frequency of Empirical Evaluation in Software Modeling Research.. pers. In Proceedings of the 25th International Conference on Software
In Proceedings of the First International Workshop on Experiences and Engineering (ICSE). IEEE, 726–736.
Empirical Studies in Software Modelling. CEUR-WS.org, 28–37. [22] Paulo Anselmo Da Mota Silveira Neto, JoáS Sousa Gomes, Eduardo San-
[4] A. Fink. 2003. The Survey Handbook. SAGE Publications. tana De Almeida, Jair Cavalcanti Leite, Thais Vasconcelos Batista, and
[5] G.V. Glass, B. McGaw, and M.L. Smith. 1981. Meta-analysis in social Larissa Leite. 2013. 25 Years of Software Engineering in Brazil: Be-
research. Sage Publications. yond an Insider’s View. J. Syst. Softw. 86, 4 (April 2013), 872–889.
[6] Daniel Karlström and Per Runeson. 2006. Integrating agile software https://doi.org/10.1016/j.jss.2012.10.041
development into stage-gate managed product development. Empirical [23] Dag I. K. Sjoberg, Tore Dyba, and Magne Jorgensen. 2007. The Future
Software Engineering 11, 2 (01 Jun 2006), 203–225. https://doi.org/10. of Empirical Methods in Software Engineering Research. In Proceedings
1007/s10664-006-6402-8 of the Future of Software Engineering (FOSE). IEEE Computer Society,
[7] Staffs Keele. 2007. Guidelines for performing systematic literature 358–378.
reviews in software engineering. In Technical report, Ver. 2.3 EBSE [24] Raul Valverde. 2011. Principles of Human Computer Interaction Design.
Technical Report. EBSE. sn. Lambert academic publishing.
[8] Barbara Kitchenham, O. Pearl Brereton, David Budgen, Mark Turner, [25] András Vargha and Harold D. Delaney. 2000. A Critique and Improve-
John Bailey, and Stephen Linkman. 2009. Systematic Literature Reviews ment of the CL Common Language Effect Size Statistics of McGraw
in Software Engineering - A Systematic Literature Review. Information and Wong. Journal of Educational and Behavioral Statistics 25, 2 (2000),
and Software Technology 51, 1 (Jan. 2009), 7–15. 101–132.
[9] Barbara A. Kitchenham, O. Pearl Brereton, David Budgen, and Zhi Li. [26] Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn
2009. An Evaluation of Quality Checklist Proposals: A Participant- Regnell, and Anders Wesslén. 2012. Experimentation in software engi-
observer Case Study. In Proceedings of the 13th International Conference neering. Springer Science & Business Media.
on Evaluation and Assessment in Software Engineering (EASE’09). BCS [27] C. Wohlin and A. Wesslen. 1998. Understanding software defect
Learning & Development Ltd., Swindon, UK, 55–64. http://dl.acm.org/ detection in the Personal Software Process. In Proceedings Ninth
citation.cfm?id=2227040.2227047 International Symposium on Software Reliability Engineering (Cat.
[10] Barbara Ann Kitchenham, David Budgen, and Pearl Brereton. 2015. No.98TB100257). 49–58. https://doi.org/10.1109/ISSRE.1998.730773
Evidence-Based Software Engineering and Systematic Reviews. Chapman [28] Carmen Zannier, Grigori Melnik, and Frank Maurer. 2006. On the Suc-
& Hall/CRC. cess of Empirical Studies in the International Conference on Software
[11] Barbara A Kitchenham, Tore Dyba, and Magne Jorgensen. 2004. Engineering. In Proceedings of the 28th International Conference on
Evidence-based software engineering. In Proceedings of the 26th inter- Software Engineering (ICSE ’06). ACM, New York, NY, USA, 341–350.
national conference on software engineering. IEEE Computer Society,
273–281.
[12] B. A. Kitchenham, E. Mendes, and G. H. Travassos. 2007. Cross versus
Within-Company Cost Estimation Studies: A Systematic Review. IEEE
Transactions on Software Engineering 33, 5 (May 2007), 316–329. https:
//doi.org/10.1109/TSE.2007.1001
[13] Robert V. Labaree. 2017. Research Methods in the Social Sciences.
http://lynn-library.libguides.com/c.php?g=549455&p=3771805
[14] Jonathan Lazar, Jinjuan Heidi Feng, and Harry Hochheiser. 2017. Re-
search methods in human-computer interaction. Morgan Kaufmann.
[15] Johan Linåker, Sardar Muhammad Sulaman, Rafael Maiani de Mello,
and Martin Höst. 2015. Guidelines for conducting surveys in software
engineering. http://portal.research.lu.se/portal/files/6062997/5463412.
pdf. Accessed 10 December 2016.
[16] Ruchika Malhotra. 2015. Empirical Research in Software Engineering:
Concepts, Analysis, and Applications. Chapman & Hall/CRC.
[17] Patrick E McKnight and Julius Najab. 2010. Mann-Whitney U Test.
The Corsini encyclopedia of psychology (2010), 1–1.

Landscape

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Landscape

Uploaded by

Copyright:

Available Formats

A Landscape of the Adoption of Empirical

Evaluations in the Brazilian Symposium on Human

to solve the detected inconsistencies. When the pair of re-

Quality Assessment Process. In order to answer RQ3, we per-

REFERENCES [18] Marcus R Munafò, Brian A Nosek, Dorothy VM Bishop, Katherine S

You might also like