Professional Documents
Culture Documents
PSYARXIV JASP Tutorial - Preprint-B
PSYARXIV JASP Tutorial - Preprint-B
Abstract
This tutorial demonstrates the use of parallel Frequentist-Bayesian analyses using JASP,
and the plausible inferences one may be able to make from such combined analyses.
Chiefly, the first analytical step is to learn from the sample in a descriptive manner,
without getting confounded by inferential statistics outputs (Tukey, 1977). A second
step is to gauge the soundness of the method and of any inference using frequentist
inferential statistics, following on the severity testing approach by Mayo (1996; Mayo
and Spanos, 2010). Finally, the third step uses a Jeffreysian—Bayes factors—approach
(Jeffreys, 1961) in order to make sensible inferences about the hypotheses being tested.
The tutorial is an eminently learning exercise rather than a write up of a formal article
for publication. Therefore, appropriate commentary is included where needed to foster
such learning.
1
This tutorial also appears as a book chapter in a Spanish book on methodology, as Perezgonzalez, J. D. y
Vincent, N. (2022). Análisis paralelos frecuentista-bayesianos con JASP. En D. Frías-Navarro y M.
Pascual-Soler (Eds.), Diseño de la investigación, análisis y redacción de los resultados (pp. 607-647).
Valencia, España: Palmero Ediciones. ISBN 978841233367
-1-
TABLE OF CONTENTS
JASP ................................................................................................................. 3
References ....................................................................................................... 33
-2-
As the saying goes, ‘There are more ways than one to skin a cat” and, in the
philosophy of statistics, two of those ways are the frequentist approach and the Bayesian
approach to data analysis and inferential work. The frequentist approach is the one that
prevails in Psychology (e.g., null hypothesis significance testing), although the Bayesian
approach is nowadays gaining track, especially after the so-called ‘crisis in Psychology’
(e.g., Open Science Collaboration, 2012; Klein et al., 2014). The latter has been
traditionally disparaged because of its association with subjective beliefs in the estimation
of the prior probabilities of the hypotheses being researched. Meanwhile, the former has
been preferred because it is associated with objectivism and the repetition of empirical
events in the long run. We are not going to discuss the pros and cons of each perspective.
Instead, this chapter is a tutorial on how to analyse research data using both approaches,
with the help of a statistics software (JASP) which combines both types of analysis in a
rather seamlessly manner (Perezgonzalez and Frías-Navarro, 2018).
The chapter is also going to be a bit different to the other chapters in the book
insofar it is more “pedantic”, so to speak. In order to prevent confusion and foster
understanding, we are going to be fussy in the use of concepts and the interpretation of
results. For example, we will not use the concept of null hypothesis significance testing
(or NHST), on the one hand because it is a concept often used to refer to, at least, three
different approaches to data testing (Fisher’s, Neyman-Pearson’s, and a conceptual blend
of both in an arguably debatable manner; e.g. Gigerenzer, 2004; Perezgonzalez, 2015a).
On the other hand, because what those approaches test is data, not hypotheses per se.
Similarly, other confusing concepts we will avoid are alpha levels, Type II errors, power,
and tests of hypotheses (see Perezgonzalez, 2014).
JASP
JASP is quite intuitive to use. It is also very flexible and allows exploring several
aspects of an analysis simply by selecting the appropriate boxes or radio buttons, which
immediately display the appropriate results on the same screen. By deselecting a
command, the corresponding results disappear, keeping the workplace nice and tidy.
-3-
Copying and pasting tables and figures from JASP onto another programme (e.g., a word
processing programme) is also straightforward, and figures and tables are transferred
already formatted as per the guidelines of the American Psychological Association (e.g.,
APA, 2010).
JASP’s main strength is that it allows for easy analysis of the same database under
above two philosophical perspectives (Perezgonzalez y Frías-Navarro, 2018). The
Fisherian perspective allows for a frequentist approach to severe testing and error
statistics (Mayo and Spanos, 2010), based on theoretical probability models such as the
normal distribution. However, although Fisher’s approach may be adequate to disconfirm
or reject a null hypothesis, it is mute for interpreting a non-significant result—meanwhile
JASP does not yet cater for Neyman-Pearson’s approach (e.g., 1928) to compensate for
such limitation, so there is no power analysis, Type II errors, or alternative hypotheses.
Jeffreys’s Bayes factor analysis is able to provide information about the relative
likelihood of the same data under two different models. These models are more extreme
than the ones used by frequentists. For example, the null model is a nil model where the
entire probability density is collapsed onto ‘zero’ (Kruschke, 2011). A Cauchy-based
alternative model, on the other hand, is a flatter one, and research data gets a higher
probability under this model than it would under a frequentist one. By using these extreme
models, Jeffreysians are able to test whether the effect size is absolutely ‘zero’ or, instead,
diverges from ‘zero’. And by comparing both models, they are able to determine which
model is most likely and how strong the evidence is. In so doing, they are also able to
conclude in favour or against the nil hypothesis in a manner than Fisherian’s cannot.
An opportune question is, thus, whether we could simply use Jeffreys’s Bayes
factors and skip the frequentist test altogether. Perhaps we could (Wagenmakers et al.,
-4-
2017; Rouder et al., 2009). However, the main criticism of Bayesian statistics is their
reliance on subjective prior constructs, which, in the case of Jeffreys’s approach, we
namely observe in the use of extreme models for data testing. That is, data tend not to be
distributed according to such extreme models (e.g., most data are distributed following
the frequentist normal curve). Therefore, there is little objective justification for using
such extreme models.
-5-
Data analysis using JASP
The strategy used for data analysis is, therefore, the focus of this tutorial. Such
strategy caters for three approaches: exploratory data analyses, tests of significance, and
Bayes factor analyses.
Exploratory data analysis is, possibly, the only intentional misnomer in this
tutorial, a nudge to Tukey’s (1977) promotion of learning by exploring data over
exclusive reliance on inferential statistics. Therefore, all our analyses start with a
descriptive exploration of the data in question, as measured by a particular variable.
JASP provides some possibilities for such data exploration, although we will limit
ourselves to a few of those, such as the size of groups (valid cases), measures of centrality
and variability, and data interpretation in the context of the scale measuring such data.
We follow here the conventional use of Fisher’s tests of significance (e.g., 1954),
whereby we put our research data to test under a null hypothesis of no difference between
groups (H0). This hypothesis acts as a “straw man” (i.e., any difference between groups
ought not to be exactly ‘zero’; Perezgonzalez, 2015a) and provides a severe test to the
data (e.g., Mayo, 1996). That is, the test will capture as non-significant 95% of the
differences closest to ‘0’ (either two-tailed or one-tailed, depending on the test).
-6-
should be based on practicalities (i.e., practical importance in the real world), it was
challenging for us to gauge such minimum effect size a priori and, therefore, we settled
for a conventional (theoretical) standardized medium effect size for the most interesting
occurrence: that of a directional effect in favor of the effectiveness of the intervention.
The required sample size thus ensured that medium, or larger, effects would be flagged
as statistically significant results. Such sensitiveness analysis thus complements Fisher’s
tests of significance in the research project used for this tutorial.
Following Fisher’s test, we use a Bayes factor analysis (Jeffreys, 1961) to shed
light both on the likelihood of the null hypothesis in case of a non-significant result and
on the likelihood of an alternative hypothesis in the case of a significant result.
JASP’s Bayes Factor analysis compares the probability of the same data under
two different models. The default model for no effect is a nil model (M0), that the effect
size in the population is exactly ‘zero’. A Cauchy distribution is JASP’s default
alternative model (M1). The Cauchy distribution is a platykurtic distribution, symmetric
yet with fat-tails (akin to a t-distribution with one degree of freedom), which gives a
higher probability to effect sizes different from ‘zero’2. Compared to the “normal” (t)
distribution used with Fisher’s tests, Bayes Factors allow comparing data under two
extreme models.
2
JASP 0.9.0.1 also allows using other distributions for the alternative model, such as the normal distribution
and the t distribution.
-7-
Brief research background
The original research project aimed to assess whether a cockpit checklist helped
increase the situation awareness of pilots when not actively operating an aircraft (i.e.,
when flying in autopilot mode).
Both groups faced similar test conditions: a flight simulation using Microsoft
Flight Simulator X, run on a laptop. The simulation lasted 10 minutes, simulated a Cessna
172 Skyhawk, and a flight between two points conducted at 3500 feet and at a heading of
000, under visual flight rules (VFR). The flight had both a change in heading and a change
in altitude occurring during the flight. There were also three pre-set failures occurring
during the flight (at 3, 5, and 8 minutes, respectively). The flight was hands-off, as the
pilots were not required to flight the aircraft but simply sit and keep a lookout—none was
briefed regarding the possibility of flight changes or failures. Participants were also
provided with an Apple Ipad displaying aeronautical plate information for the destination
airport, and blank paper for notes. They were made aware of the cockpit clock and were
encouraged to take notes and refer to such clock if so desired.
-8-
awareness. The control group did not have access to such checklist. Therefore, the only
difference between both groups was the use of the checklist.
1. Data screening3
Data screening prior to any data analysis allows us to quickly peruse the integrity of the
data (Tabachnick and Fidell, 2001), and the descriptives tab in JASP is quite handy for
purpose. All variables should be screened at once, corrected if necessary (e.g., missing
values, outliers, etc.), and any corrections should be documented, as appropriate. Data
screening should be thorough but only the most informative results need to be
communicated to the reader.
JASP 0.9.0.1: We open JASP, upload the corresponding database, select the tab
‵Descriptives′, and choose ‶Descriptive Statistics″. We then select all relevant variables
for screening. (It is also possible to screen subgroups by selecting the corresponding
variable into the box ‵Split′.) JASP displays basic descriptive statistics, including missing
cases, and minimum and maximum values. The box ‵Frequency tables (nominal and
ordinal variables)′ allows for a quick perusal of frequencies, percentages, and missing
values for nominal and ordinal variables. Under ‵Plots′ we choose ‶Distribution plots″
and ‶Boxplots″. Under ‵Statistics′ we select other descriptives of interest: ‷Median‴,
‷Skewness‴, and ‷Kurtosis‴, whose outputs get automatically incorporated into the
earlier table.
Screening results for our research variables4 are summarized in Table 1. Valid
(and missing cases—not provided but calculable by subtraction), and minimum and
maximum values accord to expectations. Medians are close to the corresponding means,
indicating relatively akin measures of centrality, and the z-scores for skewness and
kurtosis (calculated by dividing the statistic by their standard error, following Tabachnick
and Fidell, 2001) do not show serious departures from normality for Static SA and Active
3
Screening results, albeit carried out, were not provided in the original research.
4
There were more variables in the original research. Here we will restrict ourselves to the four variables
used in this tutorial.
-9-
SA, but show non-normal skewness for Timing SA, and non-normal skewness and
kurtosis for Continual SA.
Static SA and Active SA are relatively symmetric in their distributions (plots not
provided). Static SA contains several outliers (two outliers each scoring ‘7’, ‘2’, and ‘1’,
respectively). As the outliers are located in both tails of the distribution—thus, somewhat
balancing the mean—and the normality of the distribution seems not to be affected, we
decided to use the variable as is.
As we are going to carry out a dual frequentist-Bayesian analysis and because the
normality of variables is of little concern to Bayesian statistics, we decided to leave both
5
Although the table was copied from JASP, we have nonetheless eliminated redundant information,
reorganized the results, and substituted z-scores for skewness and kurtosis.
-10-
variables unchanged, opting instead for nonparametric tests when doing frequentist
analyses on Timing SA and Continual SA.
Static SA was measured using seven items, each assessing a piece of information
that a pilot should be aware of (such as the initial altitude of the flight, meteorological
conditions, and information about the destination aerodrome) but which is not part of
active cockpit operations. Responses to each of the items were scored as a dichotomy—
i.e., as being correct or not—then added up into a single component. The resulting
measure of Static SA could, thus, range between a minimum of ‘0’ (if no answer was
correct) and a maximum of ‘7’ (if all answers were correct).
Basic descriptive data for an exploratory analysis can be obtained from the Bayesian
options in JASP. We are mainly interested in group descriptives and credible intervals.
(We prefer credible intervals over frequentist confidence intervals because they accord
well with a descriptive interpretation of the samples’ sampling distributions; that is, a
measure of centrality and 95% coverage of the sampling distribution. In any case, JASP
has the quirky feature that confidence intervals are for the group differences, not for the
individual groups, so such confidence intervals would have to be calculated and plotted
by hand.)
JASP 0.9.0.1: We select the tab ‵T-Tests′, and choose the option ‶Bayesian Independent
Samples T-Test″. We then select ‘Static SA’ as our dependent variable, and ‘Group’6 as
our grouping variable. Under ‵Additional Statistics′ we tick in the box ‶Descriptives″—
which compiles group descriptives—and under ‵Plots′ we choose ‶Descriptives plots″
with ‷Credible interval‴ set at 95%—which compiles the appropriate graph and also adds
the credible intervals to the earlier table.
The observed results were similar for each group, with very little differences in
centrality and variability (Table 2). The experimental group showed a slightly lower mean
(M = 4.17)—also reflected in the somewhat off-centred credible interval (95% BCI7
[3.60, 4.75])—compared to the control group (M = 4.44; 95% BCI [3.87, 5.00]).
6
In the original research this variable was called ‘Checklist or no’.
7
BCI stands for Bayesian credible interval, as opposed to FCI, or Frequentist confidence interval. In the
original study, FCIs were provided. Here we substitute BCIs, instead.
-11-
In the context of the scale measuring them, both groups performed relatively well,
remembering, on average, four items out of seven (i.e., about 57% of the items).
Although Fisher’s tests of significance test the observed data under null hypotheses, JASP
lists research hypotheses, instead. In our case, we expected the experimental intervention
to have no effect on Static SA; therefore, we are after a non-directional test (i.e., the
research hypothesis that the means of both groups differ, without specifying a direction8).
Group allocation in JASP is automatic, so we need to refer to descriptive statistics in
order to interpret the results correctly. As far as inferential statistics go, we are interested
in the most informative ones, which are standardized effect sizes of the difference between
groups, and on test statistics.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Independent
Samples T-Test″, then select ‘Static SA’ as our dependent variable, and ‘Group’ as our
grouping variable. (Before carrying out the test, we could check whether the assumptions
for normality and equality of variance hold. As assumptions were checked previously, we
skip this step.) We select ‶Student″ as test, ‶Group 1 ≠ Group 2″ as research hypothesis,
plus some ‵Additional Statistics′: ‶Effect size″ and its ‷Confidence Interval‴ set at 95%.
Because Static SA comprises items that a pilot should remember (thus, be aware
of) but which do not require continual monitoring, they were not expected to be
directionally affected by the experimental intervention. Table 3 thus shows non-
directional, two-tailed, statistics.
Results show that the standardized effect mean difference between groups was a
Cohen’s d of 0.20, not rare in psychology and one which is considered small but not
8
The null hypothesis used for the test would read as, H0: the Experimental group will not perform
significantly differently to the Control group (for an appropriate measure of significance)
-12-
negligible. However, as we decided, a priori, to be only interested in medium to large
effect sizes, such a small effect size is considered trivial in context.
The confidence interval9 for such a standardized effect difference runs between
-0.38 and 0.7810, thus crossing ‘0’ and signalling a non-significant result. Indeed, the two-
tailed t-test for independent samples returned such non-significant result (p = 0.5111).
From these results we thus learn that the observed data shows a high probability
under the null hypothesis of no effect (i.e., it is expected to occur about 50% of the time
when no differences ensue)12. Therefore, the substantive null hypothesis of no effect
cannot be rejected13—however, we cannot say there is no effect (i.e., we cannot accept
such null hypothesis)14.
9
Frequentist confidence intervals (FCI) are relevant here, given that a significance test is also frequentist
(Perezgonzalez, 2015c). Thus, a FCI calculates the population sizes that can be rejected with a margin of
error of 5%—i.e., those that do not fall within the 95% range of calculated locations for the population
parameter closest to the observed sample (Perezgonzalez, 2017b). Therefore, such inference is subjected to
error (there is a 5% chance the true parameter is outside the interval), yet neither the interval itself represents
a density distribution nor can we make probability statements in the manner Bayesian credible intervals do
(i.e., we cannot be 95% confident the parameter is in the interval).
10
This interval was mistakenly reported as [-0.53, 1.00] in the original manuscript.
11
The non-significance of a small effect size is not surprising, considering that the sensitiveness of the test
was tailored to a minimum medium effect size (d = 0.5), one-tailed. Indeed, we could say the significance
test is redundant once we know the effect size, as effects smaller than medium size will not be statistically
significant.
12
This is the only information a p-value really provides: the probability of the data under the statistical H0
(Perezgonzalez, 2015b).
13
The rejection of the substantive H0 depends not on the p-value but on a Modus Tollens determined by the
level of significance. The Modus Tollens sets the logical syllogism for the data to contradict the substantive
H0: if H0 is true, then the observed data will be not significant (at a given level of significance). Therefore,
non-significant results cannot contradict the hypothesis, rendering the syllogism inapplicable and any
subsequent conclusion illogical (Perezgonzalez, 2017c).
14
Because H0 is used as a ‘straw man’ hypothesis to be rejected, there is no equivalent Modus Ponens
syllogism in support of H0, so non-significant data cannot be used to confirm H0, either. In any case, the
test is not powerful enough to probe H0 with severity (Mayo, 1996).
-13-
2.3. Static SA; Bayes factor analysis
Jeffreys’s Bayes factor tests the observed data under two models, which JASP also lists
as research hypotheses and refers to them as H0 and H1, respectively15. Regarding
Bayesian statistics, we are interested in the Bayes Factor as indicative of the strength of
the evidence in favor of either model.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option for ‶Bayesian
Independent Samples T-Test″, then select ‘Static SA’ as our dependent variable, and
‘Group’ as our grouping variable. We select ‶Group 1 ≠ Group 2″ as research
hypothesis. As there is no easy way of predicting the appropriate ‵Bayes Factor′ to run,
we randomly select ‶BF10″, check the results and, as it is smaller than ‘0’ (i.e., 0.351),
we select ‶BF01″, instead (as both results are interchangeable, convention calls for
reporting Bayes factors in their most interpretable form, which are those greater than
‘1’). Further, we choose informative ‵Plots′: ‶Prior and posterior″ distributions with
‷Additional info‴, and ‶Sequential analysis″.
The Bayes factor analysis shows that the nil model (M016) is almost three times
more likely than the alternative model (BF01 = 2.85; Table 4). Indeed, the posterior
distribution shows the sample effect size to be slightly off-centred yet ‘zero’ still
commands a large posterior probability.
2.853
15
Bayes factors test models, not hypotheses per se. In order to prevent confusion with fully developed
Bayesian inference, we will refer to models (M0 and M1), instead of to hypotheses (H0 or H1).
16
We will reserve the label ‘null’ for Fisher’s null hypothesis (H0) and use ‘nil’ for the Bayesian null model
(as the model is a point-nil distribution).
-14-
From these results we thus learn that the evidence provides some anecdotal (to
moderate) support to the conclusion that the intervention had no effect on Static SA, as
expected17.
Active SA was measured using three items that required ongoing, albeit not
necessarily constant, monitoring in the cockpit. This monitoring was captured via three
pre-set failures and assessed in the questionnaire by asking the pilots, after the flight had
ended, whether they could positively identify those failures. The response to each item
was, therefore, dichotomous—i.e., as having identified the failure correctly or not—and
all three responses where added up into a single component. Active SA could, thus, range
between a minimum of ‘0’ (if no failure was positively identified) and a maximum of ‘3’
(if all three failures were identified).
A 95% BCI covers 95% of the range of plausible locations for the inferred population
parameter, so that we may conclude that, with 95% confidence or certainty, the
population parameter is within the credible interval; said otherwise, that there is 95%
probability that the parameter is in the interval.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Bayesian
Independent Samples T-Test″. We then select ‘Active SA’ as our dependent variable and
‘Group’ as our grouping variable, as well as the same descriptive options selected
earlier: ‶Descriptives″, ‶Descriptives plots″, and 95% ‷Credible interval‴.
Results for Active SA (Table 5) show that the control group performed better than
the experimental group, obtaining a higher average on the component (M = 1.22; 95%
BCI [0.96, 1.48]) than the experimental group did (M = 0.70; 95% BCI [0.39, 1.00]).
Such difference in performance is obvious in the accompanying figure, which locates the
control BCI sensibly higher on the scale than the experimental BCI.
17
Bayesian inference allows to claim support for the nil model if results so suggest. Such support rests on
the larger likelihood of one model over the other. For example, the BF here represents the odds favouring
the nil model over the alternative model. Such odds can be converted into a probability with the formula
P = BF/(1+BF). Thus, the nil model has a 74% chance of being correct (compared to the 26% probability,
by difference, of the alternative model).
-15-
In the context of the scale measuring them, however, both groups performed
relatively poorly, perceiving, on average, just about one failure out of three possible
(meanwhile, as Table 1 shows, the maximum number of failures detected by either group
was two and the minimum was no detection of failures whatsoever). The control group
also performed better than the experimental group, a performance that goes against initial
expectations.
We expected the intervention to have a positive effect on Active SA; thus, we are after a
directional test. As JASP allocates groups automatically, we need to select which of two
directional hypotheses to use for purpose. Therefore, before jumping straight to
interpreting results, it is necessary to ascertain that we have selected the correct
directional test (JASP offers a handy explanatory note as part of the outputs18, which can
be used for purpose).
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Independent
Samples T-Test″, then select ‘Active SA’ as our dependent variable, and ‘Group’ as our
grouping variable. We select ‶Student″ as test. In order to determine the appropriate
directional hypothesis, we randomly select ‶Group 1 > Group 2″, read the note given in
the output, realize that it is the wrong directional hypothesis, then select ‶Group 1 <
Group 2″. We also select the same ‵Additional Statistics′ than earlier: ‶Effect size″ and
its ‷Confidence Interval‴ set at 95%.
18
There is an inherent risk here, insofar one can observe the results, thus be tempted to choose the
directional hypothesis with the most lucrative conclusion (i.e., the directional hypothesis that returns a
significant result).
-16-
As Active SA comprises items that are part of cockpit operations and that were
expected to be perceived with continual monitoring, they were also expected to be
positively affected by the experimental intervention, which motivated such continual
monitoring. Table 6 thus presents directional, one-tailed, statistics19.
Results show that the standardized effect mean difference between groups was a
Cohen’s d of 0.8020, conventionally considered a large effect in psychology. The
confidence interval of such effect, one-tailed, ranged between minus infinity to an upper
limit of d = 1.3021, thus crossing ‘0’ and signalling a non-significant result. Indeed, the
one-tailed t-test for independent samples returned such non-significant result (p = 0.995).
We thus learn that test results are statistically non-significant and, because of the
one-tailed nature of the inference, the null hypothesis of either no differential effect or
negative effects between groups cannot be rejected. We cannot, however, conclude in
favour of the null hypothesis, thus we are prevented from claiming a negative effect of
the checklist on performance, as suggested by the descriptive data (i.e., we cannot accept
such possibility even when it is contained within the scope of the null hypothesis).
Jeffreys’s Bayes factor tests the probability of the observed data under two models rather
than the probability of the hypotheses proper, as a fully-fledged Bayesian analysis would
do. Indeed, Bayes factors assume uninformative hypotheses—both hypotheses are given
same prior odds, of 50% each—, which is a way of placing the weight of the evidence
showed by the posterior distribution onto the observed data, exclusively.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Bayesian
Independent Samples T-Test″, then select ‘Active SA’ as our dependent variable, and
‘Group’ as our grouping variable. We also select ‶Group 1 < Group 2″ as research
hypothesis, and the appropriate ‵Bayes Factor′ to run—which turns out to be ‶BF01″—as
19
The directional null hypothesis would read as; H0: the Experimental group will not perform significantly
better than the Control group (for an appropriate measure of significance).
20
Cohen’s d reads a bit unintuitive here. However, we ought to remember that it reflects the difference
between groups yet in favour of the control group (i.e., the control group scored higher than the
experimental group, against expectation).
21
These statistics were mistakenly reported as [-∞, 0.10] in the original manuscript.
-17-
well as the same plots selected earlier: ‶Prior and posterior″ with ‷Additional info‴, and
‶Sequential analysis″.
The Bayes factor analysis shows that the nil model (M0) is almost eleven times
more likely than the alternative model (BF01 = 10.96; Table 7). Indeed, the one-tailed
posterior distribution shows the sample effect size to be practically ‘zero’ (median =
-0.06; 95% BCI [-0.263, -.003]). The evidence in support of the nil model—that there
was no positive effect of the checklist on Active SA (without necessarily ruling out a
negative effect)—is strong.
10.959
This section shows further data analyses prompted by unexpected results. One handy
feature of JASP 0.9.0.1 is that it brings up any previous command screen by simply
clicking on a table or figure, thus reducing the need to click on tabs and upload the same
variables each time. Beware, however, that any alteration to commands will
automatically update the results rather than generate new ones.
JASP 0.9.0.1: In ‵Results′, we click on the table ‶Independent Samples T-Test″—which
brings us back to the command screen for the directional test—and change our research
hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″. This updates the table with
the corresponding two-tailed results. We, then, click on the table ‶Bayesian Independent
Samples T-Test″—which brings us back to the appropriate command screen for the
Bayesian directional test—and also change our research hypothesis to the nondirectional
one ‶Group 1 ≠ Group 2″. This automatically updates both the table and the plots. We
double-check that the previously selected ‵Bayes Factor′ command is the appropriate
one—which, in this case, is not—and select ‶BF10″, instead.
-18-
Before moving on, it seems relevant to explore further the mismatch between our
initial expectations of improvement and the observed results, especially in view of the
latter sensibly going in the opposite direction. Therefore, we re-run both frequentist and
Bayesian tests for Active SA using a two-tailed approach, for exploratory purposes22.
Table 8 summarizes the results for the test of significance. We observe that the
confidence interval for the effect size runs between 0.19 and 1.40, thus, between a small
and a very large effect size (while half of the interval is above large, i.e., above d = 0.8023).
The interval does not cross ‘zero’, thus signalling a statistically significant result, as also
confirmed by the small p-value (p = 0.01).
Table 9 summarizes the results of the Bayesian test25. The Bayes factor analysis
shows that the alternative model (M1) would be about five times more likely than the null
model (BF10 = 5.0426). Indeed, the two-tailed posterior distribution shows the sample
effect size to be centered on a median effect d = 0.68, while the credible interval [95%
22
Given the post hoc exploratory nature of this secondary analysis, we prefer non-directional tests, as they
are statistically more conservative than directional ones.
23
This statistic was mistakenly reported as d = 0.65 in the original manuscript.
24
As said earlier, the Modus Tollens sets the logical syllogism for the data to contradict the substantive H0:
if H0 is true, then the observed data will be not significant (at a given level of significance). A significant
result thus contradicts the consequent of the syllogism, leading to denying the antecedent (the hypothesis)
in a logically sound manner. We can, thus, conclude that the substantive H0 is not true, and may reject it
(Perezgonzalez, 2017c).
25
Bayesian tests were not provided in the original manuscript.
26
These odds give the alternative model an 83% chance of being correct (compared to about 17%
probability for the nil model; P = BF/[1+BF]). Said otherwise, the posterior probability of the alternative
hypothesis has increased by 33% and that of the nil hypothesis has decreased by 33%, compared to their
prior probabilities of 50% each.
-19-
BCI [0.12, 1.30]) gives ‘zero’ a very low credibility. The evidence in support of the
alternative model is moderate, thus supporting the conclusion that there was a pernicious
effect of the checklist on Active SA (i.e., that it decreased situation awareness in the
experimental group).
5.044
Timing SA assesses the time it took the pilots to first perceive any of the three
failures, in full minutes (during the simulation, pilots were able to take notes of anything
happening and time them in reference to the cockpit clock). A positively identified failure
reported within one minute of occurring would score as ‘1’; within two minutes as ‘2’;
three minutes or longer as ‘3’; and as ‘missing’ when no time was reported or when the
failure was not correctly identified, irrespective of time27. All three scores where then
averaged into a single component. Timing SA could, thus, range between a minimum of
‘1’ (if all three failures were first identified within one minute of occurring) and a
maximum of ‘3’ (if all three failures were identified but each was reported as having
occurred with an error longer than two minutes).
27
In the original manuscript this scale is reversed, counting as ‘1’ times of three minutes or more, and as
‘3’ times within the minute. Despite explicit acknowledgement of such reversed scale, however, results
were nonetheless misinterpreted wrongly, leading also to carrying out tests in the wrong direction. The
results here provided thus differ sensibly from those in the original manuscript.
-20-
4.1. Timing SA; exploratory data analysis
A descriptive BCI is calculated assuming flat priors, is centered on the mean of the
sample, and covers, for example, 95% of the posterior frequency probability distribution
either side of the mean. Such posterior frequency distribution has a straightforward
interpretation in Bayesian statistics: 95% of credible estimates for the parameter are
within the interval, and the probability of such estimates diminishes as we move towards
the tails in the distribution. All in all, however, we are still interested in the entire interval.
JASP 0.9.0.1: We select the tab ‵T-Tests′, and choose the option ‶Bayesian Independent
Samples T-Test″. We then select ‘Timing SA’ as our dependent variable, and ‘Group’ as
our grouping variable, and the same descriptive options selected earlier: ‶Descriptives″,
‶Descriptives plots″, and 95% ‷Credible interval‴.
The observed results for Timing SA (Table 10) show that the experimental group
performed slightly worse than the control group, thus obtaining a larger time average
(M = 2.89; 95% BCI [2.70, 3.06]) than the control group (M = 2.43; 95% BCI [2.03,
2.82]).
In the context of the time scale measuring them, however, both groups performed
relatively poorly, taking longer than two minutes to first perceive whatever failures were
perceived. As the accompanying figure illustrates, the control group performed sensibly
better, although also with larger variability in performance, compared to the experimental
group. In any case, the performance of the experimental group also goes against initial
expectations. Of interest is the number of pilots reporting perceived failures, with 20
pilots in the control group reporting at list one failure, against only 13 pilots in the
experimental group doing so.
-21-
4.2. Timing SA; test of significance
Timing SA is a variable that does not accord to the basic assumptions of a parametric
t-test, especially in regards to the normal distribution of the variable. JASP provides an
alternative rank-based, non-parametric test, for purpose: Mann-Whitney’s U test.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Independent
Samples T-Test″, then select ‘Timing SA’ as our dependent variable, and ‘Group’ as our
grouping variable. (Under ‵Assumption Checks′ we could select both ‶Normality″ and
‶Equality of variances″ to check whether the data fulfill the expected parametric
assumptions for the t-test. As we already checked those earlier, we skip this step.) We
select both ‶Mann-Whitney″ and ‶Student″ under ‵Tests′ (in the context of this research,
Mann-Whitney’s U test is the one to inform the significance of the results, but Student’s
t-test gives us effect sizes which are more relatable, thus interpretable, by comparison).
In order to determine the appropriate directional hypothesis, we randomly select ‶Group
1 > Group 2″, which, upon reading the note provided with its output, turns out to be the
correct option. We also select the same ‵Additional Statistics′ selected earlier: ‶Effect
size″ and its ‷Confidence Interval‴ set at 95%.
Timing SA, because it is related to Active SA, was also expected to be positively
affected by the intervention, the research hypothesis thus being also a directional one.
Table 11 thus comprises directional, one-tailed, statistics. The main test is the non-
parametric Mann-Whitney’s U test, albeit the Student’s t-test is also given as it helps
provide a common ground for comparison with previous variables.
From these results we thus learn that test results are statistically non-significant
and, because of the one-tailed nature of the inference, the null hypothesis of either no
-22-
differential effect or negative effects between groups cannot be rejected. We cannot,
however, conclude in favor of the null hypothesis, thus we are prevented from claiming
a negative effect of the checklist on performance, as suggested by the descriptive data
(i.e., we cannot accept such possibility of the null hypothesis).
Bayesian inference, including Jeffreys’s Bayes factor, starts from the perspective that the
observed sample is given, thus, that it is not one of the potential samples from a
population of samples. As the inference is based on the observed sample as is, there is no
need to check whether the sample fits assumptions for a particular test or another.
Therefore, the same analysis applies to any variable, irrespective of its normality,
homogeneity of variance, etc.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Bayesian
Independent Samples T-Test″, then select ‘Timing SA’ as our dependent variable, and
‘Group’ as our grouping variable. We also select ‶Group 1 > Group 2″ as research
hypothesis, and the appropriate ‵Bayes Factor′ to run—which turns out to be ‶BF01″—as
well as the same plots selected earlier: ‶Prior and posterior″ with ‷Additional info‴, and
‶Sequential analysis″.
The Bayes factor analysis shows that the nil model (M0) is seven times more likely
than the alternative model (BF01 = 7.20; Table 12). The one-tailed posterior distribution
shows the sample effect size to be small (median = 0.105; 95% BCI [0.003, 0.438]) and
the evidence moderately supporting the nil model—that there was no positive effect of
the checklist on the speed with which failures were perceived (without necessarily ruling
out a negative effect).
7.195
-23-
4.4. Timing SA; two-tailed analyses
JASP proves to be quite flexible for carrying out further data analyses. This, however,
has the risk of being abused in search for statistical significance and / or Bayesian
support for a positive result. Caution as well as full reporting ought to go hand-in-hand
with such statistical flexibility.
JASP 0.9.0.1: In ‵Results′, we click on the table ‶Independent Samples T-Test″ and
change our research hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″. We, then,
click on the table ‶Bayesian Independent Samples T-Test″ and also change our research
hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″. We double-check whether the
previously selected ‵Bayes Factor′ command is the appropriate one—which it is not—and
select ‶BF10″, instead.
As done earlier, it seems relevant to explore further the mismatch between our
initial expectations of improvement and the observed results. Table 13 summarizes the
results of the two-tailed test of significance. We observe that the p-value is not significant
(U = 98.50, p = 0.145). Therefore, we have no strong statistical backing to reject the null
hypothesis of no negative effect28 (i.e., the probability of erring if rejecting it is about
15%). Comparatively speaking, the effect size is approximately a moderate one yet its
interval still spans both sides of ‘0’ (r = -0.24; Cohen’s d = -0.67, 95% FCI [-1.38, 0.06]).
Table 14 summarizes the results of the Bayesian test. The Bayes factor analysis
shows that the alternative model (M1) would be about slightly more likely than the nil
model (BF10 = 1.25) and in the direction of a negative effect. However, the evidence is
28
We have already tested the positive effect earlier, so here we can simply focus on the potential for a
negative effect, yet still using the more conservative nondirectional test.
-24-
just too flimsy as for making any credible statement regarding a negative effect of the
intervention on the speed of failure perception29.
1.247
29
Notice how the Bayesian interpretation leads to an ‘anecdotal’ statement, which may encourage a claim
that is rather unwarranted. As Bayesian statistics do not work with error probabilities, a parallel frequentist
test helps provide a more moderate understanding of the Bayesian results (namely, that claiming anything
from the data, whether anecdotally or not, has a reasonable large probability of being erroneous [i.e.,
p = 0.15]).
30
Continual SA was recalculated in this tutorial to account for performance in each failure more
systematically. These results thus differ sensibly from those reported in the original manuscript.
-25-
5.1. Continual SA; exploratory data analysis
Because descriptive BCIs are calculated assuming flat priors, they return similar results
to those returned by frequentist confidence intervals, albeit the interpretation of both type
of intervals necessarily differs. We prefer BCIs simply because their interpretation
accords better with that of a frequency distribution (although FCIs are often wrongly
interpreted as BCIs, e.g., by Cumming, 2012)31.
JASP 0.9.0.1: We select the tab ‵T-Tests′, and choose the option ‶Bayesian Independent
Samples T-Test″. We then select ‘Continual SA’ as our dependent variable, and ‘Group’
as our grouping variable, as well as the same descriptive options selected earlier:
‶Descriptives″, ‶Descriptives plots″, and 95% ‷Credible interval‴.
The observed results for Continual SA (Table 15) show that the control group
performed slightly better than the experimental group, obtaining a smaller average
(M = 3.38; 95% BCI [3.17, 3.58]) than the experimental group (M = 3.74; 95% BCI [3.62,
3.86]).
31
Cumming generates his ‘cat’s eye’ representations from the sampling distribution of means. He uses it
as a description of the sample, including its centrality (i.e., the mean) and a coverage of the distribution
(e.g., a 95% interval). As an inferential statistic, however, a FCI is an output calculated from such sampling
distribution (thus, it is not describing the sample), gives equal probability to all estimates (thus, the mean
is as probable as any other location, so it is irrelevant to draw both the mean and the frequency distribution),
and simply covers the specified percentage of the sampling distribution closest to the mean as an inferential
statistic for the true location of the population parameter (thus, either the true parameter is one of the 95%
of estimates within the parameter or one of the 5% outside it; that is, there is a 5% chance [of error] that the
parameter is outside the interval). A BCI, on the other hand, is a posterior frequency distribution and can
be represented as such: with a measure of centrality, a probability distribution, and inviting a subjective
belief or confidence that the parameter is in the interval, most probably closest to the mean than to the tails
of the distribution.
-26-
In the context of the scale measuring them, however, both groups performed
relatively poorly, being rather close to the maximum anchor for poor performance
(maximum = ‘4’). As the accompanying figure illustrates, the control group performed
sensibly better, in relative terms, a performance that also goes against initial expectations.
Continual SA does not accord to the basic assumptions of a t-test, either, so we shall rely
on the non-parametric test for interpreting statistical significance.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Independent
Samples T-Test″, then select ‘Continual SA’ as our dependent variable, and ‘Group’ as
our grouping variable. We select both ‶Mann-Whitney″ and ‶Student″ under ‵Tests′. In
order to determine the appropriate directional hypothesis, we randomly select ‶Group 1
> Group 2″, which, upon reading the note provided with its output, turns out to be the
correct option. We also select the same ‵Additional Statistics′ than earlier: ‶Effect size″
and its ‷Confidence Interval‴ set at 95%.
Results show a non-significant result for the U test (p = 0.999). The effect size is
a large rank biserial correlation between experimental and control groups (r = -0.50;
Cohen’s d = -0.93, 95% FCI [-1.43, ∞]).
We thus learn that test results are statistically not significant and, because of the
one-tailed nature of the inference, the null hypothesis of either no differential effect or
negative effects between groups cannot be rejected. We cannot, however, conclude in
-27-
favour of the null hypothesis, thus we are prevented from claiming a negative effect of
the checklist on performance, as suggested by the descriptive data (i.e., we cannot accept
such null hypothesis).
5.3. Continual SA; Bayes factor analysis
Because only the observed data weighs in on the posterior distribution, the evidence in
favour of one or the other model equally translates as evidence in favour of the hypothesis
modelled by the favoured model. But because Bayes factors do not actually work with the
prior probabilities of the hypotheses, there is a risk of wrongly concluding in favour of a
hypothesis unless we know that the prior probability for such hypothesis was truly 50%.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Bayesian
Independent Samples T-Test″, then select ‘Continual SA’ as our dependent variable, and
‘Group’ as our grouping variable. We also select ‶Group 1 > Group 2″ as research
hypothesis, and the appropriate ‵Bayes Factor′ to run—which turns out to be ‶BF01″—as
well as the same plots selected earlier: ‶Prior and posterior″ with ‷Additional info‴, and
‶Sequential analysis″.
The Bayes factor analysis shows that the nil model (M0) is twelve times more
likely than the alternative model (BF01 = 12.08; Table 17). The one-tailed posterior
distribution shows the sample effect size to be small (median = 0.10, 95% BCI [0.052,
0.238]) and the evidence moderately-to-strongly supporting the nil model—that there was
no positive effect of the checklist on awareness, as measured by Continual SA.
12.084
-28-
5.4. Continual SA; two-tailed analyses
JASP 0.9.0.1: In ‵Results′, we click on the table ‶Independent Samples T-Test″ and
change our research hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″. We, then,
click on the table ‶Bayesian Independent Samples T-Test″ and change our research
hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″, double-check whether the
previously selected ‵Bayes Factor′ command is the appropriate one—which is not—and
select ‶BF10″, instead.
Indeed, inferred effect sizes are large (rank biserial correlation = -0.50; Cohen’s
d = -0.93, 95% FCI [-1.53, -0.31]), with confidence intervals suggesting a moderate-to-
large negative effect of the intervention on overall situation awareness.
Table 19 summarizes the results of the Bayesian test. The Bayes factor analysis
shows that the alternative model (M1) is twelve times more likely than the nil model
(BF10 = 12.70), with a median effect of d = -0.80 [95% BCI [-1.43, -0.19]), giving the
‘nil’ effect very low credibility. The evidence in support of the alternative model is
deemed strong, thus supporting the conclusion that there was a pernicious effect of the
checklist on Continual SA (i.e., that it decreased overall situation awareness and
perceptive speed in the experimental group).
-29-
Table 19 | Continual SA; Bayesian Independent Samples T-Test
BF10
12.695
Final notes
In the previous section we had the opportunity to observe the flexibility of use of
JASP and the possibilities it allows to learn from the observed data. Indeed, we have
carried out exploratory, Fisherian, and Jeffreysian data analyses; planned analyses as well
as ad hoc analyses motivated by the inconsistency found between expectations and
observed results; one-tailed and two-tailed analyses; and parametric and non-parametric
analyses. We have also obtained both significant and non-significant results.
In the latter case, we have also experienced how a Jefreysian approach allows us
to learn more from such results, providing us the strength of evidence (likelihood) in
favour of either the alternative or the nil model, as appropriate. We even had the
opportunity to illustrate a case were both types of inference would contradict each other,
with the frequentist test being non-significant but the Bayes factor returning anecdotal
evidence in favour of the alternative model. In fact, such case also served to argue in
favour of Mayo’s (2017) statement that a frequentist test could as well be used to calibrate
a Jeffreysian result in order to prevent a potentially erroneous Bayesian conclusion.
In above journey, we also have called the reader’s attention with several footnotes.
We will delve a bit deeper into those now.
-30-
errors of transcription or poor housekeeping (e.g., the discrepancies between expectations
and results led to double-check the database against the original questionnaire data; upon
correcting an entry error, however, only some statistics in the corresponding table, but
not all, were updated). Other discrepancies arouse because of the way a couple of
variables were coded (footnotes 26, 29), typically against advice to do it differently. Such
discrepancies serve to highlight that neither JASP nor Bayesian analyses (nor any other
analysis nor statistics software, for that matter) are a magical pill to protect against such
sort of methodological errors. It is the researcher’s responsibility to attend to the potential
for error at methodological and analytical levels. As far as data analysis goes, the quality
of the output depends largely on the integrity of the data, something that prioritises
methodological integrity over data analysis every time.
We also had the opportunity to experience the simplicity of JASP’s command and
output screens, and the immediacy with which results can be perused. Oftentimes, we
need to calculate the results in order to ensure we have chosen the right hypothesis (e.g.,
footnote 18). But, as alerted in footnote 17, such flexibility offers a tempting opportunity
to adopt those research hypotheses which are most propitious towards a desired result.
Again, JASP offers no protection against such behaviour, which comes down to the
integrity of researchers themselves above and beyond the technical features of the
statistical software used.
A third issue to highlight is the constant reminder the tutorial has given us
regarding that the observed results did not match our initial research hypotheses most of
the time. Indeed, even the sensitiveness of the test was estimated according to the most
interesting, and expected, result: that of a positive effect of the intervention on situational
awareness. Of course, such sensitiveness analysis contained in itself the possibility for
the effect to be smaller than that selected, including ‘zero’ (footnote 10). However, at no
time prior to the investigation or during the analysis of results, we envisaged the
possibility of a negative effect size, less so of statistically significant ones. (We do now
have a possible explanation, which may serve as hypothesis for a follow up study. Until
put to test, however, it is uncertain whether the results observed may actually describe a
real state of affairs.) This issue thus delves into how to understand such prior expectations
in the context of our data, which a fully-fledged Bayesian analysis will do by integrating
the prior probabilities of the hypotheses with the probability of the observed data under
such hypotheses. As we have discussed, however (footnotes 14, 16, 25), a Bayes factor
-31-
analysis bypasses such integration and only provides the weight of the evidence given by
the data. This is also the information that a frequentist approach provides. In any case, it
is worthwhile to remind the reader that neither Jeffreysian inference nor JASP deal with
the prior probability of the hypotheses; thus, that they cannot answer the question of the
posterior probability of the same after having observed the data.
Note. For comparability purposes, all p-values are those of t-tests. SEV = Mayo’s severity tests.
Therefore, it seems that any approach is sufficient and all redundant. What we
need to remember, however, is that they all seek a different type of learning; thus they are
not necessarily interchangeable. Mind you, the fact that approaches may coincide in their
-32-
relative conclusions is not in itself a corroboration of such conclusions, either. Using the
approach most appropriate for purpose not only allows us to probe the data with the best
tool for a particular probe, but also allows us to prevent making errors of interpretation
and generalization.
References
APA (2010). Publication manual of the American Psychological Association (6th ed.).
Washington, DC: APA.
Fisher, R. A. (1954). Statistical methods for research workers (12th ed.). Edinburgh,
U.K.: Oliver and Boyd.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, N.Y.: Clarendon Press.
-33-
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B. Jr., Bahník, Š., Bernstein, M. J.,
. . . Nosek, B. A. (2014). Investigating variation in replicability. A “Many Labs”
replication project. Social Psychology, 45, 142-152. doi:10.1027/1864-
9335/a000178
Kruschke, J. K. (2011). Doing Bayesian data analysis. A tutorial with R and BUGS.
Oxford, UK: Academic Press.
Mayo, D. G. (1996). Error and the growth of experimental knowledge. Chicago, IL: The
University of Chicago Press.
Mayo, D. (2017). New venues for the statistics wars [Web log post]. Retrieved from
https://errorstatistics.com/2017/10/05/new-venues-for-the-statistics-wars
Mayo, D. G., and Spanos, A. (eds.). (2010). Error and inference. New York, NY:
Cambridge University Press.
Neyman, J., and Pearson, E. S. (1928). On the use and interpretation of certain test
criteria for purposes of statistical inference: part I. Biometrika, 20A, 175–240.
doi: 10.2307/2331945
Perezgonzalez, J. D. (2015c). Confidence intervals and tests are two sides of the same
research question. Frontiers in Psychology, 6, 341.
doi:10.3389/fpsyg.2015.00341
Perezgonzalez, J. D., and Frías-Navarro, M. D. (2018). Retract p < 0.005 and propose
using JASP, instead [version 2]. F1000Research, 6, 2122.
doi:10.12688/f1000research.13389.2
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., and Iverson, G. (2009). Bayesian
t-tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin
Review, 16, 225–237. doi:10.3758/PBR.16.2.225
Stengers, I. (2018). Another science is possible. A manifesto for slow science. Cambridge
(U.K.): Polity Press.
Tabachnick, B. G., and Fidell, L. S. (2001). Using multivariate statistics (4th ed.). Boston,
MA: Allyn & Bacon.
Vincent, N. (2018). Situational awareness of pilots in the cruise (Master’s thesis, Massey
University, New Zealand).
Wagenmakers, E. J., Verhagen, J., Ly, A., Matzke, D., Steingroever, H., Rouder, J. N., .
. . Morey, R. D. (2017). The need for Bayesian hypothesis testing in psychological
science. In S. O. Lilienfeld and I. D. Waldman (Eds.), Psychological science
under scrutiny: recent challenges and proposed solutions (pp. 123–138).
Chichester, UK: John Wiley & Sons. doi:10.1002/9781119095910.ch8
-35-