Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Frequentist-Bayesian analyses in

parallel using JASP - A tutorial1


Jose D. Perezgonzalez
Massey University
(New Zealand)

Abstract

This tutorial demonstrates the use of parallel Frequentist-Bayesian analyses using JASP,
and the plausible inferences one may be able to make from such combined analyses.
Chiefly, the first analytical step is to learn from the sample in a descriptive manner,
without getting confounded by inferential statistics outputs (Tukey, 1977). A second
step is to gauge the soundness of the method and of any inference using frequentist
inferential statistics, following on the severity testing approach by Mayo (1996; Mayo
and Spanos, 2010). Finally, the third step uses a Jeffreysian—Bayes factors—approach
(Jeffreys, 1961) in order to make sensible inferences about the hypotheses being tested.
The tutorial is an eminently learning exercise rather than a write up of a formal article
for publication. Therefore, appropriate commentary is included where needed to foster
such learning.

1
This tutorial also appears as a book chapter in a Spanish book on methodology, as Perezgonzalez, J. D. y
Vincent, N. (2022). Análisis paralelos frecuentista-bayesianos con JASP. En D. Frías-Navarro y M.
Pascual-Soler (Eds.), Diseño de la investigación, análisis y redacción de los resultados (pp. 607-647).
Valencia, España: Palmero Ediciones. ISBN 978841233367
-1-
TABLE OF CONTENTS

JASP ................................................................................................................. 3

Data analysis using JASP................................................................................... 6

Tukey’s exploratory data analysis .................................................................. 6

Fisher’s tests of significance .......................................................................... 6

Jeffreys’s Bayes factors ................................................................................. 7

Brief research background ................................................................................. 8

Tutorial and results ............................................................................................ 9

1. Data screening ........................................................................................... 9

2. Static Situation Awareness (Static SA)..................................................... 11

3. Active Situation Awareness (Active SA).................................................. 15

4. Timing Situation Awareness (Timing SA)................................................ 20

5. Continual Situation Awareness (Continual SA) ........................................ 25

Final notes ....................................................................................................... 30

References ....................................................................................................... 33

-2-
As the saying goes, ‘There are more ways than one to skin a cat” and, in the
philosophy of statistics, two of those ways are the frequentist approach and the Bayesian
approach to data analysis and inferential work. The frequentist approach is the one that
prevails in Psychology (e.g., null hypothesis significance testing), although the Bayesian
approach is nowadays gaining track, especially after the so-called ‘crisis in Psychology’
(e.g., Open Science Collaboration, 2012; Klein et al., 2014). The latter has been
traditionally disparaged because of its association with subjective beliefs in the estimation
of the prior probabilities of the hypotheses being researched. Meanwhile, the former has
been preferred because it is associated with objectivism and the repetition of empirical
events in the long run. We are not going to discuss the pros and cons of each perspective.
Instead, this chapter is a tutorial on how to analyse research data using both approaches,
with the help of a statistics software (JASP) which combines both types of analysis in a
rather seamlessly manner (Perezgonzalez and Frías-Navarro, 2018).

The chapter is also going to be a bit different to the other chapters in the book
insofar it is more “pedantic”, so to speak. In order to prevent confusion and foster
understanding, we are going to be fussy in the use of concepts and the interpretation of
results. For example, we will not use the concept of null hypothesis significance testing
(or NHST), on the one hand because it is a concept often used to refer to, at least, three
different approaches to data testing (Fisher’s, Neyman-Pearson’s, and a conceptual blend
of both in an arguably debatable manner; e.g. Gigerenzer, 2004; Perezgonzalez, 2015a).
On the other hand, because what those approaches test is data, not hypotheses per se.
Similarly, other confusing concepts we will avoid are alpha levels, Type II errors, power,
and tests of hypotheses (see Perezgonzalez, 2014).

JASP

JASP is an open-source statistics software, nowadays in its version 0.9.0.1, and


quite accomplished for performing both frequentist and Bayesian analyses. JASP is free
to download from https://jasp-stats.org. The website also provides link to tutorials.

JASP is quite intuitive to use. It is also very flexible and allows exploring several
aspects of an analysis simply by selecting the appropriate boxes or radio buttons, which
immediately display the appropriate results on the same screen. By deselecting a
command, the corresponding results disappear, keeping the workplace nice and tidy.

-3-
Copying and pasting tables and figures from JASP onto another programme (e.g., a word
processing programme) is also straightforward, and figures and tables are transferred
already formatted as per the guidelines of the American Psychological Association (e.g.,
APA, 2010).

JASP allows us to do exploratory data analysis, frequentist inferential analyses,


and Bayesian inferential analyses. Exploratory analyses include descriptives, principal
component analysis, and exploratory factor analysis. Frequentist inferential analyses are
restricted to Fisher’s approach of testing data in reference to a null hypothesis (e.g.,
Fisher, 1954). Bayesian inferential analyses are restricted to Jeffreys’s approach of model
comparison, or Bayes Factor analysis (Jeffreys, 1961). Therefore, when we talk about
frequentist analyses, we are talking about Fisher’s tests of significance only; when we
talk about Bayesian analyses, we are talking about Jeffreys’s Bayes factors.

JASP’s main strength is that it allows for easy analysis of the same database under
above two philosophical perspectives (Perezgonzalez y Frías-Navarro, 2018). The
Fisherian perspective allows for a frequentist approach to severe testing and error
statistics (Mayo and Spanos, 2010), based on theoretical probability models such as the
normal distribution. However, although Fisher’s approach may be adequate to disconfirm
or reject a null hypothesis, it is mute for interpreting a non-significant result—meanwhile
JASP does not yet cater for Neyman-Pearson’s approach (e.g., 1928) to compensate for
such limitation, so there is no power analysis, Type II errors, or alternative hypotheses.

Jeffreys’s Bayes factor analysis is able to provide information about the relative
likelihood of the same data under two different models. These models are more extreme
than the ones used by frequentists. For example, the null model is a nil model where the
entire probability density is collapsed onto ‘zero’ (Kruschke, 2011). A Cauchy-based
alternative model, on the other hand, is a flatter one, and research data gets a higher
probability under this model than it would under a frequentist one. By using these extreme
models, Jeffreysians are able to test whether the effect size is absolutely ‘zero’ or, instead,
diverges from ‘zero’. And by comparing both models, they are able to determine which
model is most likely and how strong the evidence is. In so doing, they are also able to
conclude in favour or against the nil hypothesis in a manner than Fisherian’s cannot.

An opportune question is, thus, whether we could simply use Jeffreys’s Bayes
factors and skip the frequentist test altogether. Perhaps we could (Wagenmakers et al.,

-4-
2017; Rouder et al., 2009). However, the main criticism of Bayesian statistics is their
reliance on subjective prior constructs, which, in the case of Jeffreys’s approach, we
namely observe in the use of extreme models for data testing. That is, data tend not to be
distributed according to such extreme models (e.g., most data are distributed following
the frequentist normal curve). Therefore, there is little objective justification for using
such extreme models.

Nonetheless, frequentist and Bayesian approaches seek different ways of learning


from experience. The frequentist approach is geared towards learning from error (Mayo
and Spanos, 2010), seeking to disprove hypotheses via testing. The use of more ‘realistic’
models is adequate here, even if we are limited in what we may learn: We learn about the
collective, not the individual sample, and we may be able to disprove the hypothesis yet
not to prove it. But once this hurdle has been passed (i.e., once we have carried out a
frequentist analysis), a Bayesian approach allows us to learn more deeply about the
individual sample. The use of extreme models is thus not conflicting but complementary.
For example, a frequentist test may be significant, yet it is unable to say where, exactly,
to locate the alternative hypothesis, if it exists at all! A Bayesian approach can shed light
onto it by providing a posterior distribution of plausible locations (including diminishing
certainty as we move towards the tails of such posterior distribution). On the other hand,
a frequentist test that is not significant is unable to prove the null hypothesis. A Bayesian
approach can shed light onto it by informing us how likely the nil model is, so that we
may conclude that the nil hypothesis is most likely, thus probably true, if the evidence so
warrants.

It is this dual Fisherian-Jefreysian analysis that is used in this chapter, as a tutorial.


The information contained below has been extracted from research carried out by Vincent
(2018) as part of his Master’s in Aviation at Massey University, New Zealand. The focus
of the tutorial is a data analysis strategy using JASP. Therefore, although we provide a
summary of the research methodology, this is necessarily a brief account of the original
methods section. We also skip the literature review, discussion, and conclusion sections
of the original project. Furthermore, even the information about the research results has
been edited to accommodate it to a tutorial, with notes of attention interspersed for such
didactic purpose.

-5-
Data analysis using JASP

The strategy used for data analysis is, therefore, the focus of this tutorial. Such
strategy caters for three approaches: exploratory data analyses, tests of significance, and
Bayes factor analyses.

Tukey’s exploratory data analysis

Exploratory data analysis is, possibly, the only intentional misnomer in this
tutorial, a nudge to Tukey’s (1977) promotion of learning by exploring data over
exclusive reliance on inferential statistics. Therefore, all our analyses start with a
descriptive exploration of the data in question, as measured by a particular variable.

JASP provides some possibilities for such data exploration, although we will limit
ourselves to a few of those, such as the size of groups (valid cases), measures of centrality
and variability, and data interpretation in the context of the scale measuring such data.

We also expand our exploratory analysis to incorporate some inferential statistics


that provide a more nuanced description of our research groups, including effect sizes and
credible intervals.

Fisher’s tests of significance

We follow here the conventional use of Fisher’s tests of significance (e.g., 1954),
whereby we put our research data to test under a null hypothesis of no difference between
groups (H0). This hypothesis acts as a “straw man” (i.e., any difference between groups
ought not to be exactly ‘zero’; Perezgonzalez, 2015a) and provides a severe test to the
data (e.g., Mayo, 1996). That is, the test will capture as non-significant 95% of the
differences closest to ‘0’ (either two-tailed or one-tailed, depending on the test).

As it is inherent in Fisher’s tests of significance, no alternative hypothesis is


specified, although it would be prompted by the rejection of the null hypothesis upon
obtaining a significant result. Therefore, it is not possible to control the power of the test
or to ascertain the probability of a Type II error. We would be equally unable to ascertain
the null hypothesis upon obtaining a non-significant result.

Despite being unable to control power, we could, nonetheless, control the


sensitiveness of the test if we decided on a minimum effect size of interest
(Perezgonzalez, 2016, 2017a). Although the reasoning for such minimum effect size

-6-
should be based on practicalities (i.e., practical importance in the real world), it was
challenging for us to gauge such minimum effect size a priori and, therefore, we settled
for a conventional (theoretical) standardized medium effect size for the most interesting
occurrence: that of a directional effect in favor of the effectiveness of the intervention.
The required sample size thus ensured that medium, or larger, effects would be flagged
as statistically significant results. Such sensitiveness analysis thus complements Fisher’s
tests of significance in the research project used for this tutorial.

Jeffreys’s Bayes factors

Following Fisher’s test, we use a Bayes factor analysis (Jeffreys, 1961) to shed
light both on the likelihood of the null hypothesis in case of a non-significant result and
on the likelihood of an alternative hypothesis in the case of a significant result.

JASP’s Bayes Factor analysis compares the probability of the same data under
two different models. The default model for no effect is a nil model (M0), that the effect
size in the population is exactly ‘zero’. A Cauchy distribution is JASP’s default
alternative model (M1). The Cauchy distribution is a platykurtic distribution, symmetric
yet with fat-tails (akin to a t-distribution with one degree of freedom), which gives a
higher probability to effect sizes different from ‘zero’2. Compared to the “normal” (t)
distribution used with Fisher’s tests, Bayes Factors allow comparing data under two
extreme models.

Therefore, following a significant frequentist outcome, a Bayes Factor allows us


to answer the question: How likely is an alternative hypothesis—of some effect in the
population—to the nil hypothesis—of no effect whatsoever? Furthermore, following a
non-significant frequentist outcome, a Bayes Factor allows us to answer the question:
How close to ‘zero’ is such non-significant result? Said otherwise, it allows us to decide
whether to accept the frequentist null hypothesis—a Fisher’s test of significance can only
reject the null hypothesis, never accept it.

2
JASP 0.9.0.1 also allows using other distributions for the alternative model, such as the normal distribution
and the t distribution.
-7-
Brief research background

The original research project aimed to assess whether a cockpit checklist helped
increase the situation awareness of pilots when not actively operating an aircraft (i.e.,
when flying in autopilot mode).

The sample was a convenient one, comprising 46 fixed-wing pilot students,


training in different flight schools in the North Island of New Zealand; participation was
voluntary. The total sample size was determined to be sensitive to a medium effect size
(Cohen’s d = 0.50) for a directional Student’s t-test for independent means at the
conventional 5% level of significance (Perezgonzalez, 2016, 2017a).

The research followed an experimental design with control group. Participants


were split into the control and experimental groups by the flipping of a coin; to give
groups even numbers, when one participant was put into either the control or the
experimental group, the next participant would be put in the alternative group.

Both groups faced similar test conditions: a flight simulation using Microsoft
Flight Simulator X, run on a laptop. The simulation lasted 10 minutes, simulated a Cessna
172 Skyhawk, and a flight between two points conducted at 3500 feet and at a heading of
000, under visual flight rules (VFR). The flight had both a change in heading and a change
in altitude occurring during the flight. There were also three pre-set failures occurring
during the flight (at 3, 5, and 8 minutes, respectively). The flight was hands-off, as the
pilots were not required to flight the aircraft but simply sit and keep a lookout—none was
briefed regarding the possibility of flight changes or failures. Participants were also
provided with an Apple Ipad displaying aeronautical plate information for the destination
airport, and blank paper for notes. They were made aware of the cockpit clock and were
encouraged to take notes and refer to such clock if so desired.

Participants in the experimental group were also provided with a three-item


checklist to be used during the flight (lookout, check alignment of compass and
directional indicator, check temperature and pressure) and which could be carried out in
approximately 10 to 15 seconds. They were instructed to run the checklist at about two-
minute intervals (as per the cockpit clock). Although the check procedure was not
particularly geared towards any of the pre-set failures, it was expected to keep a higher
level of attention on how the flight was progressing, as well as motivate higher situational

-8-
awareness. The control group did not have access to such checklist. Therefore, the only
difference between both groups was the use of the checklist.

After the flight, all participants filled in a ten-item questionnaire querying


different aspects of the flight. When answering the questionnaire, the participants had no
access to either the iPad or laptop and needed to rely on memory or any notes taken during
the simulation.

Tutorial and results

1. Data screening3

Data screening prior to any data analysis allows us to quickly peruse the integrity of the
data (Tabachnick and Fidell, 2001), and the descriptives tab in JASP is quite handy for
purpose. All variables should be screened at once, corrected if necessary (e.g., missing
values, outliers, etc.), and any corrections should be documented, as appropriate. Data
screening should be thorough but only the most informative results need to be
communicated to the reader.
JASP 0.9.0.1: We open JASP, upload the corresponding database, select the tab
‵Descriptives′, and choose ‶Descriptive Statistics″. We then select all relevant variables
for screening. (It is also possible to screen subgroups by selecting the corresponding
variable into the box ‵Split′.) JASP displays basic descriptive statistics, including missing
cases, and minimum and maximum values. The box ‵Frequency tables (nominal and
ordinal variables)′ allows for a quick perusal of frequencies, percentages, and missing
values for nominal and ordinal variables. Under ‵Plots′ we choose ‶Distribution plots″
and ‶Boxplots″. Under ‵Statistics′ we select other descriptives of interest: ‷Median‴,
‷Skewness‴, and ‷Kurtosis‴, whose outputs get automatically incorporated into the
earlier table.

Screening results for our research variables4 are summarized in Table 1. Valid
(and missing cases—not provided but calculable by subtraction), and minimum and
maximum values accord to expectations. Medians are close to the corresponding means,
indicating relatively akin measures of centrality, and the z-scores for skewness and
kurtosis (calculated by dividing the statistic by their standard error, following Tabachnick
and Fidell, 2001) do not show serious departures from normality for Static SA and Active

3
Screening results, albeit carried out, were not provided in the original research.
4
There were more variables in the original research. Here we will restrict ourselves to the four variables
used in this tutorial.
-9-
SA, but show non-normal skewness for Timing SA, and non-normal skewness and
kurtosis for Continual SA.

Static SA and Active SA are relatively symmetric in their distributions (plots not
provided). Static SA contains several outliers (two outliers each scoring ‘7’, ‘2’, and ‘1’,
respectively). As the outliers are located in both tails of the distribution—thus, somewhat
balancing the mean—and the normality of the distribution seems not to be affected, we
decided to use the variable as is.

Table 1 | Descriptive Statistics5


Static SA Active SA Timing SA Continual SA
Valid 46 46 33 46
Minimum 1.000 0.000 1.000 2.000
Maximum 7.000 2.000 3.000 4.000
Mean 4.304 0.957 2.606 3.558
Median 4.000 1.000 3.000 3.670
Std. Deviation 1.314 0.698 0.715 0.429
Z Skewness -0.300 0.169 -3.804 -4.546
Z Kurtosis 0.118 -1.259 1.108 5.224

Timing SA is left-skewed, with five participants scoring towards the minimum


values (‘1.5’ and ‘1’) appearing as outliers. The nature of the variable, which only counted
reported times, implies that the absence of such reporting counts as a missing value
(totalling 13 missing values). Therefore, the skewed distribution of the variable and the
presence of outliers is not surprising.

The limitations of Timing SA worsen somewhat with Continual SA, which is a


composite of both Active SA and Timing SA and includes all events, whether reported or
not (thus, it also counts non-perceived failures), as well as their reported times (with the
value ‘4’ standing for not-applicable timings). Continual SA shows an asymmetric and
peaked distribution, no missing cases, and two outliers towards the lower values on the
scale (‘2.5’ and ‘2’ respectively).

As we are going to carry out a dual frequentist-Bayesian analysis and because the
normality of variables is of little concern to Bayesian statistics, we decided to leave both

5
Although the table was copied from JASP, we have nonetheless eliminated redundant information,
reorganized the results, and substituted z-scores for skewness and kurtosis.
-10-
variables unchanged, opting instead for nonparametric tests when doing frequentist
analyses on Timing SA and Continual SA.

2. Static Situation Awareness (Static SA)

Static SA was measured using seven items, each assessing a piece of information
that a pilot should be aware of (such as the initial altitude of the flight, meteorological
conditions, and information about the destination aerodrome) but which is not part of
active cockpit operations. Responses to each of the items were scored as a dichotomy—
i.e., as being correct or not—then added up into a single component. The resulting
measure of Static SA could, thus, range between a minimum of ‘0’ (if no answer was
correct) and a maximum of ‘7’ (if all answers were correct).

2.1. Static SA; exploratory data analysis

Basic descriptive data for an exploratory analysis can be obtained from the Bayesian
options in JASP. We are mainly interested in group descriptives and credible intervals.
(We prefer credible intervals over frequentist confidence intervals because they accord
well with a descriptive interpretation of the samples’ sampling distributions; that is, a
measure of centrality and 95% coverage of the sampling distribution. In any case, JASP
has the quirky feature that confidence intervals are for the group differences, not for the
individual groups, so such confidence intervals would have to be calculated and plotted
by hand.)
JASP 0.9.0.1: We select the tab ‵T-Tests′, and choose the option ‶Bayesian Independent
Samples T-Test″. We then select ‘Static SA’ as our dependent variable, and ‘Group’6 as
our grouping variable. Under ‵Additional Statistics′ we tick in the box ‶Descriptives″—
which compiles group descriptives—and under ‵Plots′ we choose ‶Descriptives plots″
with ‷Credible interval‴ set at 95%—which compiles the appropriate graph and also adds
the credible intervals to the earlier table.

The observed results were similar for each group, with very little differences in
centrality and variability (Table 2). The experimental group showed a slightly lower mean
(M = 4.17)—also reflected in the somewhat off-centred credible interval (95% BCI7
[3.60, 4.75])—compared to the control group (M = 4.44; 95% BCI [3.87, 5.00]).

6
In the original research this variable was called ‘Checklist or no’.
7
BCI stands for Bayesian credible interval, as opposed to FCI, or Frequentist confidence interval. In the
original study, FCIs were provided. Here we substitute BCIs, instead.
-11-
In the context of the scale measuring them, both groups performed relatively well,
remembering, on average, four items out of seven (i.e., about 57% of the items).

Table 2 | Static SA; Group Descriptives


95% BCI
Group N Mean SD Lower Upper

Control 23 4.435 1.308 3.869 5.001

Experimental 23 4.174 1.337 3.596 4.752

2.2. Static SA; test of significance

Although Fisher’s tests of significance test the observed data under null hypotheses, JASP
lists research hypotheses, instead. In our case, we expected the experimental intervention
to have no effect on Static SA; therefore, we are after a non-directional test (i.e., the
research hypothesis that the means of both groups differ, without specifying a direction8).
Group allocation in JASP is automatic, so we need to refer to descriptive statistics in
order to interpret the results correctly. As far as inferential statistics go, we are interested
in the most informative ones, which are standardized effect sizes of the difference between
groups, and on test statistics.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Independent
Samples T-Test″, then select ‘Static SA’ as our dependent variable, and ‘Group’ as our
grouping variable. (Before carrying out the test, we could check whether the assumptions
for normality and equality of variance hold. As assumptions were checked previously, we
skip this step.) We select ‶Student″ as test, ‶Group 1 ≠ Group 2″ as research hypothesis,
plus some ‵Additional Statistics′: ‶Effect size″ and its ‷Confidence Interval‴ set at 95%.

Because Static SA comprises items that a pilot should remember (thus, be aware
of) but which do not require continual monitoring, they were not expected to be
directionally affected by the experimental intervention. Table 3 thus shows non-
directional, two-tailed, statistics.

Results show that the standardized effect mean difference between groups was a
Cohen’s d of 0.20, not rare in psychology and one which is considered small but not

8
The null hypothesis used for the test would read as, H0: the Experimental group will not perform
significantly differently to the Control group (for an appropriate measure of significance)
-12-
negligible. However, as we decided, a priori, to be only interested in medium to large
effect sizes, such a small effect size is considered trivial in context.

Table 3 | Static SA; Independent Samples T-Test


95% FCI for Cohen's d
t df p Cohen's d Lower Upper
0.669 44 0.507 0.197 -0.383 0.776

The confidence interval9 for such a standardized effect difference runs between
-0.38 and 0.7810, thus crossing ‘0’ and signalling a non-significant result. Indeed, the two-
tailed t-test for independent samples returned such non-significant result (p = 0.5111).

From these results we thus learn that the observed data shows a high probability
under the null hypothesis of no effect (i.e., it is expected to occur about 50% of the time
when no differences ensue)12. Therefore, the substantive null hypothesis of no effect
cannot be rejected13—however, we cannot say there is no effect (i.e., we cannot accept
such null hypothesis)14.

9
Frequentist confidence intervals (FCI) are relevant here, given that a significance test is also frequentist
(Perezgonzalez, 2015c). Thus, a FCI calculates the population sizes that can be rejected with a margin of
error of 5%—i.e., those that do not fall within the 95% range of calculated locations for the population
parameter closest to the observed sample (Perezgonzalez, 2017b). Therefore, such inference is subjected to
error (there is a 5% chance the true parameter is outside the interval), yet neither the interval itself represents
a density distribution nor can we make probability statements in the manner Bayesian credible intervals do
(i.e., we cannot be 95% confident the parameter is in the interval).
10
This interval was mistakenly reported as [-0.53, 1.00] in the original manuscript.
11
The non-significance of a small effect size is not surprising, considering that the sensitiveness of the test
was tailored to a minimum medium effect size (d = 0.5), one-tailed. Indeed, we could say the significance
test is redundant once we know the effect size, as effects smaller than medium size will not be statistically
significant.
12
This is the only information a p-value really provides: the probability of the data under the statistical H0
(Perezgonzalez, 2015b).
13
The rejection of the substantive H0 depends not on the p-value but on a Modus Tollens determined by the
level of significance. The Modus Tollens sets the logical syllogism for the data to contradict the substantive
H0: if H0 is true, then the observed data will be not significant (at a given level of significance). Therefore,
non-significant results cannot contradict the hypothesis, rendering the syllogism inapplicable and any
subsequent conclusion illogical (Perezgonzalez, 2017c).
14
Because H0 is used as a ‘straw man’ hypothesis to be rejected, there is no equivalent Modus Ponens
syllogism in support of H0, so non-significant data cannot be used to confirm H0, either. In any case, the
test is not powerful enough to probe H0 with severity (Mayo, 1996).
-13-
2.3. Static SA; Bayes factor analysis

Jeffreys’s Bayes factor tests the observed data under two models, which JASP also lists
as research hypotheses and refers to them as H0 and H1, respectively15. Regarding
Bayesian statistics, we are interested in the Bayes Factor as indicative of the strength of
the evidence in favor of either model.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option for ‶Bayesian
Independent Samples T-Test″, then select ‘Static SA’ as our dependent variable, and
‘Group’ as our grouping variable. We select ‶Group 1 ≠ Group 2″ as research
hypothesis. As there is no easy way of predicting the appropriate ‵Bayes Factor′ to run,
we randomly select ‶BF10″, check the results and, as it is smaller than ‘0’ (i.e., 0.351),
we select ‶BF01″, instead (as both results are interchangeable, convention calls for
reporting Bayes factors in their most interpretable form, which are those greater than
‘1’). Further, we choose informative ‵Plots′: ‶Prior and posterior″ distributions with
‷Additional info‴, and ‶Sequential analysis″.

The Bayes factor analysis shows that the nil model (M016) is almost three times
more likely than the alternative model (BF01 = 2.85; Table 4). Indeed, the posterior
distribution shows the sample effect size to be slightly off-centred yet ‘zero’ still
commands a large posterior probability.

Table 4 | Static SA; Bayesian Independent Samples T-Test


BF01

2.853

15
Bayes factors test models, not hypotheses per se. In order to prevent confusion with fully developed
Bayesian inference, we will refer to models (M0 and M1), instead of to hypotheses (H0 or H1).
16
We will reserve the label ‘null’ for Fisher’s null hypothesis (H0) and use ‘nil’ for the Bayesian null model
(as the model is a point-nil distribution).
-14-
From these results we thus learn that the evidence provides some anecdotal (to
moderate) support to the conclusion that the intervention had no effect on Static SA, as
expected17.

3. Active Situation Awareness (Active SA)

Active SA was measured using three items that required ongoing, albeit not
necessarily constant, monitoring in the cockpit. This monitoring was captured via three
pre-set failures and assessed in the questionnaire by asking the pilots, after the flight had
ended, whether they could positively identify those failures. The response to each item
was, therefore, dichotomous—i.e., as having identified the failure correctly or not—and
all three responses where added up into a single component. Active SA could, thus, range
between a minimum of ‘0’ (if no failure was positively identified) and a maximum of ‘3’
(if all three failures were identified).

3.1. Active SA; exploratory data analysis

A 95% BCI covers 95% of the range of plausible locations for the inferred population
parameter, so that we may conclude that, with 95% confidence or certainty, the
population parameter is within the credible interval; said otherwise, that there is 95%
probability that the parameter is in the interval.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Bayesian
Independent Samples T-Test″. We then select ‘Active SA’ as our dependent variable and
‘Group’ as our grouping variable, as well as the same descriptive options selected
earlier: ‶Descriptives″, ‶Descriptives plots″, and 95% ‷Credible interval‴.

Results for Active SA (Table 5) show that the control group performed better than
the experimental group, obtaining a higher average on the component (M = 1.22; 95%
BCI [0.96, 1.48]) than the experimental group did (M = 0.70; 95% BCI [0.39, 1.00]).
Such difference in performance is obvious in the accompanying figure, which locates the
control BCI sensibly higher on the scale than the experimental BCI.

17
Bayesian inference allows to claim support for the nil model if results so suggest. Such support rests on
the larger likelihood of one model over the other. For example, the BF here represents the odds favouring
the nil model over the alternative model. Such odds can be converted into a probability with the formula
P = BF/(1+BF). Thus, the nil model has a 74% chance of being correct (compared to the 26% probability,
by difference, of the alternative model).
-15-
In the context of the scale measuring them, however, both groups performed
relatively poorly, perceiving, on average, just about one failure out of three possible
(meanwhile, as Table 1 shows, the maximum number of failures detected by either group
was two and the minimum was no detection of failures whatsoever). The control group
also performed better than the experimental group, a performance that goes against initial
expectations.

Table 5 | Active SA; Group Descriptives


95% BCI
Group N Mean SD Lower Upper

Control 23 1.217 0.600 0.958 1.477

Experimental 23 0.696 0.703 0.392 1.000

3.2. Active SA; test of significance

We expected the intervention to have a positive effect on Active SA; thus, we are after a
directional test. As JASP allocates groups automatically, we need to select which of two
directional hypotheses to use for purpose. Therefore, before jumping straight to
interpreting results, it is necessary to ascertain that we have selected the correct
directional test (JASP offers a handy explanatory note as part of the outputs18, which can
be used for purpose).
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Independent
Samples T-Test″, then select ‘Active SA’ as our dependent variable, and ‘Group’ as our
grouping variable. We select ‶Student″ as test. In order to determine the appropriate
directional hypothesis, we randomly select ‶Group 1 > Group 2″, read the note given in
the output, realize that it is the wrong directional hypothesis, then select ‶Group 1 <
Group 2″. We also select the same ‵Additional Statistics′ than earlier: ‶Effect size″ and
its ‷Confidence Interval‴ set at 95%.

18
There is an inherent risk here, insofar one can observe the results, thus be tempted to choose the
directional hypothesis with the most lucrative conclusion (i.e., the directional hypothesis that returns a
significant result).
-16-
As Active SA comprises items that are part of cockpit operations and that were
expected to be perceived with continual monitoring, they were also expected to be
positively affected by the experimental intervention, which motivated such continual
monitoring. Table 6 thus presents directional, one-tailed, statistics19.

Results show that the standardized effect mean difference between groups was a
Cohen’s d of 0.8020, conventionally considered a large effect in psychology. The
confidence interval of such effect, one-tailed, ranged between minus infinity to an upper
limit of d = 1.3021, thus crossing ‘0’ and signalling a non-significant result. Indeed, the
one-tailed t-test for independent samples returned such non-significant result (p = 0.995).

We thus learn that test results are statistically non-significant and, because of the
one-tailed nature of the inference, the null hypothesis of either no differential effect or
negative effects between groups cannot be rejected. We cannot, however, conclude in
favour of the null hypothesis, thus we are prevented from claiming a negative effect of
the checklist on performance, as suggested by the descriptive data (i.e., we cannot accept
such possibility even when it is contained within the scope of the null hypothesis).

Table 6 | Active SA; Independent Samples T-Test


95% FCI for Cohen's d
t df p Cohen's d Lower Upper
2.708 44 0.995 0.799 -∞ 1.299

3.3. Active SA; Bayes factor analysis

Jeffreys’s Bayes factor tests the probability of the observed data under two models rather
than the probability of the hypotheses proper, as a fully-fledged Bayesian analysis would
do. Indeed, Bayes factors assume uninformative hypotheses—both hypotheses are given
same prior odds, of 50% each—, which is a way of placing the weight of the evidence
showed by the posterior distribution onto the observed data, exclusively.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Bayesian
Independent Samples T-Test″, then select ‘Active SA’ as our dependent variable, and
‘Group’ as our grouping variable. We also select ‶Group 1 < Group 2″ as research
hypothesis, and the appropriate ‵Bayes Factor′ to run—which turns out to be ‶BF01″—as

19
The directional null hypothesis would read as; H0: the Experimental group will not perform significantly
better than the Control group (for an appropriate measure of significance).
20
Cohen’s d reads a bit unintuitive here. However, we ought to remember that it reflects the difference
between groups yet in favour of the control group (i.e., the control group scored higher than the
experimental group, against expectation).
21
These statistics were mistakenly reported as [-∞, 0.10] in the original manuscript.
-17-
well as the same plots selected earlier: ‶Prior and posterior″ with ‷Additional info‴, and
‶Sequential analysis″.

The Bayes factor analysis shows that the nil model (M0) is almost eleven times
more likely than the alternative model (BF01 = 10.96; Table 7). Indeed, the one-tailed
posterior distribution shows the sample effect size to be practically ‘zero’ (median =
-0.06; 95% BCI [-0.263, -.003]). The evidence in support of the nil model—that there
was no positive effect of the checklist on Active SA (without necessarily ruling out a
negative effect)—is strong.

Table 7 | Active SA; Bayesian Independent Samples T-Test


BF01

10.959

3.4. Active SA; two-tailed analyses

This section shows further data analyses prompted by unexpected results. One handy
feature of JASP 0.9.0.1 is that it brings up any previous command screen by simply
clicking on a table or figure, thus reducing the need to click on tabs and upload the same
variables each time. Beware, however, that any alteration to commands will
automatically update the results rather than generate new ones.
JASP 0.9.0.1: In ‵Results′, we click on the table ‶Independent Samples T-Test″—which
brings us back to the command screen for the directional test—and change our research
hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″. This updates the table with
the corresponding two-tailed results. We, then, click on the table ‶Bayesian Independent
Samples T-Test″—which brings us back to the appropriate command screen for the
Bayesian directional test—and also change our research hypothesis to the nondirectional
one ‶Group 1 ≠ Group 2″. This automatically updates both the table and the plots. We
double-check that the previously selected ‵Bayes Factor′ command is the appropriate
one—which, in this case, is not—and select ‶BF10″, instead.

-18-
Before moving on, it seems relevant to explore further the mismatch between our
initial expectations of improvement and the observed results, especially in view of the
latter sensibly going in the opposite direction. Therefore, we re-run both frequentist and
Bayesian tests for Active SA using a two-tailed approach, for exploratory purposes22.

Table 8 summarizes the results for the test of significance. We observe that the
confidence interval for the effect size runs between 0.19 and 1.40, thus, between a small
and a very large effect size (while half of the interval is above large, i.e., above d = 0.8023).
The interval does not cross ‘zero’, thus signalling a statistically significant result, as also
confirmed by the small p-value (p = 0.01).

As shown by the descriptives in Table 5, the direction of the effect is in favour of


the control group. The rather large effect size and the limits of the confidence interval,
punctuated by the highly significant result, would see the null hypothesis of no effect
rejected24. These results thus suggest that the checklist has been a hindrance to Active SA
rather than simply been not helpful.

Table 8 | Active SA; Independent Samples T-Test


95% FCI for Cohen's d
t df p Cohen's d Lower Upper
2.708 44 0.010 0.799 0.193 1.396
Note. Two-tailed analyses

Table 9 summarizes the results of the Bayesian test25. The Bayes factor analysis
shows that the alternative model (M1) would be about five times more likely than the null
model (BF10 = 5.0426). Indeed, the two-tailed posterior distribution shows the sample
effect size to be centered on a median effect d = 0.68, while the credible interval [95%

22
Given the post hoc exploratory nature of this secondary analysis, we prefer non-directional tests, as they
are statistically more conservative than directional ones.
23
This statistic was mistakenly reported as d = 0.65 in the original manuscript.
24
As said earlier, the Modus Tollens sets the logical syllogism for the data to contradict the substantive H0:
if H0 is true, then the observed data will be not significant (at a given level of significance). A significant
result thus contradicts the consequent of the syllogism, leading to denying the antecedent (the hypothesis)
in a logically sound manner. We can, thus, conclude that the substantive H0 is not true, and may reject it
(Perezgonzalez, 2017c).
25
Bayesian tests were not provided in the original manuscript.
26
These odds give the alternative model an 83% chance of being correct (compared to about 17%
probability for the nil model; P = BF/[1+BF]). Said otherwise, the posterior probability of the alternative
hypothesis has increased by 33% and that of the nil hypothesis has decreased by 33%, compared to their
prior probabilities of 50% each.
-19-
BCI [0.12, 1.30]) gives ‘zero’ a very low credibility. The evidence in support of the
alternative model is moderate, thus supporting the conclusion that there was a pernicious
effect of the checklist on Active SA (i.e., that it decreased situation awareness in the
experimental group).

Table 9 | Active SA; Bayesian Independent Samples T-Test


BF10

5.044

Note. Two-tailed analyses

4. Timing Situation Awareness (Timing SA)

Timing SA assesses the time it took the pilots to first perceive any of the three
failures, in full minutes (during the simulation, pilots were able to take notes of anything
happening and time them in reference to the cockpit clock). A positively identified failure
reported within one minute of occurring would score as ‘1’; within two minutes as ‘2’;
three minutes or longer as ‘3’; and as ‘missing’ when no time was reported or when the
failure was not correctly identified, irrespective of time27. All three scores where then
averaged into a single component. Timing SA could, thus, range between a minimum of
‘1’ (if all three failures were first identified within one minute of occurring) and a
maximum of ‘3’ (if all three failures were identified but each was reported as having
occurred with an error longer than two minutes).

27
In the original manuscript this scale is reversed, counting as ‘1’ times of three minutes or more, and as
‘3’ times within the minute. Despite explicit acknowledgement of such reversed scale, however, results
were nonetheless misinterpreted wrongly, leading also to carrying out tests in the wrong direction. The
results here provided thus differ sensibly from those in the original manuscript.
-20-
4.1. Timing SA; exploratory data analysis

A descriptive BCI is calculated assuming flat priors, is centered on the mean of the
sample, and covers, for example, 95% of the posterior frequency probability distribution
either side of the mean. Such posterior frequency distribution has a straightforward
interpretation in Bayesian statistics: 95% of credible estimates for the parameter are
within the interval, and the probability of such estimates diminishes as we move towards
the tails in the distribution. All in all, however, we are still interested in the entire interval.
JASP 0.9.0.1: We select the tab ‵T-Tests′, and choose the option ‶Bayesian Independent
Samples T-Test″. We then select ‘Timing SA’ as our dependent variable, and ‘Group’ as
our grouping variable, and the same descriptive options selected earlier: ‶Descriptives″,
‶Descriptives plots″, and 95% ‷Credible interval‴.

The observed results for Timing SA (Table 10) show that the experimental group
performed slightly worse than the control group, thus obtaining a larger time average
(M = 2.89; 95% BCI [2.70, 3.06]) than the control group (M = 2.43; 95% BCI [2.03,
2.82]).

In the context of the time scale measuring them, however, both groups performed
relatively poorly, taking longer than two minutes to first perceive whatever failures were
perceived. As the accompanying figure illustrates, the control group performed sensibly
better, although also with larger variability in performance, compared to the experimental
group. In any case, the performance of the experimental group also goes against initial
expectations. Of interest is the number of pilots reporting perceived failures, with 20
pilots in the control group reporting at list one failure, against only 13 pilots in the
experimental group doing so.

Table 10 | Timing SA; Group Descriptives


95% BCI
Group N Mean SD Lower Upper

Control 20 2.425 0.847 2.028 2.822

Experimental 13 2.885 0.300 2.704 3.066

-21-
4.2. Timing SA; test of significance

Timing SA is a variable that does not accord to the basic assumptions of a parametric
t-test, especially in regards to the normal distribution of the variable. JASP provides an
alternative rank-based, non-parametric test, for purpose: Mann-Whitney’s U test.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Independent
Samples T-Test″, then select ‘Timing SA’ as our dependent variable, and ‘Group’ as our
grouping variable. (Under ‵Assumption Checks′ we could select both ‶Normality″ and
‶Equality of variances″ to check whether the data fulfill the expected parametric
assumptions for the t-test. As we already checked those earlier, we skip this step.) We
select both ‶Mann-Whitney″ and ‶Student″ under ‵Tests′ (in the context of this research,
Mann-Whitney’s U test is the one to inform the significance of the results, but Student’s
t-test gives us effect sizes which are more relatable, thus interpretable, by comparison).
In order to determine the appropriate directional hypothesis, we randomly select ‶Group
1 > Group 2″, which, upon reading the note provided with its output, turns out to be the
correct option. We also select the same ‵Additional Statistics′ selected earlier: ‶Effect
size″ and its ‷Confidence Interval‴ set at 95%.

Timing SA, because it is related to Active SA, was also expected to be positively
affected by the intervention, the research hypothesis thus being also a directional one.
Table 11 thus comprises directional, one-tailed, statistics. The main test is the non-
parametric Mann-Whitney’s U test, albeit the Student’s t-test is also given as it helps
provide a common ground for comparison with previous variables.

Table 11 | Timing SA; Independent Samples T-Test


95% FCI for Effect Size
Test Statistic df p Effect Size Lower Upper
Mann-Whitney 98.500 0.934 ᵃ -0.242 -0.530 ∞
Student -1.873 31 0.965 ᵃ -0.667 -1.264 ∞
Notes. For the Student’s t-test, effect size is given by Cohen's d; for the Mann-Whitney’s test, effect size
is given by the rank biserial correlation.
ᵃ Levene's test is significant (p < .05), suggesting a violation of the equal variance assumption

Results show a non-significant result for the Mann-Whitney’s U test (p = 0.934).


The effect size is of a medium-sized biserial rank correlation between experimental and
control groups (r = -0.24). More relatable is the medium-to-large Cohen’s d (d = -0.67,
95% FCI [-1.26, ∞]).

From these results we thus learn that test results are statistically non-significant
and, because of the one-tailed nature of the inference, the null hypothesis of either no

-22-
differential effect or negative effects between groups cannot be rejected. We cannot,
however, conclude in favor of the null hypothesis, thus we are prevented from claiming
a negative effect of the checklist on performance, as suggested by the descriptive data
(i.e., we cannot accept such possibility of the null hypothesis).

4.3. Timing SA; Bayes factor analysis

Bayesian inference, including Jeffreys’s Bayes factor, starts from the perspective that the
observed sample is given, thus, that it is not one of the potential samples from a
population of samples. As the inference is based on the observed sample as is, there is no
need to check whether the sample fits assumptions for a particular test or another.
Therefore, the same analysis applies to any variable, irrespective of its normality,
homogeneity of variance, etc.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Bayesian
Independent Samples T-Test″, then select ‘Timing SA’ as our dependent variable, and
‘Group’ as our grouping variable. We also select ‶Group 1 > Group 2″ as research
hypothesis, and the appropriate ‵Bayes Factor′ to run—which turns out to be ‶BF01″—as
well as the same plots selected earlier: ‶Prior and posterior″ with ‷Additional info‴, and
‶Sequential analysis″.

The Bayes factor analysis shows that the nil model (M0) is seven times more likely
than the alternative model (BF01 = 7.20; Table 12). The one-tailed posterior distribution
shows the sample effect size to be small (median = 0.105; 95% BCI [0.003, 0.438]) and
the evidence moderately supporting the nil model—that there was no positive effect of
the checklist on the speed with which failures were perceived (without necessarily ruling
out a negative effect).

Table 12 | Timing SA; Bayesian Independent Samples T-Test


BF01

7.195

-23-
4.4. Timing SA; two-tailed analyses

JASP proves to be quite flexible for carrying out further data analyses. This, however,
has the risk of being abused in search for statistical significance and / or Bayesian
support for a positive result. Caution as well as full reporting ought to go hand-in-hand
with such statistical flexibility.
JASP 0.9.0.1: In ‵Results′, we click on the table ‶Independent Samples T-Test″ and
change our research hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″. We, then,
click on the table ‶Bayesian Independent Samples T-Test″ and also change our research
hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″. We double-check whether the
previously selected ‵Bayes Factor′ command is the appropriate one—which it is not—and
select ‶BF10″, instead.

As done earlier, it seems relevant to explore further the mismatch between our
initial expectations of improvement and the observed results. Table 13 summarizes the
results of the two-tailed test of significance. We observe that the p-value is not significant
(U = 98.50, p = 0.145). Therefore, we have no strong statistical backing to reject the null
hypothesis of no negative effect28 (i.e., the probability of erring if rejecting it is about
15%). Comparatively speaking, the effect size is approximately a moderate one yet its
interval still spans both sides of ‘0’ (r = -0.24; Cohen’s d = -0.67, 95% FCI [-1.38, 0.06]).

Table 13 | Timing SA; Independent Samples T-Test


95% FCI for Effect Size
Test Statistic df p Effect Size Lower Upper
Mann-Whitney 98.500 0.145 ᵃ -0.242 -0.576 0.161
Student -1.873 31 0.071 ᵃ -0.667 -1.380 0.056
Notes. For the Student’s t-test, effect size is given by Cohen's d; for the Mann-Whitney’s test, effect size
is given by the rank biserial correlation.
ᵃ Levene's test is significant (p < .05), suggesting a violation of the equal variance assumption.
Two-tailed analyses.

Table 14 summarizes the results of the Bayesian test. The Bayes factor analysis
shows that the alternative model (M1) would be about slightly more likely than the nil
model (BF10 = 1.25) and in the direction of a negative effect. However, the evidence is

28
We have already tested the positive effect earlier, so here we can simply focus on the potential for a
negative effect, yet still using the more conservative nondirectional test.
-24-
just too flimsy as for making any credible statement regarding a negative effect of the
intervention on the speed of failure perception29.

Table 14 | Timing SA; Bayesian Independent Samples T-Test


BF10

1.247

Note. Two-tailed analyses.

5. Continual Situation Awareness (Continual SA)30

Continual SA is a post hoc measure created by combining Active SA and Timing


SA. It assesses both the perception (or not) of failures in the cockpit and the time it took,
in full minutes, to perceive such failures. Continual SA could, thus, range between a
minimum of ‘1’ (if all three failures were identified, each within one minute of occurring)
and a maximum of ‘4’ (if no failure was identified), while ‘3’ stands for the same value
as in Timing SA (i.e., a failure perceived longer than two minutes after first occurring).
Therefore, as scores increase on the scale, they reflect poorer performance in regards to
overall situation awareness (i.e., taking into account not only the time it takes to perceive
a potential problem but also the possibility of failing to perceive one or more of them
altogether).

29
Notice how the Bayesian interpretation leads to an ‘anecdotal’ statement, which may encourage a claim
that is rather unwarranted. As Bayesian statistics do not work with error probabilities, a parallel frequentist
test helps provide a more moderate understanding of the Bayesian results (namely, that claiming anything
from the data, whether anecdotally or not, has a reasonable large probability of being erroneous [i.e.,
p = 0.15]).
30
Continual SA was recalculated in this tutorial to account for performance in each failure more
systematically. These results thus differ sensibly from those reported in the original manuscript.
-25-
5.1. Continual SA; exploratory data analysis

Because descriptive BCIs are calculated assuming flat priors, they return similar results
to those returned by frequentist confidence intervals, albeit the interpretation of both type
of intervals necessarily differs. We prefer BCIs simply because their interpretation
accords better with that of a frequency distribution (although FCIs are often wrongly
interpreted as BCIs, e.g., by Cumming, 2012)31.
JASP 0.9.0.1: We select the tab ‵T-Tests′, and choose the option ‶Bayesian Independent
Samples T-Test″. We then select ‘Continual SA’ as our dependent variable, and ‘Group’
as our grouping variable, as well as the same descriptive options selected earlier:
‶Descriptives″, ‶Descriptives plots″, and 95% ‷Credible interval‴.

The observed results for Continual SA (Table 15) show that the control group
performed slightly better than the experimental group, obtaining a smaller average
(M = 3.38; 95% BCI [3.17, 3.58]) than the experimental group (M = 3.74; 95% BCI [3.62,
3.86]).

Table 15 | Continual SA; Group Descriptives


95% BCI
Group N Mean SD Lower Upper

Control 23 3.377 0.476 3.171 3.583

Experimental 23 3.740 0.284 3.617 3.863

31
Cumming generates his ‘cat’s eye’ representations from the sampling distribution of means. He uses it
as a description of the sample, including its centrality (i.e., the mean) and a coverage of the distribution
(e.g., a 95% interval). As an inferential statistic, however, a FCI is an output calculated from such sampling
distribution (thus, it is not describing the sample), gives equal probability to all estimates (thus, the mean
is as probable as any other location, so it is irrelevant to draw both the mean and the frequency distribution),
and simply covers the specified percentage of the sampling distribution closest to the mean as an inferential
statistic for the true location of the population parameter (thus, either the true parameter is one of the 95%
of estimates within the parameter or one of the 5% outside it; that is, there is a 5% chance [of error] that the
parameter is outside the interval). A BCI, on the other hand, is a posterior frequency distribution and can
be represented as such: with a measure of centrality, a probability distribution, and inviting a subjective
belief or confidence that the parameter is in the interval, most probably closest to the mean than to the tails
of the distribution.
-26-
In the context of the scale measuring them, however, both groups performed
relatively poorly, being rather close to the maximum anchor for poor performance
(maximum = ‘4’). As the accompanying figure illustrates, the control group performed
sensibly better, in relative terms, a performance that also goes against initial expectations.

5.2. Continual SA; test of significance

Continual SA does not accord to the basic assumptions of a t-test, either, so we shall rely
on the non-parametric test for interpreting statistical significance.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Independent
Samples T-Test″, then select ‘Continual SA’ as our dependent variable, and ‘Group’ as
our grouping variable. We select both ‶Mann-Whitney″ and ‶Student″ under ‵Tests′. In
order to determine the appropriate directional hypothesis, we randomly select ‶Group 1
> Group 2″, which, upon reading the note provided with its output, turns out to be the
correct option. We also select the same ‵Additional Statistics′ than earlier: ‶Effect size″
and its ‷Confidence Interval‴ set at 95%.

As Continual SA is partly based on Active SA items, thus expected to be


positively affected by the intervention, the research hypothesis is also a directional one.
Table 16 thus comprises directional, one-tailed, statistics. The main test is the non-
parametric Mann-Whitney’s U test, albeit the Student’s t-test is also provided, for
comparability purposes.

Results show a non-significant result for the U test (p = 0.999). The effect size is
a large rank biserial correlation between experimental and control groups (r = -0.50;
Cohen’s d = -0.93, 95% FCI [-1.43, ∞]).

Table 16 | Continual SA; Independent Samples T-Test


95% FCI for Effect Size
Test Statistic df p Effect Size Lower Upper
Mann-Whitney 131.500 0.999 -0.503 -0.684 ∞
Student -3.142 44 0.999 -0.927 -1.433 ∞
Note. For the Student’s t-test, effect size is given by Cohen's d; for the Mann-Whitney’s test, effect size
is given by the rank biserial correlation.

We thus learn that test results are statistically not significant and, because of the
one-tailed nature of the inference, the null hypothesis of either no differential effect or
negative effects between groups cannot be rejected. We cannot, however, conclude in

-27-
favour of the null hypothesis, thus we are prevented from claiming a negative effect of
the checklist on performance, as suggested by the descriptive data (i.e., we cannot accept
such null hypothesis).
5.3. Continual SA; Bayes factor analysis

Because only the observed data weighs in on the posterior distribution, the evidence in
favour of one or the other model equally translates as evidence in favour of the hypothesis
modelled by the favoured model. But because Bayes factors do not actually work with the
prior probabilities of the hypotheses, there is a risk of wrongly concluding in favour of a
hypothesis unless we know that the prior probability for such hypothesis was truly 50%.
JASP 0.9.0.1: We go back to the tab ‵T-Tests′ and choose the option ‶Bayesian
Independent Samples T-Test″, then select ‘Continual SA’ as our dependent variable, and
‘Group’ as our grouping variable. We also select ‶Group 1 > Group 2″ as research
hypothesis, and the appropriate ‵Bayes Factor′ to run—which turns out to be ‶BF01″—as
well as the same plots selected earlier: ‶Prior and posterior″ with ‷Additional info‴, and
‶Sequential analysis″.

The Bayes factor analysis shows that the nil model (M0) is twelve times more
likely than the alternative model (BF01 = 12.08; Table 17). The one-tailed posterior
distribution shows the sample effect size to be small (median = 0.10, 95% BCI [0.052,
0.238]) and the evidence moderately-to-strongly supporting the nil model—that there was
no positive effect of the checklist on awareness, as measured by Continual SA.

Table 17 | Continual SA; Bayesian Independent Samples T-Test


BF01

12.084

-28-
5.4. Continual SA; two-tailed analyses

JASP 0.9.0.1: In ‵Results′, we click on the table ‶Independent Samples T-Test″ and
change our research hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″. We, then,
click on the table ‶Bayesian Independent Samples T-Test″ and change our research
hypothesis to the nondirectional one ‶Group 1 ≠ Group 2″, double-check whether the
previously selected ‵Bayes Factor′ command is the appropriate one—which is not—and
select ‶BF10″, instead.

We explored further the mismatch between our initial expectations of


improvement and the observed results, via running both frequentist and Bayesian tests for
Continual SA using two-tailed statistics.

Table 18 summarizes the results of the two-tailed test of significance, which


shows to be a highly significant result (U = 131.5, p = 0.002). We can, thus, reject the
null hypothesis of no effect.

Indeed, inferred effect sizes are large (rank biserial correlation = -0.50; Cohen’s
d = -0.93, 95% FCI [-1.53, -0.31]), with confidence intervals suggesting a moderate-to-
large negative effect of the intervention on overall situation awareness.

Table 18 | Continual SA; Independent Samples T-Test


95% FCI for Effect Size
Test Statistic df p Effect Size Lower Upper
Mann-Whitney 131.500 0.002 -0.503 -0.712 -0.213
Student -3.142 44 0.003 -0.927 -1.531 -0.312
Notes. For the Student’s t-test, effect size is given by Cohen's d; for the Mann-Whitney’s test, effect size
is given by the rank biserial correlation.
Two-tailed analyses.

Table 19 summarizes the results of the Bayesian test. The Bayes factor analysis
shows that the alternative model (M1) is twelve times more likely than the nil model
(BF10 = 12.70), with a median effect of d = -0.80 [95% BCI [-1.43, -0.19]), giving the
‘nil’ effect very low credibility. The evidence in support of the alternative model is
deemed strong, thus supporting the conclusion that there was a pernicious effect of the
checklist on Continual SA (i.e., that it decreased overall situation awareness and
perceptive speed in the experimental group).

-29-
Table 19 | Continual SA; Bayesian Independent Samples T-Test
BF10

12.695

Note. Two-tailed analyses

Final notes

In the previous section we had the opportunity to observe the flexibility of use of
JASP and the possibilities it allows to learn from the observed data. Indeed, we have
carried out exploratory, Fisherian, and Jeffreysian data analyses; planned analyses as well
as ad hoc analyses motivated by the inconsistency found between expectations and
observed results; one-tailed and two-tailed analyses; and parametric and non-parametric
analyses. We have also obtained both significant and non-significant results.

In the latter case, we have also experienced how a Jefreysian approach allows us
to learn more from such results, providing us the strength of evidence (likelihood) in
favour of either the alternative or the nil model, as appropriate. We even had the
opportunity to illustrate a case were both types of inference would contradict each other,
with the frequentist test being non-significant but the Bayes factor returning anecdotal
evidence in favour of the alternative model. In fact, such case also served to argue in
favour of Mayo’s (2017) statement that a frequentist test could as well be used to calibrate
a Jeffreysian result in order to prevent a potentially erroneous Bayesian conclusion.

In above journey, we also have called the reader’s attention with several footnotes.
We will delve a bit deeper into those now.

For example, numerous footnotes called the attention to several discrepancies


between the results described in the original manuscript and those provided in this
tutorial. Some of those discrepancies (footnotes 9, 20, 22) can be traced back to either

-30-
errors of transcription or poor housekeeping (e.g., the discrepancies between expectations
and results led to double-check the database against the original questionnaire data; upon
correcting an entry error, however, only some statistics in the corresponding table, but
not all, were updated). Other discrepancies arouse because of the way a couple of
variables were coded (footnotes 26, 29), typically against advice to do it differently. Such
discrepancies serve to highlight that neither JASP nor Bayesian analyses (nor any other
analysis nor statistics software, for that matter) are a magical pill to protect against such
sort of methodological errors. It is the researcher’s responsibility to attend to the potential
for error at methodological and analytical levels. As far as data analysis goes, the quality
of the output depends largely on the integrity of the data, something that prioritises
methodological integrity over data analysis every time.

We also had the opportunity to experience the simplicity of JASP’s command and
output screens, and the immediacy with which results can be perused. Oftentimes, we
need to calculate the results in order to ensure we have chosen the right hypothesis (e.g.,
footnote 18). But, as alerted in footnote 17, such flexibility offers a tempting opportunity
to adopt those research hypotheses which are most propitious towards a desired result.
Again, JASP offers no protection against such behaviour, which comes down to the
integrity of researchers themselves above and beyond the technical features of the
statistical software used.

A third issue to highlight is the constant reminder the tutorial has given us
regarding that the observed results did not match our initial research hypotheses most of
the time. Indeed, even the sensitiveness of the test was estimated according to the most
interesting, and expected, result: that of a positive effect of the intervention on situational
awareness. Of course, such sensitiveness analysis contained in itself the possibility for
the effect to be smaller than that selected, including ‘zero’ (footnote 10). However, at no
time prior to the investigation or during the analysis of results, we envisaged the
possibility of a negative effect size, less so of statistically significant ones. (We do now
have a possible explanation, which may serve as hypothesis for a follow up study. Until
put to test, however, it is uncertain whether the results observed may actually describe a
real state of affairs.) This issue thus delves into how to understand such prior expectations
in the context of our data, which a fully-fledged Bayesian analysis will do by integrating
the prior probabilities of the hypotheses with the probability of the observed data under
such hypotheses. As we have discussed, however (footnotes 14, 16, 25), a Bayes factor

-31-
analysis bypasses such integration and only provides the weight of the evidence given by
the data. This is also the information that a frequentist approach provides. In any case, it
is worthwhile to remind the reader that neither Jeffreysian inference nor JASP deal with
the prior probability of the hypotheses; thus, that they cannot answer the question of the
posterior probability of the same after having observed the data.

Related to above is a fourth issue: whether a parallel frequentist – Bayesian


analysis really adds to our knowledge, beyond some exceptional instances such as those
highlighted in footnote 28. The issue is, namely, whether we are simply slicing their
differences too thinly. Indeed, Table 20 shows a summary of statistics that would help
inform reasonable conclusions for each of our research hypotheses. Both frequentist and
Bayesian conclusions would be equivalent, except for case V (see footnote 28). We could
have also reached similar conclusions (and discrepancy) than the frequentist tests had we
used an exploratory approach, relying on Cohen’s d (conditioned on our sensitiveness
analysis, footnote 10) for interpretation. Furthermore, Mayo’s (e.g., 1996) severity
analysis (SEV) would also conclude on a similar note to the Bayesians, with all Bayesian
tests resulting in moderate-to-strong evidence having passed such tests with high severity;
with moderate severity in case I, and with practically no severity in case V.

Table 20 | Reasonable conclusions based on frequentist and Bayesian results


Case Cohen’s d p Decision SEV BF Evidence

I (2t) 0.20 0.507 H0 0.75 BF01 = 2.85 M0 = anecdotal


II (1t) 0.80 0.995 H0 0.99 BF01 = 10.96 M0 = strong
III (2t) 0.80 0.010 noH0 0.99 BF10 = 5.04 M1 = moderate
IV (1t) -0.67 0.965 H0 0.99 BF01 = 7.20 M0 = moderate
V (2t) -0.67 0.071 H0 0.51 BF10 = 1.25 M1 = anecdotal
VI (1t) -0.93 0.999 H0 0.99 BF01 = 12.08 M0 = strong
VII (2t) -0.93 0.003 noH0 0.99 BF10 = 12.70 M1 = strong

Note. For comparability purposes, all p-values are those of t-tests. SEV = Mayo’s severity tests.

Therefore, it seems that any approach is sufficient and all redundant. What we
need to remember, however, is that they all seek a different type of learning; thus they are
not necessarily interchangeable. Mind you, the fact that approaches may coincide in their

-32-
relative conclusions is not in itself a corroboration of such conclusions, either. Using the
approach most appropriate for purpose not only allows us to probe the data with the best
tool for a particular probe, but also allows us to prevent making errors of interpretation
and generalization.

Finally, we have an issue of philosophical perspective (e.g., Mayo, 1996; Mayo


and Spanos, 2010). As highlighted in footnotes 12, 13, and 23, a frequentist approach is
founded on an epistemology of logical argumentation by contradiction, with a Modus
Tollens built upon a given level of significance for testing substantive null hypotheses.
Such approach is also equally uninterested on an argumentation by affirmation based on
a Modus Ponens that may prove such null hypotheses.

Bayesian statistics, on the other hand, are namely founded on an epistemology of


logical argumentation by affirmation, whereby the hypothesis (or model) most probable
is that for which there is more evidence in its favour.

Such philosophical difference can be better ascertained in a different context. The


Bayesian perspective decides between innocent or guilty based on the weight of the
evidence—with some leeway possible for allowing withholding a decision unless a
minimum weight is achieved. Meanwhile, the frequentist perspective only serves to
pinpoint when one is guilty beyond a reasonable doubt.

References

APA (2010). Publication manual of the American Psychological Association (6th ed.).
Washington, DC: APA.

Cumming, G. (2012). Understanding the New Statistics. Effect sizes, confidence


intervals, and meta-analysis. New York, NY: Routledge.

Fisher, R. A. (1954). Statistical methods for research workers (12th ed.). Edinburgh,
U.K.: Oliver and Boyd.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33, 587–


606. doi:10.1016/j.socec.2004.09.033

Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, N.Y.: Clarendon Press.

-33-
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B. Jr., Bahník, Š., Bernstein, M. J.,
. . . Nosek, B. A. (2014). Investigating variation in replicability. A “Many Labs”
replication project. Social Psychology, 45, 142-152. doi:10.1027/1864-
9335/a000178

Kruschke, J. K. (2011). Doing Bayesian data analysis. A tutorial with R and BUGS.
Oxford, UK: Academic Press.

Mayo, D. G. (1996). Error and the growth of experimental knowledge. Chicago, IL: The
University of Chicago Press.
Mayo, D. (2017). New venues for the statistics wars [Web log post]. Retrieved from
https://errorstatistics.com/2017/10/05/new-venues-for-the-statistics-wars
Mayo, D. G., and Spanos, A. (eds.). (2010). Error and inference. New York, NY:
Cambridge University Press.

Neyman, J., and Pearson, E. S. (1928). On the use and interpretation of certain test
criteria for purposes of statistical inference: part I. Biometrika, 20A, 175–240.
doi: 10.2307/2331945

Open Science Collaboration (2012). An open, large-scale, collaborative effort to


estimate the reproducibility of psychological science. Perspectives in
Psychological Science, 7, 657–660. doi:10.1177/1745691612462588

Perezgonzalez, J. D. (2014). A reconceptualization of significance testing. Theory &


Psychology, 24, 852–859. doi:10.1177/0959354314546157

Perezgonzalez, J. D. (2015a). Fisher, Neyman-Pearson or NHST? A tutorial for


teaching data testing. Frontiers in Psychology, 6, 223.
doi:10.3389/fpsyg.2015.00223

Perezgonzalez, J. D. (2015b). P-values as percentiles. Frontiers in Psychology, 6, 34.


doi:10.3389/fpsyg.2015.00034

Perezgonzalez, J. D. (2015c). Confidence intervals and tests are two sides of the same
research question. Frontiers in Psychology, 6, 341.
doi:10.3389/fpsyg.2015.00341

Perezgonzalez, J. D. (2016). Statistical sensitiveness for science. Arxiv. Retrieved from


https://arxiv.org/abs/1604.01844.

Perezgonzalez, J. D. (2017a). Statistical sensitiveness for the behavioural sciences.


-34-
PsyArxiv. doi:10.17605/OSF.IO/Y969T. Retrieved from
https://psyarxiv.com/qd3gu.

Perezgonzalez, J. D. (2017b). The fallacy of placing confidence in confidence intervals –


A commentary. PsyArxiv. doi:10.31234/osf.io/kvxc4. Retrieved from
https://psyarxiv.com/kvxc4.

Perezgonzalez, J. D. (2017c). Commentary: The need for Bayesian hypothesis testing in


psychological science. Frontiers in Psychology, 8, 1434.
doi:10.3389/fpsyg.2017.01434

Perezgonzalez, J. D., and Frías-Navarro, M. D. (2018). Retract p < 0.005 and propose
using JASP, instead [version 2]. F1000Research, 6, 2122.
doi:10.12688/f1000research.13389.2

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., and Iverson, G. (2009). Bayesian
t-tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin
Review, 16, 225–237. doi:10.3758/PBR.16.2.225
Stengers, I. (2018). Another science is possible. A manifesto for slow science. Cambridge
(U.K.): Polity Press.

Tabachnick, B. G., and Fidell, L. S. (2001). Using multivariate statistics (4th ed.). Boston,
MA: Allyn & Bacon.

Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley


Publishers.

Vincent, N. (2018). Situational awareness of pilots in the cruise (Master’s thesis, Massey
University, New Zealand).

Wagenmakers, E. J., Verhagen, J., Ly, A., Matzke, D., Steingroever, H., Rouder, J. N., .
. . Morey, R. D. (2017). The need for Bayesian hypothesis testing in psychological
science. In S. O. Lilienfeld and I. D. Waldman (Eds.), Psychological science
under scrutiny: recent challenges and proposed solutions (pp. 123–138).
Chichester, UK: John Wiley & Sons. doi:10.1002/9781119095910.ch8

-35-

You might also like