Human vs. Machine: A Statistical Investigation Into Team Performance Under AI Integration Using The Dash and Dine Mini-Game

Application Report Master Data Science
Human vs. Machine: A Statistical Investigation into Team Performance under AI Integration using the Dash and
Dine Mini-Game
TU Dortmund University
Department of Statistics
Bekir Can Torun
Izmir
14.01.2024
CONTENTS
I. INTRODUCTION……………………………………………………………………….……..1
II. PROBLEM DEFINITON………………...………………………………………………….…..1
III. METHODS………….………………...………………………………………………….…..2
IV. EVALUATION…….………………...………………………………………………………..7
V. SUMMARY………….………………...………………………………………………..…...12
VI. BIBLIOGRAPHY……….………………...……………………………………….....………13
VII. INTRODUCTION
Artificial intelligence, which has recently changed and developed in line with the needs of societies, has become
increasingly competitive. Artificial intelligence has become an indispensable need thanks to the advantages it
provides at many points of life, and as a result of this importance, new tasks such as analysing the effects of
artificial intelligence, reaching a conclusion about these analyses and improving these results have emerged.
Success metrics need to be measured to take advantage of situations such as the use of artificial intelligence by
individuals in different fields or replacing humans in coordinated teams. In this technology, as in every technology,
the most important provider of progress will be Mathematics and Statistics. Because these basic sciences play a
decisive role in the decision-making phase. In order to analyse the mentioned effects, the methods and techniques
to be applied should be researched and justified with experiments and observations.
In this report I have statistically analysed the data of the experiments implemented in the paper "Super Mario
Meets AI: Experimental Effects of Automation and Skills on Team Performance and Coordination", I statistically
analysed the data from the experiments conducted in the paper.
The motivation behind my motivation to examine the article I mentioned lies in the need to observe the effects on
general performances in situations requiring teamwork with the integration of Artificial Intelligence into the social
structure. At the same time, I was able to reach the effects of different team members on these performances with
artificial intelligence.
If I need to talk about the content of this report, I mentioned the problem definition in detail, I tried to make the
definitions of the methods applied in the Methods section in a pure and clear way, and in the evaluation section, I
answered the problem objectively together with the methods. In the conclusion section, I tried to answer the
problem and interpret it in the context of the real world. As a result of analysing the data obtained in these
experiments conducted with the motivation of analysing the effects of artificial intelligence with appropriate
methods, I did not find a statistically significant difference between the groups using artificial intelligence and the
groups of five.
VIII. PROBLEM DEFINITON

In the experiment, the participants were divided into groups and asked to play the game "Super Mario Party: Dash
and Dine" played with the "Nintendo Switch" console. The groups of players consisted of 4 people and were
assigned to be changed later or to remain the same. The aim of the game is to try to complete as many recipes as
possible by collecting 1 point for each recipe completed. The game is partnership orientated and runs
simultaneously with the players' partners. The completed recipes are reported and these groups play with the same
partners for 6 rounds. At the end of 6 rounds there are 3 possibilities. One of the team players can be replaced by
a new team member. In this case, the name of the group becomes newhire. One of the team players can be replaced
by artificial intelligence. In this case, the team is called AI team. In the last possibility, none of the team players
may change and the team name becomes control. The first 6 rounds are called phase 1 and the next 6 rounds are
called phase 2. With these experiments, it was aimed to observe the effect of the inclusion of artificial intelligence
in non-AI teams on the performances in a game that requires teamwork and coordination.
1
The data can be shown as unique "team_id" used for each team during the 12 round process, "phase" indicating
the phases in which the teams are located, "group" indicating the change status of the teams after the 2nd phase,
"round" indicating the round in which the data is available and finally "totalingred" indicating the number of
materials collected. If we need to talk about the scale types of these variables; team_id, phase, group, round can
be classified as nominal and totalingred variable can be classified as ratio.
The problem addressed in this report is the statistical data analyses of the experimental data in the article "Super
Mario Meets AI: Experimental Effects of Automation and Skills on Team Performance and Coordination" is the
problem of statistical data analyses of the experimental data, understanding the differences between the changed
teams and observing the performance change after the teams are changed. We can define the problem as observing
the performance differences in groups with artificial intelligence and new team members in a game that requires
teamwork and coordination. We can consider the data of the artificial intelligence, the new team member or the
team member in cases where the team member is not changed as an independent variable in this problem.
IX. METHODS
Statistical analyses begin with the application of descriptive statistics after the data have been obtained. This is
because descriptive statistics is a way of summarising about the data its characteristics in a numerical context.
Inferential statistics is a more intensive category and determines the outcome of the analysis. Descriptive statistics
summarise the characteristics of the data. Inferential statistics make inferences from this data. [1]
1. Descriptive Statistics
It is used to make sense of and simplify a large data set and is the first step in scientific research. It is aimed to
quantitatively determine the characteristics of the data set. Descriptive statistics summarise the characteristics of
the data and make them meaningful. Descriptive statistics uses two basic ways of summarising data: numerical
summaries and graphical summaries. Under the name of descriptive statistics I used arithmetic mean, mode,
median, maximum, minimum values, standard deviation, quartile values and histogram. [2]
1.1 Histogram
A histogram is a type of graph that shows the frequency distribution of statistical data sorted in groups. Each bar
in the histogram shows the range of values of the data in the bin. The bins corresponding to the length of the bar
show the frequency of the data at that value. In this report, it helped me to show the totalingred values of numerical
data.[3]
2. Inferential Statistics
Inferential statistics makes hypotheses about samples and tests the accuracy of the hypothesis in order to
make predictions and draw conclusions about large populations. It uses probability theory and statistical models
for these hypotheses. In this section, I have tried to give a broad definition of inferential statistics used in the report.
Parametric comparison tests in inferential statistics usually include assumptions such as homogeneity of variance
and conformity to normal distribution. Within these assumptions, the sample data should have a normal distribution
and their variances should be homogeneous, in other words, their variances should be equal. In cases where these
2
assumptions cannot be met, non-parametric tests are used. In this report, I used Man Whitney U test from non-
parametric tests and One Way ANOVA from parametric tests. [4]
2.1 P value
The P value gives the probability that the relevant prediction is realised under the null hypothesis. In other words,
if we accept the null hypothesis, it gives us the probability that this is a correct decision. When the P value is below
a predetermined threshold significance level, the null hypothesis is not rejected. In other words, the hypothesis is
statistically significant. [5]
2.2 Significance Level
A unit of measurement that determines the amount of evidence that must be shown in a sample before a hypothesis
is declared statistically significant or not significant. The significance level is denoted by the symbol a (alpha). [6]
2.3 Hypothesis Testing
It is a method used to determine the accuracy of an assumption as statistically significant. In hypothesis testing,
sample data are used instead of populations to decide between two hypotheses and to discuss their accuracy. [7]
The first stage of the test is to determine the hypothesis and the hypothesis considered as an alternative to it. The
first hypothesis is the expected state and is called the null hypothesis and is denoted by H0. The null hypothesis
usually states that there is no difference in parameters between groups, populations, variables, phenomena, in short,
the expected situation. The alternative hypothesis, denoted by Ha, is the opposite and contradictory situation in
which the null hypothesis is false. In the next stage, the significance level should be determined. After determining
the significance level as a reasonable value, sample data are collected, p value and test statistics are calculated.
According to the value of the significance level and the result of the test statistics, the null hypothesis can be
rejected or not rejected. The determination of the significance level a is therefore important. The significance level
can be conventionally set at 0.01, 0.05, 0.1. [8]
2.4 Shapiro Wilk test:
A goodness of fit test used for normality tests. It is used to check whether X, a random sample with n observations,
is from a normal Gaussian probability distribution and is ordered (X1 is the smallest). The mean and variance of
the random sample are µ, σ respectively. In line with this information, we need to test the following hypothesis for
the Shapiro Wilk test.
H0: The sample comes from a normally distributed population.
Ha: The sample does not come from a normally distributed population.
To test this hypothesis, the Shapiro test statistic that I showed in equation 1 was used.
(∑𝑛𝑖=1 𝑎𝑖 𝑥𝑖 )2
𝑊= 1
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2
The ai representation is the coefficients obtained from the means, variances and covariances of the ordinal
statistics of a sample of size n from a normal distribution.[9]
3
If the test statistic is less than the critical value with n samples and α significance level in the critical value table
of Shapiro Wilk test, the hypothesis H0 should be rejected. When the Shapiro Wilk test is compared with other
normality tests, it is observed that its power properties are better. [10]
2.5 Levene Homogeneity Test
Levene's homogeneity test is an inferential statistical test used to check the equality of variances of two or more
groups of variables. Homogeneity of variance means equality of variance and is used to test the homogeneity of
group variances. Some tools used in inferential statistics assume that the variances of populations from which
different samples are taken are homogeneous. In this report, it is used to test homogeneity of variance in
comparison tests. Levene's test can be defined as follows:
H0 : 𝜎12 = 𝜎22 = ⋯ = 𝜎𝑘2
H1: 𝑁𝑜𝑡 𝑎𝑙𝑙 𝜎𝑗2 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙 (𝑗 = 1,2, … , 𝑘)
Given a variable Y with sample of size N divided into k subgroups, where Ni is the sample size of the ith subgroup,
Levene test statistic is as I defined in Equation 2. Equation 3, Equation 4 and Equation 5 define Zij, Zi, Z.. in the
test statistics respectively.
(𝑁 − 𝑘) ∑𝑘𝑖=1 𝑁𝑖 (𝑍̅𝑖. − 𝑍̅.. )2

𝑊= 2
(𝑘 − 1) ∑𝑘𝑖=1 ∑𝑁𝑖 (𝑍𝑖𝑗 − 𝑍̅𝑖. )
𝑗=1
𝑍𝑖𝑗 = |𝑌𝑖𝑗 − 𝑌̅𝑖 |, 𝑌̅𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑖 − 𝑡ℎ 𝑔𝑟𝑜𝑢𝑝 3

𝑁𝑖
1
𝑍𝑖. = ∑ 𝑍𝑖𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑍İ𝐽 𝑓𝑜𝑟 𝑔𝑟𝑜𝑢𝑝 𝑖, 4
𝑁𝑖
𝑗=1
𝑘 𝑁𝑖
1
𝑍.. = ∑ ∑ 𝑍𝑖𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑎𝑙𝑙 𝑍𝑖𝑗 . 5
𝑁
𝑖=1 𝑗=1
The Levene test rejects the hypothesis that the variances are homogeneous if:
𝑊 > 𝐹𝑎, 𝑘−1, 𝑁−𝑘
where 𝐹𝑎, 𝑘−1, 𝑁−𝑘 is the critical value of the F distribution with k-1 and N-k degrees of freedom at a significance
level of α. . [11]
2.6 Mann Whitney U Test
This test is used to test whether the medians of sample groups that do not conform to the assumption of normal
distribution or homogeneity of variance are equal. This test is a nonparametric alternative to the independent two-
sample t test. In this report, the Mann Whitney U test is applied for sample groups with a sample size of more than
20. The test combines two groups of data and ranks them from smallest to largest. The ranks given to the data
according to the ranking generated are summed for the two groups. R1 is the sum of the ranks for the first group
and R2 is the sum of the ranks for the second group. The hypothesis of the test is as follows:
4
H0 = The medians of the compared groups are the same.
Ha = The medians of the compared groups are different.
The test statistics are as in Equation 6:
𝑛1 𝑛2
𝑈−( )
𝑍= 2 6
√𝑛1 𝑛2 (𝑛1 + 𝑛2 + 1)
12
U statistics for two groups are found with Equation 7 and Equation 8.
𝑛1 (𝑛1 +1)
𝑈1 = 𝑅1 − 2
,
𝑛2 (𝑛2 +1)
𝑈2 = 𝑅2 − 2
7
𝑈 = min(𝑈1 , 𝑈2 )
For the specified significance level α, in two tail tests, the null hypothesis is rejected if the test statistic is less than
the lower value or greater than the upper value. In one tail tests, the null hypothesis is rejected if it is greater than
the critical value.[12]
2.7 Analysis of Variance (ANOVA)
Analysis of variance is an inferential statistical tool to test whether there is a statistical difference between group
means. The analysis aims to compare the differentiation between groups with the differentiation between
individuals within the group. It allows comparison of 2 or more groups and is used to determine whether there is
a difference depending on one or more variables. One-way ANOVA is used when the number of variables is
one.[13] In this report, I used the one-way analysis of variance because of the need to statistically observe the
differences in the context of the variable "totalingred". Anova tests test the following hypotheses:
𝐻0: 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑎 ,
𝐻𝑎 : 𝑁𝑜𝑡 𝑎𝑙𝑙 𝜇𝑗 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙 ( 𝑤ℎ𝑒𝑟𝑒 𝑗 = 1,2, … , 𝑎)
In one-way ANOVA, the difference in variation is analysed to determine the difference between the means of the
groups. SST serves to calculate the total variation. It is obtained by the sum of squares of the difference of all
values of the compared groups with the overall mean. The formula is as I shown in Equation 9.
𝑎 𝑛𝑖
2
𝑆𝑆𝑇 = ∑ ∑(𝑋𝑖𝑗 − 𝑋̅) 9
𝑖=1 𝑗=1
Total variation (SST) is also the sum of “error sum of squares (SSE) and “treatment sum of squares (SSTr).
We calculate the treatment sum of squares (SSTr) by summing the sample mean 𝑋̅𝑖 of each group and the overall
mean 𝑋̅ squared differences weighted by ni . The formula is as I shown in Equation 10.
5
𝑎
𝑆𝑆𝑇𝑟 = ∑ 𝑛𝑖 ( 𝑋̅𝑖 − 𝑋̅ )2 10
𝑖=1
Error sum of squares (SSE) sums one difference of the sample size for each group, weighted by the variance of
that group. The formula is as I shown in Equation 11.
𝑎
𝑆𝑆𝐸 = ∑(𝑛𝑖 − 1) 𝑠𝑖2 11

𝑖=1
There are a-1 degrees of freedom associated with the sum of squares between groups, which is the state in which
the groups are compared. The n-a degrees of freedom associated with the sum of squares within groups, which is
the value at which values within groups are compared. Dividing these sums of squares by the associated degrees
of freedom gives the mean squares variances. The formulas are as I shown in Equation 12 and Equation 13.
𝑆𝑆𝑇𝑟
𝑀𝑆𝐴 = 12
𝑎−1
𝑆𝑆𝐸
𝑀𝑆𝑊 = 𝑛−𝑎
13
In one-way analysis of variance, F distribution is used to find out whether there is a significant difference between
group averages. [14]
One-way ANOVA F test statistics are obtained by dividing the MSA value to the MSW value, as seen in Equation
14.
𝑀𝑆𝐴
𝐹𝑆𝑇𝐴𝑇 = 14
𝑀𝑆𝑊
The F statistic has an F distribution with c-1 degrees of freedom in the numerator and n-c degrees of freedom in
the denominator.
In order to conclude the H0 hypothesis, the critical value is found according to the predetermined confidence
interval value and degrees of freedom values in the tests through the test statistic F ratio and F distribution table.
If the test statistic Fstat is greater than the upper tail critical value according to the F distribution, the hypothesis
H0 is rejected. Otherwise, it cannot be rejected.
Table 1 below represents the one-way ANOVA summary table used to observe the results. This table contains the
between-group, within-group and total sources of variation and their degrees of freedom that we have already
discussed. The table also contains the p-value, which allows us to draw conclusions about the H0 hypothesis
without having to look at the critical value table of the F distribution. If the p value is less than the chosen
significance level (alpha), the hypothesis H0 is rejected.
Source df SS MS FSTAT p-value
Treatments a-1 SSTr MSTr 𝑀𝑆𝑇𝑟 𝑃[𝐹 ≥ 𝐹𝑜𝑏𝑠 ]

𝑀𝑆𝐸
(Between)
Error n-a SSE MSE
(Within)
6
Total n-1 SST
Table 1. ANOVA Summary Table[15]
X. EVALUATION
In the evaluation section, which is the purpose of writing this report, I made some statistical evaluations using the
data obtained from the "Super Mario Meets AI: Experimental Effects of Automation and Skills on Team
Performance and Coordination" experiment. For data analysis and inferential statistics tools, I used libraries such
as Numpy, Pandas, Matplotlib, Statmodels, Scipy.Stats, which are owned by the Python software language, which
stands out with its ease of use.
Descriptive statistics: In the descriptive statistics section, I first converted the data into a pandas dataframe. Then,
I tried to access some information about the data. Thanks to a Python function I prepared before, I reached the
Minimum, Maximum, Average and Standard Deviation values in the Shape, Types, Missing Values, Percentiles,
Quantiles and descriptive sections, respectively.
Shape: In this section where I got information about the shape of the dataset, I observed the data consisting of 660
rows and 5 columns. It is shown in Table 2.
Shape (660, 5)
Table 2. Shape
Types: In the Types section, I observed the variables and their structures and shown in Table 3 that we had
previously defined in the Problem definition. I thought it might be because only the group variable observed as
object was defined as string type as a data type. The int64 expression in the table below shows that the data occupies
64 bits and is integer. I showed Types for all groups in Table 3.
team_id int64
phase int64
group object
round int64
totalingred int64
Table 3. Types
Missing Values: I observed that there were no missing values in the missing values section. However, since this
section can cause problems in datasets where zero is entered instead of missing values, I had to make the decision
about Missing Values after checking the minimum value section. I showed it in Table 4.
team_id 0
phase 0
7
group 0
round 0
totalingred 0
Table 4. Missing Values

Descriptives: In this section, I wanted to numerically describe some descriptive statistics such as mean, standard
deviation, maximum, minimum values and some percentage values. I created a dynamic function in Python that
could give me these results as output. Table 5 contains these statistics. In the Count column, I saw the number of
values the variables had. When I examined the minimum values again, I noticed that there was no variable with a
zero number, which means there were no missing values. When I looked at the quarterly values, when the data was
sorted in terms of totalingred, I saw that the average of totalingreds in the top 1% was 13.59. Compared to the
minimum value, I can say that the probability of seeing the minimum value is very, very low. It is shown in Table
5.
coun mean std. min 1% 5% 25% 50% 75% 95% 99% max.
t .
totalingre 660 24,81 4,38 9,00 13,5 16,9 22,0 25,0 28,0 31,0 34,0 36,0
d 5 9 9 5 0 0 0 0 0 0
Table 5. Descriptives
Histogram: Histogram helps us to have visual information about the distribution of the totalingred variable. Figure
1. As can be seen, the distribution of the data increases on average as it approaches the value and decreases as it
approaches extreme values. The quarter values can be considered equally distributed among themselves and the
shape resembles a bell shape. The number containing our minimum value that can be considered an outlier can be
seen in a very small column.
Figure 1. “totalingred” Distribution
8
Grouping operations: In this section, I performed some grouping operations and examined the averages and
standard deviations. In the figure below, I reached the phase averages and standard deviations in the totalingred
breakdown. I showed these group operations in Table 6.
phase mean std.
1 22,963 4,417
2 26,666 3,494
Table 6. Grouping Operations for “phase” Variable
In the Table 7., I wanted to show the averages of the groups before the changes (phase 1) and after the changes
(phase 2).
group & phase mean std.
ai_phase1 23,125 4,560
ai_phase2 26,233 3,379
control_phase1 22,441 4,250
control_phase2 26,750 3,262
newhire_phase1 23,444 4,419
newhire_phase2 27,133 3,892
Table 7. Grouping Operations for “group” & “phase” Variable
In Table 8, I showed the Round averages and standard deviations in the total red break. I wanted to show that the
averages are directly proportional to the number of rounds and there is no change in their standard deviations.
round totalingred/mean totalingred/std.
1 18,054 3,748
2 21,654 3,921
3 23,400 3,818
4 24,818 3,737
5 24,290 3,101
6 25,563 3,552
7 25,981 3,749
8 25,890 3,235
9
9 26,727 3,194
10 26,981 3,545
11 27,036 3,746
12 27,381 3,347
Table 8.Grouping Operations “round” Variable
Inferential Statistics: In the inferential statistics section, I analyzed whether there was a statistically significant
difference between the content collected for the 3 groups in the "Super Mario Meets AI: Experimental Effects of
Automation and Skills on Team Performance and Coordination" experiment. To control the performance
differences in the 1st and 2nd Phases, I used the Mann Whitney U test for all groups in their own phases, and to
analyze the differences between the 3 groups, I used the One Way ANOVA test. I used the shapiro, levene,
manwhitneu and f_oneway packages of Scipy.Stats, which are Python libraries. I set the alpha value, which is the
significance level for hypothesis tests, to 0.05. In the first stage, assumptions need to be evaluated before examining
the relationships between groups. These assumptions are compliance with normal distribution and homogeneity
of variance, as mentioned in the methods section. I divided the data into 6 different sample groups: "Newhire",
"AI", and "Control" groups and the 1st and 2nd Phases of these groups. To test the suitability of the first
assumption, normal distribution, I applied the Shapiro Wilk normality test to 6 groups and the p values are as
shown in table 9., respectively. The null hypothesis in this test is that the tested sample comes from a normal
distribution. We reject the null hypothesis for Phase 1 of the “newhire” and “ai” groups. Since the opposite is true
for other groups, the null hypothesis cannot be rejected. However, for the assumption of compliance with normal
distribution to be acceptable, both groups must meet this assumption. It seems that “Newhire” and “Control” do
not comply with this assumption. We can say that this assumption is suitable for the “AI” group.
Group-phase p-value
ai_phase1 0.023
ai_phase2 0.371
control_phase1 0.347
control_phase2 0.245
newhire_phase1 0.019
newhire_phase2 0.087
Table 9. Shapiro-Wilk Test Results
To test the assumption of variance homogeneity, I applied the Levene homogeneity test. In this test, I checked
whether the variances of each group were equal between Phases 1 and 2, and I showed the test statistics and p
10
values in the Table 10. The null hypothesis in this test is that the variances of the two tested groups are equal. I
rejected the null hypothesis for the "AI" and "Control" groups with P values significance less than 0.05. I accepted
the null hypothesis for the "Newhire" group with a P value greater than 0.05. As a result, I observed that the "AI"
and "Control" groups could not meet the assumption of variance homogeneity.
group test-stat p-value
newhire 0,815 0,367
ai 9,2698 0,002
control 6,789 0,009
Table 10. Levene Test Results
I observed that the performance differences in the 1st and 2nd Phases of all 3 groups could not meet the
assumptions. Since the conditions were not met, I used the Mann Whitney U test. The null hypothesis in this test
is that I saw that there is no statistically significant difference in the averages of the totalingred numbers, which
are the performance metrics, between the 1st and 2nd Phases for the 3 groups. P values are as shown in the figure
below, and the p values of all groups are not greater than 0.05, which is the significance value, or even zero. can
be counted. Therefore, the null hypothesis is rejected for all groups.
Group Test statistic p-value
Newhire 2192,500 >0,001
Ai 4214,000 >0,001
Control 3045,500 >0,001
Table 11. Mann-Whitney U Test Results
As in the independent 2 variable t test, in order to get correct results from One way anova, the same assumptions
must be met for the sample data. At this stage, since the performance values in the 2nd Phases of the 3 groups will
be compared while checking the assumption of compliance with normal distribution, analysis should be made only
on the 2nd Phase values of the groups. As seen in Figure X, p values are more than 0.05 in the 2nd Phase of the
groups. The null hypothesis of the Shapiro Wilk test cannot be rejected for all groups.
The test statistics and p values of the Levene tests performed between 3 groups for variance homogeneity are as I
show in Table 12. Since the P value is greater than the significance level, we cannot reject this assumption.
For 3 groups Test statistic p-value
1,563 0,210
Table 12. Levene Test Results for 3 Groups.
All assumptions for the one-way anova test are provided. The null hypothesis of this test is that there is no
difference between the "totalingred" averages in the 2nd Phases of the "Newhire", "AI" and "Control" groups. As
11
seen in the figure below, the p value was measured as 0.17. The null hypothesis cannot be rejected because it is
greater than the significance level.
For 3 groups Test statistic P-value
1,767 0,172
Table 13. One Way ANOVA Test for 3 Groups
XI. SUMMARY
With this report, I presented a statistical perspective using the experimental results in the article “Super Mario
Meets AI: Experimental Effects of Automation and Skills on Team Performance and Coordination”. My motivation
for presenting this report was to measure the consequences of the situations created by the introduction of artificial
intelligence into our lives and to examine them from a scientific perspective. First, I determined that the research
questions were a scientific comparison of groups in a game experiment that required intra-team coordination. Then
I listed and defined the methods I should use. I examined the methods I should use under the headings of descriptive
and inferential statistics, respectively. Finally, I applied these methods to the data obtained as a result of the
experiments. I used these methods thanks to the statistical analysis packages and tool libraries available using the
Python programming language.
As a result of this research, when I examined it in terms of descriptive statistics, I observed that there was a
dramatic difference between the performances of the groups in Phases 1 and 2. When comparing the performance
of the groups after the mentioned changes with the same descriptive statistics, I found that there was no difference.
But although these descriptive statistics provide insight, I used inferential statistical tests to draw conclusions in
the context of populations. As a result of these tests, I reached the same results; There is a difference between the
performance of the groups in Phases 1 and 2, and there is no difference between the groups' performances after the
changes.
As a result, when I statistically examined these experiments that wanted to examine the effect of artificial
intelligence on team performance and coordination, I did not observe any performance difference in these teams
where artificial intelligence was added, a new team member was added, and no team members were changed. I
observed a change in the performance of these teams before and after the action was taken, but I can say that the
reason for this is the increase in the coordination of the existing team members. The periodic increase in
performance averages throughout the rounds proves this. The fact that the performance does not change when
artificial intelligence is used instead of a team member shows that artificial intelligence can replace a team member
in this task.
12
XII. BIBLIOGRAPHY
[1] Larson, M. G. (2006). Descriptive statistics and graphical displays. Circulation, 114(1), 76–81.
https://doi.org/10.1161/circulationaha.105.584474
[2] Descriptive Statistics Monica Franzese and Antonella Iuliano, Institute for Applied Mathematics “Mauro
Picone”, Napoli, Italy(2018)
[3] Berenson, M., Levine, D., Szabat, K. A., & Krehbiel, T. C. (n.d.). Basic Business Statistics: Concepts and
Applications (14th ed.). Pearson Higher Education AU.
Applications (14th ed.). Pearson Higher Education AU. 400-410.
[6] Shaver, J. P. (1993). What statistical significance testing is, and what it is not. The Journal of Experimental
Education, 61(4), 293-316.
[7] Petruccelli, J. D. (n.d.). Applied Statistics for Engineers and Scientists.398-400
[9] Ramachandran, K. M., & Tsokos, C. P. (2021). Mathematical statistics with applications in R. Academic Press.
480-485.
[10] Shapiro,S.S., Wilk,M.B., ve Chen,H.J. (1968). "A comparative study of various tests of normality". Journal
of the American Statistical Association, C.63 say.1343-1372.
[11] Levene, Howard (1960). "Robust tests for equality of variances". In Ingram Olkin; Harold Hotelling; et al.
(eds.). Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford University Press.
pp. 278–292
[12] Mann, Henry B.; Whitney, Donald R. (1947). "On a Test of Whether one of Two Random Variables is
Stochastically Larger than the Other". Annals of Mathematical Statistics. 18 (1): 50–
60. doi:10.1214/aoms/1177730491. MR 0022058. Zbl 0041.26103.
[14] Akyıldız, Murat (12 Nisan 2009). "Tek Faktörlü Varyans Analizi (One-Way Anova) ve bir spss örneği"
13

Human vs. Machine: A Statistical Investigation Into Team Performance Under AI Integration Using The Dash and Dine Mini-Game

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Human vs. Machine: A Statistical Investigation Into Team Performance Under AI Integration Using The Dash and Dine Mini-Game

Uploaded by

Copyright:

Available Formats

Application Report Master Data Science

Bekir Can Torun

II. PROBLEM DEFINITON………………...………………………………………………….…..1

VIII. PROBLEM DEFINITON

2.2 Significance Level

2.3 Hypothesis Testing

2.4 Shapiro Wilk test:

H0: The sample comes from a normally distributed population.

2.5 Levene Homogeneity Test

H0 : 𝜎12 = 𝜎22 = ⋯ = 𝜎𝑘2

H1: 𝑁𝑜𝑡 𝑎𝑙𝑙 𝜎𝑗2 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙 (𝑗 = 1,2, … , 𝑘)

(𝑁 − 𝑘) ∑𝑘𝑖=1 𝑁𝑖 (𝑍̅𝑖. − 𝑍̅.. )2

𝑍𝑖𝑗 = |𝑌𝑖𝑗 − 𝑌̅𝑖 |, 𝑌̅𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑖 − 𝑡ℎ 𝑔𝑟𝑜𝑢𝑝 3

𝑊 > 𝐹𝑎, 𝑘−1, 𝑁−𝑘

2.6 Mann Whitney U Test

Ha = The medians of the compared groups are different.

The test statistics are as in Equation 6:

2.7 Analysis of Variance (ANOVA)

𝐻𝑎 : 𝑁𝑜𝑡 𝑎𝑙𝑙 𝜇𝑗 𝑎𝑟𝑒 𝑒𝑞𝑢𝑎𝑙 ( 𝑤ℎ𝑒𝑟𝑒 𝑗 = 1,2, … , 𝑎)

𝑆𝑆𝐸 = ∑(𝑛𝑖 − 1) 𝑠𝑖2 11

Source df SS MS FSTAT p-value

Treatments a-1 SSTr MSTr 𝑀𝑆𝑇𝑟 𝑃[𝐹 ≥ 𝐹𝑜𝑏𝑠 ]

Error n-a SSE MSE

Table 1. ANOVA Summary Table[15]

Table 4. Missing Values

Figure 1. “totalingred” Distribution

phase mean std.

Table 6. Grouping Operations for “phase” Variable

group & phase mean std.

ai_phase1 23,125 4,560

ai_phase2 26,233 3,379

control_phase1 22,441 4,250

control_phase2 26,750 3,262

newhire_phase1 23,444 4,419

newhire_phase2 27,133 3,892

Table 7. Grouping Operations for “group” & “phase” Variable

round totalingred/mean totalingred/std.

Table 8.Grouping Operations “round” Variable

Table 9. Shapiro-Wilk Test Results

group test-stat p-value

newhire 0,815 0,367

control 6,789 0,009

Table 10. Levene Test Results

Group Test statistic p-value

Newhire 2192,500 >0,001

Control 3045,500 >0,001

Table 11. Mann-Whitney U Test Results

For 3 groups Test statistic p-value

Table 12. Levene Test Results for 3 Groups.

For 3 groups Test statistic P-value

Table 13. One Way ANOVA Test for 3 Groups

[7] Petruccelli, J. D. (n.d.). Applied Statistics for Engineers and Scientists.398-400

You might also like