Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

What is statistics

• Statistics is a set of rules and procedures for


reducing large masses of data to manageable
proportions and for allowing us to draw
conclusions from those data. (technical)
• Statistics is the outcome of the application of the
rules/procedures to samples of data. (technical)
• Statistics is also used to mean data (layman term)
• Descriptive statistics: describing the data (average,
variation, distribution, etc.)
• Inferential statistics: drawing inferences/conclu-
sions about the data based on the sample.
What is statistics
• Statistics is a quantitative approach to research.
• Most language research(es) are qualitative.
• No description of quantitative measures: frequencies,
percentages, average to the language features in the
data.
• In quantitative research, features are counted,
classified and compared to explain the data nature.
• In fact, both quantitative methods and statistical
analyses can supplement qualitative analyses.
• Statistics is ignored because it is like science and
supposedly can destroy the beauty of literary texts
• This course illustrates how quantitative methods
Benefits of Statistics
• Research is equipped with quantitative,
measurable, hard data.
• Inferences or conclusions can be generalizable
and not limited to the observed data.
• The level of confidence of our conclusions can
be accounted for.
Basic Issues in Statistics
1. Common tools of data presentation: frequency and
contingency tables, charts/graphs.
Contingency tables (=cross tabulations, crosstabs) are
tables in a mattrix format that display (multivariate) frequency
distributions of the variables.
Frequency distribution tells how frequencies are distribution over
values or categorical data.
• Graphs/histograms in Figures 1.1 and 1.2 & Table 1.1. show
differences of sentence length
• They give a picture of the situation but not the information,
such the averages or means
Frequency distribution of PoS
50

45

40

35

30

25
Series1

20

15

10

0
Aux Nouns Verbs Adjectives Prepositions Copula
2) Measures of Central Tendencies
• Mode: the value that occurs most frequently
• Which one is in Figures 1.1. & 1.2 and Table 1.1?
• Some data distributions might have more than one mode 
bimodals, trimodals etc.
• Median: the midle point/central score of the distribution, half
of which is above or below.
• 12, 12, 13, 14, 15,16, 16, 17, 20, 22  (15+16)/2=15.5
Find the median location: (10+1)/2=5.5, which is between 15
(rank 5) and 16 (6). So, M=(15+16)2=15.5
• 12, 12, 13, 14, 15, 16, 17, 20, 22  15.
ML= (9+1)/2= 5. So, M = rank no 5  15
• Mean/average: the sum of all scores divided by the total
number of scores  X̅= Σx/N
Go to Worksheet 1
Lecture 2: Proportion and Percentages
• Proportion = frequency/total frequency
• Percentage= frequency/total frequency x 100%
• In Table 1.3 the proportion of could is in art texts is
296/1758 = 0.168 = 16.8% (percentage)
• Proportions and percentages are useful to
summarize data, but not appropriate if the original
values (numbers of observations) are not given.
This is because the data are not well presented.
Dispersion = spread of scores/range
Text 1 Text 2 Text 3 Text 4 Text 5
Nouns 20 29 24 15 22
Verbs 25 23 18 25 19

• For nouns, the highest is 29 and the lowest is 15, so


the data dispersion range is (29 -15 + 1) = 15
• For verbs, the highest is 25 and the lowest is 18, so
the data dispersion range is (25 - 18 + 1) = 8
• The dispersion values show the range of the data
variation, how far the data vary from the lowest to
the highest although they might have the same
mean (which is 22 for nouns, and also 22 for verbs)
Lect 2: Variance&Standard Deviation
• Variance =how far apart the individual values cluster
around the mean.
• Sentence length: X̅ = 15.7
N=10: 12, 12, 13, 14, 15, 16, 16, 17, 20, 22
Diff - 3.7 -3.7 -2.7 -1.7 -0.7 0.3 0.3 1.3 4.3 6.3
Σ dif = 0
V = Σdiff 2 = 98.1/10 = 9.81 OK when N is very large/ a
N population
V = Σdiff 2 = 98.1/9 = 9.81 OK when N is very small / a
N–1 sample only
Standard Deviation = √v̅, the root of variance =
The variability of distribution of the data
Standard Score = Z
• Z-score is how far a given raw score is different from
the mean in the standard deviation units
• Z = score – mean
Standard Deviation
Standard score can be transfered into a t- score = T,
where T = 10z + 50, which also shows how far a score
is from the mean, but using positive numbers.
Standard score can be transfered into a Percentile Rank
score, which represents the individual score position
in 100% = the percentage that fall below the score. If
a student reaches a percentile rank of 60, it means
that the student is better than 60% of all the students
Table 1.13 Z-score for Grammar, p 11
Score
Score Score-Mean
Score-Mean Z-score
Z-score
Wandi
Wandi 120
120 55 0.31
0.31
Tugino
Tugino 90
90 -25
-25 -1.58
-1.58
Triman
Triman 130 15
15 0.94
0.94
Fortunata
Fortunata 125
120 10
5 0.63
0.31
Fivanto
Fivanto 120 55 0.31
.31
Their scores are high because of different range score
(0 to 150)
Mean (120+90+130+125+120)/5 = 115
Table 1.14 Z-score for HoE, p. 11
Score Score-Mean Z-score
Wandi 4 0.2 0.24
Tugino 3 -0.8 -0.96
Triman 5 1.2 1.44
Fortunata 4 0.2 0.24
Fivanto 3 -0.8 -0.96
Their scores are low because the range is from 0 to 5
Mean (4+3+5+4+3)/5 = 3.8
Comparison for HoE & Grammar z-score
Wandi Tugino Triman Fortunata Fivanto

Grammar 0.31 - 1.58 0.94 0.63 0.31


HoE 0.24 - 0.96 1.44 0.24 0.96

Their scores are converted into standard


scores (z-scores), where the mean is always
0. As seen in this table, though their scores
are very different, their standard scores are
not so much different
t-score: T = 10z + 50
Grammar HoE
10z + 50 10z + 50
Wandi (0.31x10) + 50 = 53.1 (0.24 x10) + 50 = 52.4
Tugino (-1.58 x10)+50 = 34.2 (-0.96 x10)+ 50 = 40.4
Triman (0.94 x10) + 50 = 59.4 (1.44 x10) + 50 = 64.4
Fortunata (0.63 x10) + 50 = 56.3 (0.24x10) + 50 = 52.4
Fivanto (0.31 x10)+ 50 = 46.9 (-0.96x10) + 50 = 40.4

• To avoid minus numbers, a z-score can be


converted into a t-score (=T)
Percentile Rank to show the % of scores below a certain score
Wandi Tugino Triman Fortunata Fivanto
Grammar 120 90 130 125 110
REORDERED SCORES FROM LOWEST TO HIGHEST
90 110 120 125 `130
Percentile 10% 30% 50% 70% 90%
Ranks

1. Order the scores from lowest to highest.


2. For each score, add the percentages that fall below
the score to one-half the percentage of scores that
fall at the score.
The percentile rank of the score 125 is= (3/5x100%) +
(1/5X100%/2) = 60% + 10% = 70%
Go to worksheet 2
Lecture 3 Distribution and Probability
• Probability= an attitude of doubt about some future
event (layman definition).
• Eg. It will probably rain tomorrow.
The probability is small that I shall live to be 100.
• Probability= the ratio of teh number of favrrable
cases to the total number of equally like cases
(technical)  p = e/n, where p = probability, e =
expected outcome, n = possible outcome.
• What is the possibility for a noun to be countable?
• P(counts)=1 (being count)/2 (being count ornon-
count)= 0.5 or 50%
Conditional Probability
• What is the possibility for a count noun to be
followed by a count noun?
• p(counts & non-counts) = p(counts) x p (non-
counts) = 0.5 x 0.5 = 0.25 = 25%.
FREQUENCY DISTRIBUTION (1)
= A DISTRIBUTION IN WHICH THE VALUES OF TH DEPENDENT VARIABLE ARE
TABLED OR PLOTTED AGAINS THEIR FREWUENCY OF OCCURENCES.
E.G. THE DISTRIBUTION OF ARTICLE NUMBERS IN SENTENCES IN A TEXT OF 100
SENTENCES.
(1) Roughly, among the twenty-six test sentences, the most common outcomes
are 4 and 6: that is, the most likely occurrences of determiners within a
sentence is four or six times. (2) However, if we increase the test sample, it
becomes more evident that the most probable frequency of determiners
within a sentence is 6 (mode). (3) Similarly, as we increase the sample, the
histogram gets more peaked and more bell-shaped. (4) In mathematical
terms, distributions like this are known as binomial distributions and the
resulting histograms would take the shape shown in Figure 1.5.
In (1) there are 4. In (2) = 3; in (3) = 2, while in (4) = 2.
So in this paragraph, there are 3 sentences with 2 articles, and only 1 sentence
with 4 article.
So in this paragraph, there are 3 sentences with
2 articles, and only 1 sentence with 4 article.

Sentences with different numbers of articles


12

10

Series1
6

0
1-article 2-article
When the text is longer, the distribution can
be as follows
Frequencies of sentences with different
numbers of articles

70

60

Series1
50

40

30

20

10

0
1-article 2-article 3-article 4-article 5-article 6-article 7-article 8-article 9-article
The column histogram can be changed into a
chart  (almost normal distribution)
Frequencies of sentences with
different numbers of articles

70

60

50

40

30

20

10

0
1-article 2-article 3-article 4-article 5-article 6-article 7-article 8-article 9-article
Normal Distribution: mean= mode=
median; mean = 0 SD
Imaginary normal distribution of sentences with different number of articles

70

60

50

40

30

20

10

0
1-article 2-article 3-article 4-article 5-article 6-article 7-article 8-article 9-article
Skewed distribution= assymmetrical
• In a class where, most students are smart and
only some are not smart, the distribution might
be skewed, not symmetrical.
16

14

12

10

8
Series1

0
GPA <2 GPA 2-2.5 GPA 3-5 GPA 4
GO TO WORKSHEET 3

You might also like