Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

PSYC 315 - LESSON 2 - 11/09

Video 1: Distributions of Qualitative Variables

Distributions: conveys the relative frequency with which values of a variable


occur in a sample or population
● A distribution may also convey the relative standing of scores within a
sample or population

Reasons to examine distribution are to provide:


● A sense of order - a strictly descriptive point of view, distributions can
provide some sense of the order in a large set of scores
● A link between statistics and parameters - from an inferential point of view,
distributions provide the crucial link between statistics and parameters

Frequency tables for qualitative variables:

● The table is a frequency table for a qualitative variable - its summariΩes


the results of a small survey
● Area preference column: shows the five values of the variable in
arbitrary order
● f (frequency): shows the number of the students (in this specific
frequency table) who chose the corresponding area as their preferred
area - we refer to these number as raw frequency count or tallies
● p (proportions): the proportion of students that prefer each of the areas -
those are obtained by dividing the number of students preferring a
diving area by the total number of students surveyed.
● proportions convey the relative frequency of occurrence of each
value of the variable
● A frequency table would usually show raw frequencies or proportions, but
not both
● We can say that the table below conveys the distribution of area
preferences because it shows the relative frequency of occurrence of each
value of the variable
Bar Graphs for Qualitative Variables:

● The information in a frequency table can be conveyed graphically using a


bar graph
● A bar graph shows categories on the x-axis and tallies or proportions
on the y-axis - and they convey the fact that we have arbitrary
unordered categories so the bars do not touch
● Graphical representations provide more immediate impressions of the
relationship between category values and frequency
● In psychology, we use bar graphs and frequency tables to show the
distribution of scores on qualitative variables - because there is no
natural ordering to the values of qualitative variables, the order they are
listed in a bar graph or frequency table is arbitrary.

Video 2: Distributions of Discrete Quantitative Variables

Frequency tables:

● This is an example of a frequency table for an discrete quantitative


variable:
● 1) We determined the maximum and minimum scores in the data set
● 2) We then list these values from the maxim to the minimum in the left
column - values are always listed from largest to smallest and values like
this are not skipped if they happen not to occur
● 3) We then count the number of instances of each value in our data set
(Column f) - the number of scores counted are shown at the bottom of
the column, and now we can get a clearer picture of how grades were
distributed on the quiz
● How does my grade compare to the class?
● How many people scored the same or lower than you (cumulative
frequencies - listed in the column “cumulative f”)
● The lowest score on the table above is 4 and there's only an instance of
a 4 in the frequency columns - because only one person scored four or
lower, we enter a 1 in the cumulative f column. Therefore, there's one
score below 4 (its a bad score because no one scored lower than a 4)
● To find the number of scores at or below each subsequent value → we add
the number of scores having a particular value to the ones below it
● E.g: we compute the cumulative frequencies for values from 5 to 10:
● Starting with 5, we can see there is 4 instances of 5 in the frequency
column and one score less than 5 → so there are 4 + 1 = 5 scores at
or below 5
● The meaning of a cumulative frequency: it depends on how many scores
we are dealing with
● E.g: a cumulative frequency of 80 means one thing if there are 80
scores, but means another thing if there are 80.000 scores → because of
this, is conventional to express frequencies and cumulative frequencies
as proportions by dividing them by the number of scores in the data set

● Column p: the proportion of scores occurring at each value of the variable


● Column P: the cumulative proportion at or below each value of the variable
● Multiplying a cumulative proportion by 100 gives you a cumulative
percentage - the percentile rank of the score
● In statistics, notation is critical - in this case it is absolute essential to
recognize that p and P denote very different quantities:
● p: the proportion of scores having a particular value
● P: the proportion of scores at or below a given value

Histograms:

● The histogram plots the values of the variable between minimum and
maximum in the x-axis and the proportion of scores in the data set, having
each of these values in the y-axis
● Histograms vs Bar Graphs:
● Because there’s a natural ordering to the values of a quantitative
variable they must be placed in this natural order on the x-axis of the
histogram → the bar graph, on the other hand, which the order of value
on the x-axis is arbitrary
● There’s no space between the bars of a histogram as there is between the
bars of a bar graph → this convention was adopted to convey the fact
that the levels of the variable have a natural order (when the bars touch,
it conveys the continuity of adjacent value of the variable)

Group frequency tables:

● If we had 50 scores ranging from 5 to 95:


● create intervals [90,99]
● we also take this approach when creating frequency tables for continuous
quantitative variables

Video 3: Distributions of continuous Quantitative Variables

Intervals, score limits, real limits, and interval width:


● Interval: is a range of scores with lower and upper limits - these limits can
be expressed as scores limits or real limits
● score limits: integers that define an interval in the units of measurement
of the variable
● E.g: 300 to 309 grams - because our measurements are real numbers
we have to decide what to do with scores like 299.7 grams → real
limits
● real limits: specify the real numbers that fall in an interval
● From half the unit below the lower score limit to half a unit above the
upper score limit
● E.g: in the case of an interval having a lower and upper score
limits of 300 and 309 grams → the real limits range from 299.5 to
309.5
● Then, this interval contains 299.7, but nor 299.35
● Interval width can be thought in two ways:
● The interval width is the number of units in an interval → there are ten
units in the interval 300 to 309
● The interval width is the difference between the upper and lower real
limit → 309.5 - 299.5 = 10.

Example:

● The values of this variable are continuous → which means that they are
real numbers that may have infinitely many digits to the right of the
decimal place
● In this example, we have 60 scores and would like to have about 10
intervals
● The largest score we have in the example is 96.9 and the lowest score is
38.9 → range = 96.9 - 38.9 = 58
● When we divide the range by the approximate number of intervals we'd like
to have we find that we'd be aiming for an interval width of approximately
58/10 = 5.8 → because our interval width should be an intuitive number we
will have to choose one close to 5.8 --. for this example, 5 and 10 would be
good choices

Defining intervals:
● We aim to have 5 to 20 intervals - small number of intervals is appropriate
when we have a small number of scores and a large number of intervals is
appropriate when we have a large number of scores
● Interval width must be an intuitive integer value (integers like 5, 10, 20 and
100 are intuitive)
● Interval width depends on the number and range of scores in the dataset →
to choose an interval width we first determine the range, which is defined
by the maxim score minus the minimum score (range = max - min)
● Approximate width = range/ approximate number of intervals we'd like
to have → to yield an approximate interval width
● choose an intuitive number that is close to the approximate interval
width
● Lower score limits should be multiples of interval width
● E.g: if we chose an interval width of 10, then lower score limits should
be multiples of 10 (30,40,50,etc).

Making the table:

● Once we have chosen our interval width, we list our intervals from highest
to lowest, making sure that the real limits of the highest interval capture the
maximum score in our dataset and the real limits of the lowest interval
capture the minimum score in our dataset
● For our example, we’ve chosen a interval width of 10 and found that the
maximum score is 96.9 → therefore, the highest score limits are 90 - 99
| and because the minimum score is 38.9 the lowest score limits are 30 -
39
● Interval midpoint: is simply the point midway between the lowest and
upper real limits
● Once the real limits have been defined we can count the numbers scores
falling in each interval → these counts ar enlisted in the frequency column
(f)
● Cumulative f, p and P are computed the exact same way as discrete
quantitative variable
● When assigning scores to intervals, we face a small problems on what to do
with scores like 79.5 that fall simultaneously in the upper real limit of one
interval and the lower real limit of the next interval
● The short answer would be that in the real world of data collection this
happens very very rarely
● But the rule is: we will always put a score that falls into a real limit
boundary in the higher interval

Graphical Depiction of Frequencies:


3 5

2
4 6
● The image above plots our sixty final exams scores as six different
histograms
● The histograms correspond to group frequency distributions having
interval width of 1) 40 , 2) 20 , 3) 10 , 4)5 , 5)2 and 6)1
● The x-axis in each panel shows the interval midpoints → those are used
because they require less space than the real limits or the score limits
● The y-axis shows the proportion of scores falling in each interval

Video 4: Probability
● Subjective probability: subjective judgment about the likelihood of a
specific outcome
● Objective probability: numerical expressions of the likelihood of some
event occurring
● Frequent, or lung-run probability (e.g: outcomes, events, proportions
and sampling experiments)

Coin flipping:

● Heads and tails, the two sides of a coin, can be considered two value of a
qualitative variable → flipping a coin is a way of select one of these values
randomly
● Sampling experiment: the random selection of a value of the variable
● Outcome: the value of a variable obtained through random selection in a
sampling experiment
● Or, the value of a variable obtained through random selection in a
sampling experiment
● Success: a specified outcome of interest
● If we define one the outcomes as a success we can count the number
of successes in a given number of sampling experiments
● Then, we can calculate the proportion of successes in a given number
of sampling experiments by dividing the number of successes by the
number of sampling experiments
● E.g: p = 14/25 = 0.56

Rolling a die:

● A die can be a qualitative or quantitative variable with 6 values → each


time you roll a die the 6 possible outcomes are the 6 values of the variable
(1-6)
● The act of rolling a die is a sampling experiment, which means its a way to
select one of the six values randomly (random selection)
● If a success is defined as rolling a 1, then sixty “1” in sixty sampling
experiments means that the proportion of successes is: p = 6/60 = 0.1
● Success: a specific outcome of a sampling experiment
● Event: one or more outcomes in a sampling experiment
● a single value of a variable can be an outcome or an event, but two
or more values can only be an event, not an outcome
● E.g: rolling a 3 or a 6 → we can compute the proportion of times an
event occurs in the same way we compute the proportion of times an
outcome occurs
● In 75 rolls of a die, there are 14 “3” and 16 “6”. 14 + 16 = 30
successes → then the proportion of successes is p = 30/75 = 0.4

Frequentist Definition of Probability:

● Our intuitions about probability are based on an infinite number of


sampling experiments
● Probability: the proportion of times the event would occur in an infinite
number of identical sampling experiments

Mutually Exclusive and Independent Events:

● Mutually exclusive: if two events cannot co-occur → if one of the possible


events occurs in a sampling experiment, then none of the other events can
also occur
● Independent: if the occurrence of one does not affect the probability that
the other will occur
● Sampling with and without replacement: draws done with replacement
are independent events → this also applies when drawing scores from
populations
● Sampling WITH replacement: when an item is randomly selected
from a set (sample, population) it is returned to the set before the next
item is randomly selected
● Sampling WITHOUT replacement: when an time is randomly
selected from a set (sample.population) it is not returned to the set
before the next item is randomly selected
● Events are independent when sampling with replacement
● Events are dependent when sampling without replacement

Video 5: Probability distributions

Grouped Frequency Tables, Random Sampling and Probability


● The grouped frequency table can help answer the question: what is the
probability that a randomly selected score from this distribution will fall in
the interval 80 - 89?

● By posing the question this way we defined the event as in the interval
80 - 89 → the p column on the table shows that the answer is 0.20
● Drawing a simple score from this distribution is a sampling experiment
→ if we repeat this sampling experiment infinitely many times with
replacement, then 20% of the times the score will fall in the interval 80 -
89
● The probability of a random score falling outside the interval 80 - 89
is 80% or t5he probability is 0.8 that a randomly chosen score will
fall outside the interval.
● This is the same as saying that the probability is 0.2 that a randomly
chosen score will fall in the interval 80 - 89
● We are connecting intervals in a frequency table with the events in a
sampling experiment
● When we state the probability that a randomly chosen score will fall in a
given interval, we mean the proportion of times a randomly chosen
score would fall in that interval in a infinite series of identical sampling
experiments

Probability distribution

● Given any distribution of scores for a quantitative variable, we can define


any number of events
● Events might be:
● score is > than x
● score is < than x
● score is between x1 and x2
● score is outside x1 to x2
● Probability distribution: conveys the probability that a randomly selected
score will have a given value or fall in a given interval
● All relative frequency distributions are probability distributions when
we consider an infinite number of scores drawn from the distribution
with replacement

● This frequency table for a qualitative variable and the corresponding bar
graph are → probability distributions
● If we repeatedly choose individuals at random from this
distribution with replacement, then the probability of drawing a
student whose reference is Clinical would be 0.29
● The frequency table for a discrete quantitative variable and a
corresponding histogram are also probability distributions
● If we repeatedly choose scores at random from this distribution,
with replacement, then the probability of drawing a score of 7
would be 0.40
● Each histogram is a probability distribution
● Each bar represents an interval, and the bar heights show the proportion
of individuals in each interval → in each panel, each bar shows the
probability that a score will fall in the corresponding interval with
repeated random sampling with replacement

Properties of Probability Distributions

1) The probability of any event is between 0 and 1

● A: to denote any event → such as, coin coming up heads, die coming up 4,
etc.
● p(A): to express the probability of event A occurring
● The constraint that the probability of event A must be between 0 and 1
can be expressed as:
● 0 ≤ p(A) ≥ 1
● E.g: Coin flips → A: the outcome is “heads”, so p(A) = 0.5 → the
probability of any event must be between 0 and 1

2) The sum of all probabilities is 1, when events are mutually exclusive

● n mutually exclusive events that can be denoted A1, A2,..., An


● E.g: rolling a die → then n would be 6
● Their probabilities can be expressed as p(A1), p(A2),...,p(An) → and these
probabilities sum to 1
● p(A1) + p(A2) + … + p(An) = 1
● In any distribution, there are many possible mutually exclusive events → in
the case of coin flips, the two possible events are heads and tails, each
having a probability of 0.5, and these probabilities sum to 1 [(pHead) +
(pTails) = 1]
● For qualitative variables, the events are categories → there is a probability
associated with each of these categories, and the sum of all these
probabilities is 1
● For discrete quantitative variables the events are values of the variable,
which might be whole numbers → the probability associated with each
value of the variable is a number between 0 and 1 - and the sum of all
probabilities is 1
● For grouped frequency distributions, it is the same → the probability of a
score falling in a given interval is between 0 and1, and the sum of all such
probabilities is 1

Probability Density Functions

● This figure shows a hypothetical distribution of heights for Canadian


women measured in inches
● There are more scores around the center of the distribution, and fewer
away from the center → This figure is a probability density function
● Function: one y - value for every x - value → for every value of x there is
one and only one value of y
● Density: number of scores in an interval divided by interval width →
number of things per unit measures
● E.g: population density is a number of people per square kilometer
● If you were to think about the number of heights in a 1 inch interval for
the population depicted in the figure, then we can talk about the density
of scores per inch
● This is computed by dividing the number of scores interval by the
interval width
● Probability density: proportion of scores in an interval width divided by
interval width
● As interval width gets smaller and smaller, then eventually probability
density is defined at a point, so there is a probability density for each
value of x, giving us a probability function
● A frequency table CANNOT be thought of as a probability density
function
● probability density = proportions of scores in an interval / interval
width = p / width

● The six histograms on the left of this figure, depict the same distribution of
scores in a very large population
● each histogram has a very different interval width → as interval width
decreases the proportion of scores in each interval also decreases
● The bars in the six panels of the figure to the right show the probability
density associated with each interval → this is obtained by dividing the
proportion of scores in each integral by the interval width
● As interval width decreases toward 0, probability density converges to a
single fixed value, so probability density is defined at a point → this
yields a probability density function
● The smooth line in each panel of the figure to the right, shows this
probability density function

Video 6: Rules of Probability

1) The OR rule for mutually exclusive events


● When events A and B are mutually exclusive, then the probability of A or
B occurring is the sum of their separate probabilities
● p(A or B) = p(A) + p(B)
● The OR rule is the most important rule of probability

● The OR rule for events that are NOT mutually exclusive:


● p(A or B) = p(A) + p(B) - p(A)*p(B) → the probability of drawing a red
card or a king:
● p(A or B) = ½ +1/13 - ½ *1/13 = 0.5385

2) The AND rule for independent events

● When events A and B are independent, then the probability of A and B


occurring is the product of their separate probabilities
● p(A and B) = p(A)*p(B) → coming up head in two successive flips: p(A
and B) = (½)*(½) = 0.5 * 0.5 = 0.25

● The AND rule for dependent events:


● p(A and B) = p(A)*p(B/A) → the probability of A given B
● p(B/A) is a conditional probability
● E.g: the probability of an ace on two consecutive draws, without
replacement:
● p(A and B) = p(A) * p(B/A) = (4/52)*(3/51) = 0.0045
● E.g: the probability of an ace on two consecutive draws, with
replacement
● p(A and B) - p(A)*p(B) = (4/52)*(4/52) = 0.0059
● The probability of drawing two aces is greater when the draws are
independent than when the draws are dependent

EXCEL VIDEOS:

Video 1: Frequency Tables for Qualitative Variables


● Question: A pollster was interested in the social network preferences of
Americans. A random sample of 96 [18 to 23 year old] Americans was
chosen and asked about their social network preferences. Possible
responses: Facebook, Twitter or Other
● To count the numbers of ‘F, T and O’ → COUNTIF (makes the
counting for us): counts the number of cells in a specified range of the
worksheet that meet a specified criterion.
● Column “f”: will record the number of instances of each of these three
values
● =COUNTIF(all the data, the criterion → in this example “F”
[enter “=F”])
● Total: just use the function SUM and choose all the previous data
● Column “p”: will record the proportion of scores having each of these
three values
● Then, we divide each raw frequency count by this sum (f:
72/total)
● To make the instruction be: divide the content in the cell to my
left by the contents of Z9 → we accomplish this by making the
reference to cell Z9 an absolute reference rather than a relative
reference
● we double click on the reference to Z9 in cell AA9 to highlight
it, then type Command-t. → $Z$9

Video 2: Frequency Tables for Discrete Quantitative Variables

● In this questions, we consider a literature professor who was interested in


whether his students were familiar with the characters in the Lord of the
Rings (LOTR) → he made a list of 24 names: 12 from LOTR and 12 were
not, each of the 48 students in his class answered YES or NO to each of
these 24 names
● the professor counted the number of LOTR characters to which each of
the students said YES (A1 to L4)

● Building the frequency table:


● we already have column headers for the LOTR scores, raw
frequencies (f), proportions (p), and cumulative proportions (P)
1) Find the maximum and minimum values in our data set
● we can do this with the MAX function → =MAX(all the data range)
● we can find the minimum score with the MIN function → =MIN(all
the data range)
● we then list these from max down to min in the column ‘LOTR
scores’ → enter 12 at the top of the column and in the cell below put
=(the cell above “12”) - 1

2) Now we use the COUNTIF function to count the number of instances of


each value in the range (A1 to L4) having each of the values between 0 and
12
● modify the use of COUNTIF in two ways:
1) when we specify the range to search, we will make the range
A1 to L4 absolute references → Command-t
2) when we specify the criterion, we can point to the relevant
score in the column of LOTR scores → “=”&(value in
question)

3) We can next compete the proportion of scores having each value as we did
for qualitative
● we sum the number of scores having each value of the variable; then
divide each raw frequency by this sum to produce a proportion
● Column “p”: =(a value from column f/the sum[make the sum
the absolute reference) → then drag to the other cells in the
column

4) We can compute cumulative proportions (Column “P”)


● We set the cumulative probability for the lowest value to its
probability of occurrence → the proportion of scores at or below the
lowest score in the set
● =(the lowest “p” in the set) → =(the next lowest “p” in the set) +
(the lowest “P”) → drag it up to compute the remaining
cumulative probabilities

Video 3: Grouped Frequency Tables


● The numbers in cells A1 to J6 of the worksheet represent 60 final grades in
a statistics course
● The grouped frequency table is already started with column headers for
interval midpoints (midpoint), the lower score limits (lower SL), upper
score limits (upper SL), and upper real limits (upper RL) | also included
raw frequency counts (f), proportions (p), and cumulative proportions
(P)

● Start using the MIN and Max functions to determine the range
● The interval width will be 5

● Because the maximum score in the data set is 96.9 and the interval width is
5, the score limits of the top interval are 95 to 99
● therefore, we can enter 96 as the lower score (Lower SL column) limit
of the top interval → the upper score limit will be the lower score limit
+ the interval width - 1 (=(lower score limit)+(interval width)-1)
● make the interval width an absolute reference (command-t)
● We can now determine the score limits of the second highest interval by
subtracting the interval width rom the lower and upper limits of the
interval above
● =(the number in the Lower SL column that is above)-(interval width and
make it an absolute reference) → drag the cell to the right and determine
the upper score limit
● drag everything down and calculate the score limits of the remaining
levels

● Calculating the real limits:


● The lower real limit is 0.5 below the lower score limit and the upper real
limit is 0.5 above the upper score limit
● Therefore, we subtract 0.5 from the contents in column Lower SL
(=(value in Lower SL)-0.5) and add 0.5 to the contents in column Upper
SL (=(value in Upper SL)+0,5) → drag all down

● Now, the main thing that remains is to compute the raw frequency counts
(COUNTIF)
● The first criterion is that the score must be greater than or equal to the
lower real limit of the corresponding interval
● =COUNTIF(select all the range and make it absolute),”>=”&(highest
Lower RL score),(re enter the biggest range and make it absolute),
“<”&(highest Upper RL) → drag it down
● Then, we sum the number of scores having each value of the variable,
then divide each raw frequency by this sum to produce a proportion
● =(highest f value)/(sum → make it absolute)\
● We start putting 1 in the highest cell in the column “P” because the
proportion of scores at or below the highest interval is 1
● in the next cell, it will be the proportion below the interval above,
minus the proportion in the interval above
● =(the highest value in the column “P”)-(the highest value in the
“p” column → then drag it all down

Video 4: Plotting proportions in a bar graph or histogram

● Midpoints: computing the average of the real or score limits →


=AVERAGE(Lower SL and Upper SL)

1) Highlight proportions wwe wish to plot


● Select the column and click on the Insert tab → click on the histogram
(topo esquerdo) and choose the 2D option (first one)

2) Fix the x-values


● Click on the data menu and choose “chart source data” → produces a
dialog that shows the chart data range, which includes the cells we
selected
● To choose appropriate labels for the x-axis, we click on the icon to the
right of the Horizontal axis labels field (quadradinho do lado direito do
negocio branco de escrever) → doing so will take us back to the
worksheet and we can select the numbers in column “midpoint”
● The midpoints run left to right (97 to 37), but we want the opposite →
select all data in the table and ten go to the data tab (the one in green)
and click on the sort icon (a big blue A)
● in the dialog that appears, we sort the values under the header
Midpoint from smallest to largest
● To change the bar graph to a histogram → double click on one of the
bars and the Format Data Series dialog will show up
● the field labeled Gap Width can be changed to 0 so that there are no
gaps between the bars
● To change how the bars are filled and how their borders are coloured →
in the Format Data Series dialog, we can click on the paint can to reveal
sub-dialog to Fill and Border
● In the Fill option, choose Solid fill, which we select by clicking its
radio button
● A paint can button appears, which provides access to different colors
● The bar borders, click on format and then in the little blue pen on top,
right beside “shape fill” → choose black

3) Format the x and y axes


● To start, we click on the x-axis region to select it, then double-click to
access the Format Axis, which will appear to the right
● Then choose the icon that looks like a graph, click on the tick marks
option → the tick marks are not on top of the numbers
● To fix it, go to the Axis Option tab and under Axis position we
choose “On tick marks”
● Then click on one of the bars on the graph and the worksheet data
used by the graph will be highlighted → expand the range of data
plotted by positioning the mouse over one of the corners (one up and
one down on each of the columns selected)
● Now, we select and double click on the y-axis region
● Add tick marks for the major units → click on the graph icon and
select the tick marks option
● then under the major type option choose the outside option
● The number formatting can be changed under the option “Number” →
go to “category” and choose “number”
● We can then add axis labels via the “Add chart element” → name the
horizontal axis “midpoints” and the vertical axis “proportion
● Then delete the title and the gridlines
● Double click in the background region to bring up the Format Chart Area
dialog to remove the border around the figure

You might also like