Professional Documents
Culture Documents
AD8403 Data Analytics: DR S Palaniappan, Associate Professor, Department of ADS
AD8403 Data Analytics: DR S Palaniappan, Associate Professor, Department of ADS
DATA ANALYTICS
1
Dr S Palaniappan , Associate Professor , Department of ADS
UNIT I
INFERENTIAL STATISTICS I
2
Dr S Palaniappan , Associate Professor , Department of ADS
Lesson Plan
CO 1 To recognize the different types of data and gain insight into the data science process
CO 2 Understand the flow and learn the steps in a data science process
T1 David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning Publications, 2016
• Real Populations
• A real population is one in which all potential observations are accessible at the
time of sampling.
• Hypothetical Populations
• A hypothetical population is one in which all potential observations are not
accessible at the time of sampling.
• The population in which whose unit is not available in solid form
• Identify all of the expressions from the above which involve a hypothetical
population.
Dr S Palaniappan , Associate Professor , Department of ADS 9
RANDOM SAMPLING
• Random sampling occurs if, at each stage of sampling, the selection process
guarantees that all potential observations in the population have an equal chance of
being included in the sample.
(a) the random sample of 10 cards accurately represents the important features of the
whole deck.
(b) each card in the deck has an equal chance of being selected.
(c) it is impossible to get 10 cards from the same suit (for example, 10 hearts).
(d) any outcome, however unlikely, is possible.
• The size of the population determines whether you deal with numbers having one,
two, three, or more digits.
• For example, if you were attempting to take a random sample from a population
consisting of 679 students at some college, you could use the 1000 three-digit
numbers ranging from 000 to 999
• Then as each row is used up, shift down to the start of the next row and repeat the
entire process.
• As a given number between 001 and 720 is encountered, the person identified with
that number is included in the random sample
• Describe how you would use the table of random numbers to take
(a) a random sample of five statistics students in a classroom where each of nine rows
consists of nine seats.
(b) A random sample of size 40 from a large directory consisting of 3041 pages, with 480
lines per page
Dr S Palaniappan , Associate Professor , Department of ADS 15
• Many pollsters use random digit dialing in an effort to give each telephone number
—whether landline or wireless—
• In the United States an equal chance of being called for an interview.
• Essentially, the first six digits of a 10-digit phone number, including the area code,
are randomly selected from tens of thousands of telephone exchanges, while the final
four digits are taken directly from random numbers.
• a flip of a coin can decide whether that subject should be assigned to the
treatment group (if heads turns up) or the control group (if tails turn up).
Dr S Palaniappan , Associate Professor , Department of ADS 18
• Creating Equal Groups
(a) Formulate an acceptable rule for single-digit random numbers. Incorporate into this rule a
procedure that will ensure equal numbers of subjects in the two groups. Check your answer in
Appendix B before proceeding.
(b) Reading from left to right in the top row of the random number page (Table H, Appendix
C), use the random digits of each random number in conjunction with your assignment rule to
determine whether the first subject is A or B, and so forth. List the assignment for each
subject.
Dr S Palaniappan , Associate Professor , Department of ADS 20
PROBABILITY
• The probability that it will rain this weekend (20 percent, or one in five) or that
you’ll win a state lottery (one in many millions, unfortunately).
• Coin is tossed, and therefore, the probability of heads should equal .50, or 1/2.
• What is the probability of getting a sum of 9 when two dice are thrown?
• The probability that a man, X, will stand 73 or more inches tall, symbolized as Pr(X
≥ 73),
• equals the sum of the probabilities in Table that a man will stand 73 or 74 or 75 or 76
inches or taller, that is,
Pr( x >= 73) = Pr(73)+Pr(74)+Pr(75)+Pr(76 )
= .05 +.03 +.02 +.02 = .12
• Probabilities are added - in order to find the probability that any one of these events
will occur
• In a university of 300 students, 170 are boys and 130 are girls. On a workshop
conducted by the university, 40 boys and 50 girls attended the workshop. If a
student is chosen at random from the university, what is the probability of
choosing a girl or an student who attended the workshop?
• What is the probability that two randomly selected men will be at least 73
inches tall?
• “What is the probability that the first man will stand at least 73 inches tall and
that the second man will stand at least 73 inches tall?”
Pr(X1 >= 73 and X2 >= 73) = [Pr(X1) >= 73][Pr(X2 ) >=73] = (.12)
(.12) .0144
(a) the husband being born during January and the wife being born during
February?
(b) both husband and wife being born during December?
(c) both husband and wife being born during the spring (April or May)? (Hint:
First, find the probability of just one person being born during April or May.)
• A math teacher gave her class two tests. 25% of the class passed both tests and
42% of the class passed the first test. What percent of those who passed the
first test also passed the second test?
• P(A ∪ B) / P(A)
• The probability that you will earn a grade of A in a course, given that you
have already gotten an A on the midterm,
• The probability that you’ll graduate from college, given that you’ve already
completed the first two years
• “What is the conditional probability of being drunk, given that the driver takes illegal
drugs?”
Dr S Palaniappan , Associate Professor , Department of ADS 35
• “What is the conditional probability of being drunk, given that the driver takes illegal
drugs?”
• Referring to Figure 8.1, divide the number of drivers who are drunk and take drugs,
12, by the number of drivers who take drugs, 20, that is, 12/20 = .60. (This
conditional probability of .60, given drivers who take drugs
• is twice that of .30, given drunk drivers, because the fixed number of drivers who are
drunk and take drugs, 12, represents proportionately more (.60) among the relatively
small number of drivers who take drugs, 20, and proportionately less (.30) among
the relatively large number of drunk drivers, 40.)
(a) What is the probability of randomly selecting a couple who described their relationship as improved?
(b) What is the probability of randomly selecting a couple with children?
(c) What is the conditional probability of randomly selecting a couple with children, given that their
relationship was described as improved?
(d) What is the conditional probability of randomly selecting a couple without children, given that their
relationship was described as not improved?
(e) What is the conditional probability of an improved relationship, given that a couple has children?
• Common Outcomes
• Rare Outcomes
• The sampling distribution of a statistic is the distribution of all possible values taken by the
statistic when all possible samples of a fixed size n are taken from the population. It is a
theoretical idea—we do not actually build it.
• For example suppose you sample 50 students from your class regarding their mean GPA. If you
obtained many different samples of 50 , you would compute a different mean for each sample.
We are interested in the distribution of the mean GPA from all possible samples of 50 students.
• Then calculate the mean value of each of these samples sizes 16 sample means
(b) Construct a relative frequency table showing the sampling distribution of the mean.
• Standard error of the mean decreases as the sample size increases from the given
formula(square root (n))
Dr S Palaniappan , Associate Professor , Department of ADS 51
Sampling Distribution of the Mean: If the
population is Normal
• If the population is normal with mean μ and standard deviation σ, the sampling
distribution of X̄ is also normally distributed
(a) roughly measures the average amount by which sample means deviate from the
population mean.
(b) measures variability in a particular sample.
(c) increases in value with larger sample sizes.
(d) equals 5, given that σ = 40 and n = 64.
• Sample means from the population will be approximately normal as long as sample
size is large enough
57
Dr S Palaniappan , Associate Professor , Department of ADS
Definitions
• Hypothesis testing is an act in statistics whereby an analyst tests an assumption
regarding a population parameter.
• Hypothesis testing is a form of statistical inference that uses data from a
sample to draw conclusions about a population parameter or a population
probability distribution
• An investigator at a university wishes to test the claim that, on the average, the SAT
math scores for local freshmen equals the national average of 500
• Assume that it is not possible to obtain scores for the entire freshman class.
• Instead, SAT math scores are obtained for a random sample of 100 students from the
local population of freshmen, and the mean score for this sample equals 533.
• The null hypothesis (H0) is a statistical hypothesis that usually asserts that
nothing special is happening with respect to some characteristic of the
underlying population.
• The alternative hypothesis (H1) asserts the opposite of the null hypothesis.
Retain or Reject
• Rare Outcomes
(a) z = 1.74
(b) z = 0.13
(c) z = –2.51