Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

★Data: the set of individual values associated with a variable.

ex; collection, or set, of


values,
★Variable: A characteristic of an item or individual. Ex name,height, weight, eye color,
marital status, adjusted gross income, and place of residence are
★ Descriptive Statistics refer to methods that primarily help summarize and present data.
Counting physical objects in a kindergarten class may have been the first time you used a
Descriptive method.
★Inferential Statistics refer to methods that use data collected from a small group to reach
conclusions about a larger group.
★Big Data: Big data are the collection of data that cannot be easily browsed or analyzed
using traditional methods.
Chap1
★Defining and Classifying variables by type: supPly an operational definition, a
universally accepted meaning that is clear to all associated with an analysis. Operational
definitions should also classify the variable
★Classifying variables by type: classify the variable as being either categorical or
numerical. Categorical variables (also known as qualitative variables) take categories as
their values as yes or no ordinal(ranking) or nominal. Numerical variables (also known
as quantitative variables) have values that represent a counted or measured quantity.
operational definitions of numerical variables defined as discrete or continuous.
Discrete variables such as “number of items purchased” or “total amount paid” are
numerical values that arise from a counting process.
Continuous variables such as “time spent on checkout line” or “distance from home to
store” have numerical values that arise from a measuring process and those values depend
on the precision of the measuring instrument used.
Do you have a Facebook Profile? ❑ Yes ❑ No Categorical
How many text messages have you sent in the past three days? Numerical (discrete)
How long did the mobile app update take to download? seconds Numerical (continuous)
★Categorical Variable: Summary table, contingency table , Bar chart, pie chart, pareto
chart, side-by-side bar chart ordinal(rank, no fixed interval} nominal
★Numerical Variable: Ordered array, frequency distribution, relative frequency distribution,
percentage distribution, cumulative percentage distribution(interval > temperature can me
added or deducted not divide or maltipled & ratio}
Stem-and-leaf display, histogram, polygon, cumulative percentage polygon
, Mean, median, mode, quartiles, range, interquartile range, standard deviation,
variance, coefficient of variation, skewness, kurtosis
Boxplot , Normal probability plot qq
★Collecting Data and data sources Data collection consists of identifying data sources,
deciding whether the data collect will be from a population or a sample, cleaning your data,
and sometimes recoding variables. primary data source if collect own data for analysis.
secondary data source if the data for your analysis have been collected by someone
else.
You collect data by using any of the following:
• Data distributed by an organization or individual • The outcomes of a designed experiment
• The responses from a survey • The results of conducting an observational study
• Data collected by ongoing business activities
★Population & Samples : population consists of all the items or individuals about which
you want to reach conclusions. EX: sales transactions, student enroll
sample is a portion of a population selected for analysis. The results of analyzing a sample
are used to estimate characteristics of the entire population. EX: sample of 50 students.
collect data from a sample when any of the following applies:
• Selecting a sample is less time consuming than selecting every item in the population.
• Selecting a sample is less costly than selecting every item in the population.
• Analyzing a sample is less cumbersome and more practical than analyzing the entire
population.
★Data cleaning spot an irregularity in the data you have collected, you may have to “clean”
the data.
★Recording variables define a recoded variable that supplements or replaces the
original variable in your analysis. When recoding variables, be sure that the category
definitions cause each data value to be placed in one and only one category, a property
known as being mutually exclusive set of categories you create for the new, recoded
variables include all the data values being recoded, a property known as being collectively
exhaustive.
★Types of Sampling methods
★In a nonprobability sample, you select the items or individuals without knowing their
probabilities of selection. Nonprobability samples can have certain advantages, such as
convenience, speed, and low cost. Such samples are typically used to obtain informal
approximations or as small-scale initial or pilot analyses.
★A nonprobability sample can be either a convenience sample or a judgment sample.
★To collect a convenience sample, select items that are easy, inexpensive, or convenient
to sample. For example, in a warehouse of stacked items,
★judgment sample, collect the opinions of preselected experts in the subject matter.
Although the experts may be well informed, you cannot generalize their results to the
population.
★In a probability sample, you select items based on known probabilities.
★simple random sample, every item from a frame has the same chance of selection as
every other item,Sampling with replacement select an item, you return it to the frame,
where it has the same probability of being selected again. Sampling without
replacement means that once select an item, cannot select it again.
★systematic sample, you partition the N items in the frame into n groups of k items,
where k =N/n You round k to the nearest integer. To select a systematic sample, you
choose the first item to be selected at random from the first k items in the frame. Then, you
select the remaining n -1 items by taking every kth item there after from the entire
frame. Ex. To take a systematic sample of n = 40 from the population of N = 800 full-time
employees, you partition the frame of 800 into 40 groups, each of which contains 20
employees.
★stratified sample, first subdivide the N items in the frame into separate subpopulations, or
strata. stratum is defined by some common characteristic, such as gender or year in school.
★cluster sample, divide the N items in the frame into clusters that contain several items.
Clusters are often naturally occurring groups, such as counties, election districts,
Chap 2
★For Two Numerical Variable: Scatter plot, time-series plot
★Summary Tables: A summary table tallies the values as frequencies or percentages for
each category. A summary table helps you see the differences among the categories by
displaying the frequency, amount, or percentage of items in a set of categories in a separate
column.
★Contingency table: A contingency table cross-tabulates, or tallies jointly, the values of
two or more categorical variables, allowing to study patterns that may exist between the
variables. Tallies can be shown as a frequency, a percentage of the overall total, a
percentage of the row total, .
★The ordered array arranges the values of a numerical variable in rank order, from
the smallest value to the largest value. An ordered array helps get a better sense of the
range of values in data and is particularly useful when more than a few values. For
example, financial analysts reviewing travel costs city resttaurent differ from meal costs at
suburban restaurants.
★The frequency distribution tallies the values of a numerical variable into a set of
numerically ordered classes. Interval width = highest value - lowest value/ number of
classes. Each class groups a mutually exclusive range of values, called a class
interval. Each value can be assigned to only one class, and every value must be contained
in one of the class intervals.
★The relative Frequency Distribution presents the relative frequency, or proportion, of the
total for each group that each class represents. proportion = relative frequency =
number of values in each class/total number of values
★The percentage Distribution presents the percentage of the total for each group that
each class representsproportion = relative frequency = number of values in each
class /total number of values X 100%
★The Cumulative distribution: The cumulative percentage distribution provides a way of
presenting information about the percentage of values that are less than a specific amount.
use a percentage distribution as the basis to construct a cumulative percentage distribution.
For example, you might want to know what percentage of the city restaurant meals
cost less than $40 or what percentage cost less than $50
★Visualizing categorical variables (Bar Pareto)
Bar chart series of bars each bar representing single category
A pie chart uses parts of a circle to represent the tallies of each category
Pareto chart, the tallies for each category are plotted as vertical bars in descending
order, according to their frequencies, and are combined with a cumulative percentage
line on the same chart.
Pareto principle, the observation that in many data sets, a few categories of a
categorical variable represent the majority of the data, while many other categories
represent a relatively small, or trivial, amount of the data.pareto charts help you to
visually identify the “vital few” categories from the “trivial many” categories so that
you can focus on the important categories.
★The Stem and leaf Display A stem-and-leaf display visualizes data by presenting the
data as one or more row-wise stems that the right of their stem and represent the values
found in that stem. For stems with more than one leaf, the leaves are arranged in
ascending order.
>how the data are distributed and where concentrations of data exist. represent a range of
values. In turn, each stem has one or more leaves that branch out to
★A histogram visualizes data as a vertical bar chart in which each bar represents a class
interval from a frequency or percentage distribution
Chap 3
Central tendency Most variables show a distinct tendency to group around a central value.
When people talk about an “average value” or the “middle value” or the “most frequent
value,”
Mean The arithmetic mean (typically referred to as the mean) is the most common measure
of central tendency. The mean can suggest a typical or central value and serves as a
“balance point” in a set of data, The sample mean is the sum of the values in a sample
divided by the number of values in the sample:
Median The median is the middle value in an ordered array of data that has been ranked
from smallest to largest.
Mode The mode is the value that appears most frequently. Like the median and unlike the
mean, extreme values do not affect the mode.
Variation And Shape variable can be characterized by its variation and shape. Variation
measures the spread, or dispersion, of the values. One simple measure of variation is the
range, the difference between the largest and smallest values
The Range the difference between the largest and smallest values
Variance and Standard Deviation Two commonly used measures of variation that account
for how all the values are distributed are the variance and the standard deviation.
The sample variance (S2 ) is the sum of squares divided by the sample size minus 1
The sample standard deviation (S) is the square root of the sample variance.
The Coefficient of Variation coefficient of variation (CV) measures the scatter in the data
relative to the mean. The coefficient of variation is equal to the standard deviation divided by
the mean, multiplied by 100%.
Shape : Skewness measures the extent to which the data values are not symmetrical
around the mean.
The three possibilities are:
• Mean < median: negative, or left-skewed distribution
• Mean =median: symmetrical distribution (zero skewness)
• Mean > median: positive, or right-skewed distribution
Kurtosis measures the peakedness of the curve of the distribution—that is, how sharply the
curve rises approaching the center of the distribution. Kurtosis compares the shape of the
peak to the shape of the peak of a bell-shaped normal distribution
peak of a normal distribution has positive kurtosis, a kurtosis value that is greater
than zero, and is called lepokurtic
slower-rising (flatter) center peak than the peak of a normal distribution has negative
kurtosis, a kurtosis value that is less than zero, and is called platykurtic
kurtosis value that is equal to zero, and is called metakurtic
Quartiles Quartiles split the values into four equal parts—the first quartile Q1 divides the
smallest 25.0% of the values from the other 75.0% that are larger. The second quartile Q2 is
the median; 50.0% of the values are smaller than or equal to the median, and 50.0% are
larger than or equal to the median. The third quartile Q3 divides the smallest 75.0% of the
values from the largest 25.0%.
Percentiles related to quartiles are percentiles that split a variable into 100 equal parts. Q1
25th percentile, Q3 75 percentile.
The interquartile range Q3-Q1
The Five number Summary XL, XS, Q1, Q3, median
The Boxplot The boxplot uses a five-number summary to visualize the shape of the
distribution for a variable.
The Population Mean The population mean is the sum of the values in the population
divided by the population size, N.
The population variance and standard deviation
The population variance is the sum of the squared differences around the population mean
divided by the population size, N, and the population standard deviation is the square root
of the population variance.
3 Empirical rule states that for population data that form a normal distribution, the following
are true:
• Approximately 68% of the values are within ±1 standard deviation from the mean.
• Approximately 95% of the values are within ±2 standard deviations from the mean.
• Approximately 99.7% of the values are within ±3 standard deviations from the mean.
The Covariance and the Coefficient of Correlation two measures of the relationship
between two numerical variables: the covariance and the coefficient of correlation
The covariance measures the strength of the linear relationship between two numerical
variables (X and Y).
The coefficient of correlation measures the relative strength of a linear relationship
between two numerical variables

Relation of scatter plot with correlation coefficient


Chap 6
Continuous probability distributions
probability density function is a mathematical expression that defines the distribution of
the values for a continuous variable
Normal Distribution is symmetrical and bellshaped,implying that most observed values
tend to cluster around the mean, which, due to the distribution’s symmetrical shape, is equal
to the median.
uniform distribution where the values are equally distributed in the range between the
smallest value and the largest value.
exponential distribution is skewed to the right, making the mean larger than the median.
The range for an exponential distribution is zero to positive infinity, but the distribution’s
shape makes it unlikely that extremely large values will occur.

Theoretical Properties of Normal Distribution


The normal distribution is vitally important in statistics for three main reasons:
• Numerous continuous variables common in business have distributions that closely
resemble the normal distribution.
• The normal distribution can be used to approximate various discrete probability
distributions.
• The normal distribution provides the basis for classical statistical inference because of
its relationship to the Central Limit Theorem.
theoretical properties:
• It is symmetrical, and its mean and median are therefore equal.
• It is bell-shaped in appearance.
• Its interquartile range is equal to 1.33 standard deviations. Thus, the middle 50% of the
values are contained within an interval of two-thirds of a standard deviation below
the mean and two-thirds of a standard deviation above the mean.
• It has an infinite range (- ∞ X ∞)
Computing Normal probabilities

Z transformation Formula Z value is equal to the difference between X and the mean, m,
divided by the standard deviation,

The cumulative standardized normal distribution Excel


Constructing the normal probability plot A normal probability plot is a visual display that
helps you evaluate whether the data are normally distributed. One common plot is called the
quantile–quantile plot or QQ plot
Chap 7
Sampling Distributions A sampling distribution is the distribution of the results if you
actually selected all possible samples. The single result you obtain in practice is just one of
the results in the sampling distribution.
Sampling Distribution of the Mean The sampling distribution of the mean is the distribution
of all possible sample means if you select all possible samples of a given size.
The sample mean is unbiased because the mean of all the possible sample means (of
a given sample size,( n) is equal to the population mean, m.
Standard error of the Mean Sampling from Normally Distributed Populations
The value of the standard deviation of all possible sample means, called the standard error
of the mean,
Finding Z For the Sampling Distribution of the Mean

Finding X For the Sampling Distribution of the Mean

The Central limit Theorem (IMPORTANT) As the sample size (the number of values in
each sample) gets large enough, the sampling distribution of the mean is approximately
normally distributed. This is true regardless of the shape of the distribution of the individual
values in the population.
Sampling Distribution of the proportion
If repeated random samples of a given size n are taken from a population of values for a
categorical variable, where the proportion in the category of interest is p, then the mean of all
sample proportions (p-hat) is the population proportion (p).
Sample proportion

Standard Error of the proportion

Finding Z For the Sampling Distribution of the proportion


Chap 8
Confidence Interval Estimate for the Mean ( Known)

Finding One sample Z value for 95% confidence interval estimate math
The sampling error, level of confidence [(1-) X 100%]math
The Concept of Degrees of Freedom
Confidence Interval for the Mean ( UnKnown)math
Sample Size Determination For The Mean

Chap 9
★The Null Hypotheses it is equal to what it should be H0 : m =2
★Alternative Hypotheses it is not what itshould be H1 : m /= 2
The null hypothesis, H0, represents the current belief in a situation.
The alternative hypothesis, H1, is the opposite of the null hypothesis and represents a
research claim or specific inference you would like to prove
null hypothesis, H0, always refers to a specified value of the population parameter
(such as m), not a sample statistic (such as X).
★The Critical Value of the Test Statistic In hypothesis testing, a critical value is a point on
the test distribution that is compared to the test statistic to determine whether to reject the
null hypothesis.
★Regions of Rejection and Non-Rejection The sampling distribution of the test statistic is
divided into two regions, a region of rejection (sometimes called the critical region) and
a region of nonrejection
★TYPE I and TYPE II ERRORS
A Type I error occurs if you reject the null hypothesis, H0, when it is true and should not be
rejected. A Type I error is a “false alarm.” The probability of a Type I error occurring is α.
A Type II error occurs if you do not reject the null hypothesis, H0, when it is false
and should be rejected. A Type II error represents a “missed opportunity” to take some
corrective action. The probability of a Type II error occurring is β.
★Probability of Type I And Type II Errors The level of significance α of a statistical test is
the probability of committing a Type I error. The β risk is the probability of committing a
Type II error.
The complement of the probability of a Type I error, (1 -α) , is called the confidence
coefficient. The confidence coefficient is the probability that you will not reject the null
hypothesis, H0
The complement of the probability of a Type II error, (1 -β) , is called the power of a
statistical test. The power of a statistical test is the probability that you will reject the null
hypothesis when it is false and should be rejected.

★Complements of type I and type II errors complement of the probability of a Type I error,
(1 -α), The complement of the probability of a Type II error, (1 -β)
★Z Test for the mean (σ Known) When the standard deviation, s, is known (which rarely
occurs), you use the Z test for the mean if the population is normally distributed.

★Hypothesis Testing using the Critical Value Approach The critical value approach
compares the value of the computed ZSTAT test statistic from Equation to critical values that
divide the normal distribution into regions of rejection and nonrejection. The critical values
are expressed as standardized Z values that are determined by the level of significance.
★Hypothesis Testing using the p-Value Approach The p-value is the probability of
getting a test statistic equal to or more extreme than the sample result, given that the null
hypothesis, H0, is true. The p-value is also known as the observed level of significance.
Using the p-value to determine rejection and nonrejection is another approach to hypothesis
testing.
The decision rules for rejecting H0 in the p-value approach are
• If the p-value is greater than or equal to α, do not reject the null hypothesis.
• If the p-value is less than α, reject the null hypothesis.
If the p-value is low, then H0 must go.

You might also like