Professional Documents
Culture Documents
Day 2 - Data Management - Statistics
Day 2 - Data Management - Statistics
Data
Analyst Validations
“Poor quality customer data – just customer data alone – costs U.S.
businesses over $600 billion a year.”
Data Management
Master Data
Data Quality Data Integration
Management
Data Governance
• Ordinal/Rank
• In order but not equal (Likert)
• Categorical
• Names
What type of statistical test do I
want to do?
Continuous Data (Equal increments)
Qualitative Quantitative
Data Preparation
Selection of Analytics
Method/MODELS
Validation of Results
Tools for Analytics
Popular Tools
Commercial Tools
Open Source
• MS • R • R
EXCEL • Weka • Python
• SAS • Python • SAS
• SPSS • zeppelin • SPSS
• KXEN • Matlab
• MATLAB • KXEN
• Angoss • Zeppelin
• Statistica
Statistics Introduction
● Statistics is the science of collecting, organizing, interpreting
and visualizing data.
● We basically dig out some meaningful conclusions from data
that we had by applying various statistical methods.
Statistics
Inferential Descriptive
Inferential Statistics
Inferential statistics use a random sample of data taken from a population to
describe and make inferences about the population. Inferential statistics are
valuable when examination of each member of an entire population is not
convenient or possible. For example, to measure the diameter of each nail that is
manufactured in a mill is impractical. You can measure the diameters of a
representative random sample of nails. You can use the information from the
sample to make generalizations about the diameters of all of the nails.
Descriptive Statistics
We use descriptive statistics simply to describe what's going on in our data.
OR
With descriptive statistics we are simply describing what is or what the data
shows.
1. The mean and median can only be used with numerical data. The mode can
be used with both numerical and nominal data, or data in the form of names
or labels.
2. Eye color, gender, and hair color are all examples of nominal data.
3. The mean is the preferred measure of central tendency since it considers all
of the numbers in a data set; however, the mean is extremely sensitive to
outliers, or extreme values that are much higher or lower than the rest of
the values in a data set.
4. The median is preferred in cases where there are outliers, since the median
only considers the middle values
Examples
.
Examples
Example Mean
Example Variance
Example Standard Deviation