Day 2 - Data Management - Statistics

DATA MANAGEMENT
Data
•Data is a set of values of qualitative or quantitative variables.
•Data is measured, collected and reported, and analyzed, whereupon it

can be visualized using graphs, images or other analysis tools. Data as a
general concept refers to the fact that some existing information or
knowledge is represented or coded in some form suitable for better
usage or processing.
Data Management
•Data management is the

development, execution and
supervision of plans, policies, programs
and practices that control, protect,
deliver and enhance the value of data
and information assets.
Data management
Analyst Validations
“Poor quality customer data – just customer data alone – costs U.S.
businesses over $600 billion a year.”
“Up to 40% of all failed business initiatives are a

result of poor data quality.”
Due to poor Data Management “83% of consumers are

unlikely to do business again with a company when problem
resolution falls below expectations.”
Challenge
Maximize The Value Of Your Data Ecosystem
POOR QUALITY DATA TOO MUCH DATA INCONSISTENT DATA

cannot be trusted in too many places across multiple sources
The data strategy is not able to support the business strategy –

compliance, increase revenue, operational efficiency.
Better, faster business decisions depend on
getting data …
IN THE RIGHT PLACE AT THE RIGHT TIME

efficiently move data between support all data delivery latencies
systems and architectures
IN THE RIGHT FORM TO THE RIGHT PEOPLE

structure, cleanse data for govern data use, apply business
operational systems or analysis semantics, collaborate with data
Information Management
Data Management
Master Data
Data Quality Data Integration
Management
Data Governance
Enterprise Data Access

Data Quality Methodology
•Data quality issues can
be
addressed by following
a
simple methodology
that
involves three phases:
• Analyze
• Improve
• Control
STATISTICS
Types of data
• Continuous
• Equal increments
• Ordinal/Rank
• In order but not equal (Likert)
• Categorical
• Names
What type of statistical test do I
want to do?
Continuous Data (Equal increments)
• If comparing 2 groups (treatment/control)

• t-test
• If comparing > 2 groups
• ANOVA (F-test)
• If measuring association between 2 variables
• Pearson r correlation
• If trying to predict an outcome (crystal ball)
• Regression or multiple regression
Ordinal Data (In order but not equal (Likert)
Beyond the capability of Excel – just FYI

• If comparing 2 groups
• Mann Whitney U (treatment vs. control)
• Wilcoxon (matched pre vs. post)
• If comparing > 2 groups
• Kruskal-Wallis (median test)
• If measuring association between 2 variables
• Spearman rho (ρ)
• Likert-type scales are ordinal data
Categorical Data (Names)
• Called a test of frequency – how often something is observed (AKA:
Goodness of Fit Test, Test of Homogeneity)
• Chi-Square (χ2)
• Examples of burning research questions:
• Do negative ads change how people vote?
• Is there a relationship between marital status and health insurance coverage?
• Do blonds have more fun?
Words we use to describe
statistics
Mean (μ)
• The arithmetic average (add all of

the scores together, then divide
by the number of scores)
• μ = ∑x / n
Median
• The middle number (just like the
median strip that divides a
highway down the middle; 50/50)
• Used when data is not normally
distributed
• Often hear about the median
price of housing
Mode
• The most frequently occurring
number (score, measurement,
value, cost)
• On a frequency distribution, it’s
the highest point (like the á la
mode on pie)
Types of variables
Variables
Qualitative Quantitative
Dichotomic Polynomic Discrete Continuous
Children in family, Amount of income

Gender, marital Brand of Pc, hair
Strokes on a golf tax paid, weight of a
status color
hole student
Analytics : Types
❖Decision analytics: supports human decisions with visual analytics the
user models to reflect reasoning
❖Descriptive analytics: gains insight from historical data with reporting,

scorecards, clustering etc.
❖Predictive analytics: employs predictive modeling using statistical and

machine learning techniques
❖Prescriptive analytics: recommends decisions using optimization,

simulation, etc.
Lifecycle of Analytics
Identify the Business Problem
Design and collect the Data
Data Preparation
Selection of Analytics
Method/MODELS
Validation of Results
Tools for Analytics
Popular Tools
Commercial Tools
Open Source
• MS • R • R
EXCEL • Weka • Python
• SAS • Python • SAS
• SPSS • zeppelin • SPSS
• KXEN • Matlab
• MATLAB • KXEN
• Angoss • Zeppelin
• Statistica
Statistics Introduction
● Statistics is the science of collecting, organizing, interpreting
and visualizing data.
● We basically dig out some meaningful conclusions from data
that we had by applying various statistical methods.
Statistics
Inferential Descriptive
Inferential Statistics
Inferential statistics use a random sample of data taken from a population to
describe and make inferences about the population. Inferential statistics are
valuable when examination of each member of an entire population is not
convenient or possible. For example, to measure the diameter of each nail that is
manufactured in a mill is impractical. You can measure the diameters of a
representative random sample of nails. You can use the information from the
sample to make generalizations about the diameters of all of the nails.
Descriptive Statistics
We use descriptive statistics simply to describe what's going on in our data.
OR
With descriptive statistics we are simply describing what is or what the data
shows.
There are two ways basically to describe the data:

1. Measures of central tendency
2. Measures of variability, or dispersion.
Measure Of Central Tendency
You are probably somewhat familiar with the mean, but did you
know that it is a measure of central tendency?
Measures of central tendency use a single value to describe the
center of a data set. The mean, median, and mode are all the three
measures of central tendency.
Measure Of Central Tendency Conti..
The mean, or average, is calculated by finding the sum of the study

data and dividing it by the total number of data.
The mode is the number that appears most frequently in the set of
data.
The median is the middle value in a set of data. It is calculated by
first listing the data in numerical order then locating the value in
the middle of the list.
Measure Of Central Tendency Conti..
1. The mean and median can only be used with numerical data. The mode can
be used with both numerical and nominal data, or data in the form of names
or labels.
2. Eye color, gender, and hair color are all examples of nominal data.
3. The mean is the preferred measure of central tendency since it considers all
of the numbers in a data set; however, the mean is extremely sensitive to
outliers, or extreme values that are much higher or lower than the rest of
the values in a data set.
4. The median is preferred in cases where there are outliers, since the median
only considers the middle values
Examples
For example, the median in a set

of 9 data is the number in the fifth
place when working with an odd
dataset.
When working with an even set of
data, you find the average of the
two middle numbers. For
example, in a data set of 10, you
would find the average of the
numbers in the fifth and sixth
places.
Measures of Dispersion
When we wanted to look at how spread out the study data are from a
central value, i.e. the mean. In this case, you would look at measures of
dispersion, which include the range, variance, and standard deviation.
The simplest measure of dispersion is the range. This tells us how spread out
our data is. In order to calculate the range, you subtract the smallest
number from the largest number. Just like the mean, the range is very
sensitive to outliers.
The variance is a measure of the average distance that a set of data lies from
its mean. The variance is not a stand-alone statistic. It is typically used in
order to calculate other statistics, such as the standard deviation. The higher
the variance, the more spread out your data are.
The Standard Deviation is a measure of how spread out numbers are.
Measures of Dispersion Conti..
.
Examples
Example Mean
Example Variance
Example Standard Deviation

Day 2 - Data Management - Statistics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day 2 - Data Management - Statistics

Uploaded by

Copyright:

Available Formats

DATA MANAGEMENT

•Data is a set of values of qualitative or quantitative variables.

•Data is measured, collected and reported, and analyzed, whereupon it

•Data management is the

“Up to 40% of all failed business initiatives are a

Due to poor Data Management “83% of consumers are

POOR QUALITY DATA TOO MUCH DATA INCONSISTENT DATA

The data strategy is not able to support the business strategy –

IN THE RIGHT PLACE AT THE RIGHT TIME

IN THE RIGHT FORM TO THE RIGHT PEOPLE

Enterprise Data Access

• If comparing 2 groups (treatment/control)

Beyond the capability of Excel – just FYI

• The arithmetic average (add all of

Dichotomic Polynomic Discrete Continuous

Children in family, Amount of income

❖Descriptive analytics: gains insight from historical data with reporting,

❖Predictive analytics: employs predictive modeling using statistical and

❖Prescriptive analytics: recommends decisions using optimization,

Identify the Business Problem

Design and collect the Data

There are two ways basically to describe the data:

The mean, or average, is calculated by finding the sum of the study

For example, the median in a set

You might also like