[STATS] Module 3

1st Semester
| MGT1103
a.y. 2023-
2024
[Module 1] Data and Data Preparation
Types of data  Numerical information that is objective and not open to

DATA interpretation.
 Compilations of facts, figures, or other contents.  Only 20% of all data used in business decisions.
 Numerical and non-numerical. Unstructured data
o All types / formats are generated from multiple
 Do not conform to a pre-defined, row-column format.
sources.
 Textual and multimedia content.
o Customers / businesses use data to help make
 Do not conform to database structures.
decisions.
 These data may have some implied structure.
o Still considered unstructured.
STATISTICS
 Language of data.  Do not conform to row-column model required in most
database systems
 Science that deals with the collection, preparation,
analysis, interpretation, and presentation of data.  Ex: Social media data such as Twitter, YouTube,
1. Find the right data and prepare it for the analysis. Facebook, and blogs.
2. Use the appropriate statistical tool, which depends on
the data. Big data
3. Clearly communicate information with actionable  Businesses generate and gather more and more data at an
business insights. increasing phase.
o Massive volume of structured and unstructured data.
two branches of statistics o Extremely difficult to manage, process, and analyze
Descriptive statistics using traditional data processing tools.
 Summary of important aspects of a data set. o Presents great opportunities to gain knowledge and
o Collecting, organizing, and presenting data in form of game-changing intelligence.
charts and tables.  High volume, high-velocity and/or high-variety
o Often calculate numerical measures (typical value, information assets that demand cost-effective, innovative
variability). forms of information processing that enable enhanced
insight, decision making, and process automation.
Inferential statistics  Does not imply complete (population) data.
 Drawing conclusions about a larger set of data  May not be used when available
(population) based on a smaller set of data (sample). o Inconvenient and computationally burdensome
o POPULATION consists of all items/members of o Benefits may not justify costs.
interests.
o SAMPLE is a subset of population. Three characteristics of big data:
o We rely on sample data to make inferences about 1. VOLUME – immense amount of data compiled for a
various characteristics of the population. single or multiple sources.
2. VELOCITY – generated at a rapid speed, management is
Sample data a critical issue.
 We analyze sample data and calculate a sample statistic 3. VARIETY – all types, forms, granularity, structured, or
to make inferences about the unknown population unstructured.
parameter.
Additional characteristics:
 It is generally not feasible to obtain population data.
1. VERACITY – credibility, quality, reliability.
o Obtaining information on the entire population is
2. VALUES – methodological plan for formulating
expensive.
questions, curating the right data
o It is impossible to examine every member of the
population.
Internet data
 Sample data are generally collected in one of two ways.
 There is an abundance of data on the Internet.
Cross-sectional data  Many experts believe that 90% of the data in the world
 Data collected by recording a characteristic of many today was created in the last two years alone.
subjects at the same point in time, or without regard to  It is easy to access and find data by using a search engine
differences in time. like Google.
 Several sources of data:
Time series data
- Bureau of Economic Analysis
 Data collected over several time periods focusing on
- Bureau of Labor Statistics
certain groups of people, specific events, or objects.
- Federal Research Economic Data
 Most common data collection procedure. - U.S. Census Bureau
 Can include hourly, daily, weekly, monthly, quarterly, or - National Climate Data Center
annual observations. - Yahoo Finance, Google Finance
 Ex: Homeownership rates (%) in the US. - Zillow
- ESPN
STRUCTURED DATA
 Reside in pre-defined, row-column format.
 Spreadsheet or database applications. VARIABLE
 Enter, store, query, and analyze.  Characteristic of interest that differs in kind or degree
among various observations (records).
San Luis, Jersey Anne | 1
1st Semester
| MGT1103
a.y. 2023-
2024
[Module 1] Data and Data Preparation
 There are two types of variables:  Ex: Profit
categorical
Data preparation
 Qualitative
 We often spend a considerable amount of time inspecting
 Represents categories
and preparing the data for the subsequent analysis.
 Labels or names to identify distinguishing characteristics.
o Counting and sorting
 Can be defined by two or more categories
o Handling missing values
 Coded into numbers for data processing
o Subsetting
 Summarize the data with a frequency distribution.
 Ex: marital status, grade in a course Counting and sorting
 Among the very first tasks analysts perform.
numerical
 Gain a better understanding and insights into the data.
 Use numbers to identify the distinguishing characteristics
 Help to verify that the data set is complete or determine if
of each observation.
there are missing values.
 Quantitative
 Sorting allows us to review the range of values for each
 Represent meaningful numbers
variable.
 Discrete or continuous
 Sort based on a single or multiple variables.
o DISCRETE assumes a countable number of values.
- Need not be whole numbers
- Ex: number of children in a family
Dealing with missing values:
Omission strategy
o CONTINUOUS assumes an uncountable number of
 Observations with missing values be excluded from the
values within an interval
subsequent analysis.
- Often measured in discrete values
- Ex: weight of a newborn baby Imputation strategy
NOTE!  Missing values be replaced with some reasonable
In order to choose the appropriate techniques for summarizing imputed values.
and analyzing variables, we need to distinguish between the o Numeric variables: replace with the average
different measurement scales. o Categorical variables: replace with the predominant
category.
Scales of measurement
nominal subsetting
 Least sophisticated  Process of extracting a portion of the data set that is
 Represents categories or groups relevant.
 Values differ by labels or names
 Ex: marital status
ordinal
 Stronger level of measurement
 Categorize and rank data with respect to some
characteristic.
 Cannot interpret the difference between the ranked values,
numbers are arbitrary.
 Ex: reviews from 1 star (poor) to 5 stars (outstanding).
NOTE!
Numerical and ordinal scales are used for CATEGORICAL
VARIABLES.
 Typically expressed in words but are coded into numbers
for purposes of data processing.
 Typically count the number of observations that fall into
each category (or find %)
 Unable to perform meaningful arithmetic operations.
interval
 Categorize and rank; differences are meaningful.
 Zero value is arbitrary and does not reflect absence of
characteristic.
 Ratios are not meaningful
 Ex. Temperature
ratio
 Strongest level of measurement
 A true zero point, reflects absence of characteristic
 Ratios are meaningful
San Luis, Jersey Anne | 2

[STATS] Module 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

[STATS] Module 3

Uploaded by

Copyright:

Available Formats

1st Semester

Types of data  Numerical information that is objective and not open to

San Luis, Jersey Anne | 2

You might also like