Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

POLYTECHNIC UNIVERSITY OF THE PHILIPPINES

Sta. Mesa, Manila


College of Business Administration
Department of Marketing Management

LEARNING MODULE
IN

FUNDAMENTALS
OF BUSINESS
ANALYTICS

Mecmack A. Nartea, MMBM


RAQUEL G. RAMOS, DBA

COURSE DESCRIPTION:

The course provides students with an overview of the current trends in business analytics that drives today’s
business. The course will provide understanding on data management techniques that can help an organization
to achieve its business goals and address operational challenges.

COURSE OBJECTIVES:

As a result of taking this course, the student should be able to:


1. Describe the various sources of data (structured, unstructured) and the concept of data management;
2. Describe the importance of data, how data can be used by an organization towards competitive advantage,
and how it enables organizations to make quicker and better business decisions;
3. Describe, understand and explain business modeling, the business modeling process and be able to apply it
in a variety of different situations;
4. Describe basics of business intelligence including data gathering, data storing, data analyzing and providing
access to data;
5. Describe how statistical analysis can help to better understand past events and predict future events;
6. Understanding the fundamentals of project risk management, various methods used for effort and cost
estimation, various phases within a project, dependencies and critical path; and
7. Describe various database models like the hierarchical database model and network model.
8. Develop an awareness of the ethical norms as required under polices and applicable laws governing
confidentiality and non-disclosure of data/information/documents and proper conduct in the learning
TABLE OF CONTENTS
process and application of business analytics

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
1
Mecmack A. Nartea * mackharvester@gmail.com
Page No.

Chapter 1 The Process of Analytics 3


Evolution of Analytics: How Did Analytics Start? 7
The Quality Movement 9
The Second World War 10
Where Else Was Statistics Involved? 11
The Dawn of Business Intelligence 12
Chapter Exercise 13
Chapter 2 Analytics: A Comprehensive Study 14
Definition of Business Analytics 15
Types of Analytics 15
Basic Domains within Analytics 16
Definition Of Analytics 17
Analytics vs. Analysis 18
Examples of Analytics 18
Software Analytics 21
Embedded Analytics 23
Learning Analytics 24
Differentiating Learning Analytics and Educational Data Mining 26
Chapter Exercise 27
Chapter 3 Descriptive Statistical Measures 29
Populations And Samples 30
Data Sets, Variables, And Observations 30
Types Of Data 31
Descriptive Measures For Categorical Variables 33
Descriptive Measures For Numerical Variables 33
Measures Of Central Tendency 33
Measures Of Variability 38
Outliers And Missing Values 42
Chapter Exercise 44
Chapter 4 Analytics on Spreadsheets 46
Excel Tables For Filtering, Sorting, And Summarizing 47
Chapter Exercise 50
Chapter 5 Probability and Probability Distribution 51
Probability Essentials 53
Rule of Complements 53
Addition Rule 54
Conditional Probability and the Multiplication Rule 55
Probability Distribution Of A Single Random Variable 56
Summary Measures of a Probability Distribution 57
Chapter Exercise 58
Chapter 6 Statistical Inference: Sampling and Estimation 60
Understanding Samples 61
Sampling Techniques 61
Determining Sample Size 65
Introduction To Estimation 65
Sources of Estimation Error 66
Key Terms in Sampling 67
Sample Size Selection 68

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
2
Mecmack A. Nartea * mackharvester@gmail.com
Confidence Intervals 68
What Is The P-Value? 69
Errors In Hypothesis Testing 70
Sampling Distributions 72
Parametric Tests 74
Nonparametric Tests 76
Chapter Exercise 77
Chapter 7 Data Mining 78
Introduction to Data Mining 79
Data Exploration and Visualization 80
Online Analytical Processing (OLAP) 80
PowerPivot and Power View in Excel 2013 81
Visualization Software 82
Microsoft Data Mining Add-Ins For Excel 83
Classification Methods 83
Logistic Regression 84
Classification Trees 87
Clustering 88
Chapter Activity 89

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
3
Mecmack A. Nartea * mackharvester@gmail.com
CHAPTER

1
THE PROCESS OF ANALYTICS

OVERVIEW

This chapter discusses how business analytics are used in daily life. It
further discusses the various softwares used in analytics. The history on
how and when analytics was started were also tackled in this chapter.

OBJECTIVES

▪ Learn the evolution of analytics


▪ Learn where analytics was involved in the history
▪ Understand how business intelligence emerge

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
4
Mecmack A. Nartea * mackharvester@gmail.com
What Is Analytics? What Does a Data Analyst Do?

A casual search on the Internet for data scientist offers up the fact that there is a
substantial shortage of manpower for this job. In addition, Harvard Business Review has
published an article called “Data Scientist: The Sexiest Job of the 21st Century” So, what
does a data analyst actually do?

To put it simply, analytics is the use of numbers or business data to find solutions for
business problems. Thus, a data analyst looks at the data that has been collected across
huge enterprise resource planning (ERP) systems, Internet sites, and mobile applications.

In the “old days,” we just called upon an expert, who was someone with a lot of
experience. We would then take that person’s advice and decide on the solution. It’s much
like we visit the doctor today, who is a subject-matter expert.

As the complexity of business systems went up and we entered an era of continuous


change, people found it hard to deal with such complex systems that had never existed
before. The human brain is much better at working with fewer variables than many. Also,
people started using computers, which are relatively better and unbiased when it comes to
new forms and large volumes of data.

An Example

The next question often is, what do I mean by “use of numbers”? Will you have do
math again?

The last decade has seen the advent of software as a service (SaaS) in all walks of
information gathering and manipulation. Thus, analytics systems now are button-driven
systems that do the calculations and provide the results. An analyst or data scientist has to
look at these results and make recommendations for the business to implement. For
example, say a bank wants to sell loans in the market. It has data of all the customers who
have taken loans from the bank over the last 20 years. The portfolio is of, say, 1 million
loans. Using this data, the bank wants to understand which customers it should give pre-
approved loan offers to.

The simplest answer may be as follows: all the customers who paid on time every
time in their earlier loans should get a pre-approved loan offer. Let’s call this set of
customers Segment A. But on analysis, you may find that customers who defaulted but paid

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
5
Mecmack A. Nartea * mackharvester@gmail.com
the loan after the default actually made more money for the bank because they paid interest
plus the late payment charges. Let’s call this set Segment B.

Hence, you can now say that you want to send out an offer letter to Segment A +
Segment B.

However, within Segment B there was a set of customers who you had to send
collections teams to their homes to collect the money. So, they paid interest plus the late
payment charges minus the collection cost. This set is Segment C.

So, you may then decide to target Segment A + Segment B – Segment C.

You could do this exercise using the decision tree technique that cuts your data into
segments (Figure 1-1).

A Typical Day

The last question to tackle is, what does the workday of an analytics professional
look like? It probably encompasses the following:

The data analyst will walk into the office and be told about the problem that the
business needs input on.

The data analyst will determine the best way to solve the problem.

The data analyst will then gather the relevant data from the large data sets stored in
the server.

Next, the data analyst will import the data into the analytics software.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
6
Mecmack A. Nartea * mackharvester@gmail.com
The data analyst will run the technique through the software (SAS, R, SPSS,
XLSTAT, and so on).

The software will produce the relevant output.

The data analyst will study the output and prepare a report with recommendations.

The report will be discussed with the business.

Is Analytics for You?

So, is analytics the right career for you? Here are some points that will help you decide:

Do you believe that data should be the basis of all decisions? Take up analytics
only if your answer to this question is an unequivocal yes. Analytics is the process
of using and analyzing a large quantum of data (numbers, text, images, and so on)
by aggregating, visualizing/creating dashboards, checking repetitive trends, and
creating models on which decisions can be made. Only people who innately
believe in the power of data will excel in this field. If some prediction/analysis is
wrong, the attitude of a good analyst is that it is because the data was not
appropriate for the analysis or the technique used was incorrect. You will never
doubt that a correct decision will be made if the relevant data and appropriate
techniques are used.

Do you like to constantly learn new stuff? Take up analytics only if your answer to
this question is an unequivocal yes. Analytics is a new field. There is a constant
increase in the avenues of data currently regarding Internet data, social networking
information, mobile transaction data, and near field communication devices. There
are constant changes in technology to store, process, and analyze this data.
Hadoop, Google updates, and so on, have become increasingly important. Cloud
computing and data management are common now. Economic cycles have
shortened, and model building has become more frequent as older models get
redundant. Even the humble Excel has an Analysis ToolPak in Excel 2010 with
statistical functions. In other words, be ready for change.

Do you like to interpret outcomes and then track them to see whether your
recommendations were right? Take up analytics only if your answer to this
question is an unequivocal yes. A data analyst will work on a project, and the

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
7
Mecmack A. Nartea * mackharvester@gmail.com
implementation of the recommendations will generally be valid for a reasonably
long period of time, perhaps a year or even three to five years. A good analyst
should be interested to know how accurate the recommendations have been and
should want to track the performance periodically. You should ideally also be the
first person to be able to say when the analysis is not working and needs to be
reworked.

Are you ready to go back to a text book and brush up on the concepts of math and
statistics? Take up analytics only if your answer to this question is an unequivocal
yes. To accurately handle data and interpret results, you will need to brush up on
the concepts of math and statistics. It becomes important to justify why you chose
a particular path during analysis versus others. Business users will not accept your
word blindly.

Do you like debating and logical thinking? Take up analytics only if your answer to
this question is an unequivocal yes. As there is no one solution to all problems, an
analyst has to choose the best way to handle the project/problem at hand. The
analyst has to be able to not only know the best way to analyze the data but also
give the best recommendation in the given time constraints and budget constraints.
This sector generally has a very open culture where the analyst working on a
project/problem will be required to give input irrespective of the analyst’s position
in the hierarchy.

Do check your answers to the previous questions. If you said yes for three out of these
five questions and an OK for two, then analytics is a viable career option for you. Welcome
to the world of analytics!

Evolution of Analytics: How Did Analytics Start?

As per the Oxford Dictionary, the definition of statistics is as follows:

The practice or science of collecting and analyzing numerical data in large quantities,
especially for the purpose of inferring proportions in a whole from those in a
representative sample.1

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
8
Mecmack A. Nartea * mackharvester@gmail.com
Most people start working with numbers, counting, and math by the time we are five
years old. Math includes addition, subtraction, theorems, rules, and so on. Statistics is when
we start using math concepts to work on real-life data.

Statistics is derived from the Latin word status, the Italian word statista, or the
German word statistik, each of which means a political state. This word came into being
somewhere around 1780 to 1790.

In ancient times, the government collected the information regarding the population,
property, and wealth of the country. This enabled the government to get an idea of the
manpower of the country and became the basis for introducing taxes and levies. Statistics
are the practical part of math.

The implementation of standards in industry and commerce became important with


the onset of the Industrial Revolution, where there arose a need for high-precision machine
tools and interchangeable parts. Standardization is the process of developing and
implementing technical standards. It helps in maximizing compatibility, interoperability,
safety, repeatability, and quality.

Nuts and bolts held the industrialization process together; in 1800, Henry Maudslay
developed the first practical screw-cutting lathe. This allowed for the standardization of
screw thread sizes and paved the way for the practical application of interchangeability for
nuts and bolts. Before this, screw threads were usually made by chipping and filing manually.

Maudslay standardized the screw threads used in his workshop and produced sets
of nuts and bolts to those standards so that any bolt of the appropriate size would fit any nut
of the same size.

Joseph Whitworth’s screw thread measurements were adopted as the first unofficial
national standard by companies in Britain in 1841 and came to be known as the British
standard Whitworth.

By the end of the 19th century, differences and standards between companies were
making trading increasingly difficult. The Engineering Standards Committee was established
in London in 1901 and by the mid-to-late 19th century, efforts were being made to
standardize electrical measurements. Many companies had entered the market in the 1890s,

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
9
Mecmack A. Nartea * mackharvester@gmail.com
and all chose their own settings for voltage, frequency, current, and even the symbols used
in circuit diagrams, making standardization necessary for electrical measurements.

The International Federation of the National Standardizing Associations was founded


in 1926 to enhance international cooperation for all technical standards and certifications.

The Quality Movement

Once manufacturing became an established industry, the emphasis shifted to


minimizing waste and therefore cost. This movement was led by engineers who were, by
training, adept at using math. This movement was called the quality movement. Some
practices that came from this movement are Six Sigma and just-in-time manufacturing in
supply chain management. The point is that all this started in the Industrial Revolution in
1800s.

This was followed with the factory system with its emphasis on product inspection.

After the United States entered World War II, the quality became a critical
component since bullets from one state had to work with guns manufactured in another
state. For example, the U.S. Army had to inspect manually every piece of machinery, but
this was very time-consuming. Statistical techniques such as sampling started being used to
speed up the processes.

Japan around this time was also becoming conscious of quality. The quality initiative
started with a focus on defects and products and then moved on to look at the process used
for creating these products. Companies invested in training their workforce on Total Quality
Management (TQM) and statistical techniques.

This phase saw the emergence of seven “basic tools” of quality.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
10
Mecmack A. Nartea * mackharvester@gmail.com
Statistical Process Control from the early 1920s is a method of quality control using
statistical methods, where monitoring and controlling the process ensures that it operates at
its full potential. At its full potential, a process can churn out as much conforming product or
standardize a product as much as possible with a minimum of waste.

This is used extensively in manufacturing lines with a focus on continuous


improvement and is practiced in these two phases:

Initial establishment of the process

Regular production use of the process

The advantage of Statistical Process Control (SPC) over the methods of quality
control such as inspection is that it emphasizes early detection and prevention of problems
rather than correcting problems after they occur.

The following were the next steps:

Six Sigma: A process of measurement and improvement perfected by GE and


adopted by the world

Kaizen: A Japanese term for continuous improvement; a step-by-step


improvement of business processes

PDCA: Plan-Do-Check-Act, as defined by Deming

What was happening on the government front? The maximum data was being
captured and used by the military. A lot of the business terminologies and processes used
today have been copied from the military: sales campaigns, marketing strategy, business
tactics, business intelligence, and so on.

The Second World War

As mentioned, statistics made a big difference during World War II. For instance, the
Allied forces accurately estimated the production of German tanks using statistical methods.
They also used statistics and logical rules to decode German messages.

The Kerrison Predictor was one of the fully automated anti-aircraft fire control
systems that could gun an aircraft based on simple inputs such as the angle to the target
and the observed speed. The British Army used this effectively in the early 1940s.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
11
Mecmack A. Nartea * mackharvester@gmail.com
The Manhattan Project was a U.S. government research project in 1942–1945 that
produced the first atomic bomb. Under this, the first atomic bomb was exploded in July 1945
at a site in New Mexico. The following month, the other atomic bombs that were produced
by the project were dropped on Hiroshima and Nagasaki, Japan. This project used statistics
to run simulations and predict the behavior of nuclear chain reactions.

Where Else Was Statistics Involved?

Weather predictions, especially rain, affected the world economy the most since
weather affected the agriculture industry. The first attempt was made to forecast the weather
numerically in 1922 by Lewis Fry Richardson.

The first successful numerical prediction was performed using the ENIAC digital
computer in 1950 by a team of American meteorologists and mathematicians.2

Then, 1956 saw analytics solve the shortest-path problem in travel and logistics,
radically changing these industries.

In 1956 FICO was founded by engineer Bill Fair and mathematician Earl Isaac on the
principle that data used intelligently can improve business decisions. In 1958 FICO built its
first credit scoring system for American investments, and in 1981 the FICO credit bureau risk
score was introduced.3

Historically, by the 1960s, most organizations had designed, developed, and


implemented centralized computing systems for inventory control. Material requirements
planning (MRP) systems were developed in the 1970s.

In 1973, the Black-Scholes model (or Black–Scholes–Merton model) was perfected.


It is a mathematical model of a financial market containing certain derivative investment
instruments. This model estimates the price of the option/stock overtime. The key idea
behind the model is to hedge the option by buying and selling the asset in just the right way
and thereby eliminate risk. It is used by investment banks and hedge funds.

By the 1980s, manufacturing resource planning systems were introduced with the
emphasis on optimizing manufacturing processes by synchronizing materials with
production requirements. Starting in the late 1980s, software systems known as enterprise
resource planning systems became the drivers of data accumulation in business. ERP
systems are software systems for business management including models supporting

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
12
Mecmack A. Nartea * mackharvester@gmail.com
functional areas such as planning, manufacturing, sales, marketing, distribution, accounting,
and so on. ERP systems were a leg up over MRP systems. They include modules not only
related to manufacturing but also to services and maintenance.

The Dawn of Business Intelligence

Typically, early business applications and ERP systems had their own databases
that supported their functions. This meant that data was in silos because no other system
had access to it. Businesses soon realized that the value of data can increase many fold if
all the data is in one system together. This led to the concept of a data warehouse and then
an enterprise data warehouse (EDW) as a single system for the repository of all the
organization’s data. Thus, data could be acquired from a variety of incompatible systems
and brought together using extract, transform, load (ETL) processes. Once the data is
collected from the many diverse systems, the captured data needs to be converted into
information and knowledge in order to be useful. The business intelligence (BI) systems
could therefore give much more coherent intelligence to businesses and introduce the
concepts of one view of customers and customer lifetime value.

One advantage of an EDW is that business intelligence is now much more


exhaustive. Though business intelligence is a good way to use graphs and charts to get a
view of business progress, it does not use high-end statistical processes to derive greater
value from the data.

The next question that business wanted to answer by the 1990s–2000 was how the
data can be used more effectively to understand embedded trends and predict future trends.
The business world was waking up to predictive analytics.

What are the types of analytics that exist now? The analytics journey generally starts
off with the following:

Descriptive statistics: This enables businesses to understand summaries generally


about numbers that the management views as part of the business intelligence
process.

Inferential statistics: This enables businesses to understand distributions and


variations and shapes in which the data occurs.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
13
Mecmack A. Nartea * mackharvester@gmail.com
Differences statistics: This enables businesses to know how the data is changing or
if it’s the same.

Associative statistics: This enables businesses to know the strength and direction of
associations within data.

Predictive analytics: This enables businesses to make predictions related to trends


and probabilities.

Fortunately, we live in an era of software, which can help us do the math, which
means analysts can focus on the following:

Understanding the business process

Understanding the deliverable or business problem that needs to be solved

Pinpointing the technique in statistics that will be used to reach the solution

Running the SaaS to implement the technique

Generating insights or conclusions to help the business

CHAPTER EXERCISES

Direction: Discuss the following questions. Write your answer in as short bond paper.

1. How does analytics applicable in your daily life? Cite examples to substantiate your
answer.

2. Is there really a need to include analytics in the education curriculum? Justify your answer.

SUGGESTED READINGS

Http://journals.ametsoc.org/doi/pdf/10.1175/BAMS-89-1-45

www.fico.com/en/about-us#our_history

www.oxforddictionaries.com/definition/english/statistics

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
14
Mecmack A. Nartea * mackharvester@gmail.com
CHAPTER ANALYTICS: A COMPREHENSIVE

2 STUDY

OVERVIEW

Analytics is the understanding and communication of significant patterns of data. Analytics is


applied in businesses to improve their performances. Some of the aspects explained in this
text are software analytics, embedded analytics, learning analytics and social media
analytics. The section on analytics offers an insightful focus, keeping in mind the complex
subject matter.

OBJECTIVES

▪ Define business analytics


▪ Know the different types of analytics
▪ Enumerate and understand the different domains in analytics
▪ Differentiate analytics and analysis
▪ Understand what software analytics is
▪ Understand how analytics is used in academe
▪ Differentiate Learning Analytics and Educational Data Mining

REFERENCES

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
15
Mecmack A. Nartea * mackharvester@gmail.com
DEFINITION OF BUSINESS ANALYTICS

Business analytics (BA) refers to the skills, technologies, practices for continuous
iterative exploration and investigation of past business performance to gain insight and drive
business planning. Business analytics focuses on developing new insights and understanding of
business performance based on data and statistical methods. In contrast, business intelligence
traditionally focuses on using a consistent set of metrics to both measure past performance and
guide business planning, which is also based on data and statistical methods.

Business analytics makes extensive use of statistical analysis, including explanatory and
predictive modeling, and fact-based management to drive decision making. It is therefore
closely related to management science. Analytics may be used as input for human decisions or
may drive fully automated decisions. Business intelligence is querying, reporting, online
analytical processing (OLAP), and “alerts.”

In other words, querying, reporting, OLAP, and alert tools can answer questions such as
what happened, how many, how often, where the problem is, and what actions are needed.
Business analytics can answer questions like why is this happening, what if these trends
continue, what will happen next (that is, predict), what is the best that can happen (that is,
optimize).

Examples of Application

Banks, such as Capital One, use data analysis (or analytics, as it is also called in the
business set-ting), to differentiate among customers based on credit risk, usage and other
characteristics and then to match customer characteristics with appropriate product offerings.
Harrah’s, the gaming firm, uses analytics in its customer loyalty programs. E & J Gallo Winery
quantitatively analyzes and predicts the appeal of its wines. Between 2002 and 2005, Deere &
Company saved more than $1 billion by employing a new analytical tool to better optimize
inventory. A telecoms company that pursues efficient call centre usage over customer service
may save money.

Types of Analytics

• Decision analytics: supports human decisions with visual analytics the user models to
reflect reasoning.
• Descriptive analytics: gains insight from historical data with reporting, scorecards, clus-
No part of this material may be reproduced, stored in a retrieved system, or transmitted
tering etc.
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
16
Mecmack A. Nartea * mackharvester@gmail.com
• Predictive analytics: employs predictive modeling using statistical and machine learning
techniques
• Prescriptive analytics: recommends decisions using optimization, simulation, etc.

Basic Domains within Analytics

• Behavioral analytics • Fraud analytics

• Cohort Analysis • Marketing analytics

• Collections analytics • Pricing analytics

• Contextual data modeling - supports • Retail sales analytics


the human reasoning that occurs
after viewing “executive dashboards” • Risk & Credit analytics
or any other visual analytics • Supply Chain analytics
• Cyber analytics • Talent analytics
• Enterprise Optimization • Telecommunications
• Financial services analytics • Transportation analytics

History

Analytics have been used in business since the management exercises were put into
place by Frederick Winslow Taylor in the late 19th century. Henry Ford measured the time of
each component in his newly established assembly line. But analytics began to command more
attention in the late 1960s when computers were used in decision support systems. Since then,
analytics have changed and formed with the development of enterprise resource planning
(ERP) systems, data warehouses, and a large number of other software tools and processes.

In later years the business analytics have exploded with the introduction to computers.
This change has brought analytics to a whole new level and has made the possibilities endless.
As far as analytics has come in history, and what the current field of analytics is today many
people would never think that analytics started in the early 1900s with Mr. Ford himself.

Business analytics depends on sufficient volumes of high quality data. The difficulty in
ensuring data quality is integrating and reconciling data across different systems, and then
decidingNo
what
part subsets of data
of this material maytobe
make available.
reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
17
Mecmack A. Nartea * mackharvester@gmail.com
Previously, analytics was considered a type of after-the-fact method of forecasting
consumer behavior by examining the number of units sold in the last quarter or the last year.
This type of data warehousing required a lot more storage space than it did speed. Now
business analytics is becoming a tool that can influence the outcome of customer interactions.
When a specific customer type is considering a purchase, an analytics-enabled enterprise can
modify the sales pitch to appeal to that consumer. This means the storage space for all that
data must react extremely fast to provide the necessary data in real-time.

Competing on Analytics

Thomas Davenport, professor of information technology and management at Babson


College argues that businesses can optimize a distinct business capability via analytics and
thus better compete. He identifies these characteristics of an organization that are apt to
compete on analytics:

• One or more senior executives who strongly advocate fact-based decision making and,
specifically, analytics

• Widespread use of not only descriptive statistics, but also predictive modeling and
complex optimization techniques

• Substantial use of analytics across multiple business functions or processes

• Movement toward an enterprise level approach to managing analytical tools, data, and
organizational skills and capabilities

DEFINITION OF ANALYTICS

Analytics is the discovery, interpretation, and communication of meaningful patterns in


data. Especially valuable in areas rich with recorded information, analytics relies on the
simultaneous application of statistics, computer programming and operations research to
quantify performance. Analytics often favors data visualization to communicate insight.

Organizations may apply analytics to business data to describe, predict, and improve
business performance. Specifically, areas within analytics include predictive analytics,
prescriptive analytics, enterprise decision management, retail analytics, store assortment and
stock-keeping unit optimization, marketing optimization and marketing mix modeling, web
Nosales
analytics, part offorce
this material
sizing may
andbe reproduced, stored
optimization, priceinand
a retrieved system,
promotion or transmitted
modeling, predictive science,
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
18
Mecmack A. Nartea * mackharvester@gmail.com
credit risk analysis, and fraud analytics. Since analytics can require extensive computation, the
algorithms and software used for analytics harness the most current methods in computer
science, statistics, and mathematics.

Analytics vs. Analysis

Analytics is multidisciplinary. There is extensive use of mathematics and statistics, the


use of descriptive techniques and predictive models to gain valuable knowledge from data—
data analysis. The insights from data are used to recommend action or to guide decision making
rooted in business context. Thus, analytics is not so much concerned with individual analyses or
analysis steps, but with the entire methodology. There is a pronounced tendency to use the
term analytics in business settings e.g. text analytics vs. the more generic text mining to
emphasize this broader perspective.. There is an increasing use of the term advanced analytics,
typically used to describe the technical aspects of analytics, especially in the emerging fields
such as the use of machine learning techniques like neural networks to do predictive modeling.

Examples of Analytics

Marketing Optimization

Marketing has evolved from a creative process into a highly data-driven process.
Marketing organizations use analytics to determine the outcomes of campaigns or efforts and to
guide decisions for investment and consumer targeting. Demographic studies, customer
segmentation, conjoint analysis and other techniques allow marketers to use large amounts of
consumer purchase, survey and panel data to understand and communicate marketing strategy.

Web analytics allows marketers to collect session-level information about interactions on


a website using an operation called sessionization. Google Analytics is an example of a popular
free analytics tool that marketers use for this purpose. Those interactions provide web analytics
information systems with the information necessary to track the referrer, search keywords,
identify IP address, and track activities of the visitor. With this information, a marketer can
improve marketing campaigns, website creative content, and information architecture.

Analysis techniques frequently used in marketing include marketing mix modeling,


pricing and promotion analyses, sales force optimization and customer analytics e.g.:
segmentation. Web analytics and optimization of web sites and online campaigns now
frequently work hand in hand with the more traditional marketing analysis techniques. A focus
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
19
Mecmack A. Nartea * mackharvester@gmail.com
on digital media has slightly changed the vocabulary so that marketing mix modeling is
commonly referred to as attribution modeling in the digital or marketing mix modeling context.

These tools and techniques support both strategic marketing decisions (such as how
much overall to spend on marketing, how to allocate budgets across a portfolio of brands and
the marketing mix) and more tactical campaign support, in terms of targeting the best potential
customer with the optimal message in the most cost effective medium at the ideal time.

Portfolio Analytics

A common application of business analytics is portfolio analysis. In this, a bank or


lending agency has a collection of accounts of varying value and risk. The accounts may differ
by the social status (wealthy, middle-class, poor, etc.) of the holder, the geographical location,
its net value, and many other factors. The lender must balance the return on the loan with the
risk of default for each loan. The question is then how to evaluate the portfolio as a whole.

The least risk loan may be to the very wealthy, but there are a very limited number of
wealthy people. On the other hand, there are many poor that can be lent to, but at greater risk.
Some balance must be struck that maximizes return and minimizes risk. The analytics solution
may combine time series analysis with many other issues in order to make decisions on when to
lend money to these different borrower segments, or decisions on the interest rate charged to
members of a port-folio segment to cover any losses among members in that segment.

Risk Analytics

Predictive models in the banking industry are developed to bring certainty across the risk
scores for individual customers. Credit scores are built to predict individual’s delinquency
behavior and widely used to evaluate the credit worthiness of each applicant. Furthermore, risk
analyses are carried out in the scientific world and the insurance industry. It is also extensively
used in financial institutions like Online Payment Gateway companies to analyse if a transaction
was genuine or fraud. For this purpose they use the transaction history of the customer. This is
more commonly used in Credit Card purchase, when there is a sudden spike in the customer
transaction volume the customer gets a call of confirmation if the transaction was initiated by
him/her. This helps in reducing loss due to such circumstances.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
20
Mecmack A. Nartea * mackharvester@gmail.com
Digital Analytics

Digital analytics is a set of business and technical activities that define, create, collect,
verify or transform digital data into reporting, research, analyses, recommendations,
optimizations, pre-dictions, and automations. This also includes the SEO (Search Engine
Optimization) where the keyword search is tracked and that data is used for marketing purposes.
Even banner ads and clicks come under digital analytics. All marketing firms rely on digital
analytics for their digital marketing assignments, where MROI (Marketing Return on Investment)
is important.

Security Analytics

Security analytics refers to information technology (IT) solutions that gather and analyze
security events to bring situational awareness and enable IT staff to understand and analyze
events that pose the greatest risk. Solutions in this area include security information and event
management solutions and user behavior analytics solutions.

Software Analytics

Software analytics is the process of collecting information about the way a piece of
software is used and produced.

Challenges

In the industry of commercial analytics software, an emphasis has emerged on solving


the challenges of analyzing massive, complex data sets, often when such data is in a constant
state of change. Such data sets are commonly referred to as big data. Whereas once the
problems posed by big data were only found in the scientific community, today big data is a
problem for many businesses that operate transactional systems online and, as a result, amass
large volumes of data quickly.

The analysis of unstructured data types is another challenge getting attention in the
industry. Un-structured data differs from structured data in that its format varies widely and
cannot be stored in traditional relational databases without significant effort at data
transformation. Sources of unstructured data, such as email, the contents of word processor
documents, PDFs, geospatial data, etc., are rapidly becoming a relevant source of business
intelligence for businesses, governments and universities. For example, in Britain the discovery
that oneNocompany
part of thiswas
material may beselling
illegally reproduced, stored indoctor’s
fraudulent a retrieved system,
notes inororder
transmitted
to assist people in
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
21
Mecmack A. Nartea * mackharvester@gmail.com
defrauding employers and insurance companies, is an opportunity for insurance firms to
increase the vigilance of their unstructured data analysis. The McKinsey Global Institute
estimates that big data analysis could save the American health care system $300 billion per
year and the European public sector €250 billion.

These challenges are the current inspiration for much of the innovation in modern
analytics information systems, giving birth to relatively new machine analysis concepts such as
complex event processing, full text search and analysis, and even new ideas in presentation.
One such innovation is the introduction of grid-like architecture in machine analysis, allowing
increases in the speed of massively parallel processing by distributing the workload to many
computers all with equal access to the complete data set.

Analytics is increasingly used in education, particularly at the district and government


office levels. However, the complexity of student performance measures presents challenges
when educators try to understand and use analytics to discern patterns in student performance,
predict graduation likelihood, improve chances of student success, etc. For example, in a study
involving districts known for strong data use, 48% of teachers had difficulty posing questions
prompted by data, 36% did not comprehend given data, and 52% incorrectly interpreted data.
To combat this, some analytics tools for educators adhere to an over-the-counter data format
(embedding labels, supplemental documentation, and a help system, and making key
package/display and content decisions) to improve educators’ understanding and use of the
analytics being displayed.

One more emerging challenge is dynamic regulatory needs. For example, in the banking
industry, Basel and future capital adequacy needs are likely to make even smaller banks adopt
internal risk models. In such cases, cloud computing and open source R (programming
language) can help smaller banks to adopt risk analytics and support branch level monitoring by
applying predictive analytics.

SOFTWARE ANALYTICS

Software Analytics refers to analytics specific to software systems and related software
development processes. It aims at describing, predicting, and improving development,
maintenance, and management of complex software systems. Methods and techniques of
software analytics typically rely on gathering, analyzing, and visualizing information found in the
manifoldNodata
part ofsources in may
this material the be
scope of software
reproduced, stored in systems
a retrieved and their
system, software development
or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
22
Mecmack A. Nartea * mackharvester@gmail.com
processes---software analytics “turns it into actionable insight to inform better decisions related
to software”.

Software analytics represents a base component of software diagnosis that generally


aims at generating findings, conclusions, and evaluations about software systems and their
implementation, composition, behavior, and evolution. Software analytics frequently uses and
combines approach-es and techniques from statistics, prediction analysis, data mining, and
scientific visualization. For example, software analytics can map data by means of software
maps that allow for interactive exploration.

Data under exploration and analysis by Software Analytics exists in software lifecycle,
including source code, software requirement specifications, bug reports, test cases, execution
traces/logs, and real-world user feedback, etc. Data plays a critical role in modern software
development, be-cause hidden in the data is the information and insight about the quality of
software and services, the experience that software users receive, as well as the dynamics of
software development.

Insightful information obtained by Software Analytics is information that conveys


meaningful and useful understanding or knowledge towards performing the target task. Typically
insightful information cannot be easily obtained by direct investigation on the raw data without
the aid of analytic technologies.

Actionable information obtained by Software Analytics is information upon which


software practitioners can come up with concrete solutions (better than existing solutions if any)
towards completing the target task.

Software Analytics focuses on trinity of software systems, software users, and software
development process:

Software Systems. Depending on scale and complexity, the spectrum of software


systems can span from operating systems for devices to large networked systems that consist
of thousands of servers. System quality such as reliability, performance and security, etc., is the
key to success of modern software systems. As the system scale and complexity greatly
increase, larger amount of data, e.g., run-time traces and logs, is generated; and data becomes
a critical means to monitor, analyze, understand and improve system quality.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
23
Mecmack A. Nartea * mackharvester@gmail.com
Software Users. Users are (almost) always right because ultimately they will use the
software and services in various ways. Therefore, it is important to continuously provide the best
experience to users. Usage data collected from the real world reveals how users interact with
software and services. The data is incredibly valuable for software practitioners to better
understand their customers and gain insights on how to improve user experience accordingly.

Software Development Process. Software development has evolved from its traditional
form to exhibiting different characteristics. The process is more agile and engineers are more
collaborative than that in the past. Analytics on software development data provides a powerful
mechanism that software practitioners can leverage to achieve higher development productivity.

In general, the primary technologies employed by Software Analytics include analytical


technologies such as machine learning, data mining and pattern recognition, information
visualization, as well as large-scale data computing & processing.

Software Analytics Providers

CAST Software New Relic

IBM Cognos Business Intelligence Squore

Kiuwan Tableau Software

Microsoft Azure Application Insights Trackerbird Software Analytics

Nalpeiron Software Analytics

EMBEDDED ANALYTICS

Embedded analytics is the technology designed to make data analysis and business
intelligence more accessible by all kind of application or user.

According to Gartner analysts Kurt Schlegel, traditional business intelligence were


suffering in 2008 a lack of integration between the data and the business users. This technology
intention is to be more pervasive by real-time autonomy and self-service of data visualization or
customization, meanwhile decision makers, business users or even customers are doing their
own daily workflow and tasks.
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
24
Mecmack A. Nartea * mackharvester@gmail.com
Tools

Actuate Qlik

Dundas Data Visualization SAP

GoodData
SAS
IBM
Tableau
icCube

Logi Analytics TIBCO

Pentaho Sisense

LEARNING ANALYTICS

Learning analytics is the measurement, collection, analysis and reporting of data about
learners and their contexts, for purposes of understanding and optimizing learning and the
environments in which it occurs. A related field is educational data mining. For general audience
introductions, see:

The Educause Learning Initiative Briefing

The Educause Review on Learning analytics

And the UNESCO “Learning Analytics Policy Brief” (2012)

What is Learning Analytics?

The definition and aims of Learning Analytics are contested. One earlier definition
discussed by the community suggested that “Learning analytics is the use of intelligent data,
learner-produced data, and analysis models to discover information and social connections for
predicting and advis-ing people’s learning.”

But this definition has been criticised:

“I somewhat disagree with this definition - it serves well as an introductory


concept if we use analytics as a support structure for existing education models. I
think learning analytics - at an advanced and integrated implementation - can do
No part of this material may be reproduced, stored in a retrieved system, or transmitted
away with pre-fab curriculum models”. George Siemens, 2010.
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
25
Mecmack A. Nartea * mackharvester@gmail.com
“In the descriptions of learning analytics we talk about using data to
“predict success”. I’ve struggled with that as I pore over our databases. I’ve come
to realize there are differ-ent views/levels of success.” Mike Sharkey 2010.

A more holistic view than a mere definition is provided by the framework of learning
analytics by Greller and Drachsler (2012). It uses a general morphological analysis (GMA) to
divide the domain into six “critical dimensions”.

A systematic overview on learning analytics and its key concepts is provided by Chatti et
al. (2012) and Chatti et al. (2014) through a reference model for learning analytics based on four
dimensions, namely data, environments, context (what?), stakeholders (who?), objectives
(why?), and methods (how?).

It has been pointed out that there is a broad awareness of analytics across educational
institutions for various stakeholders, but that the way ‘learning analytics’ is defined and
implemented may vary, including:

• for individual learners to reflect on their achievements and patterns of behaviour in


relation to others;
• as predictors of students requiring extra support and attention;
• to help teachers and support staff plan supporting interventions with individuals and
groups;
• for functional groups such as course team seeking to improve current courses or
develop new curriculum offerings; and
• for institutional administrators taking decisions on matters such as marketing and
recruitment or efficiency and effectiveness measures.”

In that briefing paper, Powell and MacNeill go on to point out that some motivations and
implementations of analytics may come into conflict with others, for example highlighting
potential conflict between analytics for individual learners and organisational stakeholders.

Gašević, Dawson, and Siemens argue that computational aspects of learning analytics
need to be linked with the existing educational research if the field of learning analytics is to
deliver to its promise to understand and optimize learning.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
26
Mecmack A. Nartea * mackharvester@gmail.com
Differentiating Learning Analytics and Educational Data Mining

Differentiating the fields of educational data mining (EDM) and learning analytics (LA)
has been a concern of several researchers. George Siemens takes the position that educational
data mining encompasses both learning analytics and academic analytics, the former of which
is aimed at governments, funding agencies, and administrators instead of learners and faculty.
Baepler and Murdoch define academic analytics as an area that “...combines select institutional
data, statistical analysis, and predictive modeling to create intelligence upon which learners,
instructors, or administrators can change academic behavior”. They go on to attempt to
disambiguate educational data mining from academic analytics based on whether the process is
hypothesis driven or not, though Brooks questions whether this distinction exists in the literature.
Brooks instead pro-poses that a better distinction between the EDM and LA communities is in
the roots of where each community originated, with authorship at the EDM community being
dominated by researchers coming from intelligent tutoring paradigms, and learning analytics
researchers being more focused on enterprise learning systems (e.g. learning content
management systems).

Regardless of the differences between the LA and EDM communities, the two areas
have significant overlap both in the objectives of investigators as well as in the methods and
techniques that are used in the investigation. In the MS program offering in Learning Analytics
at Teachers College, Columbia University, students are taught both EDM and LA methods.

Learning Analytics in Higher Education

The first graduate program focused specifically on learning analytics was created by Dr.
Ryan Baker and launched in the Fall 2015 semester at Teachers College - Columbia University.
The pro-gram description states that “data about learning and learners are being generated
today on an unprecedented scale. The fields of learning analytics (LA) and educational data
mining (EDM) have emerged with the aim of transforming this data into new insights that can
benefit students, teachers, and administrators. As one of world’s leading teaching and research
institutions in education, psychology, and health, we are proud to offer an innovative graduate
curriculum dedicated to improving education through technology and data analysis.”

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
27
Mecmack A. Nartea * mackharvester@gmail.com
CHAPTER EXERCISES

Direction: Discuss the following. Use short bond paper for your answer.

1. Explain the importance of analytics in your program.

2. How does analytics be useful in the following sectors:

a. Health Sectors

b. Business sectors

c. Tourism

d, Agriculture

e. Economics

3. Identify the type of measurement scale— nominal, ordinal, interval, or ratio— suggested by
each statement:

a) John finished the math test in 35 minutes, whereas Jack finished the same test in 25
minutes.
b) Jack speaks French, but John does not.
c) Jack is taller than John.
d) John is 6 feet 2 inches tall.
e) John’s IQ is 120, whereas Jack’s IQ is 110.

4. Supermarket Sales

The Supermarket contains over 14,000 transactions made by supermarket customers


over a period of approximately two years. (The data are not real, but real supermarket chains
have huge data sets just like this one.) A small sample of the data appears in in the Figure
below, Column B contains the date of the purchase, column C is a unique identifier for each
customer, columns D–H contain information about the customer, columns I–K contain the
location of the store, columns L–N contain information about the product purchased (these
columns have been hidden to conserve space), and the last two columns indicate the number of
items purchased and the amount paid.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
28
Mecmack A. Nartea * mackharvester@gmail.com
a. Determine which variables are categorical and numerical.

b. Summarize the variables using a bar graph.

SUGGESTED READINGS

Explore and read articles on Analytics in http://www.library.educause.edu/

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
29
Mecmack A. Nartea * mackharvester@gmail.com
CHAPTER DESCRIPTIVE STATISTICAL
3 MEASURES

OVERVIEW

The goal of this chapter is to make sense of data by constructing appropriate summary
measures, tables, and graphs. Our purpose here is to present the data in a form that makes
sense to people. This chapter also discusses the types of data, variables, measures central
tendency, measures of variability and outliers. Techniques and tips in using Microsoft Excel
are also included to provide you guides in using the application.

OBJECTIVES

▪ Differentiate and understand sample and population


▪ Define data sets, variables and observations
▪ Enumerate types of data
▪ Understand the process in descriptive measures for categorical variables
▪ Understand the process in descriptive measures for numerical variables
▪ Learn and understand the use of the measures of central tendency and variability
▪ Understand the use of outliers and missing values

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
30
Mecmack A. Nartea * mackharvester@gmail.com
We begin with a short discussion of several important concepts: populations and
samples, data sets, variables and observations, and types of data.

POPULATIONS AND SAMPLES

First, we distinguish between a population and a sample. A population includes all of the
entities of interest: people, households, machines, or whatever.

In these situations and many others, it is virtually impossible to obtain information about
all members of the population. For example, it is far too costly to ask all potential voters which
presidential candidates they prefer. Therefore, we often try to gain insights into the
characteristics of a population by examining a sample, or subset, of the population.

A population includes all of the entities of interest in a study. A sample is a subset of


the population, often randomly chosen and preferably representative of the population as a
whole. We use the terms population and sample a few times in this chapter, which is why we
have defined them here. However, the distinction is not really important until later chapters. Our
intent in this chapter is to focus entirely on the data in a given data set, not to generalize beyond
it. Therefore, the given data set could be a population or a sample from a population. For now,
the distinction is irrelevant.

DATA SETS, VARIABLES, AND OBSERVATIONS

A data set is generally a rectangular array of data where the columns contain variables,
such as height, gender, and income, and each row contains an observation. Each observation
includes the attributes of a particular member of the population: a person, a company, a city, a
machine, or whatever. This terminology is common, but other terms are often used. A variable
(column) is often called a field or an attribute, and an observation (row) is often called a case
or a record. Also, data sets are occasionally rearranged, so that the variables are in rows and
the observations are in columns. However, the most common arrangement by far is to have
variables in columns, with variable names in the top row, and observations in the remaining
rows.

A data set is usually a rectangular array of data, with variables in columns and
observations in rows. A variable (or field or attribute) is a characteristic of members of a
population, such as height, gender, or salary. An observation (or case or record) is a list of all
variableNo part offor
values thisa material may be reproduced,
single member stored in a retrieved system, or transmitted
of a population.
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
31
Mecmack A. Nartea * mackharvester@gmail.com
Table 1. Environmental Survey Data

Consider Figure 1. Each observation lists the person’s age, gender, state of residence,
number of children, annual salary, and opinion of the president’s environmental policies. These
six pieces of information represent the variables. It is customary to include a row (row 1 in this
case) that lists variable names. These variable names should be concise but meaningful. Note
that an index of the observation is often included in column A. If you sort on other variables, you
can always sort on the index to get back to the original sort order.

TYPES OF DATA

There are several ways to categorize data. A basic distinction is between numerical and
categorical data. The distinction is whether you intend to do any arithmetic on the data. It
makes sense to do arithmetic on numerical data, but not on categorical data. (Actually, there is
a third data type, a date variable. As you may know, Excel stores dates as numbers, but for
obvious reasons, dates are treated differently from typical numbers.)

In the questionnaire data, Age, Children, and Salary are clearly numerical. For example,
it makes perfect sense to sum or average any of these. In contrast, Gender and State are
clearly categorical because they are expressed as text, not numbers.

The Opinion variable is less obvious. It is expressed numerically, on a 1-to-5 scale.


However, these numbers are really only codes for the categories “strongly disagree,” “disagree,”
“neutral,” “agree,” and “strongly agree.” There is never any intent to perform arithmetic on these
numbers; in fact, it is not really appropriate to do so. Therefore, it is most appropriate to treat the
Opinion variable as categorical. Note, too, that there is a definite ordering of its categories,
whereas there is no natural ordering of the categories for the Gender or State variables. When
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
32
Mecmack A. Nartea * mackharvester@gmail.com
there is a natural ordering of categories, the variable is classified as ordinal. If there is no natural
ordering, as with the Gender variable or the State variable, the variable is classified as nominal.
Remember, though, that both ordinal and nominal variables are categorical.

Excel Tip 1: Horizontal Alignment Conventions

Excel automatically right-aligns numbers and left-aligns text. We will use this
automatic for- matting, but starting in this edition, we will add our own. Specifically,
we will right-align all numbers that are available for arithmetic; we will left-align all
text such as Male, Female, Yes, and No; and we will center-align everything else,
including dates, indexes such as the Person

Excel Tip 2: Documenting with Cell Comments

How do you remember, for example, that “1” stands for “strongly disagree” in the
Opinion variable? You can enter a comment—a reminder to yourself and others—in
any cell. To do so, right-click a cell and select Insert Comment. A small red tag
appears in any cell with a comment. Moving the cursor over that cell causes the
comment to appear. You will see numerous comments in the files that accompany
the book.

A dummy variable is a 0Š1 coded variable for a specific category. It is coded as 1 for all
observations in that category and 0 for all observations not in that category.

The method of categorizing a numerical variable is called binning (putting the data into
discrete bins), and it is also very common. (It is also called discretizing.) The purpose of the
study dictates whether age should be treated numerically or categorically; there is no absolute
right or wrong way.

Numerical variables can be classified as discrete or continuous. The basic distinction


is whether the data arise from counts or continuous measurements. The variable Children is
clearly a count (discrete), whereas the variable Salary is best treated as continuous. This
distinction between discrete and continuous variables is sometimes important because it
dictates the most natural type of analysis.

A numerical variable is discrete if it results from a count, such as the number of children.
A continuous variable is the result of an essentially continuous measurement, such as weight
or height.
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
33
Mecmack A. Nartea * mackharvester@gmail.com
Cross-sectional data are data on a cross section of a population at a distinct point in
time. Time series data are data collected over time.

DESCRIPTIVE MEASURES FOR CATEGORICAL VARIABLES

This section discusses methods for describing a categorical variable. Because it is not
appropriate to perform arithmetic on the values of the variable, there are only a few possibilities
for describing the variable, and these are all based on counting. First, you can count the number
of categories. Many categorical variables such as Gender have only two categories. Others
such as Region can have more than two categories. As you count the categories, you can also
give the categories names, such as Male and Female.

Once you know the number of categories and their names, you can count the number of
observations in each category (this is referred to as the count of categories). The resulting
counts can be reported as “raw counts” or they can be transformed into percentages of totals.

DESCRIPTIVE MEASURES FOR NUMERICAL VARIABLES

There are many ways to summarize numerical variables, both with numerical summary
measures and with charts, and we discuss the most common ways in this section. But before
we get into details, it is important to understand the basic goal of this section. We begin with a
numerical variable such as Salary, where there is one observation for each per- son. Our basic
goal is to learn how these salaries are distributed across people. To do this, we can ask a
number of questions, including the following. (1) What are the most “typical” salaries? (2) How
spread out are the salaries? (3) What are the “extreme” salaries on either end? (4) Is a chart of
the salaries symmetric about some middle value, or is it skewed in some direction? (5) Does the
chart of salaries have any other peculiar features besides possible skewness? In the next
chapter, we explore methods for checking whether a variable such as Salary is related to other
variables, but for now we simply want to explore the distribution of values in the Salary column.

MEASURES OF CENTRAL TENDENCY

There are three common measures of central tendency, all of which try to answer the
basic question of which value is most “typical.” These are the mean, the median, and the
mode.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
34
Mecmack A. Nartea * mackharvester@gmail.com
The MEAN

The mean is the average of all values. If the data set represents a sample from some
larger population, this measure is called the sample mean and is denoted by X (pronounced “X-
bar”). If the data set represents the entire population, it is called the population mean and is
denoted by μ (the Greek letter mu). This distinction is not important in this chapter, but it will
become relevant in later chapters when we discuss statistical inference. In either case, the
formula for the mean is given by Equation (2.1).

The most widely used measure of central tendency is the mean, or arithmetic average. It
is the sum of all the scores in a distribution divided by the number of cases. In terms of a
formula, it is

∑𝑿
̅
𝒙=
𝑵

Where, 𝑥̅ = Mean

∑ 𝑋 = Sum of raw scores

N = number of cases

Suppose Anna’s IQ scores in 7 areas are:

IQ scores: 112 121 115 101 119 109 100


112+121+115+101+119+109+100
Applying the formula, 𝑥̅ = 7
= 111, hence, the mean IQ score
of Anna.

For Excel data sets, you can calculate the mean with the AVERAGE function.

The MEDIAN

The median is the middle observation when the data are sorted from smallest to largest.
If the number of observations is odd, the median is literally the middle observation. For example,
if there are nine observations, the median is the fifth smallest (or fifth largest). If the number of
observations is even, the median is usually defined as the average of the two middle
observations (although there are some slight variations of this definition). For example, if there
are 10 observations, the median is usually defined as the average of the fifth and sixth smallest
values.

Consider the following distribution of scores, where the median is 18:

14 material
No part of this 15 may be
16reproduced,
17 stored18 in a 19retrieved20
system,21 22
or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
35
Mecmack A. Nartea * mackharvester@gmail.com
In the following 10 scores we seek the point below which 5 scores fall:

14 16 16 17 18 19 20 20 21 22

The point below which 5 scores, or 50 percent of the cases, fall is halfway between 18
and 19. Thus, the median of this distribution is 18.5.

Consider the following scores:

18 20 22 25 25 30

Any point from 22.5 to 24.5 fits the definition of the median. By convention in such cases
the median is defined as half way between these lowest and highest points, in this case 22.5 +
24.5/2 = 23.5.

Table 2. Mr. Li’s Science Class Scores

(1) (2) (3) (4)


f
X fX cf

23 2 46 18
22 2 44 16
21 4 84 14
Add up

20 4 80 10
19 2 38 6
18 2 36 4
17 0 0 2
16 2 32 2

To find the median of Mr. Li’s physics exam scores, we need to find the point below
which 18/2 = 9 scores lie. We first create a cumulative frequency column (cf, column 4 in Table
6.2). The cumulative frequency for each interval is the number of scores in that interval plus the
total number of scores below it. Since the interval between 15.5 and 16.5 has no scores below it,
its cf is equal to its f, which is 2. Since there were no scores of 17, the cf for 17 is still 2. Then
adding the two scores of 18 yields a cumulative frequency of 4. Continuing up the frequency
column, we get cf ’s of 10, 14, 16, and, finally, 18, which is equal to the number of students.

The point separating the bottom nine scores from the top nine scores, the median, is
somewhere in the interval 19.5 to 20.5. Most statistics texts say to partition this interval to locate
the median. The cf column tells us that we have six scores below 19.5. We need to add three
scores to give us half the scores (9). Since there are four scores of 20, we go three-fourths of
the wayNo
from
part 19.5
of thisto 20.5 may
material to report a median
be reproduced, of in20.25.
stored Notesystem,
a retrieved that many computer programs,
or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
36
Mecmack A. Nartea * mackharvester@gmail.com
including the Statistical Package for the Social Sciences (SPSS) and the Statistical Analysis
System (SAS), simply report the midpoint of the interval—in this case 20—as the median.

The median can be calculated in Excel with the MEDIAN function.

The MODE

The mode is the value that appears most often, and it can be calculated in Excel with
the MODE function. In most cases where a variable is essentially continuous, the mode is not
very interesting because it is often the result of a few lucky ties.

The mode is the value in a distribution that occurs most frequently. It is the simplest to
find of the three measures of central tendency because it is determined by inspection rather
than by computation. Given the distribution of scores

14 16 16 17 18 19 19 19 21 22

you can readily see that the mode of this distribution is 19 because it is the most frequent score.
Sometimes there is more than one mode in a distribution. For example, if the scores had been

14 16 16 16 18 19 19 19 21 22

you would have two modes: 16 and 19. This kind of distribution with two modes is called
bimodal. Distributions with three or more modes are called trimodal or multimodal, respectively.

The mode is the least useful indicator of central value in a distribution for two reasons.
First, it is unstable. For example, two random samples drawn from the same population may
have quite different modes. Second, a distribution may have more than one mode. In published
research, the mode is seldom reported as an indicator of central tendency. Its use is largely
limited to inspectional purposes. A mode may be reported for any of the scales of measurement,
but it is the only measure of central tendency that may legitimately be used with nominal scales.

Excel Tip 3: Working with MODE Function

Two new versions of the MODE function were introduced in Excel 2010:
MODE.MULT and MODE.SNGL. The latter is the same as the older MODE
function. The MULT version returns multiple modes if there are multiple modes.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
37
Mecmack A. Nartea * mackharvester@gmail.com
Shapes of Distributions

Frequency distributions can have a variety of shapes. A distribution is symmetrical when


the two halves are mirror images of each other. In a symmetrical distribution, the values of the
mean and the median coincide. If such a distribution has a single mode, rather than two or more
modes, the three indexes of central tendency will coincide, as shown in Figure 6.

Figure 2. Symmetrical Distribution

Number of s cor es

Low Mean High


Median
Mode

If a distribution is not symmetrical, it is described as skewed, pulled out to one end or the
other by the presence of extreme scores. In skewed distributions, the values of the measures of
central tendency differ. In such distributions, the value of the mean, because it is influenced by
the size of extreme scores, is pulled toward the end of the distribution in which the extreme
scores lie, as shown in Figures 7 and 8.

Figure 3. Negatively Skewed Distribution

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
38
Mecmack A. Nartea * mackharvester@gmail.com
Figure 4. Positively Skewed Distribution

The effect of extreme values is less on the median because this index is influenced not
by the size of scores but by their position. Extreme values have no impact on the mode because
this index has no relation with either of the ends of the distribution. Skews are labeled according
to where the extreme scores lie. A way to remember this is “The tail names the beast.” Figure
4 shows a negatively skewed distribution, whereas Figure 8 shows a positively skewed
distribution.

MEASURES OF VARIABILITY

Although indexes of central tendency help researchers describe data in terms of average
value or typical measure, they do not give the total picture of a distribution. The mean values of
two distributions may be identical, whereas the degree of dispersion, or variability, of their
scores might be different. In one distribution, the scores might cluster around the central value;
in the other, they might be scattered. For illustration, consider the following distributions of
scores:

(a) 24, 24, 25, 25, 25, 26, 26 𝑋̅= 175/7 = 25

(b) 16, 19, 22, 25, 28, 30, 35 𝑋̅= 175/7 = 25

The value of the mean in both these distributions is 25, but the degree of scatter-ing of
the scores differs considerably. The scores in distribution (a) are obviously much more
homogeneous than those in distribution (b). There is clearly a need for indexes that can
describe distributions in terms of variation, spread, dispersion, heterogeneity, or scatter of
scores. Three indexes are commonly used for this purpose: range, variance, and standard
deviation.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
39
Mecmack A. Nartea * mackharvester@gmail.com
a. Range

The simplest of all indexes of variability is the range. It is the difference between the
upper real limit of the highest score and the lower real limit of the lowest score. In statistics, any
score is thought of as representing an interval width from halfway between that score and the
next lowest score (lower real limit) up to halfway between that score and the next highest score
(upper real limit).

For example, if several children have a recorded score of 12 pull-ups on a physical


fitness test, their performances probably range from those who just barely got their chin over the
bar the twelfth time and were finished (lower real limit) to those who completed 12 pull-ups,
came up again, and almost got their chin over the bar, but did not quite make it for pull-up 13
(upper limit). Thus, a score of 12 is considered as representing an interval from halfway
between 11 and 12 (11.5) to halfway between 12 and 13 (12.5) or an interval of 1. For example,
given the following distribution of scores, you find the range by subtracting 1.5 (the lower limit of
the lowest score) from 16.5 (the upper limit of the highest score), which is equal to 15.

2 10 11 12 13 14 16

Formula R = ( Xh−Xl ) + I

where
R = range
Xh = highest value in a distribution
Xl = lowest value in a distribution
I = interval width
Applying the formula, Subtract the lower number from the higher and add 1 (16 − 2 + 1 =
15). In frequency distribution, 1 is the most common interval width.

The range is an unreliable index of variability because it is based on only two values, the
highest and the lowest. It is not a stable indicator of the spread of the scores. For this reason,
the use of the range is mainly limited to inspectional purposes. Some research reports refer to
the range of distributions, but such references are usually used in conjunction with other
measures of variability, such as variance and standard deviation.

B. Variance and Standard Deviation

Variance and standard deviation are the most frequently used indexes of variability.
They are both based on deviation scores—scores that show the difference between a raw score
and the mean of the distribution. The formula for a deviation score is

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
40
Mecmack A. Nartea * mackharvester@gmail.com
__
x=X−X
where
x = deviation score
X = raw score
__

𝑋̅ = mean
Scores below the mean will have negative deviation scores, and scores above the mean
will have positive deviation scores.

By definition, the sum of the deviation scores in a distribution is always 0. Thus, to use
deviation scores in calculating measures of variability, you must find a way to get around the
fact that Σx = 0. The technique used is to square each deviation score so that they all become
positive numbers. If you then sum the squared deviations and divide by the number of scores,
you have the mean of the squared deviations from the mean, or the variance. In mathematical
form, variance is

∑ 𝑿𝟐
𝟐
𝜹 =
𝑵
where
σ2 = variance
Σ = sum of __
2
x = deviation of each score from the mean (X − X ) squared, otherwise known as
the deviation score squared
N = number of cases in the distribution

Table 3. Variance of Mr. Li’s Physics Exam Scores

(1) (2) (3) (4) (5) (6) (7) (8)


2 2 2
X f fX x x fx X f X2

23 2 46 +3 9 18 529 1058
22 2 44 +2 4 8 484 968
21 4 84 +1 1 4 441 1764
20 4 80 0 0 0 400 1600
19 2 38 −1 1 2 361 722
18 2 36 −2 4 8 324 648
17 0 0
16 2 32 −4 16 32 256 512
2 2
N=18 ΣX = 360 Σx = 72 ΣX = 7272

In column 4 of Table 4, we see the deviation scores, differences between each score,
and the No part ofColumn
mean. this material may be each
5 shows reproduced, storedscore
deviation in a retrieved
squared system,
(x2),orand
transmitted
column 6 shows the
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
41
Mecmack A. Nartea * mackharvester@gmail.com
frequency of each score from column 2 multiplied by x2 (column 5). Summing column 6 gives us
the sum of the squared deviation scores Σx2 = 72. Dividing this by the number of scores gives
us the mean of the squared deviation scores, the variance.

The common formula used in computing variance is convenient only when the mean is a
whole number. To avoid the tedious task of working with squared mixed-number deviation
scores such as 7.66672, we recommend that students always use this formula for computing
standard deviation if the computation must be done “by hand”:

𝟐
(∑ 𝑿)
∑ 𝑿𝟐 −
𝜹𝟐 = 𝑵
𝑵

where

σ2 = variance
ΣX2 = sum of the squares of each score (i.e., each score is first squared,
and then these squares are summed)
(ΣX)2 = sum of the scores squared (the scores are first summed, and then
this total is squared)
N = number of cases

Column 7 in Table 4 shows the square of the raw scores. Column 8 shows these raw
score squares multiplied by frequency. Summing this fX2 column gives us the sum of the
squared raw scores:

2
(∑ 𝑋) 3602 129600
∑ 𝑋2 − 7272− 7272− 7272−7200 72
2 𝑁 18 18
𝛿 = = = = = = 4
𝑁 18 18 18 18

In most cases, educators prefer an index that summarizes the data in the same unit of
measurement as the original data. Standard deviation (σ), the positive square root of variance,
provides such an index. By definition, the standard deviation is the square root of the mean of
the squared deviation scores. Rewriting this symbol, we obtain

∑ 𝑿𝟐
𝜹= √
𝑵

For Mr. Li’s physics exam scores, the standard deviation is

72
𝛿 = √ = √4 = 𝟐
No part of this material may be reproduced, 18
stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
42
Mecmack A. Nartea * mackharvester@gmail.com
The standard deviation belongs to the same statistical family as the mean; that is, like
the mean, it is an interval or ratio statistic, and its computation is based on the size of individual
scores in the distribution. It is by far the most frequently used measure of variability and is used
in conjunction with the mean.

There is a fundamental problem with variance as a measure of variability: It is in


squared units. For example, if the observations are measured in dollars, the variance is in
squared dollars. A more natural measure is the square root of variance. This is called the
standard deviation. Again, there are two versions of standard deviation.

The population standard deviation, denoted by σ, is the square root of the quantity in
Equation (2.3). To calculate either standard deviation in Excel, you can first find the variance
with the VAR or VARP function and then take its square root. Alternatively, you can find it
directly with the STDEV (sample) or STDEVP (population) function.

OUTLIERS AND MISSING VALUES

Most textbooks on data analysis, including this one, tend to use example data sets that
are “cleaned up.” Unfortunately, the data sets you are likely to encounter in your job are often
not so clean. Two particular problems you will encounter are outliers and missing data, the
topics of this section. There are no easy answers for dealing with these problems, but you
should at least be aware of the issues.

Outliers

An outlier is literally a value or an entire observation (row) that lies well outside of the
norm. For the baseball data, Alex Rodriguez’s salary of $32 million is definitely an outlier. This is
indeed his correct salary—the number wasn’t entered incorrectly—but it is way beyond what
most players make. Actually, statisticians disagree on an exact definition of an outlier. Going by
the third empirical rule, you might define an outlier as any value more than three standard
deviations from the mean, but this is only a rule of thumb. Let’s just agree to define outliers as
extreme values, and then for any particular data set, you can decide how extreme a value
needs to be to qualify as an outlier.

Sometimes an outlier is easy to detect and deal with. For example, this is often the case
with data entry errors. Suppose a data set includes a Height variable, a person’s height
measured in inches, and you see a value of 720. This is certainly an outlier—and it is certainly
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
43
Mecmack A. Nartea * mackharvester@gmail.com
an error. Once you spot it, you can go back and check this observation to see what the person’s
height should be. Maybe an extra 0 was accidentally appended and the true value is 72. In any
case, this type of outlier is usually easy to discover and fix.

It isn’t always easy to detect outliers, but an even more important issue is what to do
about them when they are detected. Of course, if they are due to data entry errors, they can be
ixed, but what if they are legitimate values like Alex Rodriguez’s salary? One or a few wild
outliers like this one can dominate a statistical analysis. For example, they can make a mean or
standard deviation much different than if the outliers were not present.

For this reason, some people argue, possibly naïvely, that outliers should be eliminated
before running statistical analyses. However, it is not appropriate to eliminate outliers simply to
produce “nicer” results. There has to be a legitimate reason for eliminating outliers, and such a
reason sometimes exists. For example, suppose you want to analyze salaries of “typical” man-
agers at your company. Then it is probably appropriate to eliminate the CEO and possibly other
high-ranking executives from the analysis, arguing that they aren’t really part of the population
of interest and would just skew the results. Or if you are interested in the selling prices of
“typical” homes in your community, it is probably appropriate to eliminate the few homes that
sell for over $2 million, again arguing that these are not the types of homes you are interested in.

Missing Values

There are no missing data in the baseball salary data set. All 843 observations have a
value for each of the four variables. For real data sets, however, this is probably the exception
rather than the rule. Unfortunately, most real data sets have gaps in the data. This could be
because a person didn’t want to provide all the requested personal information (what business
is it of yours how old I am or whether I drink alcohol?), it could be because data doesn’t exist
(stock prices in the 1990s for companies that went public after 2000), or it could be because
some values are simply unknown. Whatever the reason, you will undoubtedly encounter data
sets with varying degrees of missing values. As with outliers, there are two issues: how to detect
missing values and what to do about them. The first issue isn’t as simple as you might imagine.
For an Excel data set, you might expect missing data to be obvious from blank cells. This is
certainly one possibility, but there are others. Missing data are coded in a variety of strange
ways. One common method is to code missing values with an unusual number such as Š9999
or 9999. Another method is to code missing values with a symbol such as Š or *. If you know
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
44
Mecmack A. Nartea * mackharvester@gmail.com
the code (and it is often supplied in a footnote), then it is usually a good idea, at least in Excel,
to perform a global search and replace, replacing all of the missing value codes with blanks.

The more important issue is what to do about missing values. One option is to ignore
them. Then you will have to be aware of how the software deals with missing values. For
example, if you use Excel’s AVERAGE function on a column of data with missing values, it
reacts the way you would hope and expect—it adds all the non-missing values and divides by
the number of non-missing values. StatTools reacts in the same way for all of the measures
discussed in this chapter (after alerting you that there are indeed missing values). We will say
more about how StatTools deals with missing data for other analyses in later chapters. If you
are using other statistical software such as SPSS or SAS, you should read its online help to
learn how its various statistical analyses deal with missing data.

Because this is such an important topic in real-world data analysis, researchers have
studied many ways of filling in the gaps so that the missing data problem goes away (or is at
least disguised). One possibility is to fill in all of the missing values in a column with the average
of the non-missing values in that column. Indeed, this is an option in some soft- ware packages,
but we don’t believe it is usually a very good option. (Is there any reason to believe that missing
values would be average values if they were known? Probably not.) Another possibility is to
examine the non-missing values in the row of any missing value. It is possible that they provide
some clues on what the missing value should be. For example, if a person is male, is 55 years
old, has an MBA degree from Harvard, and has been a man- ager at an oil company for 25
years, this should probably help to predict his missing salary. (It probably isn’t below $100,000.)
We will not discuss this issue any-further here because it is quite complex, and there are no
easy answers. But be aware that you will undoubtedly have to deal with missing data at some
point in your job, either by ignoring the missing values or by filling in the gaps in some way.

CHAPTER EXERCISES

1. Provide answers as requested, given the following distribution: 15, 14, 14, 13, 11, 10, 10, 10,
8, 5.

a) Calculate the mean.


b) Determine the value of the median.
c) Determine the value of the mode.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
45
Mecmack A. Nartea * mackharvester@gmail.com
2. Supposed that Ms. Llave’s English class has the following scores in two tests as shown in
Table 1:

Student’s Name Test 1 Scores Test 2 Scores


Jude 27 26
Lara 28 31
Nicole 26 30
Christopher 32 31
Desserie 28 30
Lyra 27 29
Lance 25 24
Vince 24 23
Marc 23 18
Elyza 24 26
Philip 29 30
Jomarie 27 28
Earl 30 29
Budgett 19 22
Aron 25 27

Determine the following of each of the tests (1 and 2):

a. Mean

b. Median

c. Mode

d. Variance

e. Standard Deviation

SUGGESTED READINGS

Read articles on Measures of Central Tendency at http://statistics.alerd.com/statistical-


guides/measures-central-tendency-mean-mode-meadian.php

Read articles on Measures of Variability at


http://onlinestatbook.com/2/summarizing_distributions/variability.html
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
46
Mecmack A. Nartea * mackharvester@gmail.com
CHAPTER
ANALYTICS ON SPREADSHEETS
4

OVERVIEW

This section discusses a great tool that was introduced in Excel 2007: tables. Tables were
somewhat available in previous versions of Excel, but they were never called tables before,
and some of the really useful features of Excel 2007 tables were new at the time. This
chapter discusses how you were able to do filtering, sorting and summarizing data using
spreadsheets.

OBJECTIVES

▪ Learn how to use Microsoft Excel


▪ Learn how to spreadsheet in sorting, filtering and summarizing data
▪ Learn to summarize data through graphs and tables

REFERENCES

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
47
Mecmack A. Nartea * mackharvester@gmail.com
EXCEL TABLES FOR FILTERING, SORTING, AND SUMMARIZING

It is useful to begin with some terminology and history. Earlier in this chapter, we
discussed data arranged in a rectangular range of rows and columns, where each row is an
observation and each column is a variable, with variable names at the top of each column.
Informally, we refer to such a range as a data set. In fact, this is the technical term used by
StatTools. In previous versions of Excel, data sets of this form were called lists, and Excel
provided several tools for dealing with lists. In Excel 2007, recognizing the importance of data
sets, Microsoft made them much more prominent and provided even better tools for analyzing
them. Specifically, you now have the ability to designate a rectangular data set as a table and
then employ a number of powerful tools for analyzing tables. These tools include filtering,
sorting, and summarizing.

Let’s consider data in Table 4. The data contains 1000 customers of HyTex, a
(fictional) direct marketing company, for the current year. The definitions of the variables are
fairly straightforward, but details about several of them are listed in cell comments in row 1.
HyTex wants to find some useful and quick information about its customers by using an Excel
table. How can it proceed?

Table 4. HyTex Customer Data

The range A1:O1001 is in the form of a data set—it is a rectangular range bounded by
blank rows and columns, where each row is an observation, each column is a variable, and
variable names appear in the top row. Therefore, it is a candidate for an Excel table. However, it
doesn’t benefit from the new table tools until you actually designate it as a table. To do so,
select any cell in the data set, click the Table button in the left part of the Insert ribbon (see
Figure 4), and accept the default options. Two things happen. First, the data set is designated
as a table, it isofformatted
No part nicely,
this material and
may be a dropdown
reproduced, storedarrow appears
in a retrieved next or
system, totransmitted
each variable name, as
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
48
Mecmack A. Nartea * mackharvester@gmail.com
shown in Figure 5. Second, a new Table Tools Design ribbon becomes available (see Figure 6).
This ribbon is available any time the active cell is inside a table. Note that the table is named
Table1 by default (if this is the first table). However, you can change this to a more descriptive
name if you like.

Figure 5. Inserting Ribbon with Table Button

Figure 6. Table Tools Design Ribbon

One handy feature of Excel tables is that the variable names remain visible even when
you scroll down the screen. Try it to see how it works. When you scroll down far enough that the
variable names would disappear, the column headers, A, B, C, and so on, change to the
variable names. Therefore, you no longer need to freeze panes or split the screen to see the
variable names. However, this works only when the active cell is within the table. If you click
outside the table, the column headers revert back to A, B, C, and so on.

Figure 7. Table with Dropdown Arrows Next to Variable Names

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
49
Mecmack A. Nartea * mackharvester@gmail.com
Filtering

We now discuss ways of filtering data sets—that is, finding records that match particular
criteria. Before getting into details, there are two aspects of filtering you should be aware of.
First, this section is concerned with the types of filters called AutoFilter in pre-2007 versions of
Excel. The term AutoFilter implied that these were very simple filters, easy to learn and apply. If
you wanted to do any complex filtering, you had to move beyond AutoFilter to Excel’s Advanced
Filter tool. Starting in version 2007, Excel still has Advanced Filter. However, the term AutoFilter
has been changed to Filter to indicate that these “easy” filters are now more powerful than the
old AutoFilter. Fortunately, they are just as easy as AutoFilter.

Second, one way to filter is to create an Excel table, as indicated in the previous
subsection. This automatically provides the dropdown arrows next to the field names that allow
you to filter. Indeed, this is the way we will filter in this section: on an existing table. However, a
designated table is not required for filtering. You can filter on any rectangular data set with
variable names. There are actually three ways to do so. For each method, the active cell should
be a cell inside the data set.

■ Use the Filter button from the Sort & Filter dropdown list on the Home ribbon.

■ Use the Filter button from the Sort & Filter group on the Data ribbon.

■ Right-click any cell in the data set and select Filter.

You get several options, the most popular of which is Filter by Selected Cell’s Value.
For example, if the selected cell has value 1 and is in the Children column, then only customers
with a single child will remain visible. (This behavior should be familiar to Access users.) The
point is that Microsoft realizes how important filtering is to Excel users. Therefore, they have
made filtering a very prominent and powerful tool in all versions of Excel since 2007. As far as
we can tell, the two main advantages of filtering on a table, as opposed to the three options just
listed, are the nice formatting (banded rows, for example) provided by tables, and, more
importantly, the total row. If this total row is showing, it summarizes only the visible records; the
hidden rows are ignored.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
50
Mecmack A. Nartea * mackharvester@gmail.com
CHAPTER EXERCISES

1. Obtain DOH data on COVID19 in the Philippines from March to August 2020. Write/ print in
short bond paper

Tasks:

▪ For students WITH COMPUTER

a. Perform Sorting of Data (using MS Excel)

b. Filter data according to (using MS Excel)

- Month

- Types of Data (confirmed case, death, recoveries)

▪ For students WITHOUT COMPUTER

a. Create a line graph of the COVID19 data

- Monthly data

- Data type (confirmed case, death, recoveries)

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
51
Mecmack A. Nartea * mackharvester@gmail.com
CHAPTER PROBABILITY AND PROBABILITY

5 DISTRIBUTIONS

OVERVIEW

The chance of an event occurring is probability. What is the chance that it will rain today?
What is the chance that you will reach the office in the next ten minutes? Given the existing
grades, what is the chance that a student will pass the exam? In this chapter, we will cover
the concept of probability and how to calculate. It will also cover the concept of distribution,
especially normal distributions and how to work on distributions.

OBJECTIVES

▪ Explain the basic concepts and tools necessary to work with probability
distributions and their summary measures.
▪ Examine the probability distribution of a single random variable.
▪ Understand the concept of addition rule
▪ Understand and learn conditional probability and the multiplication rule
▪ Learn summarizing measure of probability distribution

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
52
Mecmack A. Nartea * mackharvester@gmail.com
A key aspect of solving real business problems is dealing appropriately with uncertainty.
This involves recognizing explicitly that uncertainty exists and using quantitative methods to
model uncertainty. If you want to develop realistic business models, you cannot simply act as if
uncertainty doesn’t exist. For example, if you don’t know next month’s demand, you shouldn’t
build a model that assumes next month’s demand is a sure 1500 units. This is only wishful
thinking. You should instead incorporate demand uncertainty explicitly into your model. To do
this, you need to know how to deal quantitatively with uncertainty. This involves probability and
probability distributions. We introduce these topics in this chapter and then use them in a
number of later chapters.

There are many sources of uncertainty. Demands for products are uncertain, times
between arrivals to a supermarket are uncertain, stock price returns are uncertain, changes in
interest rates are uncertain, and so on. In many situations, the uncertain quantity— demand,
time between arrivals, stock price return, change in interest rate—is a numerical quantity. In the
language of probability, it is called a random variable. More formally, a random variable
associates a numerical value with each possible random outcome.

Associated with each random variable is a probability distribution that lists all of the
possible values of the random variable and their corresponding probabilities. A probability
distribution provides very useful information. It not only indicates the possible values of the
random variable, but it also indicates how likely they are. For example, it is useful to know that
the possible demands for a product are, say, 100, 200, 300, and 400, but it is even more useful
to know that the probabilities of these four values are, say, 0.1, 0.2, 0.4, and 0.3. This implies,
for example, that there is a 70% chance that demand will be at least 300. It is often useful to
summarize the information from a probability distribution with numerical summary measures.
These include the mean, variance, and standard deviation. The summary measures in this
chapter are based on probability distributions, not an observed data set. We will use numerical
examples to explain the difference between the two—and how they are related.

We discuss two terms you often hear in the business world: uncertainty and risk. They
are sometimes used interchangeably, but they are not really the same. You typically have no
control over uncertainty; it is something that simply exists. A good example is the uncertainty in
exchange rates. You cannot be sure what the exchange rate between the U.S. dollar and the
euro will be a year from now. All you can try to do is measure this uncertainty with a probability
distribution.
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
53
Mecmack A. Nartea * mackharvester@gmail.com
book. By learning about probability, you will learn how to measure uncertainty, and you
will also learn how to measure the risks involved in various decisions. One important topic you
will not learn much about is risk mitigation by various types of hedging. For example, if you
know you have to purchase a large quantity of some product from Europe a year from now, you
face the risk that the value of the euro could increase dramatically, thus costing you a lot of
money. Fortunately, there are ways to hedge this risk, so that if the euro does increase relative
to the dollar, your hedge minimizes your losses.

PROBABILITY ESSENTIALS

A probability is a number between 0 and 1 that measures the likelihood that some
event will occur. An event with probability 0 cannot occur, whereas an event with probability 1 is
certain to occur. An event with probability greater than 0 and less than 1 involves uncertainty.
The closer its probability is to 1, the more likely it is to occur.

When a sports commentator states that the odds against the Miami Heat winning the
NBA Championship are 3 to 1, he or she is also making a probability statement. The concept of
probability is quite intuitive. However, the rules of probability are not always as intuitive or easy
to master. We examine the most important of these rules in this section.

As the examples in the preceding paragraph illustrate, probabilities are sometimes


expressed as percentages or odds. However, these can easily be converted to probabilities on
a 0-to-1 scale. If the chance of rain is 70%, then the probability of rain is 0.7. Similarly, if the
odds against the Heat winning are 3 to1, then the probability of the Heat winning is 1/4 (or 0.25).

There are only a few probability rules you need to know, and they are discussed in the
next few subsections. Surprisingly, these are the only rules you need to know. Probability is not
an easy topic, and a more thorough discussion of it would lead to considerable mathematical
complexity, well beyond the level of this book.

Rule of Complements

The simplest probability rule involves the complement of an event. If A is any event, then
the c complement of A, denoted by A (or in some books by A ), is the event that A does not
occur.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
54
Mecmack A. Nartea * mackharvester@gmail.com
For example, if A is the event that the Dow Jones Industrial Average will inish the year at
or above the 14,000 mark, then the complement of A is that the Dow will inish the year below
14,000.

If the probability of A is P(A), then the probability of its complement, P(A), is given by
Equation below. Equivalently, the probability of an event and the probability of its complement
sum to 1. For example, if you believe that the probability of the Dow finishing at or above 14,000
is 0.25, then the probability that it will finish the year below 14,000 is 1 - 0.25 = 0.75.

̅̅̅ = 𝟏 − 𝑷(𝑨)
𝑷(𝑨)

Addition Rule

Events are mutually exclusive if at most one of them can occur. That is, if one of them
occurs, then none of the others can occur.

For example, consider the following three events involving a company’s annual revenue
for the coming year: (1) revenue is less than $1 million, (2) revenue is at least $1 million but less
than $2 million, and (3) revenue is at least $2 million. Clearly, only one of these events can
occur. Therefore, they are mutually exclu- sive. They are also exhaustive events, which means
that they exhaust all possibilities—one of these three events must occur. Let A1 through An be
any n events. Then the addition rule of probability involves the probability that at least one of
these events will occur. In general, this probability is quite complex, but it simplifies considerably
when the events are mutually exclusive. In this case the probability that at least one of the
events will occur is the sum of their individual probabilities, as shown in Equation below Of
course, when the events are mutually exclusive, “at least one” is equivalent to “exactly one.” In
addition, if the events A1 through An are exhaustive, then the probability is one because one of
the events is certain to occur.

P(at least one of A 1 through An) = P(A1) + P(A2)+…+P(An)

For example, in terms of a company’s annual revenue, define A1 as “revenue is less than
$1 million,” A2 as “revenue is at least $1 million but less than $2 million,” and A3 as “revenue is
at least $2 million.” Then these three events are mutually exclusive and exhaustive. Therefore,
their probabilities must sum to 1. Suppose these probabilities are P(A1) = 0.5, P(A2) = 0.3, and
P(A3) = 0.2. (Note that these probabilities do sum to 1.) Then the additive rule enables you to

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
55
Mecmack A. Nartea * mackharvester@gmail.com
calculate other probabilities. For example, the event that revenue is at least $1 million is the
event that either A2 or A3 occurs. From the addition rule, its probability is

P(revenue is at least $1 million) = P(A2) + P(A3) = 0.5

Similarly,

P(revenue is less than $2 million) = P(A1) + P(A2) = 0.8

and

P(revenue is less than $1 million or at least $2 million) = P(A1) + P(A3) = 0.7

Again, the addition rule works only for mutually exclusive events. If the events overlap,
the situation is more complex.

Conditional Probability and the Multiplication Rule

Probabilities are always assessed relative to the information currently available. As new
information becomes available, probabilities can change. For example, if you read that LeBron
James suffered a season-ending injury, your assessment of the probability that the Heat will win
the NBA Championship would obviously change. A formal way to revise probabilities on the
basis of new information is to use conditional probabilities.

Let A and B be any events with probabilities P(A) and P(B). Typically, the probability
P(A) is assessed without knowledge of whether B occurs. However, if you are told that B has
occurred, then the probability of A might change. The new probability of A is called the
conditional probability of A given B, and it is denoted by P(A∣B). Note that there is still
uncertainty involving the event to the left of the vertical bar in this notation; you do not know
whether it will occur. However, there is no uncertainty involving the event to the right of the
vertical bar; you know that it has occurred. The conditional probability can be calculated with the
following formula.

The numerator in this formula is the probability that both A and B occur. This probability
must beNo
known
part oftothis
find P(A∣B).
material mayHowever, in some
be reproduced, storedapplications
in a retrieved P(A∣B) and
system, or P(B) are known. Then
transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
56
Mecmack A. Nartea * mackharvester@gmail.com
you can multiply both sides of Equation (4.3) by P(B) to obtain the following multiplication rule
for P(A and B).

Example:

Bender Company supplies contractors with materials for the construction of houses.
The company currently has a contract with one of its customers to fill an order by the end of July.
However, there is some uncertainty about whether this deadline can be met, due to uncertainty
about whether Bender will receive the materials it needs from one of its suppliers by the middle
of July. Right now it is July 1. How can the uncertainty in this situation be assessed?

Solution

Let A be the event that Bender meets its end-of-July deadline, and let B be the event
that Bender receives the materials from its supplier by the middle of July. The probabilities
Bender is best able to assess on July 1 are probably P(B) and P(A∣B). At the beginning of July,
Bender might estimate that the chances of getting the materials on time from its supplier are 2
out of 3, so that P(B) = 2/3. Also, thinking ahead, Bender estimates that if it receives the
required materials on time, the chances of meeting the end-of-July deadline are 3 out of 4. This
is a conditional probability statement, namely, that P(A∣B) = 3/4. Then the multiplication rule
implies that

P(A and B) =P(A∣B)P(B) = (3/4)(2/3) = 0.5

That is, there is a fifty-fifty chance that Bender will get its materials on time and meet its end-
of-July deadline.

PROBABILITY DISTRIBUTION OF A SINGLE RANDOM VARIABLE

There are really two types of random variables: discrete and continuous. A discrete
random variable has only a finite number of possible values, whereas a continuous random
variable has a continuum of possible values. Usually a discrete distribution results from a count,
whereas a continuous distribution results from a measurement. For example the number of
children in a family is clearly discrete, whereas the amount of rain this year in San Francisco is
clearly continuous.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
57
Mecmack A. Nartea * mackharvester@gmail.com
This distinction between counts and measurements is not always clear-cut. For
example, what about the demand for televisions at a particular store next month? The number
of televisions demanded is clearly an integer (a count), but it probably has many possible values,
such as all integers from 0 to 100. In some cases like this, we often approximate in one of two
ways. First, we might use a discrete distribution with only a few possible values, such as all
multiples of 20 from 0 to 100. Second, we might approximate the possible demand as a
continuum from 0 to 100. The reason for such approximations is to simplify the mathematics,
and they are frequently used.

Mathematically, there is an important difference between discrete and continuous


probability distributions. Specifically, a proper treatment of continuous distributions, analogous
to the treatment we provide in this chapter, requires calculus—which we do not presume for this
book. Therefore, we discuss only discrete distributions in this chapter. In later chapters we often
use continuous distributions, particularly the bell-shaped normal distribution, but we simply state
their properties without deriving them mathematically.

The essential properties of a discrete random variable and its associated probability
distribution are quite simple. We discuss them in general and then analyze a numerical example.

Let X be a random variable. To specify the probability distribution of X, we need to


specify its possible values and their probabilities. We assume that there are k possible values,
denoted v1, v2, . . . , vk. The probability of a typical value vi is denoted in one of two ways, either
P(X = vi) or p(vi). The first is a reminder that this is a probability involving the random variable X,
whereas the second is a shorthand notation. Probability distributions must satisfy two criteria:
(1) the probabilities must be nonnegative, and (2) they must sum to 1. In symbols, we must
have

Summary Measures of a Probability Distribution

It is often convenient to summarize a probability distribution with two or three well-


chosen numbers. The first of these is the mean, often denoted µ. It is also called the expected
value of X and denoted E(X) (for expected X). The mean is a weighted sum of the possible
values, weighted by their probabilities, as shown in Equation below. In much the same way that
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
58
Mecmack A. Nartea * mackharvester@gmail.com
an average of a set of numbers indicates “central location,” the mean indicates the “center” of
the probability distribution.

To measure the variability in a distribution, we calculate its variance or standard


deviation. The variance, denoted by σ2 or Var(X), is a weighted sum of the squared deviations
of the possible values from the mean, where the weights are again the probabilities. This is
shown in Equation below. As in Chapter 3, the variance is expressed in the square of the units
of X, such as dollars squared. Therefore, a more natural measure of variability is the standard
deviation, denoted by σ or Stdev(X). It is the square root of the variance, as indicated by
Equation below.

CHAPTER EXERCISES

An investor is concerned with the market return for the coming year, where the market
return is defined as the percentage gain (or loss, if negative) over the year. The investor
believes there are five possible scenarios for the national economy in the coming year: rapid
expansion, moderate expansion, no growth, moderate contraction, and serious contraction.
Furthermore, she has used all of the information available to her to estimate that the market
returns for these scenarios are, respectively, 23%, 18%, 15%, 9%, and 3%. That is, the possible
returns vary from a high of 23% to a low of 3%. Also, she has assessed that the probabilities of
these outcomes are 0.12, 0.40, 0.25, 0.15, and 0.08. Use this information to describe the
probability distribution of the market return.

Compute the following for the probability distribution of the market return for the coming
year.:
No part of this material may be reproduced, stored in a retrieved system, or transmitted
1. Mean,
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
59
Mecmack A. Nartea * mackharvester@gmail.com
2. Variance,

3. Standard deviation

Show your solutions.

SUGGESTED READINGS

Read article on Probability Distribution at http://stattrek.com/probability-distributions/probability-


distribution.aspx

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
60
Mecmack A. Nartea * mackharvester@gmail.com
CHAPTER STATISTICAL INFERENCE:

6 SAMPLING AND ESTIMATION

OVERVIEW

This chapter introduces the important problem of estimating an unknown population quantity
by randomly sampling from the population. Sampling is often expensive and/or time-
consuming, so a key step in any sampling plan is to determine the sample size that produces
a prescribed level of accuracy. This chapter also sets the stage for statistical inference, a
topic that is explored in the following few chapters. In a typical statistical inference problem,
you want to discover one or more characteristics of a given population.

OBJECTIVES

▪ Identify sample size and population


▪ Identify and Understand different sampling techniques
▪ to discuss the sampling schemes that are generally used in real sampling applications.
▪ to see how the information from a sample of the population can be used to infer the
properties of the entire population.
▪ Learn different statistical treatment used for inferences
▪ Understand the use of confidence interval and margin of errors
▪ Understand estimation and its use

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
61
Mecmack A. Nartea * mackharvester@gmail.com
UNDERSTANDING SAMPLES

What is a population? Any group with at least one common characteristic and is made
up of people, transactions, products, and so on, is called a population. You need to understand
the population for any project at the beginning of the project. In business, it is rare to have a
population that has only one characteristic. Generally, it will have many variables in the data set.
What is a sample? A sample consists of a few observations or subset of a population. Can a
sample have the same number of observations as a population? Yes, it can. Some of the
differences between populations and samples are in the computations and nomenclatures
associated with them.

In statistics, population refers to a collection of data related to people or events for


which the analyst wants to make some inferences. It is not possible to examine every member
in the population. Thus, if you take a sample that is random and large enough, you can use the
information collected from the sample to make deductions about the population. For example,
you can look at 100 students from a school (picked randomly) and make a fairly accurate
judgment of the standard of English spoken in the school. Or you can look at the last 100
transactions on a web site and figure out fairly accurately the average time a customer spends
on the web site.

Before you can choose a sample from a given population, you typically need a list of all
members of the population. In sampling terminology, this list is called a frame, and the potential
sample members are called sampling units. Depending on the context, sampling units could be
individual people, households, companies, cities, or others.

There are two basic types of samples: probability samples and judgmental samples.
A probability sample is a sample in which the sampling units are chosen from the population
according to a random mechanism. In contrast, no formal random mechanism is used to select
a judgmental sample. In this case the sampling units are chosen according to the sampler’s
judgment.

SAMPLING TECHNIQUES

A sample is part of the population which is observed in order to make inferences about
the whole population (Manheim, 1977). You use sampling when your research design requires
that you collect information from or about a population, which is large or so widely scattered as
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
62
Mecmack A. Nartea * mackharvester@gmail.com
to make it impractical to observe all the individuals in the population. A sample reflects the
characteristics of the population.

Four factors that you should take into consideration when selecting your sample and the
size of your sample are the following:

1. Homogeneity. Take samples from a homogenous population. Samples taken


from a heterogeneous population will not be representative of the population,
and therefore, cannot be inferred from.

2. Size of population. If the population is large, you need a sample. However, you
do not need a sample if the population is small and can be handled if you
include all the individuals in the population. Including all the individuals in the
population is also called total enumeration.

3. Cost. Your choice of sampling method should be based also on the cost of
adopting such method without necessarily sacrificing representativeness of the
population being considered.

4. Precision. If you have to achieve precision, you will need a larger sample
because the larger the sample, the more precise the results will be.

There are two major types of sampling techniques: probability sampling and non-
probability sampling.

a. Probability sampling

According to Domingo (1954), probability sampling is a sampling process where each


individual is drawn or selected with known probability. Parel et al. (1966) consider a sample to
be probability sample when every individual in the population is given a non-zero chance of
being chosen for the sample. There are six techniques under this sampling method.

1. Random sampling. Also called simple random sampling, this technique is a way of
selecting n individuals out of N such that everyone has an equal chance of being
selected. Sample individuals are selected at points entirely at random within the
population. This technique is suitable for homogeneous populations.

2. Systematic random sampling. This technique starts by numbering consecutively all


individuals in the population. The first sample is selected through a simple random
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
63
Mecmack A. Nartea * mackharvester@gmail.com
process, then the succeeding samples are chosen at pre-established intervals. To
determine the appropriate interval, divide N by the desired number of sample.

3. Stratified sampling. This technique is applicable when the population is not


homogeneous wherein the random sample may not be representative of the
population. When you do stratified sampling, divide the population into homogeneous
groups called strata, then draw samples either by simple random sampling or
stratified sampling from each of the formed strata. For precise results, the total
number of the desired sample may be allocated equally among the strata. This
technique prevents any chance concentration of sample units in one part of the field
because they are well distributed. For example, suppose that you would like to take a
sample of students at the University of the Philippines Los Baños using the stratified
sampling technique. The stratification of the student population has already been
made for you. The strata are: “freshmen,” “sophomore,” “junior,” and “senior.” What
do you do to select your sample from each of these groups of students to insure that
you get a cross-section of the UPLB studentry? If you select your sample by simple
random selection, there is chance that you will end up with a sample composed more
of seniors or juniors rather than representative groups of students in all
classifications.

4. Simple cluster sampling. This is a one-stage sampling technique wherein the


population is grouped into clusters or small units composed of population elements.
A number of these population clusters is chosen either by simple random sampling
or by systematic random sampling.

5. Strip sampling. Under this technique, you divide the area to be sampled into narrow
strips. Then, select a number of strips at random either by complete randomization or
with some degree of stratification. Sometimes you may consider only a part of the
strip as a sample unit.

6. Multi-stage sampling. This technique is commonly used when there is no detailed


or actual listing of individuals. You do sampling stages, which means that you group
the population elements into a hierarchy of individuals or units, and sampling is done
successively.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
64
Mecmack A. Nartea * mackharvester@gmail.com
b. Non-probability sampling

According to Yamane (1967), this method is a process whereby probabilities cannot be


assigned objectively to individuals in the population. Simply, not all the individuals in the
population are given a non-zero chance of being included in the sample. In fact, some
individuals in the population may be deliberately ignored.

1. Judgment sampling. This is a process whereby you select a representative sample


according to your subjective judgment. Since personal bias is usually a factor in the
selection of sample, there is no objective way of evaluating the results of this
technique. This sampling technique may be appropriate when you have to make
judgment about an individual’s potential as a source of information.

2. Quota sampling. This is simply a variation of judgment sampling, which provides


more explicit instructions on who to select. A definite quota must be filled. The quota
is determined to a certain extent by the characteristics of the population so that the
quota sample will be representative of the population. This is commonly used in
opinion research, where interviewers are just given specific quotas or number of
respondents to interview. This technique is very economical and simple, but it must
be used with caution as it allows for a wide latitude of interviewer’s choices which
may result in biases. The assumption here, however, is that field investigators have
high integrity and they have undergone thorough training.

3. Accidental sampling. This technique is very simple in that whoever happens to be


there at the time of the interview is interviewed and becomes part of the sample. This
is normally done in spot surveys for audience studies, for example.

Why Random Sampling?

One reason for sampling randomly from a population is to avoid biases (such as
choosing mainly stay-at- home mothers because they are easier to contact). An equally
important reason is that random sampling allows you to use probability to make inferences
about unknown population parameters. If sampling were nit random, there would be no
basis for using probability to make such inference.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
65
Mecmack A. Nartea * mackharvester@gmail.com
DETERMINING SAMPLE SIZE

On top of the basic sampling techniques that are commonly used, you can introduce a
system where you can insure that the final sample of your study is really representative of the
population comprised of individuals that may come in clusters or groups. This is called
proportional sampling and there is a simple formula that would enable you to arrive at a
complete sample that is representative of the segments of the population.

For instance, you want to obtain a sample sufficiently representative of the barangays or
villages in a town. You know that the barangays differ in total number of individuals living in
them. So you decide that those with larger population should be represented by more
respondents. How then would you determine the number of respondents coming from each
village?

To determine the sample size, Slovin’s formula is commonly used for lesser population.

𝑁
𝑛= Where, n= sample size
1+𝑁𝑒 2 N= Population size
e= margin of error

For example:

Suppose you wanted to determine the sample size for your study on households’ taste
preference on the new variety of ice cream. The study will be conducted in Sto. Nino,
Paranaque City with total number of households of 4, 921 (PSA Census on Population 2000
data).

Solution:

𝑁 4921 4921 4921 4921


𝑛= 𝑛 = 1+4921(.05)2 = 1+4921(.0025) = 1+12.30 = 13.30 = 𝟑𝟕𝟎
1+𝑁𝑒 2

Hence, your sample size is only 370 households from the 4921. This represents the
number of respondents that you will survey for your study.

INTRODUCTION TO ESTIMATION

The purpose of any random sample, simple or otherwise, is to estimate properties of a


population fromof the
No part data observed
this material in the sample.
may be reproduced, stored inThe following
a retrieved is aorgood
system, example to keep in
transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
66
Mecmack A. Nartea * mackharvester@gmail.com
mind. Suppose a government agency wants to know the average household income over the
population of all households in Indiana. Then this unknown average is the population parameter
of interest, and the government is likely to estimate it by sampling several representative
households in Indiana and reporting the average of their incomes.

The mathematical procedures appropriate for performing this estimation depend on


which properties of the population are of interest and which type of random sampling scheme is
used. Because the details are considerably more complex for more complex sampling schemes
such as multistage sampling, we will focus on simple random samples, where the mathematical
details are relatively straightforward. Details for other sampling schemes such as stratified
sampling can be found in Levy and Lemeshow (1999). However, even for more complex
sampling schemes, the concepts are the same as those we discuss here; only the details
change.

Sources of Estimation Error

There are two basic sources of errors that can occur when you sample randomly from a
population: sampling error and all other sources, usually lumped together as non-sampling
error. Sampling error results from “unlucky” samples. As such, the term error is somewhat
misleading.

For example, that the mean household income in Indiana is $58,225. (We can only
assume that this is the true value. It wouldn’t actually be known without taking a census.) A
government agency wants to estimate this mean, so it randomly samples 500 Indiana
households and finds that their average household income is $60,495. If the agency then infers
that the mean of all Indiana household incomes is $60,495, the resulting sampling error is the
difference between the reported value and the true value: $60,495 – $58,225 = $2270. Note that
the agency hasn’t done anything wrong. This sampling error is essentially due to bad luck.

Non-sampling error is quite different and can occur for a variety of reasons.

a. nonresponse bias. This occurs when a portion of the sample fails to respond to the survey.
Anyone who has ever conducted a questionnaire, whether by mail, by phone, or any other
method, knows that the percentage of non-respondents can be quite large. The question is
whether this introduces estimation error. If the non-respondents would have responded
similarly to the respondents, you don’t lose much by not hearing from them. However,
because theofnon-respondents
No part this material may bedon’t respond,
reproduced, you
stored in atypically
retrieved have
system,no
or way of knowing whether
transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
67
Mecmack A. Nartea * mackharvester@gmail.com
they differ in some important respect from the respondents. Therefore, unless you are able to
persuade the non-respondents to respond—through a follow-up email, for example—you
must guess at the amount of nonresponse bias.

b. non-truthful responses. This is particularly a problem when there are sensitive questions in
a questionnaire. For example, if the questions “Have you ever had an abortion?” or “Do you
regularly use cocaine?” are asked, most people will answer “no,” regardless of whether the
true answer is “yes” or “no.”

c. measurement error. This occurs when the responses to the questions do not reflect what
the investigator had in mind. It might result from poorly worded questions, questions the
respondents don’t fully understand, questions that require the respondents to supply
information they don’t have, and so on. Undoubtedly, there have been times when you were
filling out a questionnaire and said to yourself, “OK, I’ll answer this as well as I can, but I
know it’s not what they want to know.”

d. voluntary response bias. This occurs when the subset of people who respond to a survey
differ in some important respect from all potential respondents. For example, suppose a
population of students is surveyed to see how many hours they study per night. If the
students who respond are predominantly those who get the best grades, the resulting sample
mean number of hours could be biased on the high side.

Key Terms in Sampling

A point estimate is a single numeric value, a “best guess” of a population parameter, based on
the data in a random sample.

The sampling error (or estimation error) is the difference between the point estimate and the
true value of the population parameter being estimated.

The sampling distribution of any point estimate is the distribution of the point estimates from
all possible samples (of a given sample size) from the population.

A confidence interval is an interval around the point estimate, calculated from the sample data,
that is very likely to contain the true value of the population parameter.

An unbiased estimate is a point estimate such that the mean of its sampling distribution is
equal to the true value of the population parameter being estimated.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
68
Mecmack A. Nartea * mackharvester@gmail.com
The standard error of an estimate is the standard deviation of the sampling distribution of the
estimate. It measures how much estimates vary from sample to sample.

Sample Size Selection

The problem of selecting the appropriate sample size in any sampling context is not an
easy one (as illustrated in the chapter opener), but it must be faced in the planning stages,
before any sampling is done. We focus here on the relationship between sampling error and
sample size. As we discussed previously, the sampling error tends to decrease as the sample
size increases, so the desire to minimize sampling error encourages us to select larger sample
sizes. We should note, however, that several other factors encourage us to select smaller
sample sizes. The ultimate sample size selection must achieve a trade-off between these
opposing forces.

The determination of sample size is usually driven by sampling error considerations. If


you want to estimate a population mean with a sample mean, then the key is the standard error
of the mean, given by

̅ ) = 𝜹/√𝒏
𝑺𝑬(𝑿

CONFIDENCE INTERVALS

In the world of statistics, you can look at what applies to the sample and try to determine
the population. You know that the sample cannot be a 100 percent replica of the population.
There will be minor changes, and perhaps there are major ones too. How do you figure out that
the sample statistics is applicable to the population? To answer this, you look at the confidence
interval. Confidence intervals enable you to understand the accuracy that you can expect when
you take the sample statistics and apply them to the population.

In other words, a confidence interval gives you a range of values within which you can
expect the population statistics to be.

In statistics there is a term called the margin of error, which defines the maximum
expected difference between the population parameter and the sample statistic. It is often an
indicator of the random sampling error, and it is expressed as a likelihood or probability that the
result from the sample is close to the value that would have been calculated if you could have
calculated the statistic for the population. The margin of error is calculated when you observe
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
69
Mecmack A. Nartea * mackharvester@gmail.com
many samples instead of one sample. When you look at 50 people coming in for an interview
and find that 5 people do not arrive at the correct time, you can conclude that the margin of error
is 5 ÷ 50, which is equal to 10 percent. Therefore, the absolute margin of error, which is five
people, is converted to a relative margin of error, which is 10 percent.

Now, what is the chance that when you observe many samples of 50 people, you will
find that in each sample 5 people do not come at the designated time of interview? If you find
that, out of 100 samples, in 99 samples 5 people do not come in on time for an interview, you
can say that with 99 percent accuracy the margin of error is 10 percent.

Why should there be any margin of error if the sample is a mirror image of the
population? The answer is that there is no sample that will be a 100 percent replica of the
population. But it can be very close. Thus, the margin of error can be caused because of a
sampling error or because of a nonsampling error.

You already know that the chance that the sample is off the mark will decrease as the
sample size increases. The more people/products that you have in your sample size, the more
likely you will get a statistic that is very close to the population statistic. Thus, the margin of error
in a sample is equal to 1 divided by the square root of the number of observations in the sample.

𝒆 = 𝟏/√𝒏

What Is the P-value?

You know that hypothesis testing is used to confirm or reject whether two samples
belong to the same population. That p-value is the probability that determines whether the two
samples are the same population. This probability is a measure of evidence against the
hypothesis.

Remember the following:

▪ The null hypothesis always claims that Mean 1 = Mean 2.


▪ The aim of hypothesis testing is to reject the null.

Thus, a smaller p-value will mean that you can reject the null because the probability of
the two samples having similar means (which points to the two samples coming from the same
population) is much less (.05 = 5 percent probability).

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
70
Mecmack A. Nartea * mackharvester@gmail.com
The small p-value corresponds to strong evidence, and if the p-value is below a
predefined limit (.05 is the default value in most software), then the result is said to be
statistically significant.

For example, if the hypothesis is that a new type of medicine is better than the old
version, then the first attempt is to prove that the drugs are not similar (that any similarity is so
small that it can be random/ coincidence). Then the null hypothesis of the two drugs being the
same needs to be rejected. A small p-value signifies that the probability of the null hypothesis
being true is so small that it can be thought to be purely by chance.

Figure 11. Unlikely observations

This distribution is the distribution of the probability of the null hypothesis being true.
Thus, when the p-value (the probability of the null hypothesis of being true ) is less than .05 (or
any other value set for the test), you have to reject the null and conclude that Mean 1 = Mean 2
only because of coincidence or fate or chance.

Errors in Hypothesis Testing

No hypothesis test is 100 percent certain. As you have noticed, tests are based on
probability, and therefore, there is always a chance of an incorrect conclusion. These incorrect
conclusions can be of two types:

▪ Type 1 error, alpha: This is when the null hypothesis is true but you reject the null. Alpha
is the level of significance that you have set for the test. At a significance of .05, you are
willing to accept a 5 percent chance that you will incorrectly reject the null hypothesis. To
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
71
Mecmack A. Nartea * mackharvester@gmail.com
lower the risk, you can choose a lower value of significance. A type I error is generally
reported as the p-value.
▪ Type II error, beta: This is the error of incorrectly accepting the null. The probability of
making a type II error depends on the power of the test.

You can decrease your risk of committing a type II error by ensuring your sample size is
large enough to detect a practical difference when one truly exists. The confidence level is
equivalent to 1, the alpha level.

When the significance level is 0.05, the corresponding confidence level is 95 percent.

▪ If the p-value is less than the significance (alpha) level, the hypothesis test is statistically
significant.
▪ If the confidence interval does not contain the null hypothesis value between the upper
and lower confidence limits, the results are statistically significant (the null can be
rejected).
▪ If the p-value is less than the alpha, the confidence interval will not contain the null
hypothesis value.

1. Confidence level + alpha = 1.

2. If the p-value is low, the null must go.

3. The confidence interval and p-value will always lead to the same conclusion.

The most valuable usage of hypothesis testing is in interpreting the robustness of other
statistics generated while solving the problem/doing the project.

▪ Correlation coefficient: If the p-value is less than or equal to .05, you can conclude that
the correlation is actually equal to the correlation coefficient value displayed/ calculated.
If the p-value is greater than .05, you have to conclude that the correlation is because of
chance/coincidence.
▪ Linear regression coefficients: If the p-value is less than or equal to .05, you can
conclude that the coefficients are actually equal to the value displayed/calculated. If the
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
72
Mecmack A. Nartea * mackharvester@gmail.com
p-value is greater than .05, you have to conclude that the coefficients are because of
chance/coincidence.

SAMPLING DISTRIBUTIONS

What will happen if you are able to draw out all possible samples of 30 or more
observations from a given population/sample frame? For each of these samples, you could
compute the descriptive statistics (mean, median, standard deviation, minimum, maximum).
Now if you were to create a probability distribution of this statistic, it would be called the
sampling distribution, and the standard deviation of this statistic would be called the standard
error.

It has been found that if infinite numbers of samples are taken from the same sample
frame/population and a sample statistic (say, the mean of the samples) is plotted out, you will
find that a normal distribution emerges. Thus, most of the means will be clustered around the
mean of the sample mean, which incidentally will coincide or be very close to the
population/sample frame mean. This is as per the normal distribution rule, which states that
values are concentrated around the mean and few values will be far away from the mean (very
low or very high as compared to the mean).

Binomial Distribution

The basic building block of the binomial distribution is a Bernoulli random variable. This
is a variable for which there can be only two possible outcomes, and the probability of these
outcomes satisfies the conditions of a valid probability distribution function, which is that each
probability is between 0 and 1 and the total probabilities sum up to 1 or 100 percent.

Since a single observation of the outcome of a burn on a random variable is called a trial,
the sum of a series of such trials is distributed as a binomial distribution.

Thus, one such example is the probability of getting a tail on the toss of a coin, which is
50 percent or .5. If there are 100 such tosses, you will find that getting 0 heads and 100 tails is
very unlikely, getting 50 heads and 50 tails is the most likely, and getting 100 tails and 0 heads
is the most unlikely.

Now let’s look at a scenario where you have four possible outcomes and the probability
of getting outcome 1, 2, or 3 defines success, while getting an outcome of 4 defines failure.
Thus, the probability of success is 75 percent, and the probability of failure is 25 percent. Now if
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
73
Mecmack A. Nartea * mackharvester@gmail.com
you were to try 200 tosses again, you will find that a similar distribution occurs, but the
distribution will be more skewed or can be seen to be a bit shifted as compared to the earlier 50-
50 distribution.

Figure 8. Demonstration Of Binomial Distribution

Continuous Uniform Distribution

What if you have no prior beliefs about the distribution of probability or if you believe that
every outcome is equally possible? It’s easier when the value is discrete for a variable. When
this same condition is seen over a continuous variable, the distribution that emerges is called
the continuous uniform distribution (Figure 9). It is often used for random number generation in
simulations.

Figure 9. A continuous uniform distribution

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
74
Mecmack A. Nartea * mackharvester@gmail.com
Possion Distribution

Let’s look at some events that occur at a continuous rate, such as phone calls coming
into a call center. Let the rate of occurrence be r, or lambda. When the number is small (that is,
there is only one or two calls in a day), the possibilities that you will get zero calls on certain
days is high. However, say the number of calls in the call center is on average 100 per day.
Then the possibility that you will ever get zero calls in a day is very low. This distribution is
called the Poisson distribution (Figure 10).

Figure 10. Some Poisson distributions

PARAMETRIC TESTS

The following are some parametric tests:

▪ Students t-test: Student t-tests look at the differences between two groups across the
same variable of interest. Or they look at two variables in the same sample. The
consideration is that there can be only two groups at maximum.
▪ An example is if you want to compare the grades in English for students of Class 1’s
Section A and Section B. Another example is if you want to compare the grades in Class
1’s Section A for math and for science.
o One sample t-test: When the null hypothesis reads that the mean of a variable
is less than or equal to a specific value, then that test is one sample t-test.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
75
Mecmack A. Nartea * mackharvester@gmail.com
o Paired sample t-test: When the null hypothesis assumes that the mean of
variable 1 is equal to the mean of variable 2, then that test is a paired sample t-
test.

o Independent sample t-test: This compares the mean difference between two
independent groups for a given variable. The null hypothesis is that the mean for
the variable in sample 1 is equal to the mean for the same variable in sample 2.
The assumption is that the variance or standard deviation across the samples is
nearly equal.

For example, if you want to compare the grades in English for students of Class 1’s
Section A and Section B, you can use an analysis of variance (ANOVA) test as a substitute for
the students’ t-test.

▪ ANOVA test: This is the significance of differences between two or more groups across
one or more categorical variable. Thus, you will be able to figure out whether there is a
difference between groups, which is significant, but it will not tell you which group is
different.
▪ An example is if you want to compare the grades in English for students of Class 1’s
Section A, Section B, and Section C. Another example is if you want to compare the
grades in Class 1’s Section A for math, English, and science.
o One-way ANOVA: In this test, you compare the mean of a number of groups
based on one independent variable. There are some assumptions like that the
dependent variable is normally distributed and that the group of independent
variable groups have equal variance on the dependent variable.
o Two-way ANOVA: Here you can look at multiple groups and two variables of
factors. Again, the assumption is that there is homogeneity of variance and the
standard
No part of deviation
this material may beof the population
reproduced, ofaall
stored in the groups
retrieved system,are similar.
or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
76
Mecmack A. Nartea * mackharvester@gmail.com
NONPARAMETRIC TESTS

Here the data is not normally distributed. Thus, if the data is better represented by the
median instead of the mean, it is better to use nonparametric tests. It is also better to use
nonparametric tests. If the data sample is small, the data is ordinal ranked or may have some
outliers that you do not want to remove.

Chi-squared tests compare observed frequencies to expected frequencies and are


used across categorical variables.

As discussed, chi-square tests will be used on data that has ordinal nominal variables.
For example, say you want to understand the population of Indian males in cities who regularly
exercise, sporadically exercise, or have not exercised over the last 20 years. Thus, you have
three responses tracked over 20 years, and you need to figure out whether the population has
shifted between year 1 and year 20. The null hypothesis here would mean that there is no
change or no difference in the situation.

▪ Year 1 statistics: 60 percent regularly exercise, 20 percent sporadically exercise, and 20


percent have not exercised.
▪ Year 20 statistics: 68 percent regularly exercise, 16 percent sporadically exercise, and
16 percent have not exercised.

The test for both years was run on 500 people. Now you would compare the year 20
statistics with what could be the expected frequencies of these people in year 20 (if the year 1
trends are followed) as compared to the observed frequencies.

The test is based on a numerical measure of the difference between the two histograms.
Let C be the number of categories in the histogram, and let Oi be the observed number of
observations in category i. Also, let Ei be the expected number of observations in category i if
the population were normal with the same mean and standard deviation as in the sample. Then
the goodness-of-it measure in equation below is used as a test statistic.

If the null hypothesis of normality is true, this test statistic has (approximately) a chi-
square distribution
No part of thiswith C -may
material 3 degrees of freedom.
be reproduced, stored in aBecause large values
retrieved system, of the test statistic
or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
77
Mecmack A. Nartea * mackharvester@gmail.com
indicate a poor it—the Oi’s do not match up well with the Ei’s—the p-value for the test is the
probability to the right of the test statistic in the chi-square distribution with C - 3 degrees of
freedom.

CHAPTER EXERCISES

1. Differentiate population from a sample. Cite examples of both.

2. When do sample and population is used?

3. What is the purpose setting confidence internal and margin of errors?

4. What is the purpose of hypothesis? Hypothesis testing?

5. Suppose you are going to study on the difference on buying behavior of millennials and their
demographic profile. Formulate three hypotheses and determine what statistical treatment will
be used on each of the hypothesis.

SUGGESTED READINGS

Read article on Hypothesis Testing at https://statrek.com/hypothesis-test/hypothesis-


testing.aspx

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
78
Mecmack A. Nartea * mackharvester@gmail.com
CHAPTER
DATA MINING
7

OVERVIEW

The types of data analysis discussed throughout this book are crucial to the success of most
companies in today’s data-driven business world. However, the sheer volume of available
data often defies traditional methods of data analysis. Therefore, new methods— and
accompanying software—have recently been developed under the name of data mining.

OBJECTIVES

▪ Define Data Mining


▪ Learn the powerful tools for exploring and visualizing data.
▪ Learn classifications of data mining
▪ Learn and understand the process of clustering

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
79
Mecmack A. Nartea * mackharvester@gmail.com
INTRODUCTION TO DATA MINING

Data mining attempts to discover patterns, trends, and relationships among data,
especially nonobvious and unexpected patterns. For example, an analysis might discover that
people who purchase skim milk also tend to purchase whole wheat bread, or that cars built on
Mondays before 10 a.m. on production line #5 using parts from supplier ABC have significantly
more defects than average. This new knowledge can then be used for more effective
management of a business.

The place to start is with a data warehouse. Typically, a data warehouse is a huge
database that is designed specifically to study patterns in data. A data warehouse is not the
same as the databases companies use for their day-to-day operations. A data warehouse
should (1) combine data from multiple sources to discover as many relationships as possible,
(2) contain accurate and consistent data, (3) be structured to enable quick and accurate
responses to a variety of queries, and (4) allow follow-up responses to specific relevant
questions. In short, a data warehouse represents a relatively new type of database, one that is
specifically structured to enable data mining. Another term you might hear is data mart. A data
mart is essentially a scaled-down data warehouse, or part of an overall data warehouse, that is
structured specifically for one part of an organization, such as sales. Virtually all large
organizations, and many smaller ones, have developed data warehouses or data marts in the
past decade to enable them to better understand their business—their customers, their
suppliers, and their processes.

Once a data warehouse is in place, analysts can begin to mine the data with a collection
of methodologies and accompanying software. Some of the primary methodologies are
classification analysis, prediction, cluster analysis, market basket analysis, and forecasting.
Each of these is a large topic in itself, but some brief explanations follow.

▪ Classification analysis attempts to find variables that are related to a categorical (often
binary) variable. For example, credit card customers can be categorized as those who
pay their balances in a reasonable amount of time and those who don’t. Classification
analysis would attempt to find explanatory variables that help predict which of these two
categories a customer is in. Some variables, such as salary, are natural candidates for
explanatory variables, but an analysis might uncover others that are less obvious.
▪ Prediction is similar to classification analysis, except that it tries to ind variables that help
No part of this material may be reproduced, stored in a retrieved system, or transmitted
explain a continuous variable, such as credit card balance, rather than a categorical
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
80
Mecmack A. Nartea * mackharvester@gmail.com
variable. Regression, the topic of Chapters 10 and 11, is one of the most popular
prediction tools, but there are others not covered in this book.
▪ Cluster analysis tries to group observations into clusters so that observations within a
cluster are alike, and observations in different clusters are not alike. For example, one
cluster for an automobile dealer’s customers might be middle-aged men who are not
married, earn over $150,000, and favor high-priced sports cars. Once natural clusters
are found, a company can then tailor its marketing to the individual clusters.
▪ Market basket analysis tries to find products that customers purchase together in the
same “market basket.” In a supermarket setting, this knowledge can help a manager
position or price various products in the store. In banking and other settings, it can help
managers to cross-sell (sell a product to a customer already purchasing a related
product) or up-sell (sell a more expensive product than a customer originally intended to
purchase).
▪ Forecasting is used to predict values of a time series variable by extrapolating patterns
seen in historical data into the future. (This topic is covered in some detail in Chapter
12.) This is clearly an important problem in all areas of business, including the
forecasting of future demand for products, forecasting future stock prices and commodity
prices, and many others.

DATA EXPLORATION AND VISUALIZATION

Data mining is a relatively new field—or at least a new term—and not everyone agrees
with its definition. To many people, data mining is a collection of advanced algorithms that can
be used to find useful information and patterns in large data sets. Data mining does indeed
include a number of advanced algorithms, but we believe its definition should be broadened to
include relatively simple methods for exploring and visualizing data. This section discusses
some of the possibilities.

Online Analytical Processing (OLAP)

We introduced pivot tables in Chapter 4 as an amazingly easy and powerful way to break ®
data down by category in Excel . However, the pivot table methodology is not limited to Excel or
even to Microsoft. This methodology is usually called online analytical processing, or OLAP.
This name was initially used to distinguish this type of data analysis from online transactional
processing, or OLTP.
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
81
Mecmack A. Nartea * mackharvester@gmail.com
When analysts began to realize that the typical OLTP databases are not well equipped
to answer these broader types of questions, OLAP was born. This led to much research into the
most appropriate database structure for answering OLAP questions. The consensus was that
the best structure is a star schema. In a star schema, there is at least one Facts table of data
that has many rows and only a few columns. For example, in a supermarket database, a Facts
table might have a row for each line item purchased, including the number of items of the
product purchased, the total amount paid for the product, and possibly the discount. Each row of
the Facts table would also list “lookup information” (or foreign keys, in database terminology)
about the purchase: the date, the store, the product, the customer, any promotion in effect, and
possibly others. Finally, the database would include a dimension table for each of these. For
example, there would be a Products table. Each row of this table would contain multiple pieces
of information about a particular product. Then if a customer purchases product 15, say,
information about product 15 could be looked up in the Products table.

Most data warehouses are built according to these basic ideas. By structuring corporate
databases in this way, facts can easily be broken down by dimensions, and—you guessed it—
the methodology for doing this is pivot tables. However, these pivot tables are not just the
“standard” Excel pivot tables. You might think of them as pivot tables on steroids. The OLAP
methodology and corresponding pivot tables have the following features that distinguish them
from standard Excel pivot tables.

PowerPivot and Power View in Excel 2013

The general approach to data analysis embodied in pivot tables is one of the most
powerful ways to explore data sets. You learned about basic Excel pivot tables in Chapter 3,
and you learned about the more general OLAP technology in the previous subsection. This
subsection describes new Microsoft tools of the pivot table variety, PowerPivot and Power View,
that were introduced in Excel 2013. Actually, PowerPivot was available as a free add-in for
Excel 2010, but two things have changed in the version that is described here. First, you no
longer need to download a separate PowerPivot add-in. In Excel 2013, you can simply add it in
by checking it in the add-ins list. Second, the details of PowerPivot have changed. Therefore, if
you find a tutorial for the older PowerPivot add-in on the Web and try to follow it for Excel 2013,
you will see that the new version doesn’t work in the same way as before. So be aware that the
instructions in this section are relevant only for PowerPivot for Excel 2013 and not for the older
version.
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
82
Mecmack A. Nartea * mackharvester@gmail.com
Among other things, the PowerPivot add-in allows you to do the following:

▪ Import millions of rows from multiple data sources


▪ Create relationships between data from different sources, and between multiple tables in
a pivot table
▪ Create implicit calculated fields (previously called measures) — calculations created
automatically when you add a numeric field to the Values area of the Field List
▪ Manage data connections

Interestingly, Microsoft refers to building a data model in Excel in its discussion of


PowerPivot. This is a somewhat new Microsoft term, and they have provided the following
definition.

Data Model: A collection of tables and their relationships that reflects the real-world
relationships between business functions and processes—for example, how Products relates to
Inventory and Sales.

If you have worked with relational databases, this definition is nothing new. It is
essentially the definition of a relational database, a concept that has existed for decades. The
difference is that the data model is now contained entirely in Excel, not in Access or some other
relational database package.

Visualization Software

As the Power View tool from the previous subsection illustrates, you can gain a lot of
insight by using charts to view your data in imaginative ways. This trend toward powerful
charting software for data visualization is the wave of the future and will certainly continue.
Although this book is primarily about Microsoft software—Excel—many other companies are
developing visualization software. To get a glimpse of what is currently possible, you can watch
the accompanying video about a free software package, Tableau Public, developed by
Tableau Software. Perhaps you will find other visualization software packages, free or
otherwise, that rival Tableau or Power View. Alternatively, you might see blogs with data
visualizations from ordinary users. In any case, the purpose of charting software is to portray
data graphically so that otherwise hidden trends or patterns can emerge clearly.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
83
Mecmack A. Nartea * mackharvester@gmail.com
MICROSOFT DATA MINING ADD-INS FOR EXCEL

The methods discussed so far in this chapter, all of which basically revolve around pivot
tables, are extremely useful for data exploration, but they are not always included in discussions
of “data mining.” To many analysts, data mining refers only to the algorithms discussed in the
remainder of this chapter. These include, among others, algorithms for classification and for
clustering. (There are many other types of data mining algorithms not discussed in this book.)
Many powerful software packages have been developed by software companies such as SAS,
IBM SPSS, Oracle, Microsoft, and others to implement these data mining algorithms.
Unfortunately, this software not only takes time to master, but it is also quite expensive. The
only data mining algorithms discussed here that are included in the software that accompanies
the book are logistic regression and neural nets, two classification methods that are part of the
Palisade suite, and they are discussed in the next section.

To provide you with illustrations of other data mining methods, we will briefly discuss
Microsoft data mining add-ins for Excel. The good news is that these add-ins are free and easy
to use. You can find them by searching the Web for Microsoft Data Mining Add-ins.

The names of these add-ins provide a clue to their downside. These add-ins are really
only front ends—client tools—for the Microsoft engine that actually performs the data mining
algorithms. This engine is called Analysis Services and is part of Microsoft’s SQL Server
database package. (SQL Server Analysis Services is often abbreviated as SSAS.) In short,
Microsoft decided to implement data mining in SSAS. Therefore, to use its Excel data mining
add-ins, you must have a connection to an SSAS server. This might be possible in your
academic or corporate setting, but it can definitely be a hurdle.

Classification Methods

The previous section introduced one of the most important problems studied in data
mining, the classification problem. This is basically the same problem attacked by regression
analysis—using explanatory variables to predict a dependent variable—but now the dependent
variable is categorical. It usually has two categories, such as Yes and No, but it can have more
than two categories, such as Republican, Democrat, and Independent. This problem has been
analyzed with very different types of algorithms, some regression-like and others very different
from regression, and this section discusses three of the most popular classification methods.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
84
Mecmack A. Nartea * mackharvester@gmail.com
But each of the methods has the same objective: to use data from the explanatory variables to
classify each record (person, company, or whatever) into one of the known categories.

Before proceeding, it is important to discuss the role of data partitioning in classification


and in data mining in general. Data mining is usually used to explore very large data sets, with
many thousands or even millions of records. Therefore, it is very possible, and also very useful,
to partition the data set into two or even three distinct subsets before the algorithms are applied.
Each subset has a specified percentage of all records, and these subsets are typically chosen
randomly. The first subset, usually with about 70% to 80% of the records, is called the training
set. The second subset, called the testing set, usually contains the rest of the data. Each of
these sets should have known values of the dependent variable. Then the algorithm is trained
with the data in the training set. This results in a model that can be used for classification. The
next step is to test this model on the testing set. It is very possible that the model will work quite
well on the training set because this is, after all, the data set that was used to create the model.
The real question is whether the model is flexible enough to make accurate classifications in the
testing set.

Most data mining software packages have utilities for partitioning the data. (In the
following subsections, you will see that the logistic regression procedure in StatTools does not
yet have partitioning utilities, but the Palisade NeuralTools add-in for neural networks does have
them, and the Microsoft data mining add-in for classification trees also has them.) The various
software packages might use slightly different terms for the subsets, but the overall purpose is
always the same, as just described. They might also let you specify a third subset, often called a
prediction set, where the values of the dependent variable are unknown. Then you can use the
model to classify these unknown values. Of course, you won’t know whether the classifications
are accurate until you learn the actual values of the dependent variable in the prediction set.

Logistic Regression

Logistic regression is a popular method for classifying individuals, given the values of a
set of explanatory variables. It estimates the probability that an individual is in a particular
category. As its name implies, logistic regression is somewhat similar to the usual regression
analysis, but its approach is quite different. It uses a nonlinear function of the explanatory
variables for classification.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
85
Mecmack A. Nartea * mackharvester@gmail.com
Logistic regression is essentially regression with a dummy (0–1) dependent variable.
For the two-category problem (the only version of logistic regression discussed here), the
dummy variable indicates whether an observation is in category 0 or category 1. One approach
to the classification problem, an approach that is sometimes actually used, is to run the usual
multiple regressions on the data, using the dummy variable as the dependent variable. However,
this approach has two serious drawbacks. First, it violates the regression assumption that the
error terms should be normally distributed. Second, the predicted values of the dependent
variable can be between 0 and 1, less than 0, or greater than 1. If you want a predicted value to
estimate a probability, then values less than 0 or greater than 1 make no sense.

Therefore, logistic regression takes a slightly different approach. Let X1 through Xk be


the potential explanatory variables, and create the linear function b0+ b1 X1 + ⋯ + bkXk.
Unfortunately, there is no guarantee that this linear function will be between 0 and 1, and hence
that it will qualify as a probability. But the nonlinear function

𝟏/(𝒆−(𝒃𝟎+ 𝒃𝟏 𝑿𝟏 + ⋯ + 𝒃𝒌𝑿𝒌) )

is always between 0 and 1. In fact, the function f(x) = 1/(1 + 𝑒 −𝑋 ) is an “S-shaped logistic” curve,
as shown in Figure 17.16. For large negative values of x, the function approaches 0, and for
large positive values of x, it approaches 1.

Figure 12. S-shaped Logistics Curve

The logistic regression model uses this function to estimate the probability that any
observation is in category 1. Specifically, if p is the probability of being in category 1, the model

p=𝟏/(𝒆−(𝒃𝟎+ 𝒃𝟏 𝑿𝟏 + ⋯ + 𝒃𝒌𝑿𝒌) )

is estimated. This equation can be manipulated algebraically to obtain an equivalent form:


No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
86
Mecmack A. Nartea * mackharvester@gmail.com
𝒑
𝐥𝐧 ( ) = 𝒃𝟎 + 𝒃𝟏 𝑿𝟏 + ⋯ + 𝒃𝒌𝑿𝒌
𝟏−𝒑

This equation says that the natural logarithm of p/(1 − p) is a linear function of the
explanatory variables. The ratio p/(1 − p) is called the odds ratio.

The odds ratio is a term frequently used in everyday language. Suppose, for example,
that the probability p of a company going bankrupt is 0.25. Then the odds that the company will
go bankrupt are p/(1 − p) = 0.25/0.75 = 1/3, or “1 to 3.” Odds ratios are probably most common
in sports. If you read that the odds against Indiana winning the NCAA basketball championship
are 4 to 1, this means that the probability of Indiana winning the championship is 1/5. Or if you
read that the odds against Purdue winning the championship are 99 to 1, then the probability
that Purdue will win is only 1/100.

The logarithm of the odds ratio, the quantity on the left side of the above equation, is
called the logit (or log odds). Therefore, the logistic regression model states that the logit is a
linear function of the explanatory variables. Although this is probably a bit mysterious and there
is no easy way to justify it intuitively, logistic regression has produced useful results in many
applications.

Although the numerical algorithm used to estimate the regression coefficients is complex,
the important goal for our purposes is to interpret the regression coefficients correctly. First, if a
coefficient b is positive, then if its X increases, the log odds increases, so the probability of
being in category 1 increases. The opposite is true for a negative b. So just by looking at the
signs of the coefficients, you can see which Xs are positively correlated with being in category 1
(the positive bs) and which are positively correlated with being in group 0 (the negative bs). You
can also look at the magnitudes of the bs to try to see which of the Xs are “most important” in
explaining category membership. Unfortunately, you run into the same problem as in regular
regression. Some Xs are typically of completely different magnitudes than others, which makes
comparisons of the bs difficult. For example, if one X is income, with values in the thousands,
and another X is number of children, with values like 0, 1, and 2, the coefficient of income will
probably be much smaller than the coefficient of children, even though these two variables
might be equally important in explaining category membership.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
87
Mecmack A. Nartea * mackharvester@gmail.com
Classification Trees

The two classification methods discussed so far, logistic regression and neural networks,
use complex nonlinear functions to capture the relationship between explanatory variables and
a categorical dependent variables. The method discussed in this subsection, classification trees,
is also capable of discovering nonlinear relationships, but it is much more intuitive. This method,
which has many variations, has existed for decades, and it has been implemented in a variety of
software packages. Unfortunately, it is not available in any of the software that accompanies this
book, but it is available in the free Microsoft Data Mining Add-Ins discussed earlier. The
essential features of the method are explained here, and the accompanying video, Decision
Trees with Microsoft Data Mining Add-In, illustrates the method.

The attractive aspect of this method is that the final result is a set of simple rules for
classification. As an example, the final tree might look like the one in Figure 13. Each box has a
bar that shows the purity of the corresponding box, where blue corresponds to Yes values and
red corresponds to No values. The first split, actually a three-way split, is on Mall Values: fewer
than 4, 4 or 5, and at least 6. Each of these is then split in a different way. For example, when
Mall Trips is fewer than 4, the split is on Nbhd West versus Nbhd not West. The splits you see
here are the only ones made. They achieve sufficient purity, so the algorithm stops splitting after
these.

Predictions are then made by majority rule. As an example, suppose a person has made
3 mall trips and lives in the East. This person belongs in the second box down on the right,
which has a large majority of No values. Therefore, this person is classified as a No. In contrast,
a person with 10 mall trips belongs in one of the two bottom boxes on the right. This person is
classified as a Yes because both of these boxes have a large majority of Yes values. In fact, the
last split on Age is not really necessary.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
88
Mecmack A. Nartea * mackharvester@gmail.com
This classification tree leads directly to the following rules.

▪ If the person makes fewer than 4 mall trips:


o If the person lives in the West, classify as a trier.
o If the person doesn’t live in the West, classify as a nontrier.
▪ If the person makes 4 or 5 mall trips:
o If the person doesn’t live in the East, classify as a trier.
o If the person lives in the East, classify as a nontrier.
▪ If the person makes at least 6 mall trips, classify as a trier.

The ability of classification trees to provide such simple rules, plus fairly accurate
classifications, has made this a very popular classification technique.

Clustering

In data mining terminology, the classification methods in the previous section are called
supervised data mining techniques. This term indicates that there is a dependent variable the
method is trying to predict. In contrast, the clustering methods discussed briefly in this section
are called unsupervised data mining techniques. Unsupervised methods have no dependent
variable. Instead, they search for patterns and structure among all of the variables.

Clustering is probably the most common unsupervised method, and it is the only one
discussed here. However, another popular unsupervised method you might encounter is market
basket analysis (also called association analysis), where patterns of customer purchases are
examined to see which items customers tend to purchase together, in the same “market basket.”
This analysis can be the basis for product shelving arrangements, for example.

Clustering, known in marketing circles as segmentation, tries to group entities


(customers, companies, cities, or whatever) into similar clusters, based on the values of their
variables. This method bears some relationship to classification, but the fundamental difference
is that in clustering, there are no fixed groups like the triers and nontriers in classification.
Instead, the purpose of clustering is to discover the number of groups and their characteristics,
based entirely on the data.

Clustering methods have existed for decades, and a wide variety of clustering methods
have been developed and implemented in software packages. The key to all of these is the
development of a dissimilarity measure. Specifically, to compare two rows in a data set, you
No part of this material may be reproduced, stored in a retrieved system, or transmitted
in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
89
Mecmack A. Nartea * mackharvester@gmail.com
need a numeric measure of how dissimilar they are. Many such measures are used. For
example, if two customers have the same gender, they might get a dissimilarity score of 0,
whereas two customers of different genders might get a dissimilarity score of 1. Or if the
incomes of two customers are compared, they might get a dissimilarity score equal to the
squared difference between their incomes. The dissimilarity scores for different variables are
then combined in some way, such as normalizing and then summing, to get a single dissimilarity
score for the two rows as a whole.

Once a dissimilarity measure is developed, a clustering algorithm attempts to find


clusters of rows so that rows within a cluster are similar and rows in different clusters are
dissimilar. Again, there are many ways to do this, and many variations appear in different
software packages. For example, the package might let you specify the number of clusters
ahead of time, or it might discover this number automatically.

In any case, once an algorithm has discovered, say, five clusters, your job is to under-
stand (and possibly name) these clusters. You do this by exploring the distributions of variables
in different clusters. For example, you might find that one cluster is composed mostly of older
women who live alone and have modest incomes, whereas another cluster is composed mostly
of wealthy married men.

CHAPTER EXERCISES

1. What data mining is used for?

2. How does the OLAP methodology allows you to drill down in a pivot table?

3. What is the main purpose of logistic regression? How does it differ from the regression
discussed in the previous chapter?

SUGGESTED READINGS

Read Data mining at http://sas.com/n_ph/insights/analytics/data-mining.html

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
90
Mecmack A. Nartea * mackharvester@gmail.com
REFERENCES

Albright, S.C & Winston, W. (2015). Business Analytics: Data Analysis and Decision Making,
Fifth Edition. Cengage Learning, USA.

Inmon, W. (2002). Building the Data Warehouse, 3rd ed. John Wiley & Sons,Inc., Canada.

Ragsdale, C. (2014). Spreadsheet modeling and decision analysis: a practical introduction to


business analytics, 5th Edition. Thompson South-Western, USA..

Tripathi, S.S. (2016). Learn Business Analytics in Six Steps Using SAS and R. Apress Media ,
LLC.

No part of this material may be reproduced, stored in a retrieved system, or transmitted


in any form or by any means (electronic, recording, or otherwise) without prior written
permission from the owner.
91
Mecmack A. Nartea * mackharvester@gmail.com

You might also like