Download as pdf or txt
Download as pdf or txt
You are on page 1of 90

POLYTECHNIC UNIVERSITY OF THE PHILIPPINES

Sta. Mesa, Manila


College of Business Administration
Department of Office Administration

LEARNING MODULE
IN

CUSTOMER ANALYTICS
BUMA 30093
Mecmack a. nartea

COURSE DESCRIPTION:

The course provides students with an overview of the current trends in business/customer analytics that drives
today’s business. The course will provide understanding on data management techniques that can help an
organization to achieve its business goals and address operational challenges.

COURSE OBJECTIVES:

As a result of taking this course, the student should be able to:


1. Describe the various sources of data (structured, unstructured) and the concept of data management;
2. Describe the importance of data, how data can be used by an organization towards competitive advantage,
and how it enables organizations to make quicker and better business decisions;
3. Describe, understand and explain business modeling, the business modeling process and be able to apply it
in a variety of different situations;
4. Describe basics of business intelligence including data gathering, data storing, data analyzing and providing
access to data;
5. Describe how statistical analysis can help to better understand past events and predict future events;
6. Understanding the fundamentals of project risk management, various methods used for effort and cost
estimation, various phases within a project, dependencies and critical path; and
7. Describe various database models like the hierarchical database model and network model.
8. Develop an awareness of the ethical norms as required under polices and applicable laws governing
confidentiality and non-disclosure of data/information/documents and proper conduct in the learning
process and application of business analytics

1
TABLE OF CONTENTS

Page No.

Chapter 1 The Process of Analytics 3


Evolution of Analytics: How Did Analytics Start? 7
The Quality Movement 9
The Second World War 10
Where Else Was Statistics Involved? 11
The Dawn of Business Intelligence 12
Chapter Exercise 13
Chapter 2 Analytics: A Comprehensive Study 14
Definition of Business Analytics 15
Types of Analytics 15
Basic Domains within Analytics 16
Definition Of Analytics 17
Analytics vs. Analysis 18
Examples of Analytics 18
Software Analytics 21
Embedded Analytics 23
Learning Analytics 24
Differentiating Learning Analytics and Educational Data Mining 26
Chapter Exercise 27
Chapter 3 Descriptive Statistical Measures 29
Populations And Samples 30
Data Sets, Variables, And Observations 30
Types Of Data 31
Descriptive Measures For Categorical Variables 33
Descriptive Measures For Numerical Variables 33
Measures Of Central Tendency 33
Measures Of Variability 38
Outliers And Missing Values 42
Chapter Exercise 44
Chapter 4 Analytics on Spreadsheets 46
Excel Tables For Filtering, Sorting, And Summarizing 47
Chapter Exercise 50
Chapter 5 Probability and Probability Distribution 51
Probability Essentials 53
Rule of Complements 53
Addition Rule 54
Conditional Probability and the Multiplication Rule 55
Probability Distribution Of A Single Random Variable 56
Summary Measures of a Probability Distribution 57
Chapter Exercise 58
Chapter 6 Statistical Inference: Sampling and Estimation 60
Understanding Samples 61
Sampling Techniques 61
Determining Sample Size 65
Introduction To Estimation 65
Sources of Estimation Error 66

2
Key Terms in Sampling 67
Sample Size Selection 68
Confidence Intervals 68
What Is The P-Value? 69
Errors In Hypothesis Testing 70
Sampling Distributions 72
Parametric Tests 74
Nonparametric Tests 76
Chapter Exercise 77
Chapter 7 Data Mining 78
Introduction to Data Mining 79
Data Exploration and Visualization 80
Online Analytical Processing (OLAP) 80
PowerPivot and Power View in Excel 2013 81
Visualization Software 82
Microsoft Data Mining Add-Ins For Excel 83
Classification Methods 83
Logistic Regression 84
Classification Trees 87
Clustering 88
Chapter Activity 89

3
CHAPTER
THE PROCESS OF ANALYTICS
1

OVERVIEW

This chapter discusses how business analytics are used in daily life. It
further discusses the various softwares used in analytics. The history on
how and when analytics was started were also tackled in this chapter.

OBJECTIVES

▪ Learn the evolution of analytics


▪ Learn where analytics was involved in the history
▪ Understand how business intelligence emerge

4
What Is Analytics? What Does a Data Analyst Do?

A casual search on the Internet for data scientist offers up the fact that there is a
substantial shortage of manpower for this job. In addition, Harvard Business Review has
published an article called “Data Scientist: The Sexiest Job of the 21st Century” So, what does
a data analyst actually do?

To put it simply, analytics is the use of numbers or business data to find solutions for
business problems. Thus, a data analyst looks at the data that has been collected across
huge enterprise resource planning (ERP) systems, Internet sites, and mobile applications.

In the “old days,” we just called upon an expert, who was someone with a lot of
experience. We would then take that person’s advice and decide on the solution. It’s much
like we visit the doctor today, who is a subject-matter expert.

As the complexity of business systems went up and we entered an era of continuous


change, people found it hard to deal with such complex systems that had never existed before.
The human brain is much better at working with fewer variables than many. Also, people
started using computers, which are relatively better and unbiased when it comes to new forms
and large volumes of data.

An Example

The next question often is, what do I mean by “use of numbers”? Will you have do
math again?

The last decade has seen the advent of software as a service (SaaS) in all walks of
information gathering and manipulation. Thus, analytics systems now are button-driven
systems that do the calculations and provide the results. An analyst or data scientist has to
look at these results and make recommendations for the business to implement. For example,
say a bank wants to sell loans in the market. It has data of all the customers who have taken
loans from the bank over the last 20 years. The portfolio is of, say, 1 million loans. Using this
data, the bank wants to understand which customers it should give pre-approved loan offers
to.

The simplest answer may be as follows: all the customers who paid on time every time
in their earlier loans should get a pre-approved loan offer. Let’s call this set of customers
Segment A. But on analysis, you may find that customers who defaulted but paid the loan

5
after the default actually made more money for the bank because they paid interest plus the
late payment charges. Let’s call this set Segment B.

Hence, you can now say that you want to send out an offer letter to Segment A +
Segment B.

However, within Segment B there was a set of customers who you had to send
collections teams to their homes to collect the money. So, they paid interest plus the late
payment charges minus the collection cost. This set is Segment C.

So, you may then decide to target Segment A + Segment B – Segment C.

You could do this exercise using the decision tree technique that cuts your data into
segments (Figure 1-1).

A Typical Day

The last question to tackle is, what does the workday of an analytics professional look
like? It probably encompasses the following:

The data analyst will walk into the office and be told about the problem that the
business needs input on.

The data analyst will determine the best way to solve the problem.

The data analyst will then gather the relevant data from the large data sets stored in
the server.

Next, the data analyst will import the data into the analytics software.

6
The data analyst will run the technique through the software (SAS, R, SPSS, XLSTAT,
and so on).

The software will produce the relevant output.

The data analyst will study the output and prepare a report with recommendations.

The report will be discussed with the business.

Is Analytics for You?

So, is analytics the right career for you? Here are some points that will help you decide:

Do you believe that data should be the basis of all decisions? Take up analytics only
if your answer to this question is an unequivocal yes. Analytics is the process of
using and analyzing a large quantum of data (numbers, text, images, and so on) by
aggregating, visualizing/creating dashboards, checking repetitive trends, and
creating models on which decisions can be made. Only people who innately believe
in the power of data will excel in this field. If some prediction/analysis is wrong, the
attitude of a good analyst is that it is because the data was not appropriate for the
analysis or the technique used was incorrect. You will never doubt that a correct
decision will be made if the relevant data and appropriate techniques are used.

Do you like to constantly learn new stuff? Take up analytics only if your answer to
this question is an unequivocal yes. Analytics is a new field. There is a constant
increase in the avenues of data currently regarding Internet data, social networking
information, mobile transaction data, and near field communication devices. There
are constant changes in technology to store, process, and analyze this data. Hadoop,
Google updates, and so on, have become increasingly important. Cloud computing
and data management are common now. Economic cycles have shortened, and
model building has become more frequent as older models get redundant. Even the
humble Excel has an Analysis ToolPak in Excel 2010 with statistical functions. In
other words, be ready for change.

Do you like to interpret outcomes and then track them to see whether your
recommendations were right? Take up analytics only if your answer to this question
is an unequivocal yes. A data analyst will work on a project, and the implementation
of the recommendations will generally be valid for a reasonably long period of time,

7
perhaps a year or even three to five years. A good analyst should be interested to
know how accurate the recommendations have been and should want to track the
performance periodically. You should ideally also be the first person to be able to
say when the analysis is not working and needs to be reworked.

Are you ready to go back to a text book and brush up on the concepts of math and
statistics? Take up analytics only if your answer to this question is an unequivocal
yes. To accurately handle data and interpret results, you will need to brush up on
the concepts of math and statistics. It becomes important to justify why you chose a
particular path during analysis versus others. Business users will not accept your
word blindly.

Do you like debating and logical thinking? Take up analytics only if your answer to
this question is an unequivocal yes. As there is no one solution to all problems, an
analyst has to choose the best way to handle the project/problem at hand. The
analyst has to be able to not only know the best way to analyze the data but also
give the best recommendation in the given time constraints and budget constraints.
This sector generally has a very open culture where the analyst working on a
project/problem will be required to give input irrespective of the analyst’s position in
the hierarchy.

Do check your answers to the previous questions. If you said yes for three out of these
five questions and an OK for two, then analytics is a viable career option for you. Welcome to
the world of analytics!

Evolution of Analytics: How Did Analytics Start?

As per the Oxford Dictionary, the definition of statistics is as follows:

The practice or science of collecting and analyzing numerical data in large quantities,
especially for the purpose of inferring proportions in a whole from those in a
representative sample.1

8
Most people start working with numbers, counting, and math by the time we are five
years old. Math includes addition, subtraction, theorems, rules, and so on. Statistics is when
we start using math concepts to work on real-life data.

Statistics is derived from the Latin word status, the Italian word statista, or the German
word statistik, each of which means a political state. This word came into being somewhere
around 1780 to 1790.

In ancient times, the government collected the information regarding the population,
property, and wealth of the country. This enabled the government to get an idea of the
manpower of the country and became the basis for introducing taxes and levies. Statistics are
the practical part of math.

The implementation of standards in industry and commerce became important with


the onset of the Industrial Revolution, where there arose a need for high-precision machine
tools and interchangeable parts. Standardization is the process of developing and
implementing technical standards. It helps in maximizing compatibility, interoperability, safety,
repeatability, and quality.

Nuts and bolts held the industrialization process together; in 1800, Henry Maudslay
developed the first practical screw-cutting lathe. This allowed for the standardization of screw
thread sizes and paved the way for the practical application of interchangeability for nuts and
bolts. Before this, screw threads were usually made by chipping and filing manually.

Maudslay standardized the screw threads used in his workshop and produced sets of
nuts and bolts to those standards so that any bolt of the appropriate size would fit any nut of
the same size.

Joseph Whitworth’s screw thread measurements were adopted as the first unofficial
national standard by companies in Britain in 1841 and came to be known as the British
standard Whitworth.

By the end of the 19th century, differences and standards between companies were
making trading increasingly difficult. The Engineering Standards Committee was established
in London in 1901 and by the mid-to-late 19th century, efforts were being made to standardize
electrical measurements. Many companies had entered the market in the 1890s, and all chose

9
their own settings for voltage, frequency, current, and even the symbols used in circuit
diagrams, making standardization necessary for electrical measurements.

The International Federation of the National Standardizing Associations was founded


in 1926 to enhance international cooperation for all technical standards and certifications.

The Quality Movement

Once manufacturing became an established industry, the emphasis shifted to


minimizing waste and therefore cost. This movement was led by engineers who were, by
training, adept at using math. This movement was called the quality movement. Some
practices that came from this movement are Six Sigma and just-in-time manufacturing in
supply chain management. The point is that all this started in the Industrial Revolution in 1800s.

This was followed with the factory system with its emphasis on product inspection.

After the United States entered World War II, the quality became a critical component
since bullets from one state had to work with guns manufactured in another state. For example,
the U.S. Army had to inspect manually every piece of machinery, but this was very time-
consuming. Statistical techniques such as sampling started being used to speed up the
processes.

Japan around this time was also becoming conscious of quality. The quality initiative
started with a focus on defects and products and then moved on to look at the process used
for creating these products. Companies invested in training their workforce on Total Quality
Management (TQM) and statistical techniques.

This phase saw the emergence of seven “basic tools” of quality.

10
Statistical Process Control from the early 1920s is a method of quality control using
statistical methods, where monitoring and controlling the process ensures that it operates at
its full potential. At its full potential, a process can churn out as much conforming product or
standardize a product as much as possible with a minimum of waste.

This is used extensively in manufacturing lines with a focus on continuous


improvement and is practiced in these two phases:

Initial establishment of the process

Regular production use of the process

The advantage of Statistical Process Control (SPC) over the methods of quality control
such as inspection is that it emphasizes early detection and prevention of problems rather
than correcting problems after they occur.

The following were the next steps:

Six Sigma: A process of measurement and improvement perfected by GE and


adopted by the world

Kaizen: A Japanese term for continuous improvement; a step-by-step improvement


of business processes

PDCA: Plan-Do-Check-Act, as defined by Deming

What was happening on the government front? The maximum data was being
captured and used by the military. A lot of the business terminologies and processes used
today have been copied from the military: sales campaigns, marketing strategy, business
tactics, business intelligence, and so on.

The Second World War

As mentioned, statistics made a big difference during World War II. For instance, the
Allied forces accurately estimated the production of German tanks using statistical methods.
They also used statistics and logical rules to decode German messages.

The Kerrison Predictor was one of the fully automated anti-aircraft fire control systems
that could gun an aircraft based on simple inputs such as the angle to the target and the
observed speed. The British Army used this effectively in the early 1940s.

11
The Manhattan Project was a U.S. government research project in 1942–1945 that
produced the first atomic bomb. Under this, the first atomic bomb was exploded in July 1945
at a site in New Mexico. The following month, the other atomic bombs that were produced by
the project were dropped on Hiroshima and Nagasaki, Japan. This project used statistics to
run simulations and predict the behavior of nuclear chain reactions.

Where Else Was Statistics Involved?

Weather predictions, especially rain, affected the world economy the most since
weather affected the agriculture industry. The first attempt was made to forecast the weather
numerically in 1922 by Lewis Fry Richardson.

The first successful numerical prediction was performed using the ENIAC digital
computer in 1950 by a team of American meteorologists and mathematicians.2

Then, 1956 saw analytics solve the shortest-path problem in travel and logistics,
radically changing these industries.

In 1956 FICO was founded by engineer Bill Fair and mathematician Earl Isaac on the
principle that data used intelligently can improve business decisions. In 1958 FICO built its
first credit scoring system for American investments, and in 1981 the FICO credit bureau risk
score was introduced.3

Historically, by the 1960s, most organizations had designed, developed, and


implemented centralized computing systems for inventory control. Material requirements
planning (MRP) systems were developed in the 1970s.

In 1973, the Black-Scholes model (or Black–Scholes–Merton model) was perfected. It


is a mathematical model of a financial market containing certain derivative investment
instruments. This model estimates the price of the option/stock overtime. The key idea behind
the model is to hedge the option by buying and selling the asset in just the right way and
thereby eliminate risk. It is used by investment banks and hedge funds.

By the 1980s, manufacturing resource planning systems were introduced with the
emphasis on optimizing manufacturing processes by synchronizing materials with production
requirements. Starting in the late 1980s, software systems known as enterprise resource
planning systems became the drivers of data accumulation in business. ERP systems are
software systems for business management including models supporting functional areas

12
such as planning, manufacturing, sales, marketing, distribution, accounting, and so on. ERP
systems were a leg up over MRP systems. They include modules not only related to
manufacturing but also to services and maintenance.

The Dawn of Business Intelligence

Typically, early business applications and ERP systems had their own databases that
supported their functions. This meant that data was in silos because no other system had
access to it. Businesses soon realized that the value of data can increase many fold if all the
data is in one system together. This led to the concept of a data warehouse and then an
enterprise data warehouse (EDW) as a single system for the repository of all the
organization’s data. Thus, data could be acquired from a variety of incompatible systems and
brought together using extract, transform, load (ETL) processes. Once the data is collected
from the many diverse systems, the captured data needs to be converted into information and
knowledge in order to be useful. The business intelligence (BI) systems could therefore give
much more coherent intelligence to businesses and introduce the concepts of one view of
customers and customer lifetime value.

One advantage of an EDW is that business intelligence is now much more exhaustive.
Though business intelligence is a good way to use graphs and charts to get a view of business
progress, it does not use high-end statistical processes to derive greater value from the data.

The next question that business wanted to answer by the 1990s–2000 was how the
data can be used more effectively to understand embedded trends and predict future trends.
The business world was waking up to predictive analytics.

What are the types of analytics that exist now? The analytics journey generally starts
off with the following:

Descriptive statistics: This enables businesses to understand summaries generally


about numbers that the management views as part of the business intelligence
process.

Inferential statistics: This enables businesses to understand distributions and


variations and shapes in which the data occurs.

Differences statistics: This enables businesses to know how the data is changing or if
it’s the same.

13
Associative statistics: This enables businesses to know the strength and direction of
associations within data.

Predictive analytics: This enables businesses to make predictions related to trends


and probabilities.

Fortunately, we live in an era of software, which can help us do the math, which means
analysts can focus on the following:

Understanding the business process

Understanding the deliverable or business problem that needs to be solved

Pinpointing the technique in statistics that will be used to reach the solution

Running the SaaS to implement the technique

Generating insights or conclusions to help the business

CHAPTER EXERCISES

Direction: Discuss the following questions. Write your answer in as short bond paper.

1. How does analytics applicable in your daily life? Cite examples to substantiate your
answer.

2. Is there really a need to include analytics in the education curriculum? Justify your
answer.

SUGGESTED READINGS

Http://journals.ametsoc.org/doi/pdf/10.1175/BAMS-89-1-45

www.fico.com/en/about-us#our_history

www.oxforddictionaries.com/definition/english/statistics

ANALYTICS: A COMPREHENSIVE
CHAPTER
STUDY
2 14
OVERVIEW

Analytics is the understanding and communication of significant patterns of data. Analytics is


applied in businesses to improve their performances. Some of the aspects explained in this
text are software analytics, embedded analytics, learning analytics and social media
analytics. The section on analytics offers an insightful focus, keeping in mind the complex
subject matter.

OBJECTIVES

▪ Define business analytics


▪ Know the different types of analytics
▪ Enumerate and understand the different domains in analytics
▪ Differentiate analytics and analysis
▪ Understand what software analytics is
▪ Understand how analytics is used in academe
▪ Differentiate Learning Analytics and Educational Data Mining

DEFINITION OF BUSINESS ANALYTICS

15
Business analytics (BA) refers to the skills, technologies, practices for continuous iterative
exploration and investigation of past business performance to gain insight and drive business
planning. Business analytics focuses on developing new insights and understanding of business
performance based on data and statistical methods. In contrast, business intelligence traditionally
focuses on using a consistent set of metrics to both measure past performance and guide
business planning, which is also based on data and statistical methods.

Business analytics makes extensive use of statistical analysis, including explanatory and
predictive modeling, and fact-based management to drive decision making. It is therefore closely
related to management science. Analytics may be used as input for human decisions or may drive
fully automated decisions. Business intelligence is querying, reporting, online analytical
processing (OLAP), and “alerts.”

In other words, querying, reporting, OLAP, and alert tools can answer questions such as
what happened, how many, how often, where the problem is, and what actions are needed.
Business analytics can answer questions like why is this happening, what if these trends continue,
what will happen next (that is, predict), what is the best that can happen (that is, optimize).

Examples of Application

Banks, such as Capital One, use data analysis (or analytics, as it is also called in the
business set-ting), to differentiate among customers based on credit risk, usage and other
characteristics and then to match customer characteristics with appropriate product offerings.
Harrah’s, the gaming firm, uses analytics in its customer loyalty programs. E & J Gallo Winery
quantitatively analyzes and predicts the appeal of its wines. Between 2002 and 2005, Deere &
Company saved more than $1 billion by employing a new analytical tool to better optimize
inventory. A telecoms company that pursues efficient call centre usage over customer service
may save money.

Types of Analytics

• Decision analytics: supports human decisions with visual analytics the user models to
reflect reasoning.
• Descriptive analytics: gains insight from historical data with reporting, scorecards, clus-
tering etc.
• Predictive analytics: employs predictive modeling using statistical and machine learning
techniques

16
• Prescriptive analytics: recommends decisions using optimization, simulation, etc.

Basic Domains within Analytics

• Behavioral analytics • Fraud analytics

• Cohort Analysis • Marketing analytics

• Collections analytics • Pricing analytics

• Contextual data modeling - supports • Retail sales analytics


the human reasoning that occurs
after viewing “executive dashboards” • Risk & Credit analytics
or any other visual analytics • Supply Chain analytics
• Cyber analytics • Talent analytics
• Enterprise Optimization • Telecommunications
• Financial services analytics • Transportation analytics

History

Analytics have been used in business since the management exercises were put into
place by Frederick Winslow Taylor in the late 19th century. Henry Ford measured the time of each
component in his newly established assembly line. But analytics began to command more
attention in the late 1960s when computers were used in decision support systems. Since then,
analytics have changed and formed with the development of enterprise resource planning (ERP)
systems, data warehouses, and a large number of other software tools and processes.

In later years the business analytics have exploded with the introduction to computers.
This change has brought analytics to a whole new level and has made the possibilities endless.
As far as analytics has come in history, and what the current field of analytics is today many
people would never think that analytics started in the early 1900s with Mr. Ford himself.

Business analytics depends on sufficient volumes of high quality data. The difficulty in
ensuring data quality is integrating and reconciling data across different systems, and then
deciding what subsets of data to make available.

17
Previously, analytics was considered a type of after-the-fact method of forecasting
consumer behavior by examining the number of units sold in the last quarter or the last year. This
type of data warehousing required a lot more storage space than it did speed. Now business
analytics is becoming a tool that can influence the outcome of customer interactions. When a
specific customer type is considering a purchase, an analytics-enabled enterprise can modify the
sales pitch to appeal to that consumer. This means the storage space for all that data must react
extremely fast to provide the necessary data in real-time.

Competing on Analytics

Thomas Davenport, professor of information technology and management at Babson College


argues that businesses can optimize a distinct business capability via analytics and thus better
compete. He identifies these characteristics of an organization that are apt to compete on
analytics:

• One or more senior executives who strongly advocate fact-based decision making and,
specifically, analytics

• Widespread use of not only descriptive statistics, but also predictive modeling and
complex optimization techniques

• Substantial use of analytics across multiple business functions or processes

• Movement toward an enterprise level approach to managing analytical tools, data, and
organizational skills and capabilities

DEFINITION OF ANALYTICS

Analytics is the discovery, interpretation, and communication of meaningful patterns in


data. Especially valuable in areas rich with recorded information, analytics relies on the
simultaneous application of statistics, computer programming and operations research to quantify
performance. Analytics often favors data visualization to communicate insight.

Organizations may apply analytics to business data to describe, predict, and improve
business performance. Specifically, areas within analytics include predictive analytics,
prescriptive analytics, enterprise decision management, retail analytics, store assortment and
stock-keeping unit optimization, marketing optimization and marketing mix modeling, web
analytics, sales force sizing and optimization, price and promotion modeling, predictive science,

18
credit risk analysis, and fraud analytics. Since analytics can require extensive computation, the
algorithms and software used for analytics harness the most current methods in computer science,
statistics, and mathematics.

Analytics vs. Analysis

Analytics is multidisciplinary. There is extensive use of mathematics and statistics, the use
of descriptive techniques and predictive models to gain valuable knowledge from data—data
analysis. The insights from data are used to recommend action or to guide decision making rooted
in business context. Thus, analytics is not so much concerned with individual analyses or analysis
steps, but with the entire methodology. There is a pronounced tendency to use the term analytics
in business settings e.g. text analytics vs. the more generic text mining to emphasize this broader
perspective.. There is an increasing use of the term advanced analytics, typically used to describe
the technical aspects of analytics, especially in the emerging fields such as the use of machine
learning techniques like neural networks to do predictive modeling.

Examples of Analytics

Marketing Optimization

Marketing has evolved from a creative process into a highly data-driven process.
Marketing organizations use analytics to determine the outcomes of campaigns or efforts and to
guide decisions for investment and consumer targeting. Demographic studies, customer
segmentation, conjoint analysis and other techniques allow marketers to use large amounts of
consumer purchase, survey and panel data to understand and communicate marketing strategy.

Web analytics allows marketers to collect session-level information about interactions on


a website using an operation called sessionization. Google Analytics is an example of a popular
free analytics tool that marketers use for this purpose. Those interactions provide web analytics
information systems with the information necessary to track the referrer, search keywords, identify
IP address, and track activities of the visitor. With this information, a marketer can improve
marketing campaigns, website creative content, and information architecture.

Analysis techniques frequently used in marketing include marketing mix modeling, pricing
and promotion analyses, sales force optimization and customer analytics e.g.: segmentation. Web
analytics and optimization of web sites and online campaigns now frequently work hand in hand
with the more traditional marketing analysis techniques. A focus on digital media has slightly

19
changed the vocabulary so that marketing mix modeling is commonly referred to as attribution
modeling in the digital or marketing mix modeling context.

These tools and techniques support both strategic marketing decisions (such as how
much overall to spend on marketing, how to allocate budgets across a portfolio of brands and the
marketing mix) and more tactical campaign support, in terms of targeting the best potential
customer with the optimal message in the most cost effective medium at the ideal time.

Portfolio Analytics

A common application of business analytics is portfolio analysis. In this, a bank or lending


agency has a collection of accounts of varying value and risk. The accounts may differ by the
social status (wealthy, middle-class, poor, etc.) of the holder, the geographical location, its net
value, and many other factors. The lender must balance the return on the loan with the risk of
default for each loan. The question is then how to evaluate the portfolio as a whole.

The least risk loan may be to the very wealthy, but there are a very limited number of
wealthy people. On the other hand, there are many poor that can be lent to, but at greater risk.
Some balance must be struck that maximizes return and minimizes risk. The analytics solution
may combine time series analysis with many other issues in order to make decisions on when to
lend money to these different borrower segments, or decisions on the interest rate charged to
members of a port-folio segment to cover any losses among members in that segment.

Risk Analytics

Predictive models in the banking industry are developed to bring certainty across the risk
scores for individual customers. Credit scores are built to predict individual’s delinquency behavior
and widely used to evaluate the credit worthiness of each applicant. Furthermore, risk analyses
are carried out in the scientific world and the insurance industry. It is also extensively used in
financial institutions like Online Payment Gateway companies to analyse if a transaction was
genuine or fraud. For this purpose they use the transaction history of the customer. This is more
commonly used in Credit Card purchase, when there is a sudden spike in the customer
transaction volume the customer gets a call of confirmation if the transaction was initiated by
him/her. This helps in reducing loss due to such circumstances.

20
Digital Analytics

Digital analytics is a set of business and technical activities that define, create, collect,
verify or transform digital data into reporting, research, analyses, recommendations, optimizations,
pre-dictions, and automations. This also includes the SEO (Search Engine Optimization) where
the keyword search is tracked and that data is used for marketing purposes. Even banner ads
and clicks come under digital analytics. All marketing firms rely on digital analytics for their digital
marketing assignments, where MROI (Marketing Return on Investment) is important.

Security Analytics

Security analytics refers to information technology (IT) solutions that gather and analyze
security events to bring situational awareness and enable IT staff to understand and analyze
events that pose the greatest risk. Solutions in this area include security information and event
management solutions and user behavior analytics solutions.

Software Analytics

Software analytics is the process of collecting information about the way a piece of
software is used and produced.

Challenges

In the industry of commercial analytics software, an emphasis has emerged on solving the
challenges of analyzing massive, complex data sets, often when such data is in a constant state
of change. Such data sets are commonly referred to as big data. Whereas once the problems
posed by big data were only found in the scientific community, today big data is a problem for
many businesses that operate transactional systems online and, as a result, amass large volumes
of data quickly.

The analysis of unstructured data types is another challenge getting attention in the
industry. Un-structured data differs from structured data in that its format varies widely and cannot
be stored in traditional relational databases without significant effort at data transformation.
Sources of unstructured data, such as email, the contents of word processor documents, PDFs,
geospatial data, etc., are rapidly becoming a relevant source of business intelligence for
businesses, governments and universities. For example, in Britain the discovery that one
company was illegally selling fraudulent doctor’s notes in order to assist people in defrauding
employers and insurance companies, is an opportunity for insurance firms to increase the

21
vigilance of their unstructured data analysis. The McKinsey Global Institute estimates that big
data analysis could save the American health care system $300 billion per year and the European
public sector €250 billion.

These challenges are the current inspiration for much of the innovation in modern analytics
information systems, giving birth to relatively new machine analysis concepts such as complex
event processing, full text search and analysis, and even new ideas in presentation. One such
innovation is the introduction of grid-like architecture in machine analysis, allowing increases in
the speed of massively parallel processing by distributing the workload to many computers all
with equal access to the complete data set.

Analytics is increasingly used in education, particularly at the district and government


office levels. However, the complexity of student performance measures presents challenges
when educators try to understand and use analytics to discern patterns in student performance,
predict graduation likelihood, improve chances of student success, etc. For example, in a study
involving districts known for strong data use, 48% of teachers had difficulty posing questions
prompted by data, 36% did not comprehend given data, and 52% incorrectly interpreted data. To
combat this, some analytics tools for educators adhere to an over-the-counter data format
(embedding labels, supplemental documentation, and a help system, and making key
package/display and content decisions) to improve educators’ understanding and use of the
analytics being displayed.

One more emerging challenge is dynamic regulatory needs. For example, in the banking
industry, Basel and future capital adequacy needs are likely to make even smaller banks adopt
internal risk models. In such cases, cloud computing and open source R (programming language)
can help smaller banks to adopt risk analytics and support branch level monitoring by applying
predictive analytics.

SOFTWARE ANALYTICS

Software Analytics refers to analytics specific to software systems and related software
development processes. It aims at describing, predicting, and improving development,
maintenance, and management of complex software systems. Methods and techniques of
software analytics typically rely on gathering, analyzing, and visualizing information found in the
manifold data sources in the scope of software systems and their software development

22
processes---software analytics “turns it into actionable insight to inform better decisions related to
software”.

Software analytics represents a base component of software diagnosis that generally aims
at generating findings, conclusions, and evaluations about software systems and their
implementation, composition, behavior, and evolution. Software analytics frequently uses and
combines approach-es and techniques from statistics, prediction analysis, data mining, and
scientific visualization. For example, software analytics can map data by means of software maps
that allow for interactive exploration.

Data under exploration and analysis by Software Analytics exists in software lifecycle,
including source code, software requirement specifications, bug reports, test cases, execution
traces/logs, and real-world user feedback, etc. Data plays a critical role in modern software
development, be-cause hidden in the data is the information and insight about the quality of
software and services, the experience that software users receive, as well as the dynamics of
software development.

Insightful information obtained by Software Analytics is information that conveys


meaningful and useful understanding or knowledge towards performing the target task. Typically
insightful information cannot be easily obtained by direct investigation on the raw data without the
aid of analytic technologies.

Actionable information obtained by Software Analytics is information upon which software


practitioners can come up with concrete solutions (better than existing solutions if any) towards
completing the target task.

Software Analytics focuses on trinity of software systems, software users, and software
development process:

Software Systems. Depending on scale and complexity, the spectrum of software systems
can span from operating systems for devices to large networked systems that consist of
thousands of servers. System quality such as reliability, performance and security, etc., is the key
to success of modern software systems. As the system scale and complexity greatly increase,
larger amount of data, e.g., run-time traces and logs, is generated; and data becomes a critical
means to monitor, analyze, understand and improve system quality.

23
Software Users. Users are (almost) always right because ultimately they will use the
software and services in various ways. Therefore, it is important to continuously provide the best
experience to users. Usage data collected from the real world reveals how users interact with
software and services. The data is incredibly valuable for software practitioners to better
understand their customers and gain insights on how to improve user experience accordingly.

Software Development Process. Software development has evolved from its traditional
form to exhibiting different characteristics. The process is more agile and engineers are more
collaborative than that in the past. Analytics on software development data provides a powerful
mechanism that software practitioners can leverage to achieve higher development productivity.

In general, the primary technologies employed by Software Analytics include analytical


technologies such as machine learning, data mining and pattern recognition, information
visualization, as well as large-scale data computing & processing.

Software Analytics Providers

CAST Software New Relic

IBM Cognos Business Intelligence Squore

Kiuwan Tableau Software

Microsoft Azure Application Insights Trackerbird Software Analytics

Nalpeiron Software Analytics

EMBEDDED ANALYTICS

Embedded analytics is the technology designed to make data analysis and business
intelligence more accessible by all kind of application or user.

According to Gartner analysts Kurt Schlegel, traditional business intelligence were


suffering in 2008 a lack of integration between the data and the business users. This technology
intention is to be more pervasive by real-time autonomy and self-service of data visualization or
customization, meanwhile decision makers, business users or even customers are doing their
own daily workflow and tasks.

24
Tools

Actuate Qlik

Dundas Data Visualization SAP

GoodData
SAS
IBM
Tableau
icCube

Logi Analytics TIBCO

Pentaho Sisense

LEARNING ANALYTICS

Learning analytics is the measurement, collection, analysis and reporting of data about
learners and their contexts, for purposes of understanding and optimizing learning and the
environments in which it occurs. A related field is educational data mining. For general audience
introductions, see:

The Educause Learning Initiative Briefing

The Educause Review on Learning analytics

And the UNESCO “Learning Analytics Policy Brief” (2012)

What is Learning Analytics?

The definition and aims of Learning Analytics are contested. One earlier definition
discussed by the community suggested that “Learning analytics is the use of intelligent data,
learner-produced data, and analysis models to discover information and social connections for
predicting and advis-ing people’s learning.”

But this definition has been criticised:

“I somewhat disagree with this definition - it serves well as an introductory


concept if we use analytics as a support structure for existing education models. I

25
think learning analytics - at an advanced and integrated implementation - can do
away with pre-fab curriculum models”. George Siemens, 2010.

“In the descriptions of learning analytics we talk about using data to “predict
success”. I’ve struggled with that as I pore over our databases. I’ve come to realize
there are differ-ent views/levels of success.” Mike Sharkey 2010.

A more holistic view than a mere definition is provided by the framework of learning
analytics by Greller and Drachsler (2012). It uses a general morphological analysis (GMA) to
divide the domain into six “critical dimensions”.

A systematic overview on learning analytics and its key concepts is provided by Chatti et
al. (2012) and Chatti et al. (2014) through a reference model for learning analytics based on four
dimensions, namely data, environments, context (what?), stakeholders (who?), objectives (why?),
and methods (how?).

It has been pointed out that there is a broad awareness of analytics across educational
institutions for various stakeholders, but that the way ‘learning analytics’ is defined and
implemented may vary, including:

• for individual learners to reflect on their achievements and patterns of behaviour in


relation to others;
• as predictors of students requiring extra support and attention;
• to help teachers and support staff plan supporting interventions with individuals and
groups;
• for functional groups such as course team seeking to improve current courses or
develop new curriculum offerings; and
• for institutional administrators taking decisions on matters such as marketing and
recruitment or efficiency and effectiveness measures.”

In that briefing paper, Powell and MacNeill go on to point out that some motivations and
implementations of analytics may come into conflict with others, for example highlighting potential
conflict between analytics for individual learners and organisational stakeholders.

Gašević, Dawson, and Siemens argue that computational aspects of learning analytics
need to be linked with the existing educational research if the field of learning analytics is to deliver
to its promise to understand and optimize learning.

26
Differentiating Learning Analytics and Educational Data Mining

Differentiating the fields of educational data mining (EDM) and learning analytics (LA) has
been a concern of several researchers. George Siemens takes the position that educational data
mining encompasses both learning analytics and academic analytics, the former of which is aimed
at governments, funding agencies, and administrators instead of learners and faculty. Baepler
and Murdoch define academic analytics as an area that “...combines select institutional data,
statistical analysis, and predictive modeling to create intelligence upon which learners, instructors,
or administrators can change academic behavior”. They go on to attempt to disambiguate
educational data mining from academic analytics based on whether the process is hypothesis
driven or not, though Brooks questions whether this distinction exists in the literature. Brooks
instead pro-poses that a better distinction between the EDM and LA communities is in the roots
of where each community originated, with authorship at the EDM community being dominated by
researchers coming from intelligent tutoring paradigms, and learning analytics researchers being
more focused on enterprise learning systems (e.g. learning content management systems).

Regardless of the differences between the LA and EDM communities, the two areas have
significant overlap both in the objectives of investigators as well as in the methods and techniques
that are used in the investigation. In the MS program offering in Learning Analytics at Teachers
College, Columbia University, students are taught both EDM and LA methods.

Learning Analytics in Higher Education

The first graduate program focused specifically on learning analytics was created by Dr.
Ryan Baker and launched in the Fall 2015 semester at Teachers College - Columbia University.
The pro-gram description states that “data about learning and learners are being generated today
on an unprecedented scale. The fields of learning analytics (LA) and educational data mining
(EDM) have emerged with the aim of transforming this data into new insights that can benefit
students, teachers, and administrators. As one of world’s leading teaching and research
institutions in education, psychology, and health, we are proud to offer an innovative graduate
curriculum dedicated to improving education through technology and data analysis.”

27
CHAPTER EXERCISES

Direction: Discuss the following. Use short bond paper for your answer.

1. Explain the importance of analytics in your program.

2. How does analytics be useful in the following sectors:

a. Health Sectors

b. Business sectors

c. Tourism

d, Agriculture

e. Economics

3. Identify the type of measurement scale— nominal, ordinal, interval, or ratio— suggested by
each statement:

a) John finished the math test in 35 minutes, whereas Jack finished the same test in 25
minutes.
b) Jack speaks French, but John does not.
c) Jack is taller than John.
d) John is 6 feet 2 inches tall.
e) John’s IQ is 120, whereas Jack’s IQ is 110.

4. Supermarket Sales

The Supermarket contains over 14,000 transactions made by supermarket customers


over a period of approximately two years. (The data are not real, but real supermarket chains
have huge data sets just like this one.) A small sample of the data appears in in the Figure below,
Column B contains the date of the purchase, column C is a unique identifier for each customer,
columns D–H contain information about the customer, columns I–K contain the location of the
store, columns L–N contain information about the product purchased (these columns have been
hidden to conserve space), and the last two columns indicate the number of items purchased and
the amount paid.

28
a. Determine which variables are categorical and numerical.

b. Summarize the variables using a bar graph.

SUGGESTED READINGS

Explore and read articles on Analytics in http://www.library.educause.edu/

29
CHAPTER
DESCRIPTIVE STATISTICAL
MEASURES
3

OVERVIEW

The goal of this chapter is to make sense of data by constructing appropriate summary
measures, tables, and graphs. Our purpose here is to present the data in a form that makes
sense to people. This chapter also discusses the types of data, variables, measures central
tendency, measures of variability and outliers. Techniques and tips in using Microsoft Excel
are also included to provide you guides in using the application.

OBJECTIVES

▪ Differentiate and understand sample and population


▪ Define data sets, variables and observations
▪ Enumerate types of data
▪ Understand the process in descriptive measures for categorical variables
▪ Understand the process in descriptive measures for numerical variables
▪ Learn and understand the use of the measures of central tendency and variability
▪ Understand the use of outliers and missing values

30
We begin with a short discussion of several important concepts: populations and samples,
data sets, variables and observations, and types of data.

POPULATIONS AND SAMPLES

First, we distinguish between a population and a sample. A population includes all of the
entities of interest: people, households, machines, or whatever.

In these situations and many others, it is virtually impossible to obtain information about
all members of the population. For example, it is far too costly to ask all potential voters which
presidential candidates they prefer. Therefore, we often try to gain insights into the characteristics
of a population by examining a sample, or subset, of the population.

A population includes all of the entities of interest in a study. A sample is a subset of the
population, often randomly chosen and preferably representative of the population as a whole.
We use the terms population and sample a few times in this chapter, which is why we have defined
them here. However, the distinction is not really important until later chapters. Our intent in this
chapter is to focus entirely on the data in a given data set, not to generalize beyond it. Therefore,
the given data set could be a population or a sample from a population. For now, the distinction
is irrelevant.

DATA SETS, VARIABLES, AND OBSERVATIONS

A data set is generally a rectangular array of data where the columns contain variables,
such as height, gender, and income, and each row contains an observation. Each observation
includes the attributes of a particular member of the population: a person, a company, a city, a
machine, or whatever. This terminology is common, but other terms are often used. A variable
(column) is often called a field or an attribute, and an observation (row) is often called a case or
a record. Also, data sets are occasionally rearranged, so that the variables are in rows and the
observations are in columns. However, the most common arrangement by far is to have variables
in columns, with variable names in the top row, and observations in the remaining rows.

A data set is usually a rectangular array of data, with variables in columns and
observations in rows. A variable (or field or attribute) is a characteristic of members of a
population, such as height, gender, or salary. An observation (or case or record) is a list of all
variable values for a single member of a population.

31
Table 1. Environmental Survey Data

Consider Figure 1. Each observation lists the person’s age, gender, state of residence,
number of children, annual salary, and opinion of the president’s environmental policies. These
six pieces of information represent the variables. It is customary to include a row (row 1 in this
case) that lists variable names. These variable names should be concise but meaningful. Note
that an index of the observation is often included in column A. If you sort on other variables, you
can always sort on the index to get back to the original sort order.

TYPES OF DATA

There are several ways to categorize data. A basic distinction is between numerical and
categorical data. The distinction is whether you intend to do any arithmetic on the data. It makes
sense to do arithmetic on numerical data, but not on categorical data. (Actually, there is a third
data type, a date variable. As you may know, Excel stores dates as numbers, but for obvious
reasons, dates are treated differently from typical numbers.)

In the questionnaire data, Age, Children, and Salary are clearly numerical. For example,
it makes perfect sense to sum or average any of these. In contrast, Gender and State are clearly
categorical because they are expressed as text, not numbers.

The Opinion variable is less obvious. It is expressed numerically, on a 1-to-5 scale.


However, these numbers are really only codes for the categories “strongly disagree,” “disagree,”
“neutral,” “agree,” and “strongly agree.” There is never any intent to perform arithmetic on these
numbers; in fact, it is not really appropriate to do so. Therefore, it is most appropriate to treat the
Opinion variable as categorical. Note, too, that there is a definite ordering of its categories,
whereas there is no natural ordering of the categories for the Gender or State variables. When

32
there is a natural ordering of categories, the variable is classified as ordinal. If there is no natural
ordering, as with the Gender variable or the State variable, the variable is classified as nominal.
Remember, though, that both ordinal and nominal variables are categorical.

Excel Tip 1: Horizontal Alignment Conventions

Excel automatically right-aligns numbers and left-aligns text. We will use this
automatic for- matting, but starting in this edition, we will add our own. Specifically,
we will right-align all numbers that are available for arithmetic; we will left-align all text
such as Male, Female, Yes, and No; and we will center-align everything else,
including dates, indexes such as the Person

Excel Tip 2: Documenting with Cell Comments

How do you remember, for example, that “1” stands for “strongly disagree” in the
Opinion variable? You can enter a comment—a reminder to yourself and others—in
any cell. To do so, right-click a cell and select Insert Comment. A small red tag
appears in any cell with a comment. Moving the cursor over that cell causes the
comment to appear. You will see numerous comments in the files that accompany
the book.

A dummy variable is a 0Š1 coded variable for a specific category. It is coded as 1 for all
observations in that category and 0 for all observations not in that category.

The method of categorizing a numerical variable is called binning (putting the data into
discrete bins), and it is also very common. (It is also called discretizing.) The purpose of the
study dictates whether age should be treated numerically or categorically; there is no absolute
right or wrong way.

Numerical variables can be classified as discrete or continuous. The basic distinction is


whether the data arise from counts or continuous measurements. The variable Children is clearly
a count (discrete), whereas the variable Salary is best treated as continuous. This distinction
between discrete and continuous variables is sometimes important because it dictates the most
natural type of analysis.

A numerical variable is discrete if it results from a count, such as the number of children.
A continuous variable is the result of an essentially continuous measurement, such as weight
or height.

33
Cross-sectional data are data on a cross section of a population at a distinct point in time.
Time series data are data collected over time.

DESCRIPTIVE MEASURES FOR CATEGORICAL VARIABLES

This section discusses methods for describing a categorical variable. Because it is not
appropriate to perform arithmetic on the values of the variable, there are only a few possibilities
for describing the variable, and these are all based on counting. First, you can count the number
of categories. Many categorical variables such as Gender have only two categories. Others such
as Region can have more than two categories. As you count the categories, you can also give
the categories names, such as Male and Female.

Once you know the number of categories and their names, you can count the number of
observations in each category (this is referred to as the count of categories). The resulting
counts can be reported as “raw counts” or they can be transformed into percentages of totals.

DESCRIPTIVE MEASURES FOR NUMERICAL VARIABLES

There are many ways to summarize numerical variables, both with numerical summary
measures and with charts, and we discuss the most common ways in this section. But before we
get into details, it is important to understand the basic goal of this section. We begin with a
numerical variable such as Salary, where there is one observation for each per- son. Our basic
goal is to learn how these salaries are distributed across people. To do this, we can ask a number
of questions, including the following. (1) What are the most “typical” salaries? (2) How spread out
are the salaries? (3) What are the “extreme” salaries on either end? (4) Is a chart of the salaries
symmetric about some middle value, or is it skewed in some direction? (5) Does the chart of
salaries have any other peculiar features besides possible skewness? In the next chapter, we
explore methods for checking whether a variable such as Salary is related to other variables, but
for now we simply want to explore the distribution of values in the Salary column.

MEASURES OF CENTRAL TENDENCY

There are three common measures of central tendency, all of which try to answer the
basic question of which value is most “typical.” These are the mean, the median, and the mode.

The MEAN

34
The mean is the average of all values. If the data set represents a sample from some
larger population, this measure is called the sample mean and is denoted by X (pronounced “X-
bar”). If the data set represents the entire population, it is called the population mean and is
denoted by μ (the Greek letter mu). This distinction is not important in this chapter, but it will
become relevant in later chapters when we discuss statistical inference. In either case, the
formula for the mean is given by Equation (2.1).

The most widely used measure of central tendency is the mean, or arithmetic average. It
is the sum of all the scores in a distribution divided by the number of cases. In terms of a formula,
it is

∑𝑿
̅=
𝒙
𝑵

Where, 𝑥̅ = Mean

∑ 𝑋 = Sum of raw scores

N = number of cases

Suppose Anna’s IQ scores in 7 areas are:

IQ scores: 112 121 115 101 119 109 100


112+121+115+101+119+109+100
Applying the formula, 𝑥̅ = = 111, hence, the mean IQ score
7
of Anna.

For Excel data sets, you can calculate the mean with the AVERAGE function.

The MEDIAN

The median is the middle observation when the data are sorted from smallest to largest.
If the number of observations is odd, the median is literally the middle observation. For example,
if there are nine observations, the median is the fifth smallest (or fifth largest). If the number of
observations is even, the median is usually defined as the average of the two middle observations
(although there are some slight variations of this definition). For example, if there are 10
observations, the median is usually defined as the average of the fifth and sixth smallest values.

Consider the following distribution of scores, where the median is 18:

14 15 16 17 18 19 20 21 22

In the following 10 scores we seek the point below which 5 scores fall:

14 16 16 17 18 19 20 20 21 22

35
The point below which 5 scores, or 50 percent of the cases, fall is halfway between 18
and 19. Thus, the median of this distribution is 18.5.

Consider the following scores:

18 20 22 25 25 30

Any point from 22.5 to 24.5 fits the definition of the median. By convention in such cases
the median is defined as half way between these lowest and highest points, in this case 22.5 +
24.5/2 = 23.5.

Table 2. Mr. Li’s Science Class Scores

(1) (2) (3) (4)


f
X fX cf

23 2 46 18
22 2 44 16
21 4 84 14

Add up
20 4 80 10
19 2 38 6
18 2 36 4
17 0 0 2
16 2 32 2

To find the median of Mr. Li’s physics exam scores, we need to find the point below which
18/2 = 9 scores lie. We first create a cumulative frequency column (cf, column 4 in Table 6.2).
The cumulative frequency for each interval is the number of scores in that interval plus the total
number of scores below it. Since the interval between 15.5 and 16.5 has no scores below it, its cf
is equal to its f, which is 2. Since there were no scores of 17, the cf for 17 is still 2. Then adding
the two scores of 18 yields a cumulative frequency of 4. Continuing up the frequency column, we
get cf ’s of 10, 14, 16, and, finally, 18, which is equal to the number of students.

The point separating the bottom nine scores from the top nine scores, the median, is
somewhere in the interval 19.5 to 20.5. Most statistics texts say to partition this interval to locate
the median. The cf column tells us that we have six scores below 19.5. We need to add three
scores to give us half the scores (9). Since there are four scores of 20, we go three-fourths of the
way from 19.5 to 20.5 to report a median of 20.25. Note that many computer programs, including
the Statistical Package for the Social Sciences (SPSS) and the Statistical Analysis System (SAS),
simply report the midpoint of the interval—in this case 20—as the median.

36
The median can be calculated in Excel with the MEDIAN function.

The MODE

The mode is the value that appears most often, and it can be calculated in Excel with the
MODE function. In most cases where a variable is essentially continuous, the mode is not very
interesting because it is often the result of a few lucky ties.

The mode is the value in a distribution that occurs most frequently. It is the simplest to find
of the three measures of central tendency because it is determined by inspection rather than by
computation. Given the distribution of scores

14 16 16 17 18 19 19 19 21 22

you can readily see that the mode of this distribution is 19 because it is the most frequent score.
Sometimes there is more than one mode in a distribution. For example, if the scores had been

14 16 16 16 18 19 19 19 21 22

you would have two modes: 16 and 19. This kind of distribution with two modes is called bimodal.
Distributions with three or more modes are called trimodal or multimodal, respectively.

The mode is the least useful indicator of central value in a distribution for two reasons.
First, it is unstable. For example, two random samples drawn from the same population may have
quite different modes. Second, a distribution may have more than one mode. In published
research, the mode is seldom reported as an indicator of central tendency. Its use is largely limited
to inspectional purposes. A mode may be reported for any of the scales of measurement, but it is
the only measure of central tendency that may legitimately be used with nominal scales.

Excel Tip 3: Working with MODE Function

Two new versions of the MODE function were introduced in Excel 2010:
MODE.MULT and MODE.SNGL. The latter is the same as the older MODE function.
The MULT version returns multiple modes if there are multiple modes.

Shapes of Distributions

37
Frequency distributions can have a variety of shapes. A distribution is symmetrical when
the two halves are mirror images of each other. In a symmetrical distribution, the values of the
mean and the median coincide. If such a distribution has a single mode, rather than two or more
modes, the three indexes of central tendency will coincide, as shown in Figure 6.

Figure 2. Symmetrical Distribution

Number of s cor es

Low Mean High


Median
Mode

If a distribution is not symmetrical, it is described as skewed, pulled out to one end or the
other by the presence of extreme scores. In skewed distributions, the values of the measures of
central tendency differ. In such distributions, the value of the mean, because it is influenced by
the size of extreme scores, is pulled toward the end of the distribution in which the extreme scores
lie, as shown in Figures 7 and 8.

Figure 3. Negatively Skewed Distribution

38
Figure 4. Positively Skewed Distribution

The effect of extreme values is less on the median because this index is influenced not by
the size of scores but by their position. Extreme values have no impact on the mode because this
index has no relation with either of the ends of the distribution. Skews are labeled according to
where the extreme scores lie. A way to remember this is “The tail names the beast.” Figure 4
shows a negatively skewed distribution, whereas Figure 8 shows a positively skewed distribution.

MEASURES OF VARIABILITY

Although indexes of central tendency help researchers describe data in terms of average
value or typical measure, they do not give the total picture of a distribution. The mean values of
two distributions may be identical, whereas the degree of dispersion, or variability, of their scores
might be different. In one distribution, the scores might cluster around the central value; in the
other, they might be scattered. For illustration, consider the following distributions of scores:

(a) 24, 24, 25, 25, 25, 26, 26 𝑋̅= 175/7 = 25

(b) 16, 19, 22, 25, 28, 30, 35 𝑋̅= 175/7 = 25

The value of the mean in both these distributions is 25, but the degree of scatter-ing of the
scores differs considerably. The scores in distribution (a) are obviously much more homogeneous
than those in distribution (b). There is clearly a need for indexes that can describe distributions in
terms of variation, spread, dispersion, heterogeneity, or scatter of scores. Three indexes are
commonly used for this purpose: range, variance, and standard deviation.

a. Range

39
The simplest of all indexes of variability is the range. It is the difference between the upper
real limit of the highest score and the lower real limit of the lowest score. In statistics, any score
is thought of as representing an interval width from halfway between that score and the next
lowest score (lower real limit) up to halfway between that score and the next highest score (upper
real limit).

For example, if several children have a recorded score of 12 pull-ups on a physical fitness
test, their performances probably range from those who just barely got their chin over the bar the
twelfth time and were finished (lower real limit) to those who completed 12 pull-ups, came up
again, and almost got their chin over the bar, but did not quite make it for pull-up 13 (upper limit).
Thus, a score of 12 is considered as representing an interval from halfway between 11 and 12
(11.5) to halfway between 12 and 13 (12.5) or an interval of 1. For example, given the following
distribution of scores, you find the range by subtracting 1.5 (the lower limit of the lowest score)
from 16.5 (the upper limit of the highest score), which is equal to 15.

2 10 11 12 13 14 16

Formula R = ( Xh−Xl ) + I

where
R = range
Xh = highest value in a distribution
Xl = lowest value in a distribution
I = interval width
Applying the formula, Subtract the lower number from the higher and add 1 (16 − 2 + 1 =
15). In frequency distribution, 1 is the most common interval width.

The range is an unreliable index of variability because it is based on only two values, the highest
and the lowest. It is not a stable indicator of the spread of the scores. For this reason, the use of
the range is mainly limited to inspectional purposes. Some research reports refer to the range of
distributions, but such references are usually used in conjunction with other measures of
variability, such as variance and standard deviation.

B. Variance and Standard Deviation

Variance and standard deviation are the most frequently used indexes of variability. They
are both based on deviation scores—scores that show the difference between a raw score and
the mean of the distribution. The formula for a deviation score is

__

40
x=X−X
where
x = deviation score
X = raw score
__

𝑋̅ = mean
Scores below the mean will have negative deviation scores, and scores above the mean
will have positive deviation scores.

By definition, the sum of the deviation scores in a distribution is always 0. Thus, to use
deviation scores in calculating measures of variability, you must find a way to get around the fact
that Σx = 0. The technique used is to square each deviation score so that they all become positive
numbers. If you then sum the squared deviations and divide by the number of scores, you have
the mean of the squared deviations from the mean, or the variance. In mathematical form,
variance is

∑ 𝑿𝟐
𝜹𝟐 =
𝑵
where
σ2 = variance
Σ = sum of __

x2 = deviation of each score from the mean (X − X ) squared, otherwise known as the
deviation score squared
N = number of cases in the distribution

Table 3. Variance of Mr. Li’s Physics Exam Scores

(1) (2) (3) (4) (5) (6) (7) (8)


2 2 2
X f fX x x fx X f X2

23 2 46 +3 9 18 529 1058
22 2 44 +2 4 8 484 968
21 4 84 +1 1 4 441 1764
20 4 80 0 0 0 400 1600
19 2 38 −1 1 2 361 722
18 2 36 −2 4 8 324 648
17 0 0
16 2 32 −4 16 32 256 512
2 2
N=18 ΣX = 360 Σx = 72 ΣX = 7272

In column 4 of Table 4, we see the deviation scores, differences between each score, and
the mean. Column 5 shows each deviation score squared (x2), and column 6 shows the frequency

41
of each score from column 2 multiplied by x2 (column 5). Summing column 6 gives us the sum of
the squared deviation scores Σx2 = 72. Dividing this by the number of scores gives us the mean
of the squared deviation scores, the variance.

The common formula used in computing variance is convenient only when the mean is a
whole number. To avoid the tedious task of working with squared mixed-number deviation scores
such as 7.66672, we recommend that students always use this formula for computing standard
deviation if the computation must be done “by hand”:

(∑ 𝑿)𝟐
∑ 𝑿𝟐 −
𝜹𝟐 = 𝑵
𝑵

where

σ2 = variance
ΣX2 = sum of the squares of each score (i.e., each score is first
squared, and then these squares are summed)
(ΣX)2 = sum of the scores squared (the scores are first summed, and then
this total is squared)
N = number of cases

Column 7 in Table 4 shows the square of the raw scores. Column 8 shows these raw
score squares multiplied by frequency. Summing this fX2 column gives us the sum of the squared
raw scores:

2
(∑ 𝑋) 3602 129600
∑ 𝑋2 − 7272− 7272− 7272−7200 72
2 𝑁 18 18
𝛿 = = = = = = 4
𝑁 18 18 18 18

In most cases, educators prefer an index that summarizes the data in the same unit of
measurement as the original data. Standard deviation (σ), the positive square root of variance,
provides such an index. By definition, the standard deviation is the square root of the mean of the
squared deviation scores. Rewriting this symbol, we obtain

∑ 𝑿𝟐
𝜹= √
𝑵

For Mr. Li’s physics exam scores, the standard deviation is

72
𝛿 = √ = √4 = 𝟐
18

42
The standard deviation belongs to the same statistical family as the mean; that is, like the
mean, it is an interval or ratio statistic, and its computation is based on the size of individual scores
in the distribution. It is by far the most frequently used measure of variability and is used in
conjunction with the mean.

There is a fundamental problem with variance as a measure of variability: It is in squared


units. For example, if the observations are measured in dollars, the variance is in squared dollars.
A more natural measure is the square root of variance. This is called the standard deviation.
Again, there are two versions of standard deviation.

The population standard deviation, denoted by σ, is the square root of the quantity in
Equation (2.3). To calculate either standard deviation in Excel, you can first find the variance with
the VAR or VARP function and then take its square root. Alternatively, you can find it directly with
the STDEV (sample) or STDEVP (population) function.

OUTLIERS AND MISSING VALUES

Most textbooks on data analysis, including this one, tend to use example data sets that
are “cleaned up.” Unfortunately, the data sets you are likely to encounter in your job are often not
so clean. Two particular problems you will encounter are outliers and missing data, the topics of
this section. There are no easy answers for dealing with these problems, but you should at least
be aware of the issues.

Outliers

An outlier is literally a value or an entire observation (row) that lies well outside of the norm.
For the baseball data, Alex Rodriguez’s salary of $32 million is definitely an outlier. This is indeed
his correct salary—the number wasn’t entered incorrectly—but it is way beyond what most players
make. Actually, statisticians disagree on an exact definition of an outlier. Going by the third
empirical rule, you might define an outlier as any value more than three standard deviations from
the mean, but this is only a rule of thumb. Let’s just agree to define outliers as extreme values,
and then for any particular data set, you can decide how extreme a value needs to be to qualify
as an outlier.

Sometimes an outlier is easy to detect and deal with. For example, this is often the case
with data entry errors. Suppose a data set includes a Height variable, a person’s height measured
in inches, and you see a value of 720. This is certainly an outlier—and it is certainly an error.

43
Once you spot it, you can go back and check this observation to see what the person’s height
should be. Maybe an extra 0 was accidentally appended and the true value is 72. In any case,
this type of outlier is usually easy to discover and fix.

It isn’t always easy to detect outliers, but an even more important issue is what to do about
them when they are detected. Of course, if they are due to data entry errors, they can be ixed,
but what if they are legitimate values like Alex Rodriguez’s salary? One or a few wild outliers like
this one can dominate a statistical analysis. For example, they can make a mean or standard
deviation much different than if the outliers were not present.

For this reason, some people argue, possibly naïvely, that outliers should be eliminated
before running statistical analyses. However, it is not appropriate to eliminate outliers simply to
produce “nicer” results. There has to be a legitimate reason for eliminating outliers, and such a
reason sometimes exists. For example, suppose you want to analyze salaries of “typical” man-
agers at your company. Then it is probably appropriate to eliminate the CEO and possibly other
high-ranking executives from the analysis, arguing that they aren’t really part of the population of
interest and would just skew the results. Or if you are interested in the selling prices of “typical”
homes in your community, it is probably appropriate to eliminate the few homes that sell for over
$2 million, again arguing that these are not the types of homes you are interested in.

Missing Values

There are no missing data in the baseball salary data set. All 843 observations have a
value for each of the four variables. For real data sets, however, this is probably the exception
rather than the rule. Unfortunately, most real data sets have gaps in the data. This could be
because a person didn’t want to provide all the requested personal information (what business is
it of yours how old I am or whether I drink alcohol?), it could be because data doesn’t exist (stock
prices in the 1990s for companies that went public after 2000), or it could be because some values
are simply unknown. Whatever the reason, you will undoubtedly encounter data sets with varying
degrees of missing values. As with outliers, there are two issues: how to detect missing values
and what to do about them. The first issue isn’t as simple as you might imagine. For an Excel data
set, you might expect missing data to be obvious from blank cells. This is certainly one possibility,
but there are others. Missing data are coded in a variety of strange ways. One common method
is to code missing values with an unusual number such as Š9999 or 9999. Another method is to
code missing values with a symbol such as Š or *. If you know the code (and it is often supplied

44
in a footnote), then it is usually a good idea, at least in Excel, to perform a global search and
replace, replacing all of the missing value codes with blanks.

The more important issue is what to do about missing values. One option is to ignore them.
Then you will have to be aware of how the software deals with missing values. For example, if
you use Excel’s AVERAGE function on a column of data with missing values, it reacts the way
you would hope and expect—it adds all the non-missing values and divides by the number of non-
missing values. StatTools reacts in the same way for all of the measures discussed in this chapter
(after alerting you that there are indeed missing values). We will say more about how StatTools
deals with missing data for other analyses in later chapters. If you are using other statistical
software such as SPSS or SAS, you should read its online help to learn how its various statistical
analyses deal with missing data.

Because this is such an important topic in real-world data analysis, researchers have
studied many ways of filling in the gaps so that the missing data problem goes away (or is at least
disguised). One possibility is to fill in all of the missing values in a column with the average of the
non-missing values in that column. Indeed, this is an option in some soft- ware packages, but we
don’t believe it is usually a very good option. (Is there any reason to believe that missing values
would be average values if they were known? Probably not.) Another possibility is to examine the
non-missing values in the row of any missing value. It is possible that they provide some clues on
what the missing value should be. For example, if a person is male, is 55 years old, has an MBA
degree from Harvard, and has been a man- ager at an oil company for 25 years, this should
probably help to predict his missing salary. (It probably isn’t below $100,000.) We will not discuss
this issue any-further here because it is quite complex, and there are no easy answers. But be
aware that you will undoubtedly have to deal with missing data at some point in your job, either
by ignoring the missing values or by filling in the gaps in some way.

CHAPTER EXERCISES

1. Provide answers as requested, given the following distribution: 15, 14, 14, 13, 11, 10, 10, 10,
8, 5.

a) Calculate the mean.


b) Determine the value of the median.
c) Determine the value of the mode.

45
2. Supposed that Ms. Llave’s English class has the following scores in two tests as shown in
Table 1:

Student’s Name Test 1 Scores Test 2 Scores


Jude 27 26
Lara 28 31
Nicole 26 30
Christopher 32 31
Desserie 28 30
Lyra 27 29
Lance 25 24
Vince 24 23
Marc 23 18
Elyza 24 26
Philip 29 30
Jomarie 27 28
Earl 30 29
Budgett 19 22
Aron 25 27

Determine the following of each of the tests (1 and 2):

a. Mean

b. Median

c. Mode

d. Variance

e. Standard Deviation

SUGGESTED READINGS

Read articles on Measures of Central Tendency at http://statistics.alerd.com/statistical-


guides/measures-central-tendency-mean-mode-meadian.php

Read articles on Measures of Variability at


http://onlinestatbook.com/2/summarizing_distributions/variability.html

46
CHAPTER ANALYTICS ON SPREADSHEETS

4
OVERVIEW

This section discusses a great tool that was introduced in Excel 2007: tables. Tables were
somewhat available in previous versions of Excel, but they were never called tables before,
and some of the really useful features of Excel 2007 tables were new at the time. This
chapter discusses how you were able to do filtering, sorting and summarizing data using
spreadsheets.

OBJECTIVES

▪ Learn how to use Microsoft Excel


▪ Learn how to spreadsheet in sorting, filtering and summarizing data
▪ Learn to summarize data through graphs and tables

REFERENCES

47
EXCEL TABLES FOR FILTERING, SORTING, AND SUMMARIZING

It is useful to begin with some terminology and history. Earlier in this chapter, we discussed
data arranged in a rectangular range of rows and columns, where each row is an observation and
each column is a variable, with variable names at the top of each column. Informally, we refer to
such a range as a data set. In fact, this is the technical term used by StatTools. In previous
versions of Excel, data sets of this form were called lists, and Excel provided several tools for
dealing with lists. In Excel 2007, recognizing the importance of data sets, Microsoft made them
much more prominent and provided even better tools for analyzing them. Specifically, you now
have the ability to designate a rectangular data set as a table and then employ a number of
powerful tools for analyzing tables. These tools include filtering, sorting, and summarizing.

Let’s consider data in Table 4. The data contains 1000 customers of HyTex, a
(fictional) direct marketing company, for the current year. The definitions of the variables are fairly
straightforward, but details about several of them are listed in cell comments in row 1. HyTex
wants to find some useful and quick information about its customers by using an Excel table. How
can it proceed?

Table 4. HyTex Customer Data

The range A1:O1001 is in the form of a data set—it is a rectangular range bounded by
blank rows and columns, where each row is an observation, each column is a variable, and
variable names appear in the top row. Therefore, it is a candidate for an Excel table. However, it
doesn’t benefit from the new table tools until you actually designate it as a table. To do so, select
any cell in the data set, click the Table button in the left part of the Insert ribbon (see Figure 4),
and accept the default options. Two things happen. First, the data set is designated as a table, it
is formatted nicely, and a dropdown arrow appears next to each variable name, as shown in
Figure 5. Second, a new Table Tools Design ribbon becomes available (see Figure 6). This ribbon

48
is available any time the active cell is inside a table. Note that the table is named Table1 by default
(if this is the first table). However, you can change this to a more descriptive name if you like.

Figure 5. Inserting Ribbon with Table Button

Figure 6. Table Tools Design Ribbon

One handy feature of Excel tables is that the variable names remain visible even when
you scroll down the screen. Try it to see how it works. When you scroll down far enough that the
variable names would disappear, the column headers, A, B, C, and so on, change to the variable
names. Therefore, you no longer need to freeze panes or split the screen to see the variable
names. However, this works only when the active cell is within the table. If you click outside the
table, the column headers revert back to A, B, C, and so on.

Figure 7. Table with Dropdown Arrows Next to Variable Names

Filtering

49
We now discuss ways of filtering data sets—that is, finding records that match particular
criteria. Before getting into details, there are two aspects of filtering you should be aware of. First,
this section is concerned with the types of filters called AutoFilter in pre-2007 versions of Excel.
The term AutoFilter implied that these were very simple filters, easy to learn and apply. If you
wanted to do any complex filtering, you had to move beyond AutoFilter to Excel’s Advanced Filter
tool. Starting in version 2007, Excel still has Advanced Filter. However, the term AutoFilter has
been changed to Filter to indicate that these “easy” filters are now more powerful than the old
AutoFilter. Fortunately, they are just as easy as AutoFilter.

Second, one way to filter is to create an Excel table, as indicated in the previous
subsection. This automatically provides the dropdown arrows next to the field names that allow
you to filter. Indeed, this is the way we will filter in this section: on an existing table. However, a
designated table is not required for filtering. You can filter on any rectangular data set with variable
names. There are actually three ways to do so. For each method, the active cell should be a cell
inside the data set.

■ Use the Filter button from the Sort & Filter dropdown list on the Home ribbon.

■ Use the Filter button from the Sort & Filter group on the Data ribbon.

■ Right-click any cell in the data set and select Filter.

You get several options, the most popular of which is Filter by Selected Cell’s Value. For
example, if the selected cell has value 1 and is in the Children column, then only customers with
a single child will remain visible. (This behavior should be familiar to Access users.) The point is
that Microsoft realizes how important filtering is to Excel users. Therefore, they have made filtering
a very prominent and powerful tool in all versions of Excel since 2007. As far as we can tell, the
two main advantages of filtering on a table, as opposed to the three options just listed, are the
nice formatting (banded rows, for example) provided by tables, and, more importantly, the total
row. If this total row is showing, it summarizes only the visible records; the hidden rows are ignored.

CHAPTER EXERCISES

50
1. Obtain DOH data on COVID19 in the Philippines from March to August 2020. Write/ print in
short bond paper

Tasks:

▪ For students WITH COMPUTER

a. Perform Sorting of Data (using MS Excel)

b. Filter data according to (using MS Excel)

- Month

- Types of Data (confirmed case, death, recoveries)

▪ For students WITHOUT COMPUTER

a. Create a line graph of the COVID19 data

- Monthly data

- Data type (confirmed case, death, recoveries)

PROBABILITY AND PROBABILITY


CHAPTER 51
DISTRIBUTIONS
5
OBJECTIVES

▪ Explain the basic concepts and tools necessary to work with probability
distributions and their summary measures.
▪ Examine the probability distribution of a single random variable.
▪ Understand the concept of addition rule
▪ Understand and learn conditional probability and the multiplication rule
▪ Learn summarizing measure of probability distribution

52
A key aspect of solving real business problems is dealing appropriately with uncertainty.
This involves recognizing explicitly that uncertainty exists and using quantitative methods to
model uncertainty. If you want to develop realistic business models, you cannot simply act as if
uncertainty doesn’t exist. For example, if you don’t know next month’s demand, you shouldn’t
build a model that assumes next month’s demand is a sure 1500 units. This is only wishful thinking.
You should instead incorporate demand uncertainty explicitly into your model. To do this, you
need to know how to deal quantitatively with uncertainty. This involves probability and probability
distributions. We introduce these topics in this chapter and then use them in a number of later
chapters.

There are many sources of uncertainty. Demands for products are uncertain, times
between arrivals to a supermarket are uncertain, stock price returns are uncertain, changes in
interest rates are uncertain, and so on. In many situations, the uncertain quantity— demand, time
between arrivals, stock price return, change in interest rate—is a numerical quantity. In the
language of probability, it is called a random variable. More formally, a random variable
associates a numerical value with each possible random outcome.

Associated with each random variable is a probability distribution that lists all of the
possible values of the random variable and their corresponding probabilities. A probability
distribution provides very useful information. It not only indicates the possible values of the
random variable, but it also indicates how likely they are. For example, it is useful to know that
the possible demands for a product are, say, 100, 200, 300, and 400, but it is even more useful
to know that the probabilities of these four values are, say, 0.1, 0.2, 0.4, and 0.3. This implies,
for example, that there is a 70% chance that demand will be at least 300. It is often useful to
summarize the information from a probability distribution with numerical summary measures.
These include the mean, variance, and standard deviation. The summary measures in this
chapter are based on probability distributions, not an observed data set. We will use numerical
examples to explain the difference between the two—and how they are related.

We discuss two terms you often hear in the business world: uncertainty and risk. They
are sometimes used interchangeably, but they are not really the same. You typically have no
control over uncertainty; it is something that simply exists. A good example is the uncertainty in
exchange rates. You cannot be sure what the exchange rate between the U.S. dollar and the euro
will be a year from now. All you can try to do is measure this uncertainty with a probability
distribution.

53
book. By learning about probability, you will learn how to measure uncertainty, and you
will also learn how to measure the risks involved in various decisions. One important topic you
will not learn much about is risk mitigation by various types of hedging. For example, if you know
you have to purchase a large quantity of some product from Europe a year from now, you face
the risk that the value of the euro could increase dramatically, thus costing you a lot of money.
Fortunately, there are ways to hedge this risk, so that if the euro does increase relative to the
dollar, your hedge minimizes your losses.

PROBABILITY ESSENTIALS

A probability is a number between 0 and 1 that measures the likelihood that some event
will occur. An event with probability 0 cannot occur, whereas an event with probability 1 is certain
to occur. An event with probability greater than 0 and less than 1 involves uncertainty. The closer
its probability is to 1, the more likely it is to occur.

When a sports commentator states that the odds against the Miami Heat winning the NBA
Championship are 3 to 1, he or she is also making a probability statement. The concept of
probability is quite intuitive. However, the rules of probability are not always as intuitive or easy to
master. We examine the most important of these rules in this section.

As the examples in the preceding paragraph illustrate, probabilities are sometimes


expressed as percentages or odds. However, these can easily be converted to probabilities on a
0-to-1 scale. If the chance of rain is 70%, then the probability of rain is 0.7. Similarly, if the odds
against the Heat winning are 3 to1, then the probability of the Heat winning is 1/4 (or 0.25).

There are only a few probability rules you need to know, and they are discussed in the
next few subsections. Surprisingly, these are the only rules you need to know. Probability is not
an easy topic, and a more thorough discussion of it would lead to considerable mathematical
complexity, well beyond the level of this book.

Rule of Complements

The simplest probability rule involves the complement of an event. If A is any event, then
the c complement of A, denoted by A (or in some books by A ), is the event that A does not occur.

For example, if A is the event that the Dow Jones Industrial Average will inish the year at
or above the 14,000 mark, then the complement of A is that the Dow will inish the year below
14,000.

54
If the probability of A is P(A), then the probability of its complement, P(A), is given by
Equation below. Equivalently, the probability of an event and the probability of its complement
sum to 1. For example, if you believe that the probability of the Dow finishing at or above 14,000
is 0.25, then the probability that it will finish the year below 14,000 is 1 - 0.25 = 0.75.

̅̅̅ = 𝟏 − 𝑷(𝑨)
𝑷(𝑨)

Addition Rule

Events are mutually exclusive if at most one of them can occur. That is, if one of them
occurs, then none of the others can occur.

For example, consider the following three events involving a company’s annual revenue
for the coming year: (1) revenue is less than $1 million, (2) revenue is at least $1 million but less
than $2 million, and (3) revenue is at least $2 million. Clearly, only one of these events can occur.
Therefore, they are mutually exclu- sive. They are also exhaustive events, which means that they
exhaust all possibilities—one of these three events must occur. Let A1 through An be any n events.
Then the addition rule of probability involves the probability that at least one of these events will
occur. In general, this probability is quite complex, but it simplifies considerably when the events
are mutually exclusive. In this case the probability that at least one of the events will occur is the
sum of their individual probabilities, as shown in Equation below Of course, when the events are
mutually exclusive, “at least one” is equivalent to “exactly one.” In addition, if the events A1 through
An are exhaustive, then the probability is one because one of the events is certain to occur.

P(at least one of A 1 through An) = P(A1) + P(A2)+…+P(An)

For example, in terms of a company’s annual revenue, define A1 as “revenue is less than
$1 million,” A2 as “revenue is at least $1 million but less than $2 million,” and A3 as “revenue is at
least $2 million.” Then these three events are mutually exclusive and exhaustive. Therefore, their
probabilities must sum to 1. Suppose these probabilities are P(A1) = 0.5, P(A2) = 0.3, and P(A3) =
0.2. (Note that these probabilities do sum to 1.) Then the additive rule enables you to calculate
other probabilities. For example, the event that revenue is at least $1 million is the event that
either A2 or A3 occurs. From the addition rule, its probability is

P(revenue is at least $1 million) = P(A2) + P(A3) = 0.5

Similarly,

P(revenue is less than $2 million) = P(A1) + P(A2) = 0.8

55
and

P(revenue is less than $1 million or at least $2 million) = P(A1) + P(A3) = 0.7

Again, the addition rule works only for mutually exclusive events. If the events overlap,
the situation is more complex.

Conditional Probability and the Multiplication Rule

Probabilities are always assessed relative to the information currently available. As new
information becomes available, probabilities can change. For example, if you read that LeBron
James suffered a season-ending injury, your assessment of the probability that the Heat will win
the NBA Championship would obviously change. A formal way to revise probabilities on the basis
of new information is to use conditional probabilities.

Let A and B be any events with probabilities P(A) and P(B). Typically, the probability P(A)
is assessed without knowledge of whether B occurs. However, if you are told that B has occurred,
then the probability of A might change. The new probability of A is called the conditional
probability of A given B, and it is denoted by P(A∣B). Note that there is still uncertainty involving
the event to the left of the vertical bar in this notation; you do not know whether it will occur.
However, there is no uncertainty involving the event to the right of the vertical bar; you know that
it has occurred. The conditional probability can be calculated with the following formula.

The numerator in this formula is the probability that both A and B occur. This probability
must be known to find P(A∣B). However, in some applications P(A∣B) and P(B) are known. Then
you can multiply both sides of Equation (4.3) by P(B) to obtain the following multiplication rule for
P(A and B).

Example:

56
Bender Company supplies contractors with materials for the construction of houses. The
company currently has a contract with one of its customers to fill an order by the end of July.
However, there is some uncertainty about whether this deadline can be met, due to uncertainty
about whether Bender will receive the materials it needs from one of its suppliers by the middle
of July. Right now it is July 1. How can the uncertainty in this situation be assessed?

Solution

Let A be the event that Bender meets its end-of-July deadline, and let B be the event
that Bender receives the materials from its supplier by the middle of July. The probabilities Bender
is best able to assess on July 1 are probably P(B) and P(A∣B). At the beginning of July, Bender
might estimate that the chances of getting the materials on time from its supplier are 2 out of 3,
so that P(B) = 2/3. Also, thinking ahead, Bender estimates that if it receives the required materials
on time, the chances of meeting the end-of-July deadline are 3 out of 4. This is a conditional
probability statement, namely, that P(A∣B) = 3/4. Then the multiplication rule implies that

P(A and B) =P(A∣B)P(B) = (3/4)(2/3) = 0.5

That is, there is a fifty-fifty chance that Bender will get its materials on time and meet its end-of-
July deadline.

PROBABILITY DISTRIBUTION OF A SINGLE RANDOM VARIABLE

There are really two types of random variables: discrete and continuous. A discrete
random variable has only a finite number of possible values, whereas a continuous random
variable has a continuum of possible values. Usually a discrete distribution results from a count,
whereas a continuous distribution results from a measurement. For example the number of
children in a family is clearly discrete, whereas the amount of rain this year in San Francisco is
clearly continuous.

This distinction between counts and measurements is not always clear-cut. For example,
what about the demand for televisions at a particular store next month? The number of televisions
demanded is clearly an integer (a count), but it probably has many possible values, such as all
integers from 0 to 100. In some cases like this, we often approximate in one of two ways. First,
we might use a discrete distribution with only a few possible values, such as all multiples of 20
from 0 to 100. Second, we might approximate the possible demand as a continuum from 0 to 100.
The reason for such approximations is to simplify the mathematics, and they are frequently used.

57
Mathematically, there is an important difference between discrete and continuous
probability distributions. Specifically, a proper treatment of continuous distributions, analogous to
the treatment we provide in this chapter, requires calculus—which we do not presume for this
book. Therefore, we discuss only discrete distributions in this chapter. In later chapters we often
use continuous distributions, particularly the bell-shaped normal distribution, but we simply state
their properties without deriving them mathematically.

The essential properties of a discrete random variable and its associated probability
distribution are quite simple. We discuss them in general and then analyze a numerical example.

Let X be a random variable. To specify the probability distribution of X, we need to specify


its possible values and their probabilities. We assume that there are k possible values, denoted
v1, v2, . . . , vk. The probability of a typical value vi is denoted in one of two ways, either P(X = vi)
or p(vi). The first is a reminder that this is a probability involving the random variable X, whereas
the second is a shorthand notation. Probability distributions must satisfy two criteria: (1) the
probabilities must be nonnegative, and (2) they must sum to 1. In symbols, we must have

Summary Measures of a Probability Distribution

It is often convenient to summarize a probability distribution with two or three well-chosen


numbers. The first of these is the mean, often denoted µ. It is also called the expected value of X
and denoted E(X) (for expected X). The mean is a weighted sum of the possible values, weighted
by their probabilities, as shown in Equation below. In much the same way that an average of a
set of numbers indicates “central location,” the mean indicates the “center” of the probability
distribution.

To measure the variability in a distribution, we calculate its variance or standard deviation.


The variance, denoted by σ2 or Var(X), is a weighted sum of the squared deviations of the possible
values from the mean, where the weights are again the probabilities. This is shown in Equation
below. As in Chapter 3, the variance is expressed in the square of the units of X, such as dollars

58
squared. Therefore, a more natural measure of variability is the standard deviation, denoted by σ
or Stdev(X). It is the square root of the variance, as indicated by Equation below.

CHAPTER EXERCISES

An investor is concerned with the market return for the coming year, where the market
return is defined as the percentage gain (or loss, if negative) over the year. The investor believes
there are five possible scenarios for the national economy in the coming year: rapid expansion,
moderate expansion, no growth, moderate contraction, and serious contraction. Furthermore, she
has used all of the information available to her to estimate that the market returns for these
scenarios are, respectively, 23%, 18%, 15%, 9%, and 3%. That is, the possible returns vary from
a high of 23% to a low of 3%. Also, she has assessed that the probabilities of these outcomes are
0.12, 0.40, 0.25, 0.15, and 0.08. Use this information to describe the probability distribution of the
market return.

Compute the following for the probability distribution of the market return for the coming
year.:

1. Mean,

2. Variance,

3. Standard deviation

Show your solutions.

SUGGESTED READINGS

Read article on Probability Distribution at http://stattrek.com/probability-distributions/probability-


distribution.aspx

59
STATISTICAL INFERENCE:
CHAPTER
SAMPLING AND ESTIMATION
6
OVERVIEW

This chapter introduces the important problem of estimating an unknown population quantity
by randomly sampling from the population. Sampling is often expensive and/or time-
consuming, so a key step in any sampling plan is to determine the sample size that produces
a prescribed level of accuracy. This chapter also sets the stage for statistical inference, a 60
topic that is explored in the following few chapters. In a typical statistical inference problem,
you want to discover one or more characteristics of a given population.
OBJECTIVES

▪ Identify sample size and population


▪ Identify and Understand different sampling techniques
▪ to discuss the sampling schemes that are generally used in real sampling applications.
▪ to see how the information from a sample of the population can be used to infer the
properties of the entire population.
▪ Learn different statistical treatment used for inferences
▪ Understand the use of confidence interval and margin of errors
▪ Understand estimation and its use

UNDERSTANDING SAMPLES

What is a population? Any group with at least one common characteristic and is made up
of people, transactions, products, and so on, is called a population. You need to understand the
population for any project at the beginning of the project. In business, it is rare to have a population
that has only one characteristic. Generally, it will have many variables in the data set. What is a
sample? A sample consists of a few observations or subset of a population. Can a sample have
the same number of observations as a population? Yes, it can. Some of the differences between
populations and samples are in the computations and nomenclatures associated with them.

In statistics, population refers to a collection of data related to people or events for which
the analyst wants to make some inferences. It is not possible to examine every member in the

61
population. Thus, if you take a sample that is random and large enough, you can use the
information collected from the sample to make deductions about the population. For example, you
can look at 100 students from a school (picked randomly) and make a fairly accurate judgment of
the standard of English spoken in the school. Or you can look at the last 100 transactions on a
web site and figure out fairly accurately the average time a customer spends on the web site.

Before you can choose a sample from a given population, you typically need a list of all
members of the population. In sampling terminology, this list is called a frame, and the potential
sample members are called sampling units. Depending on the context, sampling units could be
individual people, households, companies, cities, or others.

There are two basic types of samples: probability samples and judgmental samples. A
probability sample is a sample in which the sampling units are chosen from the population
according to a random mechanism. In contrast, no formal random mechanism is used to select a
judgmental sample. In this case the sampling units are chosen according to the sampler’s
judgment.

SAMPLING TECHNIQUES

A sample is part of the population which is observed in order to make inferences about
the whole population (Manheim, 1977). You use sampling when your research design requires
that you collect information from or about a population, which is large or so widely scattered as to
make it impractical to observe all the individuals in the population. A sample reflects the
characteristics of the population.

Four factors that you should take into consideration when selecting your sample and the
size of your sample are the following:

1. Homogeneity. Take samples from a homogenous population. Samples taken


from a heterogeneous population will not be representative of the population,
and therefore, cannot be inferred from.

2. Size of population. If the population is large, you need a sample. However, you
do not need a sample if the population is small and can be handled if you include
all the individuals in the population. Including all the individuals in the population
is also called total enumeration.

62
3. Cost. Your choice of sampling method should be based also on the cost of
adopting such method without necessarily sacrificing representativeness of the
population being considered.

4. Precision. If you have to achieve precision, you will need a larger sample
because the larger the sample, the more precise the results will be.

There are two major types of sampling techniques: probability sampling and non-
probability sampling.

a. Probability sampling

According to Domingo (1954), probability sampling is a sampling process where each


individual is drawn or selected with known probability. Parel et al. (1966) consider a sample to be
probability sample when every individual in the population is given a non-zero chance of being
chosen for the sample. There are six techniques under this sampling method.

1. Random sampling. Also called simple random sampling, this technique is a way of
selecting n individuals out of N such that everyone has an equal chance of being
selected. Sample individuals are selected at points entirely at random within the
population. This technique is suitable for homogeneous populations.

2. Systematic random sampling. This technique starts by numbering consecutively all


individuals in the population. The first sample is selected through a simple random
process, then the succeeding samples are chosen at pre-established intervals. To
determine the appropriate interval, divide N by the desired number of sample.

3. Stratified sampling. This technique is applicable when the population is not


homogeneous wherein the random sample may not be representative of the
population. When you do stratified sampling, divide the population into homogeneous
groups called strata, then draw samples either by simple random sampling or stratified
sampling from each of the formed strata. For precise results, the total number of the
desired sample may be allocated equally among the strata. This technique prevents
any chance concentration of sample units in one part of the field because they are well
distributed. For example, suppose that you would like to take a sample of students at
the University of the Philippines Los Baños using the stratified sampling technique.
The stratification of the student population has already been made for you. The strata

63
are: “freshmen,” “sophomore,” “junior,” and “senior.” What do you do to select your
sample from each of these groups of students to insure that you get a cross-section of
the UPLB studentry? If you select your sample by simple random selection, there is
chance that you will end up with a sample composed more of seniors or juniors rather
than representative groups of students in all classifications.

4. Simple cluster sampling. This is a one-stage sampling technique wherein the


population is grouped into clusters or small units composed of population elements. A
number of these population clusters is chosen either by simple random sampling or by
systematic random sampling.

5. Strip sampling. Under this technique, you divide the area to be sampled into narrow
strips. Then, select a number of strips at random either by complete randomization or
with some degree of stratification. Sometimes you may consider only a part of the strip
as a sample unit.

6. Multi-stage sampling. This technique is commonly used when there is no detailed or


actual listing of individuals. You do sampling stages, which means that you group the
population elements into a hierarchy of individuals or units, and sampling is done
successively.

b. Non-probability sampling

According to Yamane (1967), this method is a process whereby probabilities cannot be


assigned objectively to individuals in the population. Simply, not all the individuals in the
population are given a non-zero chance of being included in the sample. In fact, some individuals
in the population may be deliberately ignored.

1. Judgment sampling. This is a process whereby you select a representative sample


according to your subjective judgment. Since personal bias is usually a factor in the
selection of sample, there is no objective way of evaluating the results of this
technique. This sampling technique may be appropriate when you have to make
judgment about an individual’s potential as a source of information.

2. Quota sampling. This is simply a variation of judgment sampling, which provides


more explicit instructions on who to select. A definite quota must be filled. The quota

64
is determined to a certain extent by the characteristics of the population so that the
quota sample will be representative of the population. This is commonly used in
opinion research, where interviewers are just given specific quotas or number of
respondents to interview. This technique is very economical and simple, but it must be
used with caution as it allows for a wide latitude of interviewer’s choices which may
result in biases. The assumption here, however, is that field investigators have high
integrity and they have undergone thorough training.

3. Accidental sampling. This technique is very simple in that whoever happens to be


there at the time of the interview is interviewed and becomes part of the sample. This
is normally done in spot surveys for audience studies, for example.

Why Random Sampling?

One reason for sampling randomly from a population is to avoid biases (such as
choosing mainly stay-at- home mothers because they are easier to contact). An equally
important reason is that random sampling allows you to use probability to make inferences
about unknown population parameters. If sampling were nit random, there would be no
basis for using probability to make such inference.

DETERMINING SAMPLE SIZE

On top of the basic sampling techniques that are commonly used, you can introduce a
system where you can insure that the final sample of your study is really representative of the
population comprised of individuals that may come in clusters or groups. This is called
proportional sampling and there is a simple formula that would enable you to arrive at a complete
sample that is representative of the segments of the population.

For instance, you want to obtain a sample sufficiently representative of the barangays or
villages in a town. You know that the barangays differ in total number of individuals living in them.
So you decide that those with larger population should be represented by more respondents. How
then would you determine the number of respondents coming from each village?

To determine the sample size, Slovin’s formula is commonly used for lesser population.

65
𝑁
𝑛= Where, n= sample size
1+𝑁𝑒 2 N= Population size
e= margin of error

For example:

Suppose you wanted to determine the sample size for your study on households’ taste
preference on the new variety of ice cream. The study will be conducted in Sto. Nino, Paranaque
City with total number of households of 4, 921 (PSA Census on Population 2000 data).

Solution:

𝑁 4921 4921 4921 4921


𝑛= 𝑛 = 1+4921(.05)2 = 1+4921(.0025) = 1+12.30 = 13.30 = 𝟑𝟕𝟎
1+𝑁𝑒 2

Hence, your sample size is only 370 households from the 4921. This represents the
number of respondents that you will survey for your study.

INTRODUCTION TO ESTIMATION

The purpose of any random sample, simple or otherwise, is to estimate properties of a


population from the data observed in the sample. The following is a good example to keep in mind.
Suppose a government agency wants to know the average household income over the population
of all households in Indiana. Then this unknown average is the population parameter of interest,
and the government is likely to estimate it by sampling several representative households in
Indiana and reporting the average of their incomes.

The mathematical procedures appropriate for performing this estimation depend on which
properties of the population are of interest and which type of random sampling scheme is used.
Because the details are considerably more complex for more complex sampling schemes such
as multistage sampling, we will focus on simple random samples, where the mathematical details
are relatively straightforward. Details for other sampling schemes such as stratified sampling can
be found in Levy and Lemeshow (1999). However, even for more complex sampling schemes,
the concepts are the same as those we discuss here; only the details change.

Sources of Estimation Error

There are two basic sources of errors that can occur when you sample randomly from a
population: sampling error and all other sources, usually lumped together as non-sampling

66
error. Sampling error results from “unlucky” samples. As such, the term error is somewhat
misleading.

For example, that the mean household income in Indiana is $58,225. (We can only
assume that this is the true value. It wouldn’t actually be known without taking a census.) A
government agency wants to estimate this mean, so it randomly samples 500 Indiana households
and finds that their average household income is $60,495. If the agency then infers that the mean
of all Indiana household incomes is $60,495, the resulting sampling error is the difference
between the reported value and the true value: $60,495 – $58,225 = $2270. Note that the agency
hasn’t done anything wrong. This sampling error is essentially due to bad luck.

Non-sampling error is quite different and can occur for a variety of reasons.

a. nonresponse bias. This occurs when a portion of the sample fails to respond to the survey.
Anyone who has ever conducted a questionnaire, whether by mail, by phone, or any other
method, knows that the percentage of non-respondents can be quite large. The question is
whether this introduces estimation error. If the non-respondents would have responded
similarly to the respondents, you don’t lose much by not hearing from them. However, because
the non-respondents don’t respond, you typically have no way of knowing whether they differ
in some important respect from the respondents. Therefore, unless you are able to persuade
the non-respondents to respond—through a follow-up email, for example—you must guess at
the amount of nonresponse bias.

b. non-truthful responses. This is particularly a problem when there are sensitive questions in
a questionnaire. For example, if the questions “Have you ever had an abortion?” or “Do you
regularly use cocaine?” are asked, most people will answer “no,” regardless of whether the
true answer is “yes” or “no.”

c. measurement error. This occurs when the responses to the questions do not reflect what the
investigator had in mind. It might result from poorly worded questions, questions the
respondents don’t fully understand, questions that require the respondents to supply
information they don’t have, and so on. Undoubtedly, there have been times when you were
filling out a questionnaire and said to yourself, “OK, I’ll answer this as well as I can, but I know
it’s not what they want to know.”

d. voluntary response bias. This occurs when the subset of people who respond to a survey
differ in some important respect from all potential respondents. For example, suppose a

67
population of students is surveyed to see how many hours they study per night. If the students
who respond are predominantly those who get the best grades, the resulting sample mean
number of hours could be biased on the high side.

Key Terms in Sampling

A point estimate is a single numeric value, a “best guess” of a population parameter, based on
the data in a random sample.

The sampling error (or estimation error) is the difference between the point estimate and the true
value of the population parameter being estimated.

The sampling distribution of any point estimate is the distribution of the point estimates from all
possible samples (of a given sample size) from the population.

A confidence interval is an interval around the point estimate, calculated from the sample data,
that is very likely to contain the true value of the population parameter.

An unbiased estimate is a point estimate such that the mean of its sampling distribution is equal
to the true value of the population parameter being estimated.

The standard error of an estimate is the standard deviation of the sampling distribution of the
estimate. It measures how much estimates vary from sample to sample.

Sample Size Selection

The problem of selecting the appropriate sample size in any sampling context is not an
easy one (as illustrated in the chapter opener), but it must be faced in the planning stages, before
any sampling is done. We focus here on the relationship between sampling error and sample size.
As we discussed previously, the sampling error tends to decrease as the sample size increases,
so the desire to minimize sampling error encourages us to select larger sample sizes. We should
note, however, that several other factors encourage us to select smaller sample sizes. The
ultimate sample size selection must achieve a trade-off between these opposing forces.

The determination of sample size is usually driven by sampling error considerations. If you
want to estimate a population mean with a sample mean, then the key is the standard error of the
mean, given by

̅ ) = 𝜹/√𝒏
𝑺𝑬(𝑿

68
CONFIDENCE INTERVALS

In the world of statistics, you can look at what applies to the sample and try to determine
the population. You know that the sample cannot be a 100 percent replica of the population. There
will be minor changes, and perhaps there are major ones too. How do you figure out that the
sample statistics is applicable to the population? To answer this, you look at the confidence
interval. Confidence intervals enable you to understand the accuracy that you can expect when
you take the sample statistics and apply them to the population.

In other words, a confidence interval gives you a range of values within which you can
expect the population statistics to be.

In statistics there is a term called the margin of error, which defines the maximum
expected difference between the population parameter and the sample statistic. It is often an
indicator of the random sampling error, and it is expressed as a likelihood or probability that the
result from the sample is close to the value that would have been calculated if you could have
calculated the statistic for the population. The margin of error is calculated when you observe
many samples instead of one sample. When you look at 50 people coming in for an interview and
find that 5 people do not arrive at the correct time, you can conclude that the margin of error is 5
÷ 50, which is equal to 10 percent. Therefore, the absolute margin of error, which is five people,
is converted to a relative margin of error, which is 10 percent.

Now, what is the chance that when you observe many samples of 50 people, you will find
that in each sample 5 people do not come at the designated time of interview? If you find that, out
of 100 samples, in 99 samples 5 people do not come in on time for an interview, you can say that
with 99 percent accuracy the margin of error is 10 percent.

Why should there be any margin of error if the sample is a mirror image of the population?
The answer is that there is no sample that will be a 100 percent replica of the population. But it
can be very close. Thus, the margin of error can be caused because of a sampling error or
because of a nonsampling error.

You already know that the chance that the sample is off the mark will decrease as the
sample size increases. The more people/products that you have in your sample size, the more
likely you will get a statistic that is very close to the population statistic. Thus, the margin of error
in a sample is equal to 1 divided by the square root of the number of observations in the sample.

69
𝒆 = 𝟏/√𝒏

What Is the P-value?

You know that hypothesis testing is used to confirm or reject whether two samples belong
to the same population. That p-value is the probability that determines whether the two samples
are the same population. This probability is a measure of evidence against the hypothesis.

Remember the following:

▪ The null hypothesis always claims that Mean 1 = Mean 2.


▪ The aim of hypothesis testing is to reject the null.

Thus, a smaller p-value will mean that you can reject the null because the probability of
the two samples having similar means (which points to the two samples coming from the same
population) is much less (.05 = 5 percent probability).

The small p-value corresponds to strong evidence, and if the p-value is below a predefined
limit (.05 is the default value in most software), then the result is said to be statistically significant.

For example, if the hypothesis is that a new type of medicine is better than the old version,
then the first attempt is to prove that the drugs are not similar (that any similarity is so small that
it can be random/ coincidence). Then the null hypothesis of the two drugs being the same needs
to be rejected. A small p-value signifies that the probability of the null hypothesis being true is so
small that it can be thought to be purely by chance.

Figure 11. Unlikely observations

70
This distribution is the distribution of the probability of the null hypothesis being true. Thus,
when the p-value (the probability of the null hypothesis of being true ) is less than .05 (or any
other value set for the test), you have to reject the null and conclude that Mean 1 = Mean 2 only
because of coincidence or fate or chance.

Errors in Hypothesis Testing

No hypothesis test is 100 percent certain. As you have noticed, tests are based on
probability, and therefore, there is always a chance of an incorrect conclusion. These incorrect
conclusions can be of two types:

▪ Type 1 error, alpha: This is when the null hypothesis is true but you reject the null. Alpha
is the level of significance that you have set for the test. At a significance of .05, you are
willing to accept a 5 percent chance that you will incorrectly reject the null hypothesis. To
lower the risk, you can choose a lower value of significance. A type I error is generally
reported as the p-value.
▪ Type II error, beta: This is the error of incorrectly accepting the null. The probability of
making a type II error depends on the power of the test.

You can decrease your risk of committing a type II error by ensuring your sample size is
large enough to detect a practical difference when one truly exists. The confidence level is
equivalent to 1, the alpha level.

When the significance level is 0.05, the corresponding confidence level is 95 percent.

▪ If the p-value is less than the significance (alpha) level, the hypothesis test is statistically
significant.
▪ If the confidence interval does not contain the null hypothesis value between the upper
and lower confidence limits, the results are statistically significant (the null can be
rejected).

71
▪ If the p-value is less than the alpha, the confidence interval will not contain the null
hypothesis value.

1. Confidence level + alpha = 1.

2. If the p-value is low, the null must go.

3. The confidence interval and p-value will always lead to the same conclusion.

The most valuable usage of hypothesis testing is in interpreting the robustness of other
statistics generated while solving the problem/doing the project.

▪ Correlation coefficient: If the p-value is less than or equal to .05, you can conclude that
the correlation is actually equal to the correlation coefficient value displayed/ calculated.
If the p-value is greater than .05, you have to conclude that the correlation is because of
chance/coincidence.
▪ Linear regression coefficients: If the p-value is less than or equal to .05, you can
conclude that the coefficients are actually equal to the value displayed/calculated. If the
p-value is greater than .05, you have to conclude that the coefficients are because of
chance/coincidence.

SAMPLING DISTRIBUTIONS

What will happen if you are able to draw out all possible samples of 30 or more
observations from a given population/sample frame? For each of these samples, you could
compute the descriptive statistics (mean, median, standard deviation, minimum, maximum). Now
if you were to create a probability distribution of this statistic, it would be called the sampling
distribution, and the standard deviation of this statistic would be called the standard error.

It has been found that if infinite numbers of samples are taken from the same sample
frame/population and a sample statistic (say, the mean of the samples) is plotted out, you will find
that a normal distribution emerges. Thus, most of the means will be clustered around the mean
of the sample mean, which incidentally will coincide or be very close to the population/sample
frame mean. This is as per the normal distribution rule, which states that values are concentrated
around the mean and few values will be far away from the mean (very low or very high as
compared to the mean).

Binomial Distribution

The basic building block of the binomial distribution is a Bernoulli random variable. This is
a variable for which there can be only two possible outcomes, and the probability of these

72
outcomes satisfies the conditions of a valid probability distribution function, which is that each
probability is between 0 and 1 and the total probabilities sum up to 1 or 100 percent.

Since a single observation of the outcome of a burn on a random variable is called a trial,
the sum of a series of such trials is distributed as a binomial distribution.

Thus, one such example is the probability of getting a tail on the toss of a coin, which is
50 percent or .5. If there are 100 such tosses, you will find that getting 0 heads and 100 tails is
very unlikely, getting 50 heads and 50 tails is the most likely, and getting 100 tails and 0 heads is
the most unlikely.

Now let’s look at a scenario where you have four possible outcomes and the probability of
getting outcome 1, 2, or 3 defines success, while getting an outcome of 4 defines failure. Thus,
the probability of success is 75 percent, and the probability of failure is 25 percent. Now if you
were to try 200 tosses again, you will find that a similar distribution occurs, but the distribution will
be more skewed or can be seen to be a bit shifted as compared to the earlier 50-50 distribution.

Figure 8. Demonstration Of Binomial Distribution

Continuous Uniform Distribution

What if you have no prior beliefs about the distribution of probability or if you believe that
every outcome is equally possible? It’s easier when the value is discrete for a variable. When this
same condition is seen over a continuous variable, the distribution that emerges is called the
continuous uniform distribution (Figure 9). It is often used for random number generation in
simulations.

73
Figure 9. A continuous uniform distribution

Possion Distribution

Let’s look at some events that occur at a continuous rate, such as phone calls coming into
a call center. Let the rate of occurrence be r, or lambda. When the number is small (that is, there
is only one or two calls in a day), the possibilities that you will get zero calls on certain days is
high. However, say the number of calls in the call center is on average 100 per day. Then the
possibility that you will ever get zero calls in a day is very low. This distribution is called the
Poisson distribution (Figure 10).

Figure 10. Some Poisson distributions

PARAMETRIC TESTS

74
The following are some parametric tests:

▪ Students t-test: Student t-tests look at the differences between two groups across the
same variable of interest. Or they look at two variables in the same sample. The
consideration is that there can be only two groups at maximum.
▪ An example is if you want to compare the grades in English for students of Class 1’s
Section A and Section B. Another example is if you want to compare the grades in Class
1’s Section A for math and for science.
o One sample t-test: When the null hypothesis reads that the mean of a variable is
less than or equal to a specific value, then that test is one sample t-test.
o Paired sample t-test: When the null hypothesis assumes that the mean of
variable 1 is equal to the mean of variable 2, then that test is a paired sample t-
test.

o Independent sample t-test: This compares the mean difference between two
independent groups for a given variable. The null hypothesis is that the mean for
the variable in sample 1 is equal to the mean for the same variable in sample 2.
The assumption is that the variance or standard deviation across the samples is
nearly equal.

For example, if you want to compare the grades in English for students of Class 1’s Section
A and Section B, you can use an analysis of variance (ANOVA) test as a substitute for the
students’ t-test.

▪ ANOVA test: This is the significance of differences between two or more groups across
one or more categorical variable. Thus, you will be able to figure out whether there is a
difference between groups, which is significant, but it will not tell you which group is
different.

75
▪ An example is if you want to compare the grades in English for students of Class 1’s
Section A, Section B, and Section C. Another example is if you want to compare the
grades in Class 1’s Section A for math, English, and science.
o One-way ANOVA: In this test, you compare the mean of a number of groups
based on one independent variable. There are some assumptions like that the
dependent variable is normally distributed and that the group of independent
variable groups have equal variance on the dependent variable.
o Two-way ANOVA: Here you can look at multiple groups and two variables of
factors. Again, the assumption is that there is homogeneity of variance and the
standard deviation of the population of all the groups are similar.

NONPARAMETRIC TESTS

Here the data is not normally distributed. Thus, if the data is better represented by the
median instead of the mean, it is better to use nonparametric tests. It is also better to use
nonparametric tests. If the data sample is small, the data is ordinal ranked or may have some
outliers that you do not want to remove.

Chi-squared tests compare observed frequencies to expected frequencies and are used
across categorical variables.

As discussed, chi-square tests will be used on data that has ordinal nominal variables. For
example, say you want to understand the population of Indian males in cities who regularly
exercise, sporadically exercise, or have not exercised over the last 20 years. Thus, you have
three responses tracked over 20 years, and you need to figure out whether the population has
shifted between year 1 and year 20. The null hypothesis here would mean that there is no change
or no difference in the situation.

▪ Year 1 statistics: 60 percent regularly exercise, 20 percent sporadically exercise, and 20


percent have not exercised.
▪ Year 20 statistics: 68 percent regularly exercise, 16 percent sporadically exercise, and 16
percent have not exercised.

The test for both years was run on 500 people. Now you would compare the year 20
statistics with what could be the expected frequencies of these people in year 20 (if the year 1
trends are followed) as compared to the observed frequencies.

76
The test is based on a numerical measure of the difference between the two histograms.
Let C be the number of categories in the histogram, and let Oi be the observed number of
observations in category i. Also, let Ei be the expected number of observations in category i if the
population were normal with the same mean and standard deviation as in the sample. Then the
goodness-of-it measure in equation below is used as a test statistic.

If the null hypothesis of normality is true, this test statistic has (approximately) a chi-square
distribution with C - 3 degrees of freedom. Because large values of the test statistic indicate a
poor it—the Oi’s do not match up well with the Ei’s—the p-value for the test is the probability to
the right of the test statistic in the chi-square distribution with C - 3 degrees of freedom.

CHAPTER EXERCISES

1. Differentiate population from a sample. Cite examples of both.

2. When do sample and population is used?

3. What is the purpose setting confidence internal and margin of errors?

4. What is the purpose of hypothesis? Hypothesis testing?

5. Suppose you are going to study on the difference on buying behavior of millennials and their
demographic profile. Formulate three hypotheses and determine what statistical treatment will be
used on each of the hypothesis.

SUGGESTED READINGS

Read article on Hypothesis Testing at https://statrek.com/hypothesis-test/hypothesis-


testing.aspx

77
CHAPTER DATA MINING
7
OVERVIEW

The types of data analysis discussed throughout this book are crucial to the success of most
companies in today’s data-driven business world. However, the sheer volume of available
data often defies traditional methods of data analysis. Therefore, new methods— and
accompanying software—have recently been developed under the name of data mining.

OBJECTIVES

▪ Define Data Mining


▪ Learn the powerful tools for exploring and visualizing data.
▪ Learn classifications of data mining
▪ Learn and understand the process of clustering

78
INTRODUCTION TO DATA MINING

Data mining attempts to discover patterns, trends, and relationships among data,
especially nonobvious and unexpected patterns. For example, an analysis might discover that
people who purchase skim milk also tend to purchase whole wheat bread, or that cars built on
Mondays before 10 a.m. on production line #5 using parts from supplier ABC have significantly
more defects than average. This new knowledge can then be used for more effective
management of a business.

The place to start is with a data warehouse. Typically, a data warehouse is a huge
database that is designed specifically to study patterns in data. A data warehouse is not the same
as the databases companies use for their day-to-day operations. A data warehouse should (1)
combine data from multiple sources to discover as many relationships as possible, (2) contain
accurate and consistent data, (3) be structured to enable quick and accurate responses to a
variety of queries, and (4) allow follow-up responses to specific relevant questions. In short, a
data warehouse represents a relatively new type of database, one that is specifically structured
to enable data mining. Another term you might hear is data mart. A data mart is essentially a
scaled-down data warehouse, or part of an overall data warehouse, that is structured specifically
for one part of an organization, such as sales. Virtually all large organizations, and many smaller
ones, have developed data warehouses or data marts in the past decade to enable them to better
understand their business—their customers, their suppliers, and their processes.

Once a data warehouse is in place, analysts can begin to mine the data with a collection
of methodologies and accompanying software. Some of the primary methodologies are
classification analysis, prediction, cluster analysis, market basket analysis, and forecasting. Each
of these is a large topic in itself, but some brief explanations follow.

▪ Classification analysis attempts to find variables that are related to a categorical (often
binary) variable. For example, credit card customers can be categorized as those who pay
their balances in a reasonable amount of time and those who don’t. Classification analysis
would attempt to find explanatory variables that help predict which of these two categories
a customer is in. Some variables, such as salary, are natural candidates for explanatory
variables, but an analysis might uncover others that are less obvious.
▪ Prediction is similar to classification analysis, except that it tries to ind variables that help
explain a continuous variable, such as credit card balance, rather than a categorical

79
variable. Regression, the topic of Chapters 10 and 11, is one of the most popular prediction
tools, but there are others not covered in this book.
▪ Cluster analysis tries to group observations into clusters so that observations within a
cluster are alike, and observations in different clusters are not alike. For example, one
cluster for an automobile dealer’s customers might be middle-aged men who are not
married, earn over $150,000, and favor high-priced sports cars. Once natural clusters are
found, a company can then tailor its marketing to the individual clusters.
▪ Market basket analysis tries to find products that customers purchase together in the
same “market basket.” In a supermarket setting, this knowledge can help a manager
position or price various products in the store. In banking and other settings, it can help
managers to cross-sell (sell a product to a customer already purchasing a related product)
or up-sell (sell a more expensive product than a customer originally intended to purchase).
▪ Forecasting is used to predict values of a time series variable by extrapolating patterns
seen in historical data into the future. (This topic is covered in some detail in Chapter 12.)
This is clearly an important problem in all areas of business, including the forecasting of
future demand for products, forecasting future stock prices and commodity prices, and
many others.

DATA EXPLORATION AND VISUALIZATION

Data mining is a relatively new field—or at least a new term—and not everyone agrees
with its definition. To many people, data mining is a collection of advanced algorithms that can be
used to find useful information and patterns in large data sets. Data mining does indeed include
a number of advanced algorithms, but we believe its definition should be broadened to include
relatively simple methods for exploring and visualizing data. This section discusses some of the
possibilities.

Online Analytical Processing (OLAP)

We introduced pivot tables in Chapter 4 as an amazingly easy and powerful way to break ® data
down by category in Excel . However, the pivot table methodology is not limited to Excel or even
to Microsoft. This methodology is usually called online analytical processing, or OLAP. This
name was initially used to distinguish this type of data analysis from online transactional
processing, or OLTP.

80
When analysts began to realize that the typical OLTP databases are not well equipped to
answer these broader types of questions, OLAP was born. This led to much research into the
most appropriate database structure for answering OLAP questions. The consensus was that the
best structure is a star schema. In a star schema, there is at least one Facts table of data that has
many rows and only a few columns. For example, in a supermarket database, a Facts table might
have a row for each line item purchased, including the number of items of the product purchased,
the total amount paid for the product, and possibly the discount. Each row of the Facts table would
also list “lookup information” (or foreign keys, in database terminology) about the purchase: the
date, the store, the product, the customer, any promotion in effect, and possibly others. Finally,
the database would include a dimension table for each of these. For example, there would be a
Products table. Each row of this table would contain multiple pieces of information about a
particular product. Then if a customer purchases product 15, say, information about product 15
could be looked up in the Products table.

Most data warehouses are built according to these basic ideas. By structuring corporate
databases in this way, facts can easily be broken down by dimensions, and—you guessed it—
the methodology for doing this is pivot tables. However, these pivot tables are not just the
“standard” Excel pivot tables. You might think of them as pivot tables on steroids. The OLAP
methodology and corresponding pivot tables have the following features that distinguish them
from standard Excel pivot tables.

PowerPivot and Power View in Excel 2013

The general approach to data analysis embodied in pivot tables is one of the most powerful
ways to explore data sets. You learned about basic Excel pivot tables in Chapter 3, and you
learned about the more general OLAP technology in the previous subsection. This subsection
describes new Microsoft tools of the pivot table variety, PowerPivot and Power View, that were
introduced in Excel 2013. Actually, PowerPivot was available as a free add-in for Excel 2010, but
two things have changed in the version that is described here. First, you no longer need to
download a separate PowerPivot add-in. In Excel 2013, you can simply add it in by checking it in
the add-ins list. Second, the details of PowerPivot have changed. Therefore, if you find a tutorial
for the older PowerPivot add-in on the Web and try to follow it for Excel 2013, you will see that
the new version doesn’t work in the same way as before. So be aware that the instructions in this
section are relevant only for PowerPivot for Excel 2013 and not for the older version.

Among other things, the PowerPivot add-in allows you to do the following:

81
▪ Import millions of rows from multiple data sources
▪ Create relationships between data from different sources, and between multiple tables in
a pivot table
▪ Create implicit calculated fields (previously called measures) — calculations created
automatically when you add a numeric field to the Values area of the Field List
▪ Manage data connections

Interestingly, Microsoft refers to building a data model in Excel in its discussion of PowerPivot.
This is a somewhat new Microsoft term, and they have provided the following definition.

Data Model: A collection of tables and their relationships that reflects the real-world
relationships between business functions and processes—for example, how Products relates to
Inventory and Sales.

If you have worked with relational databases, this definition is nothing new. It is essentially
the definition of a relational database, a concept that has existed for decades. The difference is
that the data model is now contained entirely in Excel, not in Access or some other relational
database package.

Visualization Software

As the Power View tool from the previous subsection illustrates, you can gain a lot of
insight by using charts to view your data in imaginative ways. This trend toward powerful charting
software for data visualization is the wave of the future and will certainly continue. Although this
book is primarily about Microsoft software—Excel—many other companies are developing
visualization software. To get a glimpse of what is currently possible, you can watch the
accompanying video about a free software package, Tableau Public, developed by Tableau
Software. Perhaps you will find other visualization software packages, free or otherwise, that rival
Tableau or Power View. Alternatively, you might see blogs with data visualizations from ordinary
users. In any case, the purpose of charting software is to portray data graphically so that otherwise
hidden trends or patterns can emerge clearly.

MICROSOFT DATA MINING ADD-INS FOR EXCEL

82
The methods discussed so far in this chapter, all of which basically revolve around pivot
tables, are extremely useful for data exploration, but they are not always included in discussions
of “data mining.” To many analysts, data mining refers only to the algorithms discussed in the
remainder of this chapter. These include, among others, algorithms for classification and for
clustering. (There are many other types of data mining algorithms not discussed in this book.)
Many powerful software packages have been developed by software companies such as SAS,
IBM SPSS, Oracle, Microsoft, and others to implement these data mining algorithms.
Unfortunately, this software not only takes time to master, but it is also quite expensive. The only
data mining algorithms discussed here that are included in the software that accompanies the
book are logistic regression and neural nets, two classification methods that are part of the
Palisade suite, and they are discussed in the next section.

To provide you with illustrations of other data mining methods, we will briefly discuss
Microsoft data mining add-ins for Excel. The good news is that these add-ins are free and easy
to use. You can find them by searching the Web for Microsoft Data Mining Add-ins.

The names of these add-ins provide a clue to their downside. These add-ins are really
only front ends—client tools—for the Microsoft engine that actually performs the data mining
algorithms. This engine is called Analysis Services and is part of Microsoft’s SQL Server database
package. (SQL Server Analysis Services is often abbreviated as SSAS.) In short, Microsoft
decided to implement data mining in SSAS. Therefore, to use its Excel data mining add-ins, you
must have a connection to an SSAS server. This might be possible in your academic or corporate
setting, but it can definitely be a hurdle.

Classification Methods

The previous section introduced one of the most important problems studied in data
mining, the classification problem. This is basically the same problem attacked by regression
analysis—using explanatory variables to predict a dependent variable—but now the dependent
variable is categorical. It usually has two categories, such as Yes and No, but it can have more
than two categories, such as Republican, Democrat, and Independent. This problem has been
analyzed with very different types of algorithms, some regression-like and others very different
from regression, and this section discusses three of the most popular classification methods. But
each of the methods has the same objective: to use data from the explanatory variables to classify
each record (person, company, or whatever) into one of the known categories.

83
Before proceeding, it is important to discuss the role of data partitioning in classification
and in data mining in general. Data mining is usually used to explore very large data sets, with
many thousands or even millions of records. Therefore, it is very possible, and also very useful,
to partition the data set into two or even three distinct subsets before the algorithms are applied.
Each subset has a specified percentage of all records, and these subsets are typically chosen
randomly. The first subset, usually with about 70% to 80% of the records, is called the training
set. The second subset, called the testing set, usually contains the rest of the data. Each of these
sets should have known values of the dependent variable. Then the algorithm is trained with the
data in the training set. This results in a model that can be used for classification. The next step
is to test this model on the testing set. It is very possible that the model will work quite well on the
training set because this is, after all, the data set that was used to create the model. The real
question is whether the model is flexible enough to make accurate classifications in the testing
set.

Most data mining software packages have utilities for partitioning the data. (In the
following subsections, you will see that the logistic regression procedure in StatTools does not
yet have partitioning utilities, but the Palisade NeuralTools add-in for neural networks does have
them, and the Microsoft data mining add-in for classification trees also has them.) The various
software packages might use slightly different terms for the subsets, but the overall purpose is
always the same, as just described. They might also let you specify a third subset, often called a
prediction set, where the values of the dependent variable are unknown. Then you can use the
model to classify these unknown values. Of course, you won’t know whether the classifications
are accurate until you learn the actual values of the dependent variable in the prediction set.

Logistic Regression

Logistic regression is a popular method for classifying individuals, given the values of a
set of explanatory variables. It estimates the probability that an individual is in a particular
category. As its name implies, logistic regression is somewhat similar to the usual regression
analysis, but its approach is quite different. It uses a nonlinear function of the explanatory
variables for classification.

Logistic regression is essentially regression with a dummy (0–1) dependent variable.


For the two-category problem (the only version of logistic regression discussed here), the dummy
variable indicates whether an observation is in category 0 or category 1. One approach to the
classification problem, an approach that is sometimes actually used, is to run the usual multiple

84
regressions on the data, using the dummy variable as the dependent variable. However, this
approach has two serious drawbacks. First, it violates the regression assumption that the error
terms should be normally distributed. Second, the predicted values of the dependent variable can
be between 0 and 1, less than 0, or greater than 1. If you want a predicted value to estimate a
probability, then values less than 0 or greater than 1 make no sense.

Therefore, logistic regression takes a slightly different approach. Let X1 through Xk be the
potential explanatory variables, and create the linear function b0+ b1 X1 + ⋯ + bkXk. Unfortunately,
there is no guarantee that this linear function will be between 0 and 1, and hence that it will qualify
as a probability. But the nonlinear function

𝟏/(𝒆−(𝒃𝟎+ 𝒃𝟏 𝑿𝟏 + ⋯ + 𝒃𝒌𝑿𝒌) )

is always between 0 and 1. In fact, the function f(x) = 1/(1 + 𝑒 −𝑋 ) is an “S-shaped logistic” curve,
as shown in Figure 17.16. For large negative values of x, the function approaches 0, and for large
positive values of x, it approaches 1.

Figure 12. S-shaped Logistics Curve

The logistic regression model uses this function to estimate the probability that any
observation is in category 1. Specifically, if p is the probability of being in category 1, the model

p=𝟏/(𝒆−(𝒃𝟎+ 𝒃𝟏 𝑿𝟏 + ⋯ + 𝒃𝒌𝑿𝒌) )

is estimated. This equation can be manipulated algebraically to obtain an equivalent form:

𝒑
𝐥𝐧 ( ) = 𝒃𝟎 + 𝒃𝟏 𝑿𝟏 + ⋯ + 𝒃𝒌𝑿𝒌
𝟏−𝒑

This equation says that the natural logarithm of p/(1 − p) is a linear function of the
explanatory variables. The ratio p/(1 − p) is called the odds ratio.

85
The odds ratio is a term frequently used in everyday language. Suppose, for example, that
the probability p of a company going bankrupt is 0.25. Then the odds that the company will go
bankrupt are p/(1 − p) = 0.25/0.75 = 1/3, or “1 to 3.” Odds ratios are probably most common in
sports. If you read that the odds against Indiana winning the NCAA basketball championship are
4 to 1, this means that the probability of Indiana winning the championship is 1/5. Or if you read
that the odds against Purdue winning the championship are 99 to 1, then the probability that
Purdue will win is only 1/100.

The logarithm of the odds ratio, the quantity on the left side of the above equation, is called
the logit (or log odds). Therefore, the logistic regression model states that the logit is a linear
function of the explanatory variables. Although this is probably a bit mysterious and there is no
easy way to justify it intuitively, logistic regression has produced useful results in many
applications.

Although the numerical algorithm used to estimate the regression coefficients is complex,
the important goal for our purposes is to interpret the regression coefficients correctly. First, if a
coefficient b is positive, then if its X increases, the log odds increases, so the probability of being
in category 1 increases. The opposite is true for a negative b. So just by looking at the signs of
the coefficients, you can see which Xs are positively correlated with being in category 1 (the
positive bs) and which are positively correlated with being in group 0 (the negative bs). You can
also look at the magnitudes of the bs to try to see which of the Xs are “most important” in
explaining category membership. Unfortunately, you run into the same problem as in regular
regression. Some Xs are typically of completely different magnitudes than others, which makes
comparisons of the bs difficult. For example, if one X is income, with values in the thousands, and
another X is number of children, with values like 0, 1, and 2, the coefficient of income will probably
be much smaller than the coefficient of children, even though these two variables might be equally
important in explaining category membership.

Classification Trees

The two classification methods discussed so far, logistic regression and neural networks,
use complex nonlinear functions to capture the relationship between explanatory variables and a
categorical dependent variables. The method discussed in this subsection, classification trees, is

86
also capable of discovering nonlinear relationships, but it is much more intuitive. This method,
which has many variations, has existed for decades, and it has been implemented in a variety of
software packages. Unfortunately, it is not available in any of the software that accompanies this
book, but it is available in the free Microsoft Data Mining Add-Ins discussed earlier. The essential
features of the method are explained here, and the accompanying video, Decision Trees with
Microsoft Data Mining Add-In, illustrates the method.

The attractive aspect of this method is that the final result is a set of simple rules for
classification. As an example, the final tree might look like the one in Figure 13. Each box has a
bar that shows the purity of the corresponding box, where blue corresponds to Yes values and
red corresponds to No values. The first split, actually a three-way split, is on Mall Values: fewer
than 4, 4 or 5, and at least 6. Each of these is then split in a different way. For example, when
Mall Trips is fewer than 4, the split is on Nbhd West versus Nbhd not West. The splits you see
here are the only ones made. They achieve sufficient purity, so the algorithm stops splitting after
these.

Predictions are then made by majority rule. As an example, suppose a person has made
3 mall trips and lives in the East. This person belongs in the second box down on the right, which
has a large majority of No values. Therefore, this person is classified as a No. In contrast, a person
with 10 mall trips belongs in one of the two bottom boxes on the right. This person is classified as
a Yes because both of these boxes have a large majority of Yes values. In fact, the last split on
Age is not really necessary.

This classification tree leads directly to the following rules.

▪ If the person makes fewer than 4 mall trips:


o If the person lives in the West, classify as a trier.
o If the person doesn’t live in the West, classify as a nontrier.

87
▪ If the person makes 4 or 5 mall trips:
o If the person doesn’t live in the East, classify as a trier.
o If the person lives in the East, classify as a nontrier.
▪ If the person makes at least 6 mall trips, classify as a trier.

The ability of classification trees to provide such simple rules, plus fairly accurate
classifications, has made this a very popular classification technique.

Clustering

In data mining terminology, the classification methods in the previous section are called
supervised data mining techniques. This term indicates that there is a dependent variable the
method is trying to predict. In contrast, the clustering methods discussed briefly in this section are
called unsupervised data mining techniques. Unsupervised methods have no dependent
variable. Instead, they search for patterns and structure among all of the variables.

Clustering is probably the most common unsupervised method, and it is the only one
discussed here. However, another popular unsupervised method you might encounter is market
basket analysis (also called association analysis), where patterns of customer purchases are
examined to see which items customers tend to purchase together, in the same “market basket.”
This analysis can be the basis for product shelving arrangements, for example.

Clustering, known in marketing circles as segmentation, tries to group entities


(customers, companies, cities, or whatever) into similar clusters, based on the values of their
variables. This method bears some relationship to classification, but the fundamental difference
is that in clustering, there are no fixed groups like the triers and nontriers in classification. Instead,
the purpose of clustering is to discover the number of groups and their characteristics, based
entirely on the data.

Clustering methods have existed for decades, and a wide variety of clustering methods
have been developed and implemented in software packages. The key to all of these is the
development of a dissimilarity measure. Specifically, to compare two rows in a data set, you need
a numeric measure of how dissimilar they are. Many such measures are used. For example, if
two customers have the same gender, they might get a dissimilarity score of 0, whereas two
customers of different genders might get a dissimilarity score of 1. Or if the incomes of two
customers are compared, they might get a dissimilarity score equal to the squared difference
between their incomes. The dissimilarity scores for different variables are then combined in some

88
way, such as normalizing and then summing, to get a single dissimilarity score for the two rows
as a whole.

Once a dissimilarity measure is developed, a clustering algorithm attempts to find clusters


of rows so that rows within a cluster are similar and rows in different clusters are dissimilar. Again,
there are many ways to do this, and many variations appear in different software packages. For
example, the package might let you specify the number of clusters ahead of time, or it might
discover this number automatically.

In any case, once an algorithm has discovered, say, five clusters, your job is to under-
stand (and possibly name) these clusters. You do this by exploring the distributions of variables
in different clusters. For example, you might find that one cluster is composed mostly of older
women who live alone and have modest incomes, whereas another cluster is composed mostly
of wealthy married men.

CHAPTER EXERCISES

1. What data mining is used for?

2. How does the OLAP methodology allows you to drill down in a pivot table?

3. What is the main purpose of logistic regression? How does it differ from the regression
discussed in the previous chapter?

SUGGESTED READINGS

Read Data mining at http://sas.com/n_ph/insights/analytics/data-mining.html

89
REFERENCES

Albright, S.C & Winston, W. (2015). Business Analytics: Data Analysis and Decision Making,
Fifth Edition. Cengage Learning, USA.

Inmon, W. (2002). Building the Data Warehouse, 3rd ed. John Wiley & Sons,Inc., Canada.

Ragsdale, C. (2014). Spreadsheet modeling and decision analysis: a practical introduction to


business analytics, 5th Edition. Thompson South-Western, USA..

Tripathi, S.S. (2016). Learn Business Analytics in Six Steps Using SAS and R. Apress Media ,
LLC.

90

You might also like