Professional Documents
Culture Documents
Ofad 30093 Customer Analytics
Ofad 30093 Customer Analytics
LEARNING MODULE
IN
CUSTOMER ANALYTICS
BUMA 30093
Mecmack a. nartea
COURSE DESCRIPTION:
The course provides students with an overview of the current trends in business/customer analytics that drives
today’s business. The course will provide understanding on data management techniques that can help an
organization to achieve its business goals and address operational challenges.
COURSE OBJECTIVES:
1
TABLE OF CONTENTS
Page No.
2
Key Terms in Sampling 67
Sample Size Selection 68
Confidence Intervals 68
What Is The P-Value? 69
Errors In Hypothesis Testing 70
Sampling Distributions 72
Parametric Tests 74
Nonparametric Tests 76
Chapter Exercise 77
Chapter 7 Data Mining 78
Introduction to Data Mining 79
Data Exploration and Visualization 80
Online Analytical Processing (OLAP) 80
PowerPivot and Power View in Excel 2013 81
Visualization Software 82
Microsoft Data Mining Add-Ins For Excel 83
Classification Methods 83
Logistic Regression 84
Classification Trees 87
Clustering 88
Chapter Activity 89
3
CHAPTER
THE PROCESS OF ANALYTICS
1
OVERVIEW
This chapter discusses how business analytics are used in daily life. It
further discusses the various softwares used in analytics. The history on
how and when analytics was started were also tackled in this chapter.
OBJECTIVES
4
What Is Analytics? What Does a Data Analyst Do?
A casual search on the Internet for data scientist offers up the fact that there is a
substantial shortage of manpower for this job. In addition, Harvard Business Review has
published an article called “Data Scientist: The Sexiest Job of the 21st Century” So, what does
a data analyst actually do?
To put it simply, analytics is the use of numbers or business data to find solutions for
business problems. Thus, a data analyst looks at the data that has been collected across
huge enterprise resource planning (ERP) systems, Internet sites, and mobile applications.
In the “old days,” we just called upon an expert, who was someone with a lot of
experience. We would then take that person’s advice and decide on the solution. It’s much
like we visit the doctor today, who is a subject-matter expert.
An Example
The next question often is, what do I mean by “use of numbers”? Will you have do
math again?
The last decade has seen the advent of software as a service (SaaS) in all walks of
information gathering and manipulation. Thus, analytics systems now are button-driven
systems that do the calculations and provide the results. An analyst or data scientist has to
look at these results and make recommendations for the business to implement. For example,
say a bank wants to sell loans in the market. It has data of all the customers who have taken
loans from the bank over the last 20 years. The portfolio is of, say, 1 million loans. Using this
data, the bank wants to understand which customers it should give pre-approved loan offers
to.
The simplest answer may be as follows: all the customers who paid on time every time
in their earlier loans should get a pre-approved loan offer. Let’s call this set of customers
Segment A. But on analysis, you may find that customers who defaulted but paid the loan
5
after the default actually made more money for the bank because they paid interest plus the
late payment charges. Let’s call this set Segment B.
Hence, you can now say that you want to send out an offer letter to Segment A +
Segment B.
However, within Segment B there was a set of customers who you had to send
collections teams to their homes to collect the money. So, they paid interest plus the late
payment charges minus the collection cost. This set is Segment C.
You could do this exercise using the decision tree technique that cuts your data into
segments (Figure 1-1).
A Typical Day
The last question to tackle is, what does the workday of an analytics professional look
like? It probably encompasses the following:
The data analyst will walk into the office and be told about the problem that the
business needs input on.
The data analyst will determine the best way to solve the problem.
The data analyst will then gather the relevant data from the large data sets stored in
the server.
Next, the data analyst will import the data into the analytics software.
6
The data analyst will run the technique through the software (SAS, R, SPSS, XLSTAT,
and so on).
The data analyst will study the output and prepare a report with recommendations.
So, is analytics the right career for you? Here are some points that will help you decide:
Do you believe that data should be the basis of all decisions? Take up analytics only
if your answer to this question is an unequivocal yes. Analytics is the process of
using and analyzing a large quantum of data (numbers, text, images, and so on) by
aggregating, visualizing/creating dashboards, checking repetitive trends, and
creating models on which decisions can be made. Only people who innately believe
in the power of data will excel in this field. If some prediction/analysis is wrong, the
attitude of a good analyst is that it is because the data was not appropriate for the
analysis or the technique used was incorrect. You will never doubt that a correct
decision will be made if the relevant data and appropriate techniques are used.
Do you like to constantly learn new stuff? Take up analytics only if your answer to
this question is an unequivocal yes. Analytics is a new field. There is a constant
increase in the avenues of data currently regarding Internet data, social networking
information, mobile transaction data, and near field communication devices. There
are constant changes in technology to store, process, and analyze this data. Hadoop,
Google updates, and so on, have become increasingly important. Cloud computing
and data management are common now. Economic cycles have shortened, and
model building has become more frequent as older models get redundant. Even the
humble Excel has an Analysis ToolPak in Excel 2010 with statistical functions. In
other words, be ready for change.
Do you like to interpret outcomes and then track them to see whether your
recommendations were right? Take up analytics only if your answer to this question
is an unequivocal yes. A data analyst will work on a project, and the implementation
of the recommendations will generally be valid for a reasonably long period of time,
7
perhaps a year or even three to five years. A good analyst should be interested to
know how accurate the recommendations have been and should want to track the
performance periodically. You should ideally also be the first person to be able to
say when the analysis is not working and needs to be reworked.
Are you ready to go back to a text book and brush up on the concepts of math and
statistics? Take up analytics only if your answer to this question is an unequivocal
yes. To accurately handle data and interpret results, you will need to brush up on
the concepts of math and statistics. It becomes important to justify why you chose a
particular path during analysis versus others. Business users will not accept your
word blindly.
Do you like debating and logical thinking? Take up analytics only if your answer to
this question is an unequivocal yes. As there is no one solution to all problems, an
analyst has to choose the best way to handle the project/problem at hand. The
analyst has to be able to not only know the best way to analyze the data but also
give the best recommendation in the given time constraints and budget constraints.
This sector generally has a very open culture where the analyst working on a
project/problem will be required to give input irrespective of the analyst’s position in
the hierarchy.
Do check your answers to the previous questions. If you said yes for three out of these
five questions and an OK for two, then analytics is a viable career option for you. Welcome to
the world of analytics!
The practice or science of collecting and analyzing numerical data in large quantities,
especially for the purpose of inferring proportions in a whole from those in a
representative sample.1
8
Most people start working with numbers, counting, and math by the time we are five
years old. Math includes addition, subtraction, theorems, rules, and so on. Statistics is when
we start using math concepts to work on real-life data.
Statistics is derived from the Latin word status, the Italian word statista, or the German
word statistik, each of which means a political state. This word came into being somewhere
around 1780 to 1790.
In ancient times, the government collected the information regarding the population,
property, and wealth of the country. This enabled the government to get an idea of the
manpower of the country and became the basis for introducing taxes and levies. Statistics are
the practical part of math.
Nuts and bolts held the industrialization process together; in 1800, Henry Maudslay
developed the first practical screw-cutting lathe. This allowed for the standardization of screw
thread sizes and paved the way for the practical application of interchangeability for nuts and
bolts. Before this, screw threads were usually made by chipping and filing manually.
Maudslay standardized the screw threads used in his workshop and produced sets of
nuts and bolts to those standards so that any bolt of the appropriate size would fit any nut of
the same size.
Joseph Whitworth’s screw thread measurements were adopted as the first unofficial
national standard by companies in Britain in 1841 and came to be known as the British
standard Whitworth.
By the end of the 19th century, differences and standards between companies were
making trading increasingly difficult. The Engineering Standards Committee was established
in London in 1901 and by the mid-to-late 19th century, efforts were being made to standardize
electrical measurements. Many companies had entered the market in the 1890s, and all chose
9
their own settings for voltage, frequency, current, and even the symbols used in circuit
diagrams, making standardization necessary for electrical measurements.
This was followed with the factory system with its emphasis on product inspection.
After the United States entered World War II, the quality became a critical component
since bullets from one state had to work with guns manufactured in another state. For example,
the U.S. Army had to inspect manually every piece of machinery, but this was very time-
consuming. Statistical techniques such as sampling started being used to speed up the
processes.
Japan around this time was also becoming conscious of quality. The quality initiative
started with a focus on defects and products and then moved on to look at the process used
for creating these products. Companies invested in training their workforce on Total Quality
Management (TQM) and statistical techniques.
10
Statistical Process Control from the early 1920s is a method of quality control using
statistical methods, where monitoring and controlling the process ensures that it operates at
its full potential. At its full potential, a process can churn out as much conforming product or
standardize a product as much as possible with a minimum of waste.
The advantage of Statistical Process Control (SPC) over the methods of quality control
such as inspection is that it emphasizes early detection and prevention of problems rather
than correcting problems after they occur.
What was happening on the government front? The maximum data was being
captured and used by the military. A lot of the business terminologies and processes used
today have been copied from the military: sales campaigns, marketing strategy, business
tactics, business intelligence, and so on.
As mentioned, statistics made a big difference during World War II. For instance, the
Allied forces accurately estimated the production of German tanks using statistical methods.
They also used statistics and logical rules to decode German messages.
The Kerrison Predictor was one of the fully automated anti-aircraft fire control systems
that could gun an aircraft based on simple inputs such as the angle to the target and the
observed speed. The British Army used this effectively in the early 1940s.
11
The Manhattan Project was a U.S. government research project in 1942–1945 that
produced the first atomic bomb. Under this, the first atomic bomb was exploded in July 1945
at a site in New Mexico. The following month, the other atomic bombs that were produced by
the project were dropped on Hiroshima and Nagasaki, Japan. This project used statistics to
run simulations and predict the behavior of nuclear chain reactions.
Weather predictions, especially rain, affected the world economy the most since
weather affected the agriculture industry. The first attempt was made to forecast the weather
numerically in 1922 by Lewis Fry Richardson.
The first successful numerical prediction was performed using the ENIAC digital
computer in 1950 by a team of American meteorologists and mathematicians.2
Then, 1956 saw analytics solve the shortest-path problem in travel and logistics,
radically changing these industries.
In 1956 FICO was founded by engineer Bill Fair and mathematician Earl Isaac on the
principle that data used intelligently can improve business decisions. In 1958 FICO built its
first credit scoring system for American investments, and in 1981 the FICO credit bureau risk
score was introduced.3
By the 1980s, manufacturing resource planning systems were introduced with the
emphasis on optimizing manufacturing processes by synchronizing materials with production
requirements. Starting in the late 1980s, software systems known as enterprise resource
planning systems became the drivers of data accumulation in business. ERP systems are
software systems for business management including models supporting functional areas
12
such as planning, manufacturing, sales, marketing, distribution, accounting, and so on. ERP
systems were a leg up over MRP systems. They include modules not only related to
manufacturing but also to services and maintenance.
Typically, early business applications and ERP systems had their own databases that
supported their functions. This meant that data was in silos because no other system had
access to it. Businesses soon realized that the value of data can increase many fold if all the
data is in one system together. This led to the concept of a data warehouse and then an
enterprise data warehouse (EDW) as a single system for the repository of all the
organization’s data. Thus, data could be acquired from a variety of incompatible systems and
brought together using extract, transform, load (ETL) processes. Once the data is collected
from the many diverse systems, the captured data needs to be converted into information and
knowledge in order to be useful. The business intelligence (BI) systems could therefore give
much more coherent intelligence to businesses and introduce the concepts of one view of
customers and customer lifetime value.
One advantage of an EDW is that business intelligence is now much more exhaustive.
Though business intelligence is a good way to use graphs and charts to get a view of business
progress, it does not use high-end statistical processes to derive greater value from the data.
The next question that business wanted to answer by the 1990s–2000 was how the
data can be used more effectively to understand embedded trends and predict future trends.
The business world was waking up to predictive analytics.
What are the types of analytics that exist now? The analytics journey generally starts
off with the following:
Differences statistics: This enables businesses to know how the data is changing or if
it’s the same.
13
Associative statistics: This enables businesses to know the strength and direction of
associations within data.
Fortunately, we live in an era of software, which can help us do the math, which means
analysts can focus on the following:
Pinpointing the technique in statistics that will be used to reach the solution
CHAPTER EXERCISES
Direction: Discuss the following questions. Write your answer in as short bond paper.
1. How does analytics applicable in your daily life? Cite examples to substantiate your
answer.
2. Is there really a need to include analytics in the education curriculum? Justify your
answer.
SUGGESTED READINGS
Http://journals.ametsoc.org/doi/pdf/10.1175/BAMS-89-1-45
www.fico.com/en/about-us#our_history
www.oxforddictionaries.com/definition/english/statistics
ANALYTICS: A COMPREHENSIVE
CHAPTER
STUDY
2 14
OVERVIEW
OBJECTIVES
15
Business analytics (BA) refers to the skills, technologies, practices for continuous iterative
exploration and investigation of past business performance to gain insight and drive business
planning. Business analytics focuses on developing new insights and understanding of business
performance based on data and statistical methods. In contrast, business intelligence traditionally
focuses on using a consistent set of metrics to both measure past performance and guide
business planning, which is also based on data and statistical methods.
Business analytics makes extensive use of statistical analysis, including explanatory and
predictive modeling, and fact-based management to drive decision making. It is therefore closely
related to management science. Analytics may be used as input for human decisions or may drive
fully automated decisions. Business intelligence is querying, reporting, online analytical
processing (OLAP), and “alerts.”
In other words, querying, reporting, OLAP, and alert tools can answer questions such as
what happened, how many, how often, where the problem is, and what actions are needed.
Business analytics can answer questions like why is this happening, what if these trends continue,
what will happen next (that is, predict), what is the best that can happen (that is, optimize).
Examples of Application
Banks, such as Capital One, use data analysis (or analytics, as it is also called in the
business set-ting), to differentiate among customers based on credit risk, usage and other
characteristics and then to match customer characteristics with appropriate product offerings.
Harrah’s, the gaming firm, uses analytics in its customer loyalty programs. E & J Gallo Winery
quantitatively analyzes and predicts the appeal of its wines. Between 2002 and 2005, Deere &
Company saved more than $1 billion by employing a new analytical tool to better optimize
inventory. A telecoms company that pursues efficient call centre usage over customer service
may save money.
Types of Analytics
• Decision analytics: supports human decisions with visual analytics the user models to
reflect reasoning.
• Descriptive analytics: gains insight from historical data with reporting, scorecards, clus-
tering etc.
• Predictive analytics: employs predictive modeling using statistical and machine learning
techniques
16
• Prescriptive analytics: recommends decisions using optimization, simulation, etc.
History
Analytics have been used in business since the management exercises were put into
place by Frederick Winslow Taylor in the late 19th century. Henry Ford measured the time of each
component in his newly established assembly line. But analytics began to command more
attention in the late 1960s when computers were used in decision support systems. Since then,
analytics have changed and formed with the development of enterprise resource planning (ERP)
systems, data warehouses, and a large number of other software tools and processes.
In later years the business analytics have exploded with the introduction to computers.
This change has brought analytics to a whole new level and has made the possibilities endless.
As far as analytics has come in history, and what the current field of analytics is today many
people would never think that analytics started in the early 1900s with Mr. Ford himself.
Business analytics depends on sufficient volumes of high quality data. The difficulty in
ensuring data quality is integrating and reconciling data across different systems, and then
deciding what subsets of data to make available.
17
Previously, analytics was considered a type of after-the-fact method of forecasting
consumer behavior by examining the number of units sold in the last quarter or the last year. This
type of data warehousing required a lot more storage space than it did speed. Now business
analytics is becoming a tool that can influence the outcome of customer interactions. When a
specific customer type is considering a purchase, an analytics-enabled enterprise can modify the
sales pitch to appeal to that consumer. This means the storage space for all that data must react
extremely fast to provide the necessary data in real-time.
Competing on Analytics
• One or more senior executives who strongly advocate fact-based decision making and,
specifically, analytics
• Widespread use of not only descriptive statistics, but also predictive modeling and
complex optimization techniques
• Movement toward an enterprise level approach to managing analytical tools, data, and
organizational skills and capabilities
DEFINITION OF ANALYTICS
Organizations may apply analytics to business data to describe, predict, and improve
business performance. Specifically, areas within analytics include predictive analytics,
prescriptive analytics, enterprise decision management, retail analytics, store assortment and
stock-keeping unit optimization, marketing optimization and marketing mix modeling, web
analytics, sales force sizing and optimization, price and promotion modeling, predictive science,
18
credit risk analysis, and fraud analytics. Since analytics can require extensive computation, the
algorithms and software used for analytics harness the most current methods in computer science,
statistics, and mathematics.
Analytics is multidisciplinary. There is extensive use of mathematics and statistics, the use
of descriptive techniques and predictive models to gain valuable knowledge from data—data
analysis. The insights from data are used to recommend action or to guide decision making rooted
in business context. Thus, analytics is not so much concerned with individual analyses or analysis
steps, but with the entire methodology. There is a pronounced tendency to use the term analytics
in business settings e.g. text analytics vs. the more generic text mining to emphasize this broader
perspective.. There is an increasing use of the term advanced analytics, typically used to describe
the technical aspects of analytics, especially in the emerging fields such as the use of machine
learning techniques like neural networks to do predictive modeling.
Examples of Analytics
Marketing Optimization
Marketing has evolved from a creative process into a highly data-driven process.
Marketing organizations use analytics to determine the outcomes of campaigns or efforts and to
guide decisions for investment and consumer targeting. Demographic studies, customer
segmentation, conjoint analysis and other techniques allow marketers to use large amounts of
consumer purchase, survey and panel data to understand and communicate marketing strategy.
Analysis techniques frequently used in marketing include marketing mix modeling, pricing
and promotion analyses, sales force optimization and customer analytics e.g.: segmentation. Web
analytics and optimization of web sites and online campaigns now frequently work hand in hand
with the more traditional marketing analysis techniques. A focus on digital media has slightly
19
changed the vocabulary so that marketing mix modeling is commonly referred to as attribution
modeling in the digital or marketing mix modeling context.
These tools and techniques support both strategic marketing decisions (such as how
much overall to spend on marketing, how to allocate budgets across a portfolio of brands and the
marketing mix) and more tactical campaign support, in terms of targeting the best potential
customer with the optimal message in the most cost effective medium at the ideal time.
Portfolio Analytics
The least risk loan may be to the very wealthy, but there are a very limited number of
wealthy people. On the other hand, there are many poor that can be lent to, but at greater risk.
Some balance must be struck that maximizes return and minimizes risk. The analytics solution
may combine time series analysis with many other issues in order to make decisions on when to
lend money to these different borrower segments, or decisions on the interest rate charged to
members of a port-folio segment to cover any losses among members in that segment.
Risk Analytics
Predictive models in the banking industry are developed to bring certainty across the risk
scores for individual customers. Credit scores are built to predict individual’s delinquency behavior
and widely used to evaluate the credit worthiness of each applicant. Furthermore, risk analyses
are carried out in the scientific world and the insurance industry. It is also extensively used in
financial institutions like Online Payment Gateway companies to analyse if a transaction was
genuine or fraud. For this purpose they use the transaction history of the customer. This is more
commonly used in Credit Card purchase, when there is a sudden spike in the customer
transaction volume the customer gets a call of confirmation if the transaction was initiated by
him/her. This helps in reducing loss due to such circumstances.
20
Digital Analytics
Digital analytics is a set of business and technical activities that define, create, collect,
verify or transform digital data into reporting, research, analyses, recommendations, optimizations,
pre-dictions, and automations. This also includes the SEO (Search Engine Optimization) where
the keyword search is tracked and that data is used for marketing purposes. Even banner ads
and clicks come under digital analytics. All marketing firms rely on digital analytics for their digital
marketing assignments, where MROI (Marketing Return on Investment) is important.
Security Analytics
Security analytics refers to information technology (IT) solutions that gather and analyze
security events to bring situational awareness and enable IT staff to understand and analyze
events that pose the greatest risk. Solutions in this area include security information and event
management solutions and user behavior analytics solutions.
Software Analytics
Software analytics is the process of collecting information about the way a piece of
software is used and produced.
Challenges
In the industry of commercial analytics software, an emphasis has emerged on solving the
challenges of analyzing massive, complex data sets, often when such data is in a constant state
of change. Such data sets are commonly referred to as big data. Whereas once the problems
posed by big data were only found in the scientific community, today big data is a problem for
many businesses that operate transactional systems online and, as a result, amass large volumes
of data quickly.
The analysis of unstructured data types is another challenge getting attention in the
industry. Un-structured data differs from structured data in that its format varies widely and cannot
be stored in traditional relational databases without significant effort at data transformation.
Sources of unstructured data, such as email, the contents of word processor documents, PDFs,
geospatial data, etc., are rapidly becoming a relevant source of business intelligence for
businesses, governments and universities. For example, in Britain the discovery that one
company was illegally selling fraudulent doctor’s notes in order to assist people in defrauding
employers and insurance companies, is an opportunity for insurance firms to increase the
21
vigilance of their unstructured data analysis. The McKinsey Global Institute estimates that big
data analysis could save the American health care system $300 billion per year and the European
public sector €250 billion.
These challenges are the current inspiration for much of the innovation in modern analytics
information systems, giving birth to relatively new machine analysis concepts such as complex
event processing, full text search and analysis, and even new ideas in presentation. One such
innovation is the introduction of grid-like architecture in machine analysis, allowing increases in
the speed of massively parallel processing by distributing the workload to many computers all
with equal access to the complete data set.
One more emerging challenge is dynamic regulatory needs. For example, in the banking
industry, Basel and future capital adequacy needs are likely to make even smaller banks adopt
internal risk models. In such cases, cloud computing and open source R (programming language)
can help smaller banks to adopt risk analytics and support branch level monitoring by applying
predictive analytics.
SOFTWARE ANALYTICS
Software Analytics refers to analytics specific to software systems and related software
development processes. It aims at describing, predicting, and improving development,
maintenance, and management of complex software systems. Methods and techniques of
software analytics typically rely on gathering, analyzing, and visualizing information found in the
manifold data sources in the scope of software systems and their software development
22
processes---software analytics “turns it into actionable insight to inform better decisions related to
software”.
Software analytics represents a base component of software diagnosis that generally aims
at generating findings, conclusions, and evaluations about software systems and their
implementation, composition, behavior, and evolution. Software analytics frequently uses and
combines approach-es and techniques from statistics, prediction analysis, data mining, and
scientific visualization. For example, software analytics can map data by means of software maps
that allow for interactive exploration.
Data under exploration and analysis by Software Analytics exists in software lifecycle,
including source code, software requirement specifications, bug reports, test cases, execution
traces/logs, and real-world user feedback, etc. Data plays a critical role in modern software
development, be-cause hidden in the data is the information and insight about the quality of
software and services, the experience that software users receive, as well as the dynamics of
software development.
Software Analytics focuses on trinity of software systems, software users, and software
development process:
Software Systems. Depending on scale and complexity, the spectrum of software systems
can span from operating systems for devices to large networked systems that consist of
thousands of servers. System quality such as reliability, performance and security, etc., is the key
to success of modern software systems. As the system scale and complexity greatly increase,
larger amount of data, e.g., run-time traces and logs, is generated; and data becomes a critical
means to monitor, analyze, understand and improve system quality.
23
Software Users. Users are (almost) always right because ultimately they will use the
software and services in various ways. Therefore, it is important to continuously provide the best
experience to users. Usage data collected from the real world reveals how users interact with
software and services. The data is incredibly valuable for software practitioners to better
understand their customers and gain insights on how to improve user experience accordingly.
Software Development Process. Software development has evolved from its traditional
form to exhibiting different characteristics. The process is more agile and engineers are more
collaborative than that in the past. Analytics on software development data provides a powerful
mechanism that software practitioners can leverage to achieve higher development productivity.
EMBEDDED ANALYTICS
Embedded analytics is the technology designed to make data analysis and business
intelligence more accessible by all kind of application or user.
24
Tools
Actuate Qlik
GoodData
SAS
IBM
Tableau
icCube
Pentaho Sisense
LEARNING ANALYTICS
Learning analytics is the measurement, collection, analysis and reporting of data about
learners and their contexts, for purposes of understanding and optimizing learning and the
environments in which it occurs. A related field is educational data mining. For general audience
introductions, see:
The definition and aims of Learning Analytics are contested. One earlier definition
discussed by the community suggested that “Learning analytics is the use of intelligent data,
learner-produced data, and analysis models to discover information and social connections for
predicting and advis-ing people’s learning.”
25
think learning analytics - at an advanced and integrated implementation - can do
away with pre-fab curriculum models”. George Siemens, 2010.
“In the descriptions of learning analytics we talk about using data to “predict
success”. I’ve struggled with that as I pore over our databases. I’ve come to realize
there are differ-ent views/levels of success.” Mike Sharkey 2010.
A more holistic view than a mere definition is provided by the framework of learning
analytics by Greller and Drachsler (2012). It uses a general morphological analysis (GMA) to
divide the domain into six “critical dimensions”.
A systematic overview on learning analytics and its key concepts is provided by Chatti et
al. (2012) and Chatti et al. (2014) through a reference model for learning analytics based on four
dimensions, namely data, environments, context (what?), stakeholders (who?), objectives (why?),
and methods (how?).
It has been pointed out that there is a broad awareness of analytics across educational
institutions for various stakeholders, but that the way ‘learning analytics’ is defined and
implemented may vary, including:
In that briefing paper, Powell and MacNeill go on to point out that some motivations and
implementations of analytics may come into conflict with others, for example highlighting potential
conflict between analytics for individual learners and organisational stakeholders.
Gašević, Dawson, and Siemens argue that computational aspects of learning analytics
need to be linked with the existing educational research if the field of learning analytics is to deliver
to its promise to understand and optimize learning.
26
Differentiating Learning Analytics and Educational Data Mining
Differentiating the fields of educational data mining (EDM) and learning analytics (LA) has
been a concern of several researchers. George Siemens takes the position that educational data
mining encompasses both learning analytics and academic analytics, the former of which is aimed
at governments, funding agencies, and administrators instead of learners and faculty. Baepler
and Murdoch define academic analytics as an area that “...combines select institutional data,
statistical analysis, and predictive modeling to create intelligence upon which learners, instructors,
or administrators can change academic behavior”. They go on to attempt to disambiguate
educational data mining from academic analytics based on whether the process is hypothesis
driven or not, though Brooks questions whether this distinction exists in the literature. Brooks
instead pro-poses that a better distinction between the EDM and LA communities is in the roots
of where each community originated, with authorship at the EDM community being dominated by
researchers coming from intelligent tutoring paradigms, and learning analytics researchers being
more focused on enterprise learning systems (e.g. learning content management systems).
Regardless of the differences between the LA and EDM communities, the two areas have
significant overlap both in the objectives of investigators as well as in the methods and techniques
that are used in the investigation. In the MS program offering in Learning Analytics at Teachers
College, Columbia University, students are taught both EDM and LA methods.
The first graduate program focused specifically on learning analytics was created by Dr.
Ryan Baker and launched in the Fall 2015 semester at Teachers College - Columbia University.
The pro-gram description states that “data about learning and learners are being generated today
on an unprecedented scale. The fields of learning analytics (LA) and educational data mining
(EDM) have emerged with the aim of transforming this data into new insights that can benefit
students, teachers, and administrators. As one of world’s leading teaching and research
institutions in education, psychology, and health, we are proud to offer an innovative graduate
curriculum dedicated to improving education through technology and data analysis.”
27
CHAPTER EXERCISES
Direction: Discuss the following. Use short bond paper for your answer.
a. Health Sectors
b. Business sectors
c. Tourism
d, Agriculture
e. Economics
3. Identify the type of measurement scale— nominal, ordinal, interval, or ratio— suggested by
each statement:
a) John finished the math test in 35 minutes, whereas Jack finished the same test in 25
minutes.
b) Jack speaks French, but John does not.
c) Jack is taller than John.
d) John is 6 feet 2 inches tall.
e) John’s IQ is 120, whereas Jack’s IQ is 110.
4. Supermarket Sales
28
a. Determine which variables are categorical and numerical.
SUGGESTED READINGS
29
CHAPTER
DESCRIPTIVE STATISTICAL
MEASURES
3
OVERVIEW
The goal of this chapter is to make sense of data by constructing appropriate summary
measures, tables, and graphs. Our purpose here is to present the data in a form that makes
sense to people. This chapter also discusses the types of data, variables, measures central
tendency, measures of variability and outliers. Techniques and tips in using Microsoft Excel
are also included to provide you guides in using the application.
OBJECTIVES
30
We begin with a short discussion of several important concepts: populations and samples,
data sets, variables and observations, and types of data.
First, we distinguish between a population and a sample. A population includes all of the
entities of interest: people, households, machines, or whatever.
In these situations and many others, it is virtually impossible to obtain information about
all members of the population. For example, it is far too costly to ask all potential voters which
presidential candidates they prefer. Therefore, we often try to gain insights into the characteristics
of a population by examining a sample, or subset, of the population.
A population includes all of the entities of interest in a study. A sample is a subset of the
population, often randomly chosen and preferably representative of the population as a whole.
We use the terms population and sample a few times in this chapter, which is why we have defined
them here. However, the distinction is not really important until later chapters. Our intent in this
chapter is to focus entirely on the data in a given data set, not to generalize beyond it. Therefore,
the given data set could be a population or a sample from a population. For now, the distinction
is irrelevant.
A data set is generally a rectangular array of data where the columns contain variables,
such as height, gender, and income, and each row contains an observation. Each observation
includes the attributes of a particular member of the population: a person, a company, a city, a
machine, or whatever. This terminology is common, but other terms are often used. A variable
(column) is often called a field or an attribute, and an observation (row) is often called a case or
a record. Also, data sets are occasionally rearranged, so that the variables are in rows and the
observations are in columns. However, the most common arrangement by far is to have variables
in columns, with variable names in the top row, and observations in the remaining rows.
A data set is usually a rectangular array of data, with variables in columns and
observations in rows. A variable (or field or attribute) is a characteristic of members of a
population, such as height, gender, or salary. An observation (or case or record) is a list of all
variable values for a single member of a population.
31
Table 1. Environmental Survey Data
Consider Figure 1. Each observation lists the person’s age, gender, state of residence,
number of children, annual salary, and opinion of the president’s environmental policies. These
six pieces of information represent the variables. It is customary to include a row (row 1 in this
case) that lists variable names. These variable names should be concise but meaningful. Note
that an index of the observation is often included in column A. If you sort on other variables, you
can always sort on the index to get back to the original sort order.
TYPES OF DATA
There are several ways to categorize data. A basic distinction is between numerical and
categorical data. The distinction is whether you intend to do any arithmetic on the data. It makes
sense to do arithmetic on numerical data, but not on categorical data. (Actually, there is a third
data type, a date variable. As you may know, Excel stores dates as numbers, but for obvious
reasons, dates are treated differently from typical numbers.)
In the questionnaire data, Age, Children, and Salary are clearly numerical. For example,
it makes perfect sense to sum or average any of these. In contrast, Gender and State are clearly
categorical because they are expressed as text, not numbers.
32
there is a natural ordering of categories, the variable is classified as ordinal. If there is no natural
ordering, as with the Gender variable or the State variable, the variable is classified as nominal.
Remember, though, that both ordinal and nominal variables are categorical.
Excel automatically right-aligns numbers and left-aligns text. We will use this
automatic for- matting, but starting in this edition, we will add our own. Specifically,
we will right-align all numbers that are available for arithmetic; we will left-align all text
such as Male, Female, Yes, and No; and we will center-align everything else,
including dates, indexes such as the Person
How do you remember, for example, that “1” stands for “strongly disagree” in the
Opinion variable? You can enter a comment—a reminder to yourself and others—in
any cell. To do so, right-click a cell and select Insert Comment. A small red tag
appears in any cell with a comment. Moving the cursor over that cell causes the
comment to appear. You will see numerous comments in the files that accompany
the book.
A dummy variable is a 0Š1 coded variable for a specific category. It is coded as 1 for all
observations in that category and 0 for all observations not in that category.
The method of categorizing a numerical variable is called binning (putting the data into
discrete bins), and it is also very common. (It is also called discretizing.) The purpose of the
study dictates whether age should be treated numerically or categorically; there is no absolute
right or wrong way.
A numerical variable is discrete if it results from a count, such as the number of children.
A continuous variable is the result of an essentially continuous measurement, such as weight
or height.
33
Cross-sectional data are data on a cross section of a population at a distinct point in time.
Time series data are data collected over time.
This section discusses methods for describing a categorical variable. Because it is not
appropriate to perform arithmetic on the values of the variable, there are only a few possibilities
for describing the variable, and these are all based on counting. First, you can count the number
of categories. Many categorical variables such as Gender have only two categories. Others such
as Region can have more than two categories. As you count the categories, you can also give
the categories names, such as Male and Female.
Once you know the number of categories and their names, you can count the number of
observations in each category (this is referred to as the count of categories). The resulting
counts can be reported as “raw counts” or they can be transformed into percentages of totals.
There are many ways to summarize numerical variables, both with numerical summary
measures and with charts, and we discuss the most common ways in this section. But before we
get into details, it is important to understand the basic goal of this section. We begin with a
numerical variable such as Salary, where there is one observation for each per- son. Our basic
goal is to learn how these salaries are distributed across people. To do this, we can ask a number
of questions, including the following. (1) What are the most “typical” salaries? (2) How spread out
are the salaries? (3) What are the “extreme” salaries on either end? (4) Is a chart of the salaries
symmetric about some middle value, or is it skewed in some direction? (5) Does the chart of
salaries have any other peculiar features besides possible skewness? In the next chapter, we
explore methods for checking whether a variable such as Salary is related to other variables, but
for now we simply want to explore the distribution of values in the Salary column.
There are three common measures of central tendency, all of which try to answer the
basic question of which value is most “typical.” These are the mean, the median, and the mode.
The MEAN
34
The mean is the average of all values. If the data set represents a sample from some
larger population, this measure is called the sample mean and is denoted by X (pronounced “X-
bar”). If the data set represents the entire population, it is called the population mean and is
denoted by μ (the Greek letter mu). This distinction is not important in this chapter, but it will
become relevant in later chapters when we discuss statistical inference. In either case, the
formula for the mean is given by Equation (2.1).
The most widely used measure of central tendency is the mean, or arithmetic average. It
is the sum of all the scores in a distribution divided by the number of cases. In terms of a formula,
it is
∑𝑿
̅=
𝒙
𝑵
Where, 𝑥̅ = Mean
N = number of cases
For Excel data sets, you can calculate the mean with the AVERAGE function.
The MEDIAN
The median is the middle observation when the data are sorted from smallest to largest.
If the number of observations is odd, the median is literally the middle observation. For example,
if there are nine observations, the median is the fifth smallest (or fifth largest). If the number of
observations is even, the median is usually defined as the average of the two middle observations
(although there are some slight variations of this definition). For example, if there are 10
observations, the median is usually defined as the average of the fifth and sixth smallest values.
14 15 16 17 18 19 20 21 22
In the following 10 scores we seek the point below which 5 scores fall:
14 16 16 17 18 19 20 20 21 22
35
The point below which 5 scores, or 50 percent of the cases, fall is halfway between 18
and 19. Thus, the median of this distribution is 18.5.
18 20 22 25 25 30
Any point from 22.5 to 24.5 fits the definition of the median. By convention in such cases
the median is defined as half way between these lowest and highest points, in this case 22.5 +
24.5/2 = 23.5.
23 2 46 18
22 2 44 16
21 4 84 14
Add up
20 4 80 10
19 2 38 6
18 2 36 4
17 0 0 2
16 2 32 2
To find the median of Mr. Li’s physics exam scores, we need to find the point below which
18/2 = 9 scores lie. We first create a cumulative frequency column (cf, column 4 in Table 6.2).
The cumulative frequency for each interval is the number of scores in that interval plus the total
number of scores below it. Since the interval between 15.5 and 16.5 has no scores below it, its cf
is equal to its f, which is 2. Since there were no scores of 17, the cf for 17 is still 2. Then adding
the two scores of 18 yields a cumulative frequency of 4. Continuing up the frequency column, we
get cf ’s of 10, 14, 16, and, finally, 18, which is equal to the number of students.
The point separating the bottom nine scores from the top nine scores, the median, is
somewhere in the interval 19.5 to 20.5. Most statistics texts say to partition this interval to locate
the median. The cf column tells us that we have six scores below 19.5. We need to add three
scores to give us half the scores (9). Since there are four scores of 20, we go three-fourths of the
way from 19.5 to 20.5 to report a median of 20.25. Note that many computer programs, including
the Statistical Package for the Social Sciences (SPSS) and the Statistical Analysis System (SAS),
simply report the midpoint of the interval—in this case 20—as the median.
36
The median can be calculated in Excel with the MEDIAN function.
The MODE
The mode is the value that appears most often, and it can be calculated in Excel with the
MODE function. In most cases where a variable is essentially continuous, the mode is not very
interesting because it is often the result of a few lucky ties.
The mode is the value in a distribution that occurs most frequently. It is the simplest to find
of the three measures of central tendency because it is determined by inspection rather than by
computation. Given the distribution of scores
14 16 16 17 18 19 19 19 21 22
you can readily see that the mode of this distribution is 19 because it is the most frequent score.
Sometimes there is more than one mode in a distribution. For example, if the scores had been
14 16 16 16 18 19 19 19 21 22
you would have two modes: 16 and 19. This kind of distribution with two modes is called bimodal.
Distributions with three or more modes are called trimodal or multimodal, respectively.
The mode is the least useful indicator of central value in a distribution for two reasons.
First, it is unstable. For example, two random samples drawn from the same population may have
quite different modes. Second, a distribution may have more than one mode. In published
research, the mode is seldom reported as an indicator of central tendency. Its use is largely limited
to inspectional purposes. A mode may be reported for any of the scales of measurement, but it is
the only measure of central tendency that may legitimately be used with nominal scales.
Two new versions of the MODE function were introduced in Excel 2010:
MODE.MULT and MODE.SNGL. The latter is the same as the older MODE function.
The MULT version returns multiple modes if there are multiple modes.
Shapes of Distributions
37
Frequency distributions can have a variety of shapes. A distribution is symmetrical when
the two halves are mirror images of each other. In a symmetrical distribution, the values of the
mean and the median coincide. If such a distribution has a single mode, rather than two or more
modes, the three indexes of central tendency will coincide, as shown in Figure 6.
Number of s cor es
If a distribution is not symmetrical, it is described as skewed, pulled out to one end or the
other by the presence of extreme scores. In skewed distributions, the values of the measures of
central tendency differ. In such distributions, the value of the mean, because it is influenced by
the size of extreme scores, is pulled toward the end of the distribution in which the extreme scores
lie, as shown in Figures 7 and 8.
38
Figure 4. Positively Skewed Distribution
The effect of extreme values is less on the median because this index is influenced not by
the size of scores but by their position. Extreme values have no impact on the mode because this
index has no relation with either of the ends of the distribution. Skews are labeled according to
where the extreme scores lie. A way to remember this is “The tail names the beast.” Figure 4
shows a negatively skewed distribution, whereas Figure 8 shows a positively skewed distribution.
MEASURES OF VARIABILITY
Although indexes of central tendency help researchers describe data in terms of average
value or typical measure, they do not give the total picture of a distribution. The mean values of
two distributions may be identical, whereas the degree of dispersion, or variability, of their scores
might be different. In one distribution, the scores might cluster around the central value; in the
other, they might be scattered. For illustration, consider the following distributions of scores:
The value of the mean in both these distributions is 25, but the degree of scatter-ing of the
scores differs considerably. The scores in distribution (a) are obviously much more homogeneous
than those in distribution (b). There is clearly a need for indexes that can describe distributions in
terms of variation, spread, dispersion, heterogeneity, or scatter of scores. Three indexes are
commonly used for this purpose: range, variance, and standard deviation.
a. Range
39
The simplest of all indexes of variability is the range. It is the difference between the upper
real limit of the highest score and the lower real limit of the lowest score. In statistics, any score
is thought of as representing an interval width from halfway between that score and the next
lowest score (lower real limit) up to halfway between that score and the next highest score (upper
real limit).
For example, if several children have a recorded score of 12 pull-ups on a physical fitness
test, their performances probably range from those who just barely got their chin over the bar the
twelfth time and were finished (lower real limit) to those who completed 12 pull-ups, came up
again, and almost got their chin over the bar, but did not quite make it for pull-up 13 (upper limit).
Thus, a score of 12 is considered as representing an interval from halfway between 11 and 12
(11.5) to halfway between 12 and 13 (12.5) or an interval of 1. For example, given the following
distribution of scores, you find the range by subtracting 1.5 (the lower limit of the lowest score)
from 16.5 (the upper limit of the highest score), which is equal to 15.
2 10 11 12 13 14 16
Formula R = ( Xh−Xl ) + I
where
R = range
Xh = highest value in a distribution
Xl = lowest value in a distribution
I = interval width
Applying the formula, Subtract the lower number from the higher and add 1 (16 − 2 + 1 =
15). In frequency distribution, 1 is the most common interval width.
The range is an unreliable index of variability because it is based on only two values, the highest
and the lowest. It is not a stable indicator of the spread of the scores. For this reason, the use of
the range is mainly limited to inspectional purposes. Some research reports refer to the range of
distributions, but such references are usually used in conjunction with other measures of
variability, such as variance and standard deviation.
Variance and standard deviation are the most frequently used indexes of variability. They
are both based on deviation scores—scores that show the difference between a raw score and
the mean of the distribution. The formula for a deviation score is
__
40
x=X−X
where
x = deviation score
X = raw score
__
𝑋̅ = mean
Scores below the mean will have negative deviation scores, and scores above the mean
will have positive deviation scores.
By definition, the sum of the deviation scores in a distribution is always 0. Thus, to use
deviation scores in calculating measures of variability, you must find a way to get around the fact
that Σx = 0. The technique used is to square each deviation score so that they all become positive
numbers. If you then sum the squared deviations and divide by the number of scores, you have
the mean of the squared deviations from the mean, or the variance. In mathematical form,
variance is
∑ 𝑿𝟐
𝜹𝟐 =
𝑵
where
σ2 = variance
Σ = sum of __
x2 = deviation of each score from the mean (X − X ) squared, otherwise known as the
deviation score squared
N = number of cases in the distribution
23 2 46 +3 9 18 529 1058
22 2 44 +2 4 8 484 968
21 4 84 +1 1 4 441 1764
20 4 80 0 0 0 400 1600
19 2 38 −1 1 2 361 722
18 2 36 −2 4 8 324 648
17 0 0
16 2 32 −4 16 32 256 512
2 2
N=18 ΣX = 360 Σx = 72 ΣX = 7272
In column 4 of Table 4, we see the deviation scores, differences between each score, and
the mean. Column 5 shows each deviation score squared (x2), and column 6 shows the frequency
41
of each score from column 2 multiplied by x2 (column 5). Summing column 6 gives us the sum of
the squared deviation scores Σx2 = 72. Dividing this by the number of scores gives us the mean
of the squared deviation scores, the variance.
The common formula used in computing variance is convenient only when the mean is a
whole number. To avoid the tedious task of working with squared mixed-number deviation scores
such as 7.66672, we recommend that students always use this formula for computing standard
deviation if the computation must be done “by hand”:
(∑ 𝑿)𝟐
∑ 𝑿𝟐 −
𝜹𝟐 = 𝑵
𝑵
where
σ2 = variance
ΣX2 = sum of the squares of each score (i.e., each score is first
squared, and then these squares are summed)
(ΣX)2 = sum of the scores squared (the scores are first summed, and then
this total is squared)
N = number of cases
Column 7 in Table 4 shows the square of the raw scores. Column 8 shows these raw
score squares multiplied by frequency. Summing this fX2 column gives us the sum of the squared
raw scores:
2
(∑ 𝑋) 3602 129600
∑ 𝑋2 − 7272− 7272− 7272−7200 72
2 𝑁 18 18
𝛿 = = = = = = 4
𝑁 18 18 18 18
In most cases, educators prefer an index that summarizes the data in the same unit of
measurement as the original data. Standard deviation (σ), the positive square root of variance,
provides such an index. By definition, the standard deviation is the square root of the mean of the
squared deviation scores. Rewriting this symbol, we obtain
∑ 𝑿𝟐
𝜹= √
𝑵
72
𝛿 = √ = √4 = 𝟐
18
42
The standard deviation belongs to the same statistical family as the mean; that is, like the
mean, it is an interval or ratio statistic, and its computation is based on the size of individual scores
in the distribution. It is by far the most frequently used measure of variability and is used in
conjunction with the mean.
The population standard deviation, denoted by σ, is the square root of the quantity in
Equation (2.3). To calculate either standard deviation in Excel, you can first find the variance with
the VAR or VARP function and then take its square root. Alternatively, you can find it directly with
the STDEV (sample) or STDEVP (population) function.
Most textbooks on data analysis, including this one, tend to use example data sets that
are “cleaned up.” Unfortunately, the data sets you are likely to encounter in your job are often not
so clean. Two particular problems you will encounter are outliers and missing data, the topics of
this section. There are no easy answers for dealing with these problems, but you should at least
be aware of the issues.
Outliers
An outlier is literally a value or an entire observation (row) that lies well outside of the norm.
For the baseball data, Alex Rodriguez’s salary of $32 million is definitely an outlier. This is indeed
his correct salary—the number wasn’t entered incorrectly—but it is way beyond what most players
make. Actually, statisticians disagree on an exact definition of an outlier. Going by the third
empirical rule, you might define an outlier as any value more than three standard deviations from
the mean, but this is only a rule of thumb. Let’s just agree to define outliers as extreme values,
and then for any particular data set, you can decide how extreme a value needs to be to qualify
as an outlier.
Sometimes an outlier is easy to detect and deal with. For example, this is often the case
with data entry errors. Suppose a data set includes a Height variable, a person’s height measured
in inches, and you see a value of 720. This is certainly an outlier—and it is certainly an error.
43
Once you spot it, you can go back and check this observation to see what the person’s height
should be. Maybe an extra 0 was accidentally appended and the true value is 72. In any case,
this type of outlier is usually easy to discover and fix.
It isn’t always easy to detect outliers, but an even more important issue is what to do about
them when they are detected. Of course, if they are due to data entry errors, they can be ixed,
but what if they are legitimate values like Alex Rodriguez’s salary? One or a few wild outliers like
this one can dominate a statistical analysis. For example, they can make a mean or standard
deviation much different than if the outliers were not present.
For this reason, some people argue, possibly naïvely, that outliers should be eliminated
before running statistical analyses. However, it is not appropriate to eliminate outliers simply to
produce “nicer” results. There has to be a legitimate reason for eliminating outliers, and such a
reason sometimes exists. For example, suppose you want to analyze salaries of “typical” man-
agers at your company. Then it is probably appropriate to eliminate the CEO and possibly other
high-ranking executives from the analysis, arguing that they aren’t really part of the population of
interest and would just skew the results. Or if you are interested in the selling prices of “typical”
homes in your community, it is probably appropriate to eliminate the few homes that sell for over
$2 million, again arguing that these are not the types of homes you are interested in.
Missing Values
There are no missing data in the baseball salary data set. All 843 observations have a
value for each of the four variables. For real data sets, however, this is probably the exception
rather than the rule. Unfortunately, most real data sets have gaps in the data. This could be
because a person didn’t want to provide all the requested personal information (what business is
it of yours how old I am or whether I drink alcohol?), it could be because data doesn’t exist (stock
prices in the 1990s for companies that went public after 2000), or it could be because some values
are simply unknown. Whatever the reason, you will undoubtedly encounter data sets with varying
degrees of missing values. As with outliers, there are two issues: how to detect missing values
and what to do about them. The first issue isn’t as simple as you might imagine. For an Excel data
set, you might expect missing data to be obvious from blank cells. This is certainly one possibility,
but there are others. Missing data are coded in a variety of strange ways. One common method
is to code missing values with an unusual number such as Š9999 or 9999. Another method is to
code missing values with a symbol such as Š or *. If you know the code (and it is often supplied
44
in a footnote), then it is usually a good idea, at least in Excel, to perform a global search and
replace, replacing all of the missing value codes with blanks.
The more important issue is what to do about missing values. One option is to ignore them.
Then you will have to be aware of how the software deals with missing values. For example, if
you use Excel’s AVERAGE function on a column of data with missing values, it reacts the way
you would hope and expect—it adds all the non-missing values and divides by the number of non-
missing values. StatTools reacts in the same way for all of the measures discussed in this chapter
(after alerting you that there are indeed missing values). We will say more about how StatTools
deals with missing data for other analyses in later chapters. If you are using other statistical
software such as SPSS or SAS, you should read its online help to learn how its various statistical
analyses deal with missing data.
Because this is such an important topic in real-world data analysis, researchers have
studied many ways of filling in the gaps so that the missing data problem goes away (or is at least
disguised). One possibility is to fill in all of the missing values in a column with the average of the
non-missing values in that column. Indeed, this is an option in some soft- ware packages, but we
don’t believe it is usually a very good option. (Is there any reason to believe that missing values
would be average values if they were known? Probably not.) Another possibility is to examine the
non-missing values in the row of any missing value. It is possible that they provide some clues on
what the missing value should be. For example, if a person is male, is 55 years old, has an MBA
degree from Harvard, and has been a man- ager at an oil company for 25 years, this should
probably help to predict his missing salary. (It probably isn’t below $100,000.) We will not discuss
this issue any-further here because it is quite complex, and there are no easy answers. But be
aware that you will undoubtedly have to deal with missing data at some point in your job, either
by ignoring the missing values or by filling in the gaps in some way.
CHAPTER EXERCISES
1. Provide answers as requested, given the following distribution: 15, 14, 14, 13, 11, 10, 10, 10,
8, 5.
45
2. Supposed that Ms. Llave’s English class has the following scores in two tests as shown in
Table 1:
a. Mean
b. Median
c. Mode
d. Variance
e. Standard Deviation
SUGGESTED READINGS
46
CHAPTER ANALYTICS ON SPREADSHEETS
4
OVERVIEW
This section discusses a great tool that was introduced in Excel 2007: tables. Tables were
somewhat available in previous versions of Excel, but they were never called tables before,
and some of the really useful features of Excel 2007 tables were new at the time. This
chapter discusses how you were able to do filtering, sorting and summarizing data using
spreadsheets.
OBJECTIVES
REFERENCES
47
EXCEL TABLES FOR FILTERING, SORTING, AND SUMMARIZING
It is useful to begin with some terminology and history. Earlier in this chapter, we discussed
data arranged in a rectangular range of rows and columns, where each row is an observation and
each column is a variable, with variable names at the top of each column. Informally, we refer to
such a range as a data set. In fact, this is the technical term used by StatTools. In previous
versions of Excel, data sets of this form were called lists, and Excel provided several tools for
dealing with lists. In Excel 2007, recognizing the importance of data sets, Microsoft made them
much more prominent and provided even better tools for analyzing them. Specifically, you now
have the ability to designate a rectangular data set as a table and then employ a number of
powerful tools for analyzing tables. These tools include filtering, sorting, and summarizing.
Let’s consider data in Table 4. The data contains 1000 customers of HyTex, a
(fictional) direct marketing company, for the current year. The definitions of the variables are fairly
straightforward, but details about several of them are listed in cell comments in row 1. HyTex
wants to find some useful and quick information about its customers by using an Excel table. How
can it proceed?
The range A1:O1001 is in the form of a data set—it is a rectangular range bounded by
blank rows and columns, where each row is an observation, each column is a variable, and
variable names appear in the top row. Therefore, it is a candidate for an Excel table. However, it
doesn’t benefit from the new table tools until you actually designate it as a table. To do so, select
any cell in the data set, click the Table button in the left part of the Insert ribbon (see Figure 4),
and accept the default options. Two things happen. First, the data set is designated as a table, it
is formatted nicely, and a dropdown arrow appears next to each variable name, as shown in
Figure 5. Second, a new Table Tools Design ribbon becomes available (see Figure 6). This ribbon
48
is available any time the active cell is inside a table. Note that the table is named Table1 by default
(if this is the first table). However, you can change this to a more descriptive name if you like.
One handy feature of Excel tables is that the variable names remain visible even when
you scroll down the screen. Try it to see how it works. When you scroll down far enough that the
variable names would disappear, the column headers, A, B, C, and so on, change to the variable
names. Therefore, you no longer need to freeze panes or split the screen to see the variable
names. However, this works only when the active cell is within the table. If you click outside the
table, the column headers revert back to A, B, C, and so on.
Filtering
49
We now discuss ways of filtering data sets—that is, finding records that match particular
criteria. Before getting into details, there are two aspects of filtering you should be aware of. First,
this section is concerned with the types of filters called AutoFilter in pre-2007 versions of Excel.
The term AutoFilter implied that these were very simple filters, easy to learn and apply. If you
wanted to do any complex filtering, you had to move beyond AutoFilter to Excel’s Advanced Filter
tool. Starting in version 2007, Excel still has Advanced Filter. However, the term AutoFilter has
been changed to Filter to indicate that these “easy” filters are now more powerful than the old
AutoFilter. Fortunately, they are just as easy as AutoFilter.
Second, one way to filter is to create an Excel table, as indicated in the previous
subsection. This automatically provides the dropdown arrows next to the field names that allow
you to filter. Indeed, this is the way we will filter in this section: on an existing table. However, a
designated table is not required for filtering. You can filter on any rectangular data set with variable
names. There are actually three ways to do so. For each method, the active cell should be a cell
inside the data set.
■ Use the Filter button from the Sort & Filter dropdown list on the Home ribbon.
■ Use the Filter button from the Sort & Filter group on the Data ribbon.
You get several options, the most popular of which is Filter by Selected Cell’s Value. For
example, if the selected cell has value 1 and is in the Children column, then only customers with
a single child will remain visible. (This behavior should be familiar to Access users.) The point is
that Microsoft realizes how important filtering is to Excel users. Therefore, they have made filtering
a very prominent and powerful tool in all versions of Excel since 2007. As far as we can tell, the
two main advantages of filtering on a table, as opposed to the three options just listed, are the
nice formatting (banded rows, for example) provided by tables, and, more importantly, the total
row. If this total row is showing, it summarizes only the visible records; the hidden rows are ignored.
CHAPTER EXERCISES
50
1. Obtain DOH data on COVID19 in the Philippines from March to August 2020. Write/ print in
short bond paper
Tasks:
- Month
- Monthly data
▪ Explain the basic concepts and tools necessary to work with probability
distributions and their summary measures.
▪ Examine the probability distribution of a single random variable.
▪ Understand the concept of addition rule
▪ Understand and learn conditional probability and the multiplication rule
▪ Learn summarizing measure of probability distribution
52
A key aspect of solving real business problems is dealing appropriately with uncertainty.
This involves recognizing explicitly that uncertainty exists and using quantitative methods to
model uncertainty. If you want to develop realistic business models, you cannot simply act as if
uncertainty doesn’t exist. For example, if you don’t know next month’s demand, you shouldn’t
build a model that assumes next month’s demand is a sure 1500 units. This is only wishful thinking.
You should instead incorporate demand uncertainty explicitly into your model. To do this, you
need to know how to deal quantitatively with uncertainty. This involves probability and probability
distributions. We introduce these topics in this chapter and then use them in a number of later
chapters.
There are many sources of uncertainty. Demands for products are uncertain, times
between arrivals to a supermarket are uncertain, stock price returns are uncertain, changes in
interest rates are uncertain, and so on. In many situations, the uncertain quantity— demand, time
between arrivals, stock price return, change in interest rate—is a numerical quantity. In the
language of probability, it is called a random variable. More formally, a random variable
associates a numerical value with each possible random outcome.
Associated with each random variable is a probability distribution that lists all of the
possible values of the random variable and their corresponding probabilities. A probability
distribution provides very useful information. It not only indicates the possible values of the
random variable, but it also indicates how likely they are. For example, it is useful to know that
the possible demands for a product are, say, 100, 200, 300, and 400, but it is even more useful
to know that the probabilities of these four values are, say, 0.1, 0.2, 0.4, and 0.3. This implies,
for example, that there is a 70% chance that demand will be at least 300. It is often useful to
summarize the information from a probability distribution with numerical summary measures.
These include the mean, variance, and standard deviation. The summary measures in this
chapter are based on probability distributions, not an observed data set. We will use numerical
examples to explain the difference between the two—and how they are related.
We discuss two terms you often hear in the business world: uncertainty and risk. They
are sometimes used interchangeably, but they are not really the same. You typically have no
control over uncertainty; it is something that simply exists. A good example is the uncertainty in
exchange rates. You cannot be sure what the exchange rate between the U.S. dollar and the euro
will be a year from now. All you can try to do is measure this uncertainty with a probability
distribution.
53
book. By learning about probability, you will learn how to measure uncertainty, and you
will also learn how to measure the risks involved in various decisions. One important topic you
will not learn much about is risk mitigation by various types of hedging. For example, if you know
you have to purchase a large quantity of some product from Europe a year from now, you face
the risk that the value of the euro could increase dramatically, thus costing you a lot of money.
Fortunately, there are ways to hedge this risk, so that if the euro does increase relative to the
dollar, your hedge minimizes your losses.
PROBABILITY ESSENTIALS
A probability is a number between 0 and 1 that measures the likelihood that some event
will occur. An event with probability 0 cannot occur, whereas an event with probability 1 is certain
to occur. An event with probability greater than 0 and less than 1 involves uncertainty. The closer
its probability is to 1, the more likely it is to occur.
When a sports commentator states that the odds against the Miami Heat winning the NBA
Championship are 3 to 1, he or she is also making a probability statement. The concept of
probability is quite intuitive. However, the rules of probability are not always as intuitive or easy to
master. We examine the most important of these rules in this section.
There are only a few probability rules you need to know, and they are discussed in the
next few subsections. Surprisingly, these are the only rules you need to know. Probability is not
an easy topic, and a more thorough discussion of it would lead to considerable mathematical
complexity, well beyond the level of this book.
Rule of Complements
The simplest probability rule involves the complement of an event. If A is any event, then
the c complement of A, denoted by A (or in some books by A ), is the event that A does not occur.
For example, if A is the event that the Dow Jones Industrial Average will inish the year at
or above the 14,000 mark, then the complement of A is that the Dow will inish the year below
14,000.
54
If the probability of A is P(A), then the probability of its complement, P(A), is given by
Equation below. Equivalently, the probability of an event and the probability of its complement
sum to 1. For example, if you believe that the probability of the Dow finishing at or above 14,000
is 0.25, then the probability that it will finish the year below 14,000 is 1 - 0.25 = 0.75.
̅̅̅ = 𝟏 − 𝑷(𝑨)
𝑷(𝑨)
Addition Rule
Events are mutually exclusive if at most one of them can occur. That is, if one of them
occurs, then none of the others can occur.
For example, consider the following three events involving a company’s annual revenue
for the coming year: (1) revenue is less than $1 million, (2) revenue is at least $1 million but less
than $2 million, and (3) revenue is at least $2 million. Clearly, only one of these events can occur.
Therefore, they are mutually exclu- sive. They are also exhaustive events, which means that they
exhaust all possibilities—one of these three events must occur. Let A1 through An be any n events.
Then the addition rule of probability involves the probability that at least one of these events will
occur. In general, this probability is quite complex, but it simplifies considerably when the events
are mutually exclusive. In this case the probability that at least one of the events will occur is the
sum of their individual probabilities, as shown in Equation below Of course, when the events are
mutually exclusive, “at least one” is equivalent to “exactly one.” In addition, if the events A1 through
An are exhaustive, then the probability is one because one of the events is certain to occur.
For example, in terms of a company’s annual revenue, define A1 as “revenue is less than
$1 million,” A2 as “revenue is at least $1 million but less than $2 million,” and A3 as “revenue is at
least $2 million.” Then these three events are mutually exclusive and exhaustive. Therefore, their
probabilities must sum to 1. Suppose these probabilities are P(A1) = 0.5, P(A2) = 0.3, and P(A3) =
0.2. (Note that these probabilities do sum to 1.) Then the additive rule enables you to calculate
other probabilities. For example, the event that revenue is at least $1 million is the event that
either A2 or A3 occurs. From the addition rule, its probability is
Similarly,
55
and
Again, the addition rule works only for mutually exclusive events. If the events overlap,
the situation is more complex.
Probabilities are always assessed relative to the information currently available. As new
information becomes available, probabilities can change. For example, if you read that LeBron
James suffered a season-ending injury, your assessment of the probability that the Heat will win
the NBA Championship would obviously change. A formal way to revise probabilities on the basis
of new information is to use conditional probabilities.
Let A and B be any events with probabilities P(A) and P(B). Typically, the probability P(A)
is assessed without knowledge of whether B occurs. However, if you are told that B has occurred,
then the probability of A might change. The new probability of A is called the conditional
probability of A given B, and it is denoted by P(A∣B). Note that there is still uncertainty involving
the event to the left of the vertical bar in this notation; you do not know whether it will occur.
However, there is no uncertainty involving the event to the right of the vertical bar; you know that
it has occurred. The conditional probability can be calculated with the following formula.
The numerator in this formula is the probability that both A and B occur. This probability
must be known to find P(A∣B). However, in some applications P(A∣B) and P(B) are known. Then
you can multiply both sides of Equation (4.3) by P(B) to obtain the following multiplication rule for
P(A and B).
Example:
56
Bender Company supplies contractors with materials for the construction of houses. The
company currently has a contract with one of its customers to fill an order by the end of July.
However, there is some uncertainty about whether this deadline can be met, due to uncertainty
about whether Bender will receive the materials it needs from one of its suppliers by the middle
of July. Right now it is July 1. How can the uncertainty in this situation be assessed?
Solution
Let A be the event that Bender meets its end-of-July deadline, and let B be the event
that Bender receives the materials from its supplier by the middle of July. The probabilities Bender
is best able to assess on July 1 are probably P(B) and P(A∣B). At the beginning of July, Bender
might estimate that the chances of getting the materials on time from its supplier are 2 out of 3,
so that P(B) = 2/3. Also, thinking ahead, Bender estimates that if it receives the required materials
on time, the chances of meeting the end-of-July deadline are 3 out of 4. This is a conditional
probability statement, namely, that P(A∣B) = 3/4. Then the multiplication rule implies that
That is, there is a fifty-fifty chance that Bender will get its materials on time and meet its end-of-
July deadline.
There are really two types of random variables: discrete and continuous. A discrete
random variable has only a finite number of possible values, whereas a continuous random
variable has a continuum of possible values. Usually a discrete distribution results from a count,
whereas a continuous distribution results from a measurement. For example the number of
children in a family is clearly discrete, whereas the amount of rain this year in San Francisco is
clearly continuous.
This distinction between counts and measurements is not always clear-cut. For example,
what about the demand for televisions at a particular store next month? The number of televisions
demanded is clearly an integer (a count), but it probably has many possible values, such as all
integers from 0 to 100. In some cases like this, we often approximate in one of two ways. First,
we might use a discrete distribution with only a few possible values, such as all multiples of 20
from 0 to 100. Second, we might approximate the possible demand as a continuum from 0 to 100.
The reason for such approximations is to simplify the mathematics, and they are frequently used.
57
Mathematically, there is an important difference between discrete and continuous
probability distributions. Specifically, a proper treatment of continuous distributions, analogous to
the treatment we provide in this chapter, requires calculus—which we do not presume for this
book. Therefore, we discuss only discrete distributions in this chapter. In later chapters we often
use continuous distributions, particularly the bell-shaped normal distribution, but we simply state
their properties without deriving them mathematically.
The essential properties of a discrete random variable and its associated probability
distribution are quite simple. We discuss them in general and then analyze a numerical example.
58
squared. Therefore, a more natural measure of variability is the standard deviation, denoted by σ
or Stdev(X). It is the square root of the variance, as indicated by Equation below.
CHAPTER EXERCISES
An investor is concerned with the market return for the coming year, where the market
return is defined as the percentage gain (or loss, if negative) over the year. The investor believes
there are five possible scenarios for the national economy in the coming year: rapid expansion,
moderate expansion, no growth, moderate contraction, and serious contraction. Furthermore, she
has used all of the information available to her to estimate that the market returns for these
scenarios are, respectively, 23%, 18%, 15%, 9%, and 3%. That is, the possible returns vary from
a high of 23% to a low of 3%. Also, she has assessed that the probabilities of these outcomes are
0.12, 0.40, 0.25, 0.15, and 0.08. Use this information to describe the probability distribution of the
market return.
Compute the following for the probability distribution of the market return for the coming
year.:
1. Mean,
2. Variance,
3. Standard deviation
SUGGESTED READINGS
59
STATISTICAL INFERENCE:
CHAPTER
SAMPLING AND ESTIMATION
6
OVERVIEW
This chapter introduces the important problem of estimating an unknown population quantity
by randomly sampling from the population. Sampling is often expensive and/or time-
consuming, so a key step in any sampling plan is to determine the sample size that produces
a prescribed level of accuracy. This chapter also sets the stage for statistical inference, a 60
topic that is explored in the following few chapters. In a typical statistical inference problem,
you want to discover one or more characteristics of a given population.
OBJECTIVES
UNDERSTANDING SAMPLES
What is a population? Any group with at least one common characteristic and is made up
of people, transactions, products, and so on, is called a population. You need to understand the
population for any project at the beginning of the project. In business, it is rare to have a population
that has only one characteristic. Generally, it will have many variables in the data set. What is a
sample? A sample consists of a few observations or subset of a population. Can a sample have
the same number of observations as a population? Yes, it can. Some of the differences between
populations and samples are in the computations and nomenclatures associated with them.
In statistics, population refers to a collection of data related to people or events for which
the analyst wants to make some inferences. It is not possible to examine every member in the
61
population. Thus, if you take a sample that is random and large enough, you can use the
information collected from the sample to make deductions about the population. For example, you
can look at 100 students from a school (picked randomly) and make a fairly accurate judgment of
the standard of English spoken in the school. Or you can look at the last 100 transactions on a
web site and figure out fairly accurately the average time a customer spends on the web site.
Before you can choose a sample from a given population, you typically need a list of all
members of the population. In sampling terminology, this list is called a frame, and the potential
sample members are called sampling units. Depending on the context, sampling units could be
individual people, households, companies, cities, or others.
There are two basic types of samples: probability samples and judgmental samples. A
probability sample is a sample in which the sampling units are chosen from the population
according to a random mechanism. In contrast, no formal random mechanism is used to select a
judgmental sample. In this case the sampling units are chosen according to the sampler’s
judgment.
SAMPLING TECHNIQUES
A sample is part of the population which is observed in order to make inferences about
the whole population (Manheim, 1977). You use sampling when your research design requires
that you collect information from or about a population, which is large or so widely scattered as to
make it impractical to observe all the individuals in the population. A sample reflects the
characteristics of the population.
Four factors that you should take into consideration when selecting your sample and the
size of your sample are the following:
2. Size of population. If the population is large, you need a sample. However, you
do not need a sample if the population is small and can be handled if you include
all the individuals in the population. Including all the individuals in the population
is also called total enumeration.
62
3. Cost. Your choice of sampling method should be based also on the cost of
adopting such method without necessarily sacrificing representativeness of the
population being considered.
4. Precision. If you have to achieve precision, you will need a larger sample
because the larger the sample, the more precise the results will be.
There are two major types of sampling techniques: probability sampling and non-
probability sampling.
a. Probability sampling
1. Random sampling. Also called simple random sampling, this technique is a way of
selecting n individuals out of N such that everyone has an equal chance of being
selected. Sample individuals are selected at points entirely at random within the
population. This technique is suitable for homogeneous populations.
63
are: “freshmen,” “sophomore,” “junior,” and “senior.” What do you do to select your
sample from each of these groups of students to insure that you get a cross-section of
the UPLB studentry? If you select your sample by simple random selection, there is
chance that you will end up with a sample composed more of seniors or juniors rather
than representative groups of students in all classifications.
5. Strip sampling. Under this technique, you divide the area to be sampled into narrow
strips. Then, select a number of strips at random either by complete randomization or
with some degree of stratification. Sometimes you may consider only a part of the strip
as a sample unit.
b. Non-probability sampling
64
is determined to a certain extent by the characteristics of the population so that the
quota sample will be representative of the population. This is commonly used in
opinion research, where interviewers are just given specific quotas or number of
respondents to interview. This technique is very economical and simple, but it must be
used with caution as it allows for a wide latitude of interviewer’s choices which may
result in biases. The assumption here, however, is that field investigators have high
integrity and they have undergone thorough training.
One reason for sampling randomly from a population is to avoid biases (such as
choosing mainly stay-at- home mothers because they are easier to contact). An equally
important reason is that random sampling allows you to use probability to make inferences
about unknown population parameters. If sampling were nit random, there would be no
basis for using probability to make such inference.
On top of the basic sampling techniques that are commonly used, you can introduce a
system where you can insure that the final sample of your study is really representative of the
population comprised of individuals that may come in clusters or groups. This is called
proportional sampling and there is a simple formula that would enable you to arrive at a complete
sample that is representative of the segments of the population.
For instance, you want to obtain a sample sufficiently representative of the barangays or
villages in a town. You know that the barangays differ in total number of individuals living in them.
So you decide that those with larger population should be represented by more respondents. How
then would you determine the number of respondents coming from each village?
To determine the sample size, Slovin’s formula is commonly used for lesser population.
65
𝑁
𝑛= Where, n= sample size
1+𝑁𝑒 2 N= Population size
e= margin of error
For example:
Suppose you wanted to determine the sample size for your study on households’ taste
preference on the new variety of ice cream. The study will be conducted in Sto. Nino, Paranaque
City with total number of households of 4, 921 (PSA Census on Population 2000 data).
Solution:
Hence, your sample size is only 370 households from the 4921. This represents the
number of respondents that you will survey for your study.
INTRODUCTION TO ESTIMATION
The mathematical procedures appropriate for performing this estimation depend on which
properties of the population are of interest and which type of random sampling scheme is used.
Because the details are considerably more complex for more complex sampling schemes such
as multistage sampling, we will focus on simple random samples, where the mathematical details
are relatively straightforward. Details for other sampling schemes such as stratified sampling can
be found in Levy and Lemeshow (1999). However, even for more complex sampling schemes,
the concepts are the same as those we discuss here; only the details change.
There are two basic sources of errors that can occur when you sample randomly from a
population: sampling error and all other sources, usually lumped together as non-sampling
66
error. Sampling error results from “unlucky” samples. As such, the term error is somewhat
misleading.
For example, that the mean household income in Indiana is $58,225. (We can only
assume that this is the true value. It wouldn’t actually be known without taking a census.) A
government agency wants to estimate this mean, so it randomly samples 500 Indiana households
and finds that their average household income is $60,495. If the agency then infers that the mean
of all Indiana household incomes is $60,495, the resulting sampling error is the difference
between the reported value and the true value: $60,495 – $58,225 = $2270. Note that the agency
hasn’t done anything wrong. This sampling error is essentially due to bad luck.
Non-sampling error is quite different and can occur for a variety of reasons.
a. nonresponse bias. This occurs when a portion of the sample fails to respond to the survey.
Anyone who has ever conducted a questionnaire, whether by mail, by phone, or any other
method, knows that the percentage of non-respondents can be quite large. The question is
whether this introduces estimation error. If the non-respondents would have responded
similarly to the respondents, you don’t lose much by not hearing from them. However, because
the non-respondents don’t respond, you typically have no way of knowing whether they differ
in some important respect from the respondents. Therefore, unless you are able to persuade
the non-respondents to respond—through a follow-up email, for example—you must guess at
the amount of nonresponse bias.
b. non-truthful responses. This is particularly a problem when there are sensitive questions in
a questionnaire. For example, if the questions “Have you ever had an abortion?” or “Do you
regularly use cocaine?” are asked, most people will answer “no,” regardless of whether the
true answer is “yes” or “no.”
c. measurement error. This occurs when the responses to the questions do not reflect what the
investigator had in mind. It might result from poorly worded questions, questions the
respondents don’t fully understand, questions that require the respondents to supply
information they don’t have, and so on. Undoubtedly, there have been times when you were
filling out a questionnaire and said to yourself, “OK, I’ll answer this as well as I can, but I know
it’s not what they want to know.”
d. voluntary response bias. This occurs when the subset of people who respond to a survey
differ in some important respect from all potential respondents. For example, suppose a
67
population of students is surveyed to see how many hours they study per night. If the students
who respond are predominantly those who get the best grades, the resulting sample mean
number of hours could be biased on the high side.
A point estimate is a single numeric value, a “best guess” of a population parameter, based on
the data in a random sample.
The sampling error (or estimation error) is the difference between the point estimate and the true
value of the population parameter being estimated.
The sampling distribution of any point estimate is the distribution of the point estimates from all
possible samples (of a given sample size) from the population.
A confidence interval is an interval around the point estimate, calculated from the sample data,
that is very likely to contain the true value of the population parameter.
An unbiased estimate is a point estimate such that the mean of its sampling distribution is equal
to the true value of the population parameter being estimated.
The standard error of an estimate is the standard deviation of the sampling distribution of the
estimate. It measures how much estimates vary from sample to sample.
The problem of selecting the appropriate sample size in any sampling context is not an
easy one (as illustrated in the chapter opener), but it must be faced in the planning stages, before
any sampling is done. We focus here on the relationship between sampling error and sample size.
As we discussed previously, the sampling error tends to decrease as the sample size increases,
so the desire to minimize sampling error encourages us to select larger sample sizes. We should
note, however, that several other factors encourage us to select smaller sample sizes. The
ultimate sample size selection must achieve a trade-off between these opposing forces.
The determination of sample size is usually driven by sampling error considerations. If you
want to estimate a population mean with a sample mean, then the key is the standard error of the
mean, given by
̅ ) = 𝜹/√𝒏
𝑺𝑬(𝑿
68
CONFIDENCE INTERVALS
In the world of statistics, you can look at what applies to the sample and try to determine
the population. You know that the sample cannot be a 100 percent replica of the population. There
will be minor changes, and perhaps there are major ones too. How do you figure out that the
sample statistics is applicable to the population? To answer this, you look at the confidence
interval. Confidence intervals enable you to understand the accuracy that you can expect when
you take the sample statistics and apply them to the population.
In other words, a confidence interval gives you a range of values within which you can
expect the population statistics to be.
In statistics there is a term called the margin of error, which defines the maximum
expected difference between the population parameter and the sample statistic. It is often an
indicator of the random sampling error, and it is expressed as a likelihood or probability that the
result from the sample is close to the value that would have been calculated if you could have
calculated the statistic for the population. The margin of error is calculated when you observe
many samples instead of one sample. When you look at 50 people coming in for an interview and
find that 5 people do not arrive at the correct time, you can conclude that the margin of error is 5
÷ 50, which is equal to 10 percent. Therefore, the absolute margin of error, which is five people,
is converted to a relative margin of error, which is 10 percent.
Now, what is the chance that when you observe many samples of 50 people, you will find
that in each sample 5 people do not come at the designated time of interview? If you find that, out
of 100 samples, in 99 samples 5 people do not come in on time for an interview, you can say that
with 99 percent accuracy the margin of error is 10 percent.
Why should there be any margin of error if the sample is a mirror image of the population?
The answer is that there is no sample that will be a 100 percent replica of the population. But it
can be very close. Thus, the margin of error can be caused because of a sampling error or
because of a nonsampling error.
You already know that the chance that the sample is off the mark will decrease as the
sample size increases. The more people/products that you have in your sample size, the more
likely you will get a statistic that is very close to the population statistic. Thus, the margin of error
in a sample is equal to 1 divided by the square root of the number of observations in the sample.
69
𝒆 = 𝟏/√𝒏
You know that hypothesis testing is used to confirm or reject whether two samples belong
to the same population. That p-value is the probability that determines whether the two samples
are the same population. This probability is a measure of evidence against the hypothesis.
Thus, a smaller p-value will mean that you can reject the null because the probability of
the two samples having similar means (which points to the two samples coming from the same
population) is much less (.05 = 5 percent probability).
The small p-value corresponds to strong evidence, and if the p-value is below a predefined
limit (.05 is the default value in most software), then the result is said to be statistically significant.
For example, if the hypothesis is that a new type of medicine is better than the old version,
then the first attempt is to prove that the drugs are not similar (that any similarity is so small that
it can be random/ coincidence). Then the null hypothesis of the two drugs being the same needs
to be rejected. A small p-value signifies that the probability of the null hypothesis being true is so
small that it can be thought to be purely by chance.
70
This distribution is the distribution of the probability of the null hypothesis being true. Thus,
when the p-value (the probability of the null hypothesis of being true ) is less than .05 (or any
other value set for the test), you have to reject the null and conclude that Mean 1 = Mean 2 only
because of coincidence or fate or chance.
No hypothesis test is 100 percent certain. As you have noticed, tests are based on
probability, and therefore, there is always a chance of an incorrect conclusion. These incorrect
conclusions can be of two types:
▪ Type 1 error, alpha: This is when the null hypothesis is true but you reject the null. Alpha
is the level of significance that you have set for the test. At a significance of .05, you are
willing to accept a 5 percent chance that you will incorrectly reject the null hypothesis. To
lower the risk, you can choose a lower value of significance. A type I error is generally
reported as the p-value.
▪ Type II error, beta: This is the error of incorrectly accepting the null. The probability of
making a type II error depends on the power of the test.
You can decrease your risk of committing a type II error by ensuring your sample size is
large enough to detect a practical difference when one truly exists. The confidence level is
equivalent to 1, the alpha level.
When the significance level is 0.05, the corresponding confidence level is 95 percent.
▪ If the p-value is less than the significance (alpha) level, the hypothesis test is statistically
significant.
▪ If the confidence interval does not contain the null hypothesis value between the upper
and lower confidence limits, the results are statistically significant (the null can be
rejected).
71
▪ If the p-value is less than the alpha, the confidence interval will not contain the null
hypothesis value.
3. The confidence interval and p-value will always lead to the same conclusion.
The most valuable usage of hypothesis testing is in interpreting the robustness of other
statistics generated while solving the problem/doing the project.
▪ Correlation coefficient: If the p-value is less than or equal to .05, you can conclude that
the correlation is actually equal to the correlation coefficient value displayed/ calculated.
If the p-value is greater than .05, you have to conclude that the correlation is because of
chance/coincidence.
▪ Linear regression coefficients: If the p-value is less than or equal to .05, you can
conclude that the coefficients are actually equal to the value displayed/calculated. If the
p-value is greater than .05, you have to conclude that the coefficients are because of
chance/coincidence.
SAMPLING DISTRIBUTIONS
What will happen if you are able to draw out all possible samples of 30 or more
observations from a given population/sample frame? For each of these samples, you could
compute the descriptive statistics (mean, median, standard deviation, minimum, maximum). Now
if you were to create a probability distribution of this statistic, it would be called the sampling
distribution, and the standard deviation of this statistic would be called the standard error.
It has been found that if infinite numbers of samples are taken from the same sample
frame/population and a sample statistic (say, the mean of the samples) is plotted out, you will find
that a normal distribution emerges. Thus, most of the means will be clustered around the mean
of the sample mean, which incidentally will coincide or be very close to the population/sample
frame mean. This is as per the normal distribution rule, which states that values are concentrated
around the mean and few values will be far away from the mean (very low or very high as
compared to the mean).
Binomial Distribution
The basic building block of the binomial distribution is a Bernoulli random variable. This is
a variable for which there can be only two possible outcomes, and the probability of these
72
outcomes satisfies the conditions of a valid probability distribution function, which is that each
probability is between 0 and 1 and the total probabilities sum up to 1 or 100 percent.
Since a single observation of the outcome of a burn on a random variable is called a trial,
the sum of a series of such trials is distributed as a binomial distribution.
Thus, one such example is the probability of getting a tail on the toss of a coin, which is
50 percent or .5. If there are 100 such tosses, you will find that getting 0 heads and 100 tails is
very unlikely, getting 50 heads and 50 tails is the most likely, and getting 100 tails and 0 heads is
the most unlikely.
Now let’s look at a scenario where you have four possible outcomes and the probability of
getting outcome 1, 2, or 3 defines success, while getting an outcome of 4 defines failure. Thus,
the probability of success is 75 percent, and the probability of failure is 25 percent. Now if you
were to try 200 tosses again, you will find that a similar distribution occurs, but the distribution will
be more skewed or can be seen to be a bit shifted as compared to the earlier 50-50 distribution.
What if you have no prior beliefs about the distribution of probability or if you believe that
every outcome is equally possible? It’s easier when the value is discrete for a variable. When this
same condition is seen over a continuous variable, the distribution that emerges is called the
continuous uniform distribution (Figure 9). It is often used for random number generation in
simulations.
73
Figure 9. A continuous uniform distribution
Possion Distribution
Let’s look at some events that occur at a continuous rate, such as phone calls coming into
a call center. Let the rate of occurrence be r, or lambda. When the number is small (that is, there
is only one or two calls in a day), the possibilities that you will get zero calls on certain days is
high. However, say the number of calls in the call center is on average 100 per day. Then the
possibility that you will ever get zero calls in a day is very low. This distribution is called the
Poisson distribution (Figure 10).
PARAMETRIC TESTS
74
The following are some parametric tests:
▪ Students t-test: Student t-tests look at the differences between two groups across the
same variable of interest. Or they look at two variables in the same sample. The
consideration is that there can be only two groups at maximum.
▪ An example is if you want to compare the grades in English for students of Class 1’s
Section A and Section B. Another example is if you want to compare the grades in Class
1’s Section A for math and for science.
o One sample t-test: When the null hypothesis reads that the mean of a variable is
less than or equal to a specific value, then that test is one sample t-test.
o Paired sample t-test: When the null hypothesis assumes that the mean of
variable 1 is equal to the mean of variable 2, then that test is a paired sample t-
test.
o Independent sample t-test: This compares the mean difference between two
independent groups for a given variable. The null hypothesis is that the mean for
the variable in sample 1 is equal to the mean for the same variable in sample 2.
The assumption is that the variance or standard deviation across the samples is
nearly equal.
For example, if you want to compare the grades in English for students of Class 1’s Section
A and Section B, you can use an analysis of variance (ANOVA) test as a substitute for the
students’ t-test.
▪ ANOVA test: This is the significance of differences between two or more groups across
one or more categorical variable. Thus, you will be able to figure out whether there is a
difference between groups, which is significant, but it will not tell you which group is
different.
75
▪ An example is if you want to compare the grades in English for students of Class 1’s
Section A, Section B, and Section C. Another example is if you want to compare the
grades in Class 1’s Section A for math, English, and science.
o One-way ANOVA: In this test, you compare the mean of a number of groups
based on one independent variable. There are some assumptions like that the
dependent variable is normally distributed and that the group of independent
variable groups have equal variance on the dependent variable.
o Two-way ANOVA: Here you can look at multiple groups and two variables of
factors. Again, the assumption is that there is homogeneity of variance and the
standard deviation of the population of all the groups are similar.
NONPARAMETRIC TESTS
Here the data is not normally distributed. Thus, if the data is better represented by the
median instead of the mean, it is better to use nonparametric tests. It is also better to use
nonparametric tests. If the data sample is small, the data is ordinal ranked or may have some
outliers that you do not want to remove.
Chi-squared tests compare observed frequencies to expected frequencies and are used
across categorical variables.
As discussed, chi-square tests will be used on data that has ordinal nominal variables. For
example, say you want to understand the population of Indian males in cities who regularly
exercise, sporadically exercise, or have not exercised over the last 20 years. Thus, you have
three responses tracked over 20 years, and you need to figure out whether the population has
shifted between year 1 and year 20. The null hypothesis here would mean that there is no change
or no difference in the situation.
The test for both years was run on 500 people. Now you would compare the year 20
statistics with what could be the expected frequencies of these people in year 20 (if the year 1
trends are followed) as compared to the observed frequencies.
76
The test is based on a numerical measure of the difference between the two histograms.
Let C be the number of categories in the histogram, and let Oi be the observed number of
observations in category i. Also, let Ei be the expected number of observations in category i if the
population were normal with the same mean and standard deviation as in the sample. Then the
goodness-of-it measure in equation below is used as a test statistic.
If the null hypothesis of normality is true, this test statistic has (approximately) a chi-square
distribution with C - 3 degrees of freedom. Because large values of the test statistic indicate a
poor it—the Oi’s do not match up well with the Ei’s—the p-value for the test is the probability to
the right of the test statistic in the chi-square distribution with C - 3 degrees of freedom.
CHAPTER EXERCISES
5. Suppose you are going to study on the difference on buying behavior of millennials and their
demographic profile. Formulate three hypotheses and determine what statistical treatment will be
used on each of the hypothesis.
SUGGESTED READINGS
77
CHAPTER DATA MINING
7
OVERVIEW
The types of data analysis discussed throughout this book are crucial to the success of most
companies in today’s data-driven business world. However, the sheer volume of available
data often defies traditional methods of data analysis. Therefore, new methods— and
accompanying software—have recently been developed under the name of data mining.
OBJECTIVES
78
INTRODUCTION TO DATA MINING
Data mining attempts to discover patterns, trends, and relationships among data,
especially nonobvious and unexpected patterns. For example, an analysis might discover that
people who purchase skim milk also tend to purchase whole wheat bread, or that cars built on
Mondays before 10 a.m. on production line #5 using parts from supplier ABC have significantly
more defects than average. This new knowledge can then be used for more effective
management of a business.
The place to start is with a data warehouse. Typically, a data warehouse is a huge
database that is designed specifically to study patterns in data. A data warehouse is not the same
as the databases companies use for their day-to-day operations. A data warehouse should (1)
combine data from multiple sources to discover as many relationships as possible, (2) contain
accurate and consistent data, (3) be structured to enable quick and accurate responses to a
variety of queries, and (4) allow follow-up responses to specific relevant questions. In short, a
data warehouse represents a relatively new type of database, one that is specifically structured
to enable data mining. Another term you might hear is data mart. A data mart is essentially a
scaled-down data warehouse, or part of an overall data warehouse, that is structured specifically
for one part of an organization, such as sales. Virtually all large organizations, and many smaller
ones, have developed data warehouses or data marts in the past decade to enable them to better
understand their business—their customers, their suppliers, and their processes.
Once a data warehouse is in place, analysts can begin to mine the data with a collection
of methodologies and accompanying software. Some of the primary methodologies are
classification analysis, prediction, cluster analysis, market basket analysis, and forecasting. Each
of these is a large topic in itself, but some brief explanations follow.
▪ Classification analysis attempts to find variables that are related to a categorical (often
binary) variable. For example, credit card customers can be categorized as those who pay
their balances in a reasonable amount of time and those who don’t. Classification analysis
would attempt to find explanatory variables that help predict which of these two categories
a customer is in. Some variables, such as salary, are natural candidates for explanatory
variables, but an analysis might uncover others that are less obvious.
▪ Prediction is similar to classification analysis, except that it tries to ind variables that help
explain a continuous variable, such as credit card balance, rather than a categorical
79
variable. Regression, the topic of Chapters 10 and 11, is one of the most popular prediction
tools, but there are others not covered in this book.
▪ Cluster analysis tries to group observations into clusters so that observations within a
cluster are alike, and observations in different clusters are not alike. For example, one
cluster for an automobile dealer’s customers might be middle-aged men who are not
married, earn over $150,000, and favor high-priced sports cars. Once natural clusters are
found, a company can then tailor its marketing to the individual clusters.
▪ Market basket analysis tries to find products that customers purchase together in the
same “market basket.” In a supermarket setting, this knowledge can help a manager
position or price various products in the store. In banking and other settings, it can help
managers to cross-sell (sell a product to a customer already purchasing a related product)
or up-sell (sell a more expensive product than a customer originally intended to purchase).
▪ Forecasting is used to predict values of a time series variable by extrapolating patterns
seen in historical data into the future. (This topic is covered in some detail in Chapter 12.)
This is clearly an important problem in all areas of business, including the forecasting of
future demand for products, forecasting future stock prices and commodity prices, and
many others.
Data mining is a relatively new field—or at least a new term—and not everyone agrees
with its definition. To many people, data mining is a collection of advanced algorithms that can be
used to find useful information and patterns in large data sets. Data mining does indeed include
a number of advanced algorithms, but we believe its definition should be broadened to include
relatively simple methods for exploring and visualizing data. This section discusses some of the
possibilities.
We introduced pivot tables in Chapter 4 as an amazingly easy and powerful way to break ® data
down by category in Excel . However, the pivot table methodology is not limited to Excel or even
to Microsoft. This methodology is usually called online analytical processing, or OLAP. This
name was initially used to distinguish this type of data analysis from online transactional
processing, or OLTP.
80
When analysts began to realize that the typical OLTP databases are not well equipped to
answer these broader types of questions, OLAP was born. This led to much research into the
most appropriate database structure for answering OLAP questions. The consensus was that the
best structure is a star schema. In a star schema, there is at least one Facts table of data that has
many rows and only a few columns. For example, in a supermarket database, a Facts table might
have a row for each line item purchased, including the number of items of the product purchased,
the total amount paid for the product, and possibly the discount. Each row of the Facts table would
also list “lookup information” (or foreign keys, in database terminology) about the purchase: the
date, the store, the product, the customer, any promotion in effect, and possibly others. Finally,
the database would include a dimension table for each of these. For example, there would be a
Products table. Each row of this table would contain multiple pieces of information about a
particular product. Then if a customer purchases product 15, say, information about product 15
could be looked up in the Products table.
Most data warehouses are built according to these basic ideas. By structuring corporate
databases in this way, facts can easily be broken down by dimensions, and—you guessed it—
the methodology for doing this is pivot tables. However, these pivot tables are not just the
“standard” Excel pivot tables. You might think of them as pivot tables on steroids. The OLAP
methodology and corresponding pivot tables have the following features that distinguish them
from standard Excel pivot tables.
The general approach to data analysis embodied in pivot tables is one of the most powerful
ways to explore data sets. You learned about basic Excel pivot tables in Chapter 3, and you
learned about the more general OLAP technology in the previous subsection. This subsection
describes new Microsoft tools of the pivot table variety, PowerPivot and Power View, that were
introduced in Excel 2013. Actually, PowerPivot was available as a free add-in for Excel 2010, but
two things have changed in the version that is described here. First, you no longer need to
download a separate PowerPivot add-in. In Excel 2013, you can simply add it in by checking it in
the add-ins list. Second, the details of PowerPivot have changed. Therefore, if you find a tutorial
for the older PowerPivot add-in on the Web and try to follow it for Excel 2013, you will see that
the new version doesn’t work in the same way as before. So be aware that the instructions in this
section are relevant only for PowerPivot for Excel 2013 and not for the older version.
Among other things, the PowerPivot add-in allows you to do the following:
81
▪ Import millions of rows from multiple data sources
▪ Create relationships between data from different sources, and between multiple tables in
a pivot table
▪ Create implicit calculated fields (previously called measures) — calculations created
automatically when you add a numeric field to the Values area of the Field List
▪ Manage data connections
Interestingly, Microsoft refers to building a data model in Excel in its discussion of PowerPivot.
This is a somewhat new Microsoft term, and they have provided the following definition.
Data Model: A collection of tables and their relationships that reflects the real-world
relationships between business functions and processes—for example, how Products relates to
Inventory and Sales.
If you have worked with relational databases, this definition is nothing new. It is essentially
the definition of a relational database, a concept that has existed for decades. The difference is
that the data model is now contained entirely in Excel, not in Access or some other relational
database package.
Visualization Software
As the Power View tool from the previous subsection illustrates, you can gain a lot of
insight by using charts to view your data in imaginative ways. This trend toward powerful charting
software for data visualization is the wave of the future and will certainly continue. Although this
book is primarily about Microsoft software—Excel—many other companies are developing
visualization software. To get a glimpse of what is currently possible, you can watch the
accompanying video about a free software package, Tableau Public, developed by Tableau
Software. Perhaps you will find other visualization software packages, free or otherwise, that rival
Tableau or Power View. Alternatively, you might see blogs with data visualizations from ordinary
users. In any case, the purpose of charting software is to portray data graphically so that otherwise
hidden trends or patterns can emerge clearly.
82
The methods discussed so far in this chapter, all of which basically revolve around pivot
tables, are extremely useful for data exploration, but they are not always included in discussions
of “data mining.” To many analysts, data mining refers only to the algorithms discussed in the
remainder of this chapter. These include, among others, algorithms for classification and for
clustering. (There are many other types of data mining algorithms not discussed in this book.)
Many powerful software packages have been developed by software companies such as SAS,
IBM SPSS, Oracle, Microsoft, and others to implement these data mining algorithms.
Unfortunately, this software not only takes time to master, but it is also quite expensive. The only
data mining algorithms discussed here that are included in the software that accompanies the
book are logistic regression and neural nets, two classification methods that are part of the
Palisade suite, and they are discussed in the next section.
To provide you with illustrations of other data mining methods, we will briefly discuss
Microsoft data mining add-ins for Excel. The good news is that these add-ins are free and easy
to use. You can find them by searching the Web for Microsoft Data Mining Add-ins.
The names of these add-ins provide a clue to their downside. These add-ins are really
only front ends—client tools—for the Microsoft engine that actually performs the data mining
algorithms. This engine is called Analysis Services and is part of Microsoft’s SQL Server database
package. (SQL Server Analysis Services is often abbreviated as SSAS.) In short, Microsoft
decided to implement data mining in SSAS. Therefore, to use its Excel data mining add-ins, you
must have a connection to an SSAS server. This might be possible in your academic or corporate
setting, but it can definitely be a hurdle.
Classification Methods
The previous section introduced one of the most important problems studied in data
mining, the classification problem. This is basically the same problem attacked by regression
analysis—using explanatory variables to predict a dependent variable—but now the dependent
variable is categorical. It usually has two categories, such as Yes and No, but it can have more
than two categories, such as Republican, Democrat, and Independent. This problem has been
analyzed with very different types of algorithms, some regression-like and others very different
from regression, and this section discusses three of the most popular classification methods. But
each of the methods has the same objective: to use data from the explanatory variables to classify
each record (person, company, or whatever) into one of the known categories.
83
Before proceeding, it is important to discuss the role of data partitioning in classification
and in data mining in general. Data mining is usually used to explore very large data sets, with
many thousands or even millions of records. Therefore, it is very possible, and also very useful,
to partition the data set into two or even three distinct subsets before the algorithms are applied.
Each subset has a specified percentage of all records, and these subsets are typically chosen
randomly. The first subset, usually with about 70% to 80% of the records, is called the training
set. The second subset, called the testing set, usually contains the rest of the data. Each of these
sets should have known values of the dependent variable. Then the algorithm is trained with the
data in the training set. This results in a model that can be used for classification. The next step
is to test this model on the testing set. It is very possible that the model will work quite well on the
training set because this is, after all, the data set that was used to create the model. The real
question is whether the model is flexible enough to make accurate classifications in the testing
set.
Most data mining software packages have utilities for partitioning the data. (In the
following subsections, you will see that the logistic regression procedure in StatTools does not
yet have partitioning utilities, but the Palisade NeuralTools add-in for neural networks does have
them, and the Microsoft data mining add-in for classification trees also has them.) The various
software packages might use slightly different terms for the subsets, but the overall purpose is
always the same, as just described. They might also let you specify a third subset, often called a
prediction set, where the values of the dependent variable are unknown. Then you can use the
model to classify these unknown values. Of course, you won’t know whether the classifications
are accurate until you learn the actual values of the dependent variable in the prediction set.
Logistic Regression
Logistic regression is a popular method for classifying individuals, given the values of a
set of explanatory variables. It estimates the probability that an individual is in a particular
category. As its name implies, logistic regression is somewhat similar to the usual regression
analysis, but its approach is quite different. It uses a nonlinear function of the explanatory
variables for classification.
84
regressions on the data, using the dummy variable as the dependent variable. However, this
approach has two serious drawbacks. First, it violates the regression assumption that the error
terms should be normally distributed. Second, the predicted values of the dependent variable can
be between 0 and 1, less than 0, or greater than 1. If you want a predicted value to estimate a
probability, then values less than 0 or greater than 1 make no sense.
Therefore, logistic regression takes a slightly different approach. Let X1 through Xk be the
potential explanatory variables, and create the linear function b0+ b1 X1 + ⋯ + bkXk. Unfortunately,
there is no guarantee that this linear function will be between 0 and 1, and hence that it will qualify
as a probability. But the nonlinear function
𝟏/(𝒆−(𝒃𝟎+ 𝒃𝟏 𝑿𝟏 + ⋯ + 𝒃𝒌𝑿𝒌) )
is always between 0 and 1. In fact, the function f(x) = 1/(1 + 𝑒 −𝑋 ) is an “S-shaped logistic” curve,
as shown in Figure 17.16. For large negative values of x, the function approaches 0, and for large
positive values of x, it approaches 1.
The logistic regression model uses this function to estimate the probability that any
observation is in category 1. Specifically, if p is the probability of being in category 1, the model
p=𝟏/(𝒆−(𝒃𝟎+ 𝒃𝟏 𝑿𝟏 + ⋯ + 𝒃𝒌𝑿𝒌) )
𝒑
𝐥𝐧 ( ) = 𝒃𝟎 + 𝒃𝟏 𝑿𝟏 + ⋯ + 𝒃𝒌𝑿𝒌
𝟏−𝒑
This equation says that the natural logarithm of p/(1 − p) is a linear function of the
explanatory variables. The ratio p/(1 − p) is called the odds ratio.
85
The odds ratio is a term frequently used in everyday language. Suppose, for example, that
the probability p of a company going bankrupt is 0.25. Then the odds that the company will go
bankrupt are p/(1 − p) = 0.25/0.75 = 1/3, or “1 to 3.” Odds ratios are probably most common in
sports. If you read that the odds against Indiana winning the NCAA basketball championship are
4 to 1, this means that the probability of Indiana winning the championship is 1/5. Or if you read
that the odds against Purdue winning the championship are 99 to 1, then the probability that
Purdue will win is only 1/100.
The logarithm of the odds ratio, the quantity on the left side of the above equation, is called
the logit (or log odds). Therefore, the logistic regression model states that the logit is a linear
function of the explanatory variables. Although this is probably a bit mysterious and there is no
easy way to justify it intuitively, logistic regression has produced useful results in many
applications.
Although the numerical algorithm used to estimate the regression coefficients is complex,
the important goal for our purposes is to interpret the regression coefficients correctly. First, if a
coefficient b is positive, then if its X increases, the log odds increases, so the probability of being
in category 1 increases. The opposite is true for a negative b. So just by looking at the signs of
the coefficients, you can see which Xs are positively correlated with being in category 1 (the
positive bs) and which are positively correlated with being in group 0 (the negative bs). You can
also look at the magnitudes of the bs to try to see which of the Xs are “most important” in
explaining category membership. Unfortunately, you run into the same problem as in regular
regression. Some Xs are typically of completely different magnitudes than others, which makes
comparisons of the bs difficult. For example, if one X is income, with values in the thousands, and
another X is number of children, with values like 0, 1, and 2, the coefficient of income will probably
be much smaller than the coefficient of children, even though these two variables might be equally
important in explaining category membership.
Classification Trees
The two classification methods discussed so far, logistic regression and neural networks,
use complex nonlinear functions to capture the relationship between explanatory variables and a
categorical dependent variables. The method discussed in this subsection, classification trees, is
86
also capable of discovering nonlinear relationships, but it is much more intuitive. This method,
which has many variations, has existed for decades, and it has been implemented in a variety of
software packages. Unfortunately, it is not available in any of the software that accompanies this
book, but it is available in the free Microsoft Data Mining Add-Ins discussed earlier. The essential
features of the method are explained here, and the accompanying video, Decision Trees with
Microsoft Data Mining Add-In, illustrates the method.
The attractive aspect of this method is that the final result is a set of simple rules for
classification. As an example, the final tree might look like the one in Figure 13. Each box has a
bar that shows the purity of the corresponding box, where blue corresponds to Yes values and
red corresponds to No values. The first split, actually a three-way split, is on Mall Values: fewer
than 4, 4 or 5, and at least 6. Each of these is then split in a different way. For example, when
Mall Trips is fewer than 4, the split is on Nbhd West versus Nbhd not West. The splits you see
here are the only ones made. They achieve sufficient purity, so the algorithm stops splitting after
these.
Predictions are then made by majority rule. As an example, suppose a person has made
3 mall trips and lives in the East. This person belongs in the second box down on the right, which
has a large majority of No values. Therefore, this person is classified as a No. In contrast, a person
with 10 mall trips belongs in one of the two bottom boxes on the right. This person is classified as
a Yes because both of these boxes have a large majority of Yes values. In fact, the last split on
Age is not really necessary.
87
▪ If the person makes 4 or 5 mall trips:
o If the person doesn’t live in the East, classify as a trier.
o If the person lives in the East, classify as a nontrier.
▪ If the person makes at least 6 mall trips, classify as a trier.
The ability of classification trees to provide such simple rules, plus fairly accurate
classifications, has made this a very popular classification technique.
Clustering
In data mining terminology, the classification methods in the previous section are called
supervised data mining techniques. This term indicates that there is a dependent variable the
method is trying to predict. In contrast, the clustering methods discussed briefly in this section are
called unsupervised data mining techniques. Unsupervised methods have no dependent
variable. Instead, they search for patterns and structure among all of the variables.
Clustering is probably the most common unsupervised method, and it is the only one
discussed here. However, another popular unsupervised method you might encounter is market
basket analysis (also called association analysis), where patterns of customer purchases are
examined to see which items customers tend to purchase together, in the same “market basket.”
This analysis can be the basis for product shelving arrangements, for example.
Clustering methods have existed for decades, and a wide variety of clustering methods
have been developed and implemented in software packages. The key to all of these is the
development of a dissimilarity measure. Specifically, to compare two rows in a data set, you need
a numeric measure of how dissimilar they are. Many such measures are used. For example, if
two customers have the same gender, they might get a dissimilarity score of 0, whereas two
customers of different genders might get a dissimilarity score of 1. Or if the incomes of two
customers are compared, they might get a dissimilarity score equal to the squared difference
between their incomes. The dissimilarity scores for different variables are then combined in some
88
way, such as normalizing and then summing, to get a single dissimilarity score for the two rows
as a whole.
In any case, once an algorithm has discovered, say, five clusters, your job is to under-
stand (and possibly name) these clusters. You do this by exploring the distributions of variables
in different clusters. For example, you might find that one cluster is composed mostly of older
women who live alone and have modest incomes, whereas another cluster is composed mostly
of wealthy married men.
CHAPTER EXERCISES
2. How does the OLAP methodology allows you to drill down in a pivot table?
3. What is the main purpose of logistic regression? How does it differ from the regression
discussed in the previous chapter?
SUGGESTED READINGS
89
REFERENCES
Albright, S.C & Winston, W. (2015). Business Analytics: Data Analysis and Decision Making,
Fifth Edition. Cengage Learning, USA.
Inmon, W. (2002). Building the Data Warehouse, 3rd ed. John Wiley & Sons,Inc., Canada.
Tripathi, S.S. (2016). Learn Business Analytics in Six Steps Using SAS and R. Apress Media ,
LLC.
90