Download as pdf or txt
Download as pdf or txt
You are on page 1of 193

CCW331

BUSINESS ANLAYTICS
(Unit 1- Unit 3)
CCW331 BUSINESS ANALYTICS L T P C

Syllabus 2 0 2 3

UNIT 1 Introduction to Business Analytics 6

Analytics and data science-Analytics Life cycle-Types of analytics-Business Problem Definition-Data

Collection-Data Preparation-Hypothesis Generation- Modeling – Validation and Evaluation –

Interpretation – Deployment and Iteration.

UNIT 2 Business Intelligence 6

Data Warehouses and Data Mart - Knowledge Management –Types of Decisions - Decision Making Process

- Decision Support Systems – Business Intelligence –OLAP – Analytic functions.

UNIT 3 Business Forecasting 6

Introduction to Business Forecasting and Predictive analytics - Logic and Data Driven Models – Data

Mining and Predictive Analysis Modeling – Machine Learning for Predictive analytics.

UNIT 4 HR & supply chain Analytics 6

Human Resources – Planning and Recruitment – Training and Development - Supply chain network -

Planning Demand, Inventory and Supply – Logistics – Analytics applications in HR & Supply Chain -

Applying HR Analytics to make a prediction of the demand for hourly employees for a year.

UNIT 5 Marketing & Sales Analytics 6

Marketing Strategy, Marketing Mix, Customer Behavior –selling Process – Sales Planning –

Analytics applications in Marketing and Sales - predictive analytics for customers' behavior in marketing

and sales.
TOTAL(L):30 PERIODS

LIST OF EXPERIMENTS:

Use MS-Excel and Power-BI to perform the following experiments using a Business data set, andmake

presentations.

Students may be encouraged to bring their own real-time socially relevant data set.

I Cycle – MS Excel

1. Explore the features of Ms-Excel.

2. (i) Get the input from user and perform numerical operations (MAX, MIN, AVG, SUM,

SQRT, ROUND)

ii) Perform data import/export operations for different file formats.

3. Perform statistical operations - Mean, Median, Mode and Standard deviation, Variance,

Skewness, Kurtosis

4. Perform Z-test, T-test & ANOVA

5. Perform data pre-processing operations i) Handling Missing data ii) Normalization

6. Perform dimensionality reduction operation using PCA, KPCA & SVD

7. Perform bivariate and multivariate analysis on the dataset.

8. Apply and explore various plotting functions on the data set.

II Cycle – Power BI Desktop

9. Explore the features of Power BI Desktop

10. Prepare & Load data


11. Develop the data model

12. Perform DAX calculations

13. Design a report

14. Create a dashboard and perform data analysis

15. Presentation of a case study

30 PERIODS

UNIT 1
INTRODUCTION TO BUSINESS ANALYTICS

ANALYTICS AND DATA SCIENCE:


Definition:
Analytics generally refers to the science of manipulating data by applying different models
and statistical formulae on it to find insights.
These insights are the key factors that help us solve various problems. These problems may
be of many types, and when we work with data to find insights and solve business-related
problems, we are actually doing Business Analytics.
The tools used for analytics may range from spreadsheets to predictive analytics for complex
business problems. The process includes using these tools to draw out patterns and identify
relationships. Next, new questions are asked and the iterative process starts again and
continues until the business goal is achieved.
Business analytics refers to a subset of several methodologies, such as data mining, statistical
analysis, and predictive analytics, to analyze and transform data into useful information.
Business analytics is also used to identify and anticipate trends and outcomes. With the help
of these results, it becomes easier to make data-driven business decisions.
The use of business analytics is very popular in some industries such as healthcare,
hospitality, and any other business that has to track or closely monitor its customers. Many
high-end business analytics software solutions and platforms have been developed to ingest
and process large data sets.
Business Analytics Examples
Some of the examples of Business Analytics are:
● A simple example of Business Analytics would be working with data to find out what would
be the optimal price point for a product that a company is about to launch. While doing this
research, there are a lot of factors that it would have to take into consideration before arriving
at a solution.
● Another example would be applying Business Analytics techniques to identify and figure out
how many and which customers are likely to cancel the subscription
● One of the highly appreciated examples of Business Analytics is working with available data
to figure out and assess how and why the tastes and preferences change of customers who
visit a particular restaurant regularly.
Components of Business Analytics
Modern world business strategies are centered around data. Business Analytics, Machine
Learning, Data Science, etc. are used to arrive at solutions for complex and specific business
problems. Even though all of these have various components, the core components still
remain similar. Following are the core components of Business Analytics:
● Data Storage– The data is stored by the computers in a way that it can be further used in the
future. The processing of this data using storage devices is known as data storage. Object
storage, Block Storage, etc. are some of the storage products and services.
● Data Visualization– It is the process of graphically representing the information or insights
drawn through the analysis of data. Data visualization makes the communication of outputs
to the management easier in simple terms.
● Insights– Insights are the outputs and inferences drawn from the analysis of data by
implementing business analytics techniques and tools.
● Data Security– One of the most important components of Business Analytics is Data
Security. It involves monitoring and identifying malicious activities in the security networks.
Real-time data and predictive modelling techniques are used to identify vulnerabilities in the
system.

TYPES OF BUSINESS ANALYTICS


There are various types of analytics that are performed on a daily basis across many
companies. Let’s understand each one of them in this section.
Descriptive Analytics
Whenever we are trying to answer questions such as “what were the sales figures last year”
or what has occurred before”, we are basically doing descriptive analysis. In descriptive
analysis, we describe or summarize the past data and transform it into easily comprehensible
forms, such as charts or graphs.
Example –
Let’s take an example of DMart, we can look at the product’s history and find out which
products have been sold more or which products have large demand by looking at the product
sold trends and based on their analysis we can further make the decision of putting a stock of
that item in large quantity for the coming year.
Predictive Analytics
Predictive analytics is exactly what it sounds like. It is that side of business analytics where
predictions about a future event are made. An example of predictive analytics is calculating
the expected sales figures for the upcoming fiscal year. Predictive analytics is majorly used to
set up expectations and follow proper processes and measures to meet those expectations.
Example –
The best example would be Amazon and Netflix recommender system. You might have
noticed that whenever you buy any product from Amazon, on the payment side it shows you
a recommendation saying the customer who purchased this has also purchased this product
that recommendation is based on the customer purchased behavior in the past. By looking at
customer past purchase behavior analyst creates an association between each product and
that’s the reason it shows recommendation when you buy any product.
The next example would be Netflix, when you watch any movies or web series on Netflix
you can see that Netflix provide you with a lot of recommended movies or web series, that
recommendation is based on past data or past trends, it identifies which movie or series has
gain lot of public interest and based on that it creates a recommendation

Prescriptive Analytics
In the case of prescriptive analytics, we make use of simulation, data modelling, and
optimization of algorithms to find answers to questions such as “what needs to be done”. This
is used to provide solutions and identify the potential results of those solutions. This field of
business analytics has recently surfaced and is on heavy rise since it gives multiple solutions,
with their possible effectiveness, to the problems faced by businesses. Let’s say Plan A fails
or there aren’t enough resources to execute it, then there is still Plan B, Plan C, etc., in hand.
Example –
The best example would be Google self-driving Car, by looking at the past trends and
forecasted data it identifies when to turn or when to slow down, works much like a human
driver.

ANALYTICS LIFE CYCLE


In the early 1990's as data mining was evolving from toddler to adolescent. As a community,
we spent a lot of time getting the data ready for the fairly limited tools and computing power.
The CRISP-DM that emerged as a result is still valid today in the era of Big Data & Stream
Analytics.

Business Understanding
Focuses on understanding the project objectives and requirements from a business
perspective. The analyst formulates this knowledge as a data mining problem and develops
preliminary plan Data Understanding
Starting with initial data collection, the analyst proceeds with activities to get familiar with
the data, identify data quality problems & discover first insights into the data. In this phase,
the analyst might also detect interesting subsets to form hypotheses for hidden information
Data Preparation
The data preparation phase covers all activities to construct the final dataset from the initial
raw data
Modelling
The analyst evaluates, selects & applies the appropriate modelling techniques. Since some
techniques like neural nets have specific requirements regarding the form of the data. There
can be a loop back here to data prep
Evaluation
The analyst builds & chooses models that appear to have high quality based on loss functions
that were selected. The analyst them tests them to ensure that they can generalise the models
against unseen data. Subsequently, the analyst also validates that the models sufficiently
cover all key business issues. The end result is the selection of the champion model(s)
Deployment
Generally, this will mean deploying a code representation of the model into an operating
system. This also includes mechanisms to score or categorize new unseen data as it arises.
The mechanism should use the new information in the solution of the original business
problem. Importantly, the code representation must also include all the data prep steps
leading up to modelling. This ensures that the model will treat new raw data in the same
manner as during model development

BUSINESS PROBLEM DEFINITION


The Business Understanding phase is to understand what the business wants to solve.
Important task within this phase according to the Data Science Project Management
including:
1. Determine the business question and objective: What to solve from the business
perspective, what the customer wants, and define the business success criteria (Key
Performance Indicator or KPI). For fresher, research what kind of the situation company
would face and try to build your project on top of it.
2. Situation Assessment: You need to assess the resources availability, project requirements,
risks, and cost-benefit from this project. While you might not know the situation within the
company if you are not hired yet, you could assess it based on your research and explain what
your assessment is based on.
3. Determine the project goals: What the technical data mining perspective success criteria.
You could set it based on model metrics or availability time or anything as long as you could
explain it — what is important is that it logically sounded.
4. Project plan: Try to create a detailed plan for each project phase and what kind of tools you
would use.

Determine the business question and objective:


The first thing you must do in any project is to find out exactly what you’re trying to
accomplish! That’s less obvious than it sounds. Many data miners have invested time on data
analysis, only to find that their management wasn’t particularly interested in the issue they
were investigating. You must start with a clear understanding of
● A problem that your management wants to address
● The business goals
● Constraints (limitations on what you may do, the kinds of solutions that can be used, when
the work must be completed, and so on)
● Impact (how the problem and possible solutions fit in with the business)
Deliverables for this task include three items (usually brief reports focusing on just the main
points):
● Background: Explain the business situation that drives the project. This item, like many that
follow, amounts only to a few paragraphs.
● Business goals: Define what your organization intends to accomplish with the project. This is
usually a broader goal than you, as a data miner, can accomplish independently. For example,
the business goal might be to increase sales from a holiday ad campaign by 10 percent year
over year.
● Business success criteria: Define how the results will be measured. Try to get clearly
defined quantitative success criteria. If you must use subjective criteria (hint: terms like gain
insight or get a handle on imply subjective criteria), at least get agreement on exactly who
will judge whether or not those criteria have been fulfilled.

Assessing your situation


This is where you get into more detail on the issues associated with your business goals. Now
you will go deeper into fact-finding, building out a much fleshier explanation of the issues
outlined in the business goals task.
Deliverables for this task include five in-depth reports:
● Inventory of resources: A list of all resources available for the project. These may include
people (not just data miners, but also those with expert knowledge of the business problem,
data managers, technical support, and others), data, hardware, and software.
● Requirements, assumptions, and constraints: Requirements will include a schedule for
completion, legal and security obligations, and requirements for acceptable finished work.
This is the point to verify that you’ll have access to appropriate data!
● Risks and contingencies: Identify causes that could delay completion of the project, and
prepare a contingency plan for each of them. For example, if an Internet outage in your office
could pose a problem, perhaps your contingency could be to work at another office until the
outage has ended.
● Terminology: Create a list of business terms and data-mining terms that are relevant to your
project and write them down in a glossary with definitions (and perhaps examples), so that
everyone involved in the project can have a common understanding of those terms.
● Costs and benefits: Prepare a cost-benefit analysis for the project. Try to state all costs and
benefits in dollar (euro, pound, yen, and so on) terms. If the benefits don’t significantly
exceed the costs, stop and reconsider this analysis and your project.

Defining your project goals


Reaching the business goal often requires action from many people, not just the data miner.
So now, you must define your little part within the bigger picture. If the business goal is to
reduce customer attrition, for example, your data-mining goals might be to identify attrition
rates for several customer segments, and develop models to predict which customers are at
greatest risk. Deliverables for this task include two reports:
● Project goals: Define project deliverables, such as models, reports, presentations, and
processed datasets.
● Project success criteria: Define the project technical criteria necessary to support the
business success criteria. Try to define these in quantitative terms (such as model accuracy or
predictive improvement compared to an existing method). If the criteria must be qualitative,
identify the person who makes the assessment.

Project plan
Now you specify every step that you, the data miner, intend to take until the project is
completed and the results are presented and reviewed.
Deliverables for this task include two reports:
● Project plan: Outline your step-by-step action plan for the project. Expand the outline with a
schedule for completion of each step, required resources, inputs (such as data or a meeting
with a subject matter expert), and outputs (such as cleaned data, a model, or a report) for each
step, and dependencies (steps that can’t begin until this step is completed). Explicitly state
that certain steps must be repeated (for example, modeling and evaluation usually call for
several back-and-forth repetitions).
● Initial assessment of tools and techniques: Identify the required capabilities for meeting
your data-mining goals and assess the tools and resources that you have. If something is
missing, you have to address that concern very early in the process.
DATA COLLECTION
Data is a collection of facts, figures, objects, symbols, and events gathered from different
sources. Organizations collect data to make better decisions. Without data, it would be
difficult for organizations to make appropriate decisions, and so data is collected at various
points in time from different audiences.
For instance, before launching a new product, an organization needs to collect data on
product demand, customer preferences, competitors, etc. In case data is not collected
beforehand, the organization’s newly launched product may lead to failure for many reasons,
such as less demand and inability to meet customer needs.
Although data is a valuable asset for every organization, it does not serve any purpose until
analyzed or processed to get the desired results.
Collecting the information from the numerical fact after observation is known as raw data.
There are two types of data. Below we have provided the types of data: Primary Data and
Secondary Data.
The two types of data are as follows.
1. Primary Data
When an investigator collects data himself with a definite plan or design in his/her way, then
the data is known as primary data. Generally, the results derived from the primary data are
accurate as the researcher gathers the information. But, one of the disadvantages of primary
data collection is the expenses associated with it. Primary data research is very time-
consuming and expensive.
2. Secondary Data
Data that the investigator does not initially collect but instead obtains from published or
unpublished sources are secondary data. Secondary data is collected by an individual or an
institution for some purpose and are used by someone else in another context. It is worth
noting that although secondary data is cheaper to obtain, it raises concerns about accuracy.
As the data is second-hand, one cannot fully rely on the information to be authentic.
Data Collection: Methods
Data collection is defined as collecting and analysing data to validate and research using
some techniques. It is done to diagnose a problem and learn its outcome and future trends.
When there is a need to solve a question, data collection methods help assume the future
result.
We must collect reliable data from the correct sources to make the calculations and analysis
easier. There are two types of data collection methods. This is dependent on the kind of data
that is being collected. They are:
1. Primary Data Collection Methods
2. Secondary Data Collection Methods
Types of Data Collection
Students require primary or secondary data while doing their research. Both primary and
secondary data have their own advantages and disadvantages. Both the methods come into
the picture in different scenarios. One can use secondary data to save time and primary data
to get accurate results.
Primary Data Collection Method
Primary or raw data is obtained directly from the first-hand source through experiments,
surveys, or observations. The primary data collection method is further classified into two
types, and they are given below:
1. Quantitative Data Collection Methods
2. Qualitative Data Collection Methods
Quantitative Data Collection Methods
The term ‘Quantity’ tells us a specific number. Quantitative data collection methods express
the data in numbers using traditional or online data collection methods. Once this data is
collected, the results can be calculated using Statistical methods and Mathematical tools.
Some of the quantitative data collection methods include
Time Series Analysis
The term time series refers to a sequential order of values of a variable, known as a trend, at
equal time intervals. Using patterns, an organization can predict the demand for its products
and services for the projected time.
Smoothing Techniques
In cases where the time series lacks significant trends, smoothing techniques can be used.
They eliminate a random variation from the historical demand. It helps in identifying patterns
and demand levels to estimate future demand. The most common methods used in smoothing
demand forecasting techniques are the simple moving average method and the weighted
moving average method.
Barometric Method
Also known as the leading indicators approach, researchers use this method to speculate
future trends based on current developments. When the past events are considered to predict
future events, they act as leading indicators.
Qualitative Data Collection Methods
The qualitative method does not involve any mathematical calculations. This method is
closely connected with elements that are not quantifiable. The qualitative data collection
method includes several ways to collect this type of data, and they are given below:
Interview Method
As the name suggests, data collection is done through the verbal conversation of interviewing
the people in person or on a telephone or by using any computer-aided model. This is one of
the most often used methods by researchers. A brief description of each of these methods is
shown below:
Personal or Face-to-Face Interview: In this type of interview, questions are asked
personally directly to the respondent. For this, a researcher can do online surveys to take note
of the answers.
Telephonic Interview: This method is done by asking questions on a telephonic call. Data is
collected from the people directly by collecting their views or opinions.
Computer-Assisted Interview: The computer-assisted type of interview is the same as a
personal interview, except that the interviewer and the person being interviewed will be doing
it on a desktop or laptop. Also, the data collected is directly updated in a database to make the
process quicker and easier. In addition, it eliminates a lot of paperwork to be done in updating
the collection of data.
Questionnaire Method of Collecting Data
The questionnaire method is nothing but conducting surveys with a set of quantitative
research questions. These survey questions are done by using online survey questions
creation software. It also ensures that the people’s trust in the surveys is legitimised. Some
types of questionnaire methods are given below:
Web-Based Questionnaire: The interviewer can send a survey link to the selected
respondents. Then the respondents click on the link, which takes them to the survey
questionnaire. This method is very cost-efficient and quick, which people can do at their own
convenient time. Moreover, the survey has the flexibility of being done on any device. So, it
is reliable and flexible.
Mail-Based Questionnaire: Questionnaires are sent to the selected audience via email. At
times, some incentives are also given to complete this survey which is the main attraction.
The advantage of this method is that the respondent’s name remains confidential to the
researchers, and there is the flexibility of time to complete this survey.
Observation Method
As the word ‘observation’ suggests, data is collected directly by observing this method. This
can be obtained by counting the number of people or the number of events in a particular
time frame. Generally, it’s effective in small-scale scenarios. The primary skill needed here is
observing and arriving at the numbers correctly. Structured observation is the type of
observation method in which a researcher detects certain specific behaviours.
Document Review Method
The document review method is a data aggregation method used to collect data from existing
documents with data about the past. There are two types of documents from which we can
collect data. They are given below:
Public Records: The data collected in an organisation like annual reports and sales
information of the past months are used to do future analysis.
Personal Records: As the name suggests, the documents about an individual such as type of
job, designation, and interests are taken into account.
Secondary Data Collection Method
The data collected by another person other than the researcher is secondary data. Secondary
data is readily available and does not require any particular collection methods. It is available
in the form of historical archives, government data, organisational records etc. This data can
be obtained directly from the company or the organization where the research is being
organised or from outside sources.
The internal sources of secondary data gathering include company documents, financial
statements, annual reports, team member information, and reports got from customers or
dealers. Now, the external data sources include information from books, journals, magazines,
the census taken by the government, and the information available on the internet about
research. The leading edge of this data aggregation method is that it is easy to collect since
they are readily accessible.

The secondary data collection methods, too, can involve both quantitative and qualitative
techniques. Secondary data is easily available and hence, less time-consuming and expensive
as compared to the primary data. However, with the secondary data collection methods, the
authenticity of the data gathered cannot be verified.
Collection of Data in Statistics
There are various ways to represent data after gathering. But, the most popular method is to
tabulate the data using tally marks and then represent them in a frequency distribution table.
The frequency distribution table is constructed by using the tally marks. Tally marks are a
form of a numerical system used for counting. The vertical lines are used for the counting.
The cross line is placed over the four lines giving the total at 55.

Example:
Consider a jar containing the different colours of pieces of bread as shown below:

Construct a frequency distribution table for the above-mentioned data.


Ans:

DATA PREPARATION
Data preparation is the process of gathering, combining, structuring and organizing data so it
can be used in business intelligence (BI), analytics and data visualization applications. The
components of data preparation include data preprocessing, profiling, cleansing, validation
and transformation; it often also involves pulling together data from different internal
systems and external sources.
Data preparation work is done by information technology (IT), BI and data management
teams as they integrate data sets to load into a data warehouse, NoSQL database or data lake
repository, and then when new analytics applications are developed with those data sets. In
addition, data scientists, data engineers, other data analysts and business users increasingly
use self-service data preparation tools to collect and prepare data themselves.
Data preparation is often referred to informally as data prep. It's also known as data
wrangling, although some practitioners use that term in a narrower sense to refer to cleansing,
structuring and transforming data; that usage distinguishes data wrangling from the data pre-
processing stage.
Purposes of data preparation
One of the primary purposes of data preparation is to ensure that raw data being readied for
processing and analysis is accurate and consistent so the results of BI and analytics
applications will be valid. Data is commonly created with missing values, inaccuracies or
other errors, and separate data sets often have different formats that need to be reconciled
when they're combined. Correcting data errors, validating data quality and consolidating data
sets are big parts of data preparation projects.
Data preparation also involves finding relevant data to ensure that analytics applications
deliver meaningful information and actionable insights for business decision-making. The
data often is enriched and optimized to make it more informative and useful -- for example,
by blending internal and external data sets, creating new data fields, eliminating outlier values
and addressing imbalanced data sets that could skew analytics results.
In addition, BI and data management teams use the data preparation process to curate data
sets for business users to analyse. Doing so helps streamline and guide self-service BI
applications for business analysts, executives and workers.
What are the benefits of data preparation?
Data scientists often complain that they spend most of their time gathering, cleansing and
structuring data instead of analysing it. A big benefit of an effective data preparation process
is that they and other end users can focus more on data mining and data analysis -- the parts
of their job that generate business value. For example, data preparation can be done more
quickly, and prepared data can automatically be fed to users for recurring analytics
applications.
Done properly, data preparation also helps an organization do the following:
● ensure the data used in analytics applications produces reliable results;
● identify and fix data issues that otherwise might not be detected;
● enable more informed decision-making by business executives and operational workers;
● reduce data management and analytics costs;
● avoid duplication of effort in preparing data for use in multiple applications; and
● get a higher ROI from BI and analytics initiatives.
Effective data preparation is particularly beneficial in big data environments that store a
combination of structured, semi structured and unstructured data, often in raw form until it's
needed for specific analytics uses. Those uses include predictive analytics, machine learning
(ML) and other forms of advanced analytics that typically involve large amounts of data to
prepare. For example, in an article on preparing data for machine learning, Felix Wick,
corporate vice president of data science at supply chain software vendor Blue Yonder, is
quoted as saying that data preparation "is at the heart of ML."
Steps in the data preparation process
Data preparation is done in a series of steps. There's some variation in the data preparation
steps listed by different data professionals and software vendors, but the process typically
involves the following tasks:
1. Data discovery and profiling. The next step is to explore the collected data to better
understand what it contains and what needs to be done to prepare it for the intended uses. To
help with that, data profiling identifies patterns, relationships and other attributes in the data,
as well as inconsistencies, anomalies, missing values and other issues so they can be
addressed.
What is data profiling?
Data profiling refers to the process of examining, analyzing, reviewing and summarizing data
sets to gain insight into the quality of data. Data quality is a measure of the condition of data
based on factors such as its accuracy, completeness, consistency, timeliness and accessibility.
Additionally, data profiling involves a review of source data to understand the data's
structure, content and interrelationships.
This review process delivers two high-level values to the organization: It provides a
high-level view of the quality of its data sets; and two, it helps the organization identify
potential data projects.
Given those benefits, data profiling is an important component of data preparation programs.
Its assistance helping organizations to identify quality data makes it an important precursor to
data processing and data analytics activities.
Moreover, an organization can use data profiling and the insights it produces to continuously
improve the quality of its data and measure the results of that effort.
Data profiling may also be known as data archaeology, data assessment, data discovery or
data quality analysis.
Organizations use data profiling at the beginning of a project to determine if enough data has
been gathered, if any data can be reused or if the project is worth pursuing. The process of
data profiling itself can be based on specific business rules that will uncover how the data set
aligns with business standards and goals.
Types of data profiling
There are three types of data profiling.
● Structure discovery. This focuses on the formatting of the data, making sure everything is
uniform and consistent. It uses basic statistical analysis to return information about the
validity of the data.
● Content discovery. This process assesses the quality of individual pieces of data. For
example, ambiguous, incomplete and null values are identified.
● Relationship discovery. This detects connections, similarities, differences and associations
among data sources.
What are the steps in the data profiling process?
Data profiling helps organizations identify and fix data quality problems before the data is
analyzed, so data professionals aren't dealing with inconsistencies, null values or incoherent
schema designs as they process data to make decisions.
Data profiling statistically examines and analyzes data at its source and when loaded. It also
analyzes the metadata to check for accuracy and completeness.
data scienceIt typically involves either writing queries or using data
profiling tools. A high-level breakdown of the process is as follows:
1. The first step of data profiling is gathering one or multiple data sources and the associated
metadata for analysis.
2. The data is then cleaned to unify structure, eliminate duplications, identify interrelationships
and find anomalies.
3. Once the data is cleaned, data profiling tools will return various statistics to describe the data
set. This could include the mean, minimum/maximum value, frequency, recurring patterns,
dependencies or data quality risks.
For example, by examining the frequency distribution of different values for each column in a
table, a data analyst could gain insight into the type and use of each column. Cross-column
analysis can be used to expose embedded value dependencies; inter-table analysis allows the
analyst to discover overlapping value sets that represent foreign key relationships between
entities.
Benefits of data profiling
Data profiling returns a high-level overview of data that can result in the following benefits:
● leads to higher-quality, more credible data;
● helps with more accurate predictive analytics and decision-making;
● makes better sense of the relationships between different data sets and sources;
● keeps company information centralized and organized;
● eliminates errors, such as missing values or outliers, that add costs to data-driven projects;
● highlights areas within a system that experience the most data quality issues, such as data
corruption or user input errors; and
● produces insights surrounding risks, opportunities and trends.
Data profiling challenges
Although the objectives of data profiling are straightforward, the actual work involved is
quite complex, with multiple tasks occurring from the ingestion of data through its
warehousing.
That complexity is one of the challenges organizations encounter when trying to implement
and run a successful data profiling program.
The sheer volume of data being collected by a typical organization is another challenge, as is
the range of sources -- from cloud-based systems to endpoint devices deployed as part of an
internet-of-things ecosystem -- that produce data.
The speed at which data enters an organization creates further challenges to having a
successful data profiling program.
These data prep challenges are even more significant in organizations that have not adopted
modern data profiling tools and still rely on manual processes for large parts of this work.
On a similar note, organizations that don't have adequate resources -- including trained data
professionals, tools and the funding for them -- will have a harder time overcoming these
challenges.
However, those same elements make data profiling more critical than ever to ensure that the
organization has the quality data it needs to fuel intelligent systems, customer
personalization, productivity-boosting automation projects and more.
Examples of data profiling
Data profiling can be implemented in a variety of use cases where data quality is important.
For example, projects that involve data warehousing or business intelligence may require
gathering data from multiple disparate systems or databases for one report or analysis.
Applying data profiling to these projects can help identify potential issues and corrections that
need to be made in extract, transform and load (ETL) jobs and other data integration
processes before moving forward.
Additionally, data profiling is crucial in data conversion or data migration initiatives that
involve moving data from one system to another. Data profiling can help identify data quality
issues that may get lost in translation or adaptions that must be made to the new system prior
to migration.
The following four methods, or techniques, are used in data profiling:
● column profiling, which assesses tables and quantifies entries in each column;
● cross-column profiling, which features both key analysis and dependency analysis;
● cross-table profiling, which uses key analysis to identify stray data as well as semantic and
syntactic discrepancies; and
● data rule validation, which assesses data sets against established rules and standards to
validate that they're being followed.
Data profiling tools
Data profiling tools replace much, if not all, of the manual effort of this function by
discovering and investigating issues that affect data quality, such as duplication, inaccuracies,
inconsistencies and lack of completeness.
These technologies work by analyzing data sources and linking sources to their metadata to
allow for further investigation into errors.
Furthermore, they offer data professionals quantitative information and statistics around data
quality, typically in tabular and graph formats.
Data management applications, for example, can manage the profiling process through tools
that eliminate errors and apply consistency to data extracted from multiple sources without
the need for hand coding.
Such tools are essential for many, if not most, organizations today as the volume of data they
use for their business activities significantly outpaces even a large team's ability to perform
this function through mostly manual efforts.
Data profile tools also generally include data wrangling, data gap and metadata discovery
capabilities as well as the ability to detect and merge duplicates, check for data similarities
and customize data assessments.
Commercial vendors that provide data profiling capabilities include Datameer, Informatica,
Oracle and SAS. Open source solutions include Aggregate Profiler, Apache Griffin, Quadient
DataCleaner and Talend.
2. Data cleansing. Next, the identified data errors and issues are corrected to create complete
and accurate data sets. For example, as part of cleansing data sets, faulty data is removed or
fixed, missing values are filled in and inconsistent entries are harmonized.
What is data cleansing?
Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing
incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves
identifying data errors and then changing, updating or removing data to correct them. Data
cleansing improves data quality and helps provide more accurate, consistent and reliable
information for decision-making in an organization.
Data cleansing is a key part of the overall data management process and one of the core
components of data preparation work that readies data sets for use in business intelligence
(BI) and data science applications. It's typically done by data quality analysts and engineers
or other data management professionals. But data scientists, BI analysts and business users
may also clean data or take part in the data cleansing process for their own applications.
Data cleansing vs. data cleaning vs. data scrubbing
Data cleansing, data cleaning and data scrubbing are often used interchangeably. For the most
part, they're considered to be the same thing. In some cases, though, data scrubbing is viewed
as an element of data cleansing that specifically involves removing duplicate, bad, unneeded
or old data from data sets.
Data scrubbing also has a different meaning in connection with data storage. In that context,
it's an automated function that checks disk drives and storage systems to make sure the data
they contain can be read and to identify any bad sectors or blocks.

Why is clean data important?


Business operations and decision-making are increasingly data-driven, as organizations look
to use data analytics to help improve business performance and gain competitive advantages
over rivals. As a result, clean data is a must for BI and data science teams, business
executives, marketing managers, sales reps and operational workers. That's particularly true
in retail, financial services and other data-intensive industries, but it applies to organizations
across the board, both large and small.
If data isn't properly cleansed, customer records and other business data may not be accurate
and analytics applications may provide faulty information. That can lead to flawed business
decisions, misguided strategies, missed opportunities and operational problems, which
ultimately may increase costs and reduce revenue and profits. IBM estimated that data quality
issues cost organizations in the U.S. a total of $3.1 trillion in 2016, a figure that's still widely
cited.
What kind of data errors does data scrubbing fix?
Data cleansing addresses a range of errors and issues in data sets, including inaccurate,
invalid, incompatible and corrupt data. Some of those problems are caused by human error
during the data entry process, while others result from the use of different data structures,
formats and terminology in separate systems throughout an organization.
The types of issues that are commonly fixed as part of data cleansing projects include the
following:
● Typos and invalid or missing data. Data cleansing corrects various structural errors in data
sets. For example, that includes misspellings and other typographical errors, wrong numerical
entries, syntax errors and missing values, such as blank or null fields that should contain data.
● Inconsistent data. Names, addresses and other attributes are often formatted differently from
system to system. For example, one data set might include a customer's middle initial, while
another doesn't. Data elements such as terms and identifiers may also vary. Data cleansing
helps ensure that data is consistent so it can be analyzed accurately.
● Duplicate data. Data cleansing identifies duplicate records in data sets and either removes or
merges them through the use of deduplication measures. For example, when data from two
systems is combined, duplicate data entries can be reconciled to create single records.
● Irrelevant data. Some data -- outliers or out-of-date entries, for example -- may not be
relevant to analytics applications and could skew their results. Data cleansing removes
redundant data from data sets, which streamlines data preparation and reduces the required
amount of data processing and storage resources.
What are the steps in the data cleansing process?
The scope of data cleansing work varies depending on the data set and analytics
requirements. For example, a data scientist doing fraud detection analysis on credit card
transaction data may want to retain outlier values because they could be a sign of fraudulent
purchases. But the data scrubbing process typically includes the following actions:
1. Inspection and profiling. First, data is inspected and audited to assess its quality level and
profiligidentify issues that need to be fixed. This step usually involves data profiling, which
documents relationships between data elements, checks data quality and gathers statistics on
data sets to help find errors, discrepancies and other problems.
2. Cleaning. This is the heart of the cleansing process, when data errors are corrected and
inconsistent, duplicate and redundant data is addressed.
3. Verification. After the cleaning step is completed, the person or team that did the work
should inspect the data again to verify its cleanliness and make sure it conforms to internal
data quality rules and standards.
4. Reporting. The results of the data cleansing work should then be reported to IT and business
executives to highlight data quality trends and progress. The report could include the number
of issues found and corrected, plus updated metrics on the data's quality levels.
The cleansed data can then be moved into the remaining stages of data preparation, starting
with data structuring and data transformation, to continue readying it for analytics uses.
Characteristics of clean data
Various data characteristics and attributes are used to measure the cleanliness and overall
quality of data sets, including the following:
● accuracy
● completeness
● consistency
● integrity
● timeliness
● uniformity
● validity
Data management teams create data quality metrics to track those characteristics, as well as
things like error rates and the overall number of errors in data sets. Many also try to calculate
the business impact of data quality problems and the potential business value of fixing them,
partly through surveys and interviews with business executives.
The benefits of effective data cleansing
Done well, data cleansing provides the following business and data management benefits:
● Improved decision-making. With more accurate data, analytics applications can produce
better results. That enables organizations to make more informed decisions on business
strategies and operations, as well as things like patient care and government programs.
● More effective marketing and sales. Customer data is often wrong, inconsistent or out of
date. Cleaning up the data in customer relationship management and sales systems helps
improve the effectiveness of marketing campaigns and sales efforts.
● Better operational performance. Clean, high-quality data helps organizations avoid
inventory shortages, delivery snafus and other business problems that can result in higher
costs, lower revenues and damaged relationships with customers.
● Increased use of data. Data has become a key corporate asset, but it can't generate business
value if it isn't used. By making data more trustworthy, data cleansing helps convince
business managers and workers to rely on it as part of their jobs.
● Reduced data costs. Data cleansing stops data errors and issues from further propagating in
systems and analytics applications. In the long term, that saves time and money, because IT
and data management teams don't have to continue fixing the same errors in data sets.
Data cleansing and other data quality methods are also a key part of data governance
programs, which aim to ensure that the data in enterprise systems is consistent and gets used
properly. Clean data is one of the hallmarks of a successful data governance initiative.
Data cleansing challenges
Data cleansing doesn't lack for challenges. One of the biggest is that it's often
time-consuming, due to the number of issues that need to be addressed in many data sets and
the difficulty of pinpointing the causes of some errors. Other common challenges include the
following:
● deciding how to resolve missing data values so they don't affect analytics applications;
● fixing inconsistent data in systems controlled by different business units;
● cleaning up data quality issues in big data systems that contain a mix of structured, semi
structured and unstructured data;
● getting sufficient resources and organizational support; and
● dealing with data silos that complicate the data cleansing
process. Data cleansing tools and vendors
Numerous tools can be used to automate data cleansing tasks, including both commercial
software and open source technologies. Typically, the tools include a variety of functions for
correcting data errors and issues, such as adding missing values, replacing null ones, fixing
punctuation, standardizing fields and combining duplicate records. Many also do data
matching to find duplicate or related records.
Tools that help cleanse data are available in a variety of products and platforms, including the
following:
● specialized data cleaning tools from vendors such as Data Ladder and WinPure;
● data quality software from vendors such as Datactics, Experian, Innovative Systems, Melissa,
Microsoft and Precisely;
● data preparation tools from vendors such as Altair, DataRobot, Tableau, Tibco Software and
Trifacta;
● data management platforms from vendors such as Alteryx, Ataccama, IBM, Informatica,
SAP, SAS, Syniti and Talend;
● customer and contact data management software from vendors such as Redpoint Global,
RingLead, Synthio and Tye;
● tools for cleansing data in Salesforce systems from vendors such as Cloudingo and Plauti; and
● open-source tools, such as DataCleaner and OpenRefine
3. Data structuring. At this point, the data needs to be modeled and organized to meet the
analytics requirements. For example, data stored in comma-separated values (CSV) files or
other file formats has to be converted into tables to make it accessible to BI and analytics
tools.
4. Data transformation and enrichment. In addition to being structured, the data typically
must be transformed into a unified and usable format. For example, data transformation may
involve creating new fields or columns that aggregate values from existing ones. Data
enrichment further enhances and optimizes data sets as needed, through measures such as
augmenting and adding data.
What is data transformation?
Data transformation is the process of converting data from one format, such as a database file,
XML document or Excel spreadsheet, into another.
Transformations typically involve converting a raw data source into a cleansed, validated and
ready-to-use format. Data transformation is crucial to data management processes that
include data integration, data migration, data warehousing and data preparation.
The process of data transformation can also be referred to as extract/transform/load (ETL).
The extraction phase involves identifying and pulling data from the various source systems
that create data and then moving the data to a single repository. Next, the raw data is
cleansed, if needed. It's then transformed into a target format that can be fed into operational
systems or into a data warehouse, a date lake or another repository for use in business
intelligence and analytics applications. The transformation may involve converting data
types, removing duplicate data and enriching the source data.
Data transformation is crucial to processes that include data integration, data management,
data migration, data warehousing and data wrangling.
It is also a critical component for any organization seeking to leverage its data to generate
timely business insights. As the volume of data has proliferated, organizations must have an
efficient way to harness data to effectively put it to business use. Data transformation is one
element of harnessing this data, because -- when done properly -- it ensures data is easy to
access, consistent, secure and ultimately trusted by the intended business users.
What are the key steps in data transformation?
The process of data transformation, as noted, involves identifying data sources and types;
determining the structure of transformations that need to occur; and defining how fields will
be changed or aggregated. It includes extracting data from its original source, transforming it
and sending it to the target destination, such as a database or data warehouse. Extractions can
come from many locations, including structured sources, streaming sources or log files from
web applications.
Data analysts, data engineers and data scientists are typically in charge of data transformation
within an organization. They identify the source data, determine the required data formats
and perform data mapping, as well as execute the actual transformation process before
moving the data into appropriate databases for storage and use.
Their work involves five main steps:
1. data discovery, in which data professionals use data profiling tools or profiling
scripts to understand the structure and characteristics of the data and also to
determine how it should be transformed;
2. data mapping, during which data professionals connect, or match, data fields
from one source to data fields in another;
3. code generation, a part of the process where the software code required to
transform the data is created (either by data transformation tools or the data
professionals themselves writing script);
4. execution of the code, where the data undergoes the transformation; and
5. review, during which data professionals or the business/end users confirm that the
output data meets the established transformation requirements and, if not, address
and correct any anomalies and errors.
These steps fall in the middle of the ETL process for organizations that use on-premises
warehouses. However, scalable cloud-based data warehouses have given rise to a slightly
different process called ELT for extract, load, transform; in this process, organizations can
load raw data into data warehouses and then transform data at the time of use.
What are the benefits and challenges of data transformation?
Organizations across the board need to analyze their data for a host of business operations,
from customer service to supply chain management. They also need data to feed the
increasing number of automated and intelligent systems within their enterprise.
To gain insight into and improve these operations, organizations need high-quality data in
formats compatible with the systems consuming the data.
Thus, data transformation is a critical component of an enterprise data program because it
delivers the following benefits:
● higher data quality;
● reduced number of mistakes, such as missing values;
● faster queries and retrieval times;
● less resources needed to manipulate data;
● better data organization and management; and
● more usable data, especially for advanced business intelligence or analytics.
The data transformation process, however, can be complex and complicated. The challenges
organizations face include the following:
● high cost of transformation tools and professional expertise;
● significant compute resources, with the intensity of some
on-premises transformation processes having the potential to slow down other
operations;
● difficulty recruiting and retaining the skilled data professionals required for this
work, with data professionals some of the most in-demand workers today; and
● difficulty of properly aligning data transformation activities to the business's data-
related priorities and requirements.
Reasons to do data transformation
Organizations must be able to mine their data for insights in order to successfully compete in
the digital marketplace, optimize operations, cut costs and boost productivity. They also
require data to feed systems that use artificial intelligence, machine learning, natural language
processing and other advanced technologies.
To gain accurate insights and to ensure accurate operations of intelligent systems,
organizations must collect data and merge it from multiple sources and ensure that integrated
data is high quality.
This is where data transformation plays the star role, by ensuring that data collected from one
system is compatible with data from other systems and that the combined data is ultimately
compatible for use in the systems that require it. For example, databases might need to be
combined following a corporate acquisition, transferred to a cloud data warehouse or merged
for analysis.
Examples of data transformation
There are various data transformation methods, including the following:
● aggregation, in which data is collected from multiple sources and stored in a
single format;
● attribute construction, in which new attributes are added or created from
existing attributes;
● discretization, which involves converting continuous data values into sets of data
intervals with specific values to make the data more manageable for analysis;
● generalization, where low-level data attributes are converted into high-level data
attributes (for example, converting data from multiple brackets broken up by ages
into the more general "young" and "old" attributes) to gain a more comprehensive
view of the data;
● integration, a step that involves combining data from different sources into a
single view;
● manipulation, where the data is changed or altered to make it more readable and
organized;
● normalization, a process that converts source data into another format to limit the
occurrence of duplicated data; and
● smoothing, which uses algorithms to reduce "noise" in data sets, thereby helping
to more efficiently and effectively identify trends in the data.
Data transformation tools
Data professionals have a number of tools at their disposal to support the ETL process. These
technologies automate many of the steps within data transformation, replacing much, if not
all, of the manual scripting and hand coding that had been a major part of the data
transformation process.
Both commercial and open-source data transformation tools are available, with some options
designed for on-premises transformation processes and others catering to cloud-based
transformation activities.
Moreover, some data transformation tools are focused on the data transformation process
itself, handling the string of actions required to transform data. However, other ETL tools on
the market are part of platforms that offer a broad range of capabilities for managing
enterprise data.
Options include IBM InfoSphere, DataStage, Matillion, SAP Data Services and Talend.
5. Data validation and publishing. In this last step, automated routines are run against the data
to validate its consistency, completeness and accuracy. The prepared data is then stored in a
data warehouse, a data lake or another repository and either used directly by whoever
prepared it or made available for other users to access.
What is data validation?
Data validation is the practice of checking the integrity, accuracy and structure of data before
it is used for a business operation. Data validation operation results can provide data used for
data analytics, business intelligence or training a machine learning model. It can also be used
to ensure the integrity of data for financial accounting or regulatory compliance.
Data can be examined as part of a validation process in a variety of ways, including data
type, constraint, structured, consistency and code validation. Each type of data validation is
designed to make sure the data meets the requirements to be useful.
Data validation is related to data quality. Data validation can be a component to measure data
quality, which ensures that a given data set is supplied with information sources that are of
the highest quality, authoritative and accurate.
Data validation is also used as part of application workflows, including spell checking and
rules for strong password creation.
Why validate data?
For data scientists, data analysts and others working with data, validating it is very important.
The output of any given system can only be as good as the data the operation is based on.
These operations can include machine learning or artificial intelligence models, data analytics
reports and business intelligence dashboards. Validating the data ensures that the data is
accurate, which means all systems relying on a validated given data set will be as well.
Data validation is also important for data to be useful for an organization or for a specific
application operation. For example, if data is not in the right format to be consumed by a
system, then the data can't be used easily, if at all.
As data moves from one location to another, different needs for the data arise based on the
context for how the data is being used. Data tr ensures that the data is correct for specific
contexts. The right type of data validation makes the data useful.
What are the different types of data validation?
Multiple types of data validation are available to ensure that the right data is being used. The
most common types of data validation include the following:
● Data type validation is common and confirms that the data in each field, column,
list, range or file matches a specified data type and format.
● Constraint validation checks to see if a given data field input fits a specified
requirement within certain ranges. For example, it verifies that a data field has a
minimum or maximum number of characters.
● Structured validation ensures that data is compliant with a specified data format,
structure or schema.
● Consistency validation makes sure data styles are consistent. For example, it
confirms that all values are listed to two decimal points.
● Code validation is similar to a consistency check and confirms that codes used
for different data inputs are correct. For example, it checks a country code or
North American Industry Classification System (NAICS) codes.
How to perform data validation
Among the most basic and common ways that data is used is within a spreadsheet program
such as Microsoft Excel or Google Sheets. In both Excel and Sheets, the data validation
process is a straightforward, integrated feature. Excel and Sheets both have a menu item
listed as Data
> Data Validation. By selecting the Data Validation menu, a user can choose the specific data
type or constraint validation required for a given file or data range.
ETL (Extract, Transform and Load) and data integration tools typically integrate data
validation policies to be executed as data is extracted from one source and then loaded into
another. Popular open source tools, such as dbt, also include data validation options and are
commonly used for data transformation.
Data validation can also be done programmatically in an application context for an input
value. For example, as an input variable is sent, such as a password, it can be checked by a
script to make sure it meets constraint validation for the right length.
Data preparation can also incorporate or feed into data curation work that creates and
oversees ready-to-use data sets for BI and analytics. Data curation involves tasks such as
indexing, cataloging and maintaining data sets and their associated metadata to help users
find and access the data. In some organizations, data curator is a formal role that works
collaboratively with data scientists, business analysts, other users and the IT and data
management teams. In others, data may be curated by data stewards, data engineers, database
administrators or data scientists and business users themselves.
What are the challenges of data preparation?
Data preparation is inherently complicated. Data sets pulled together from different source
systems are highly likely to have numerous data quality, accuracy and consistency issues to
resolve. The data also must be manipulated to make it usable, and irrelevant data needs to be
weeded out. As noted above, it's a time-consuming process: The 80/20 rule is often applied to
analytics applications, with about 80% of the work said to be devoted to collecting and
preparing data and only 20% to analyzing it.
In an article on common data preparation challenges, Rick Sherman, managing partner of
consulting firm Athena IT Solutions, detailed the following seven challenges along with
advice on how to overcome each of them:
● Inadequate or non-existent data profiling. If data isn't properly profiled, errors, anomalies
and other problems might not be identified, which can result in flawed analytics.
● Missing or incomplete data. Data sets often have missing values and other forms of
incomplete data; such issues need to be assessed as possible errors and addressed if so.
● Invalid data values. Misspellings, other typos and wrong numbers are examples of invalid
entries that frequently occur in data and must be fixed to ensure analytics accuracy.
● Name and address standardization. Names and addresses may be inconsistent in data from
different systems, with variations that can affect views of customers and other entities.
● Inconsistent data across enterprise systems. Other inconsistencies in data sets drawn from
multiple source systems, such as different terminology and unique identifiers, are also a
pervasive issue in data preparation efforts.
● Data enrichment. Deciding how to enrich a data set -- for example, what to add to it -- is a
complex task that requires a strong understanding of business needs and analytics goals.
● Maintaining and expanding data prep processes. Data preparation work often becomes a
recurring process that needs to be sustained and enhanced on an ongoing basis.

HYPOTHESIS GENERATION
Data scientists work with data sets small and large, and are tellers of stories. These stories
have entities, properties and relationships, all described by data. Their apparatus and methods
open up data scientists to opportunities to identify, consolidate and validate hypotheses with
data, and use these hypotheses as starting points for our data narratives. Hypothesis
generation is a key challenge for data scientists. Hypothesis generation and by extension
hypothesis refinement constitute the very purpose of data analysis and data science.
Hypothesis generation for a data scientist can take numerous forms, such as:
1. They may be interested in the properties of a certain stream of data or a certain
measurement. These properties and their default or exceptional values may form a
certain hypothesis.
2. They may be keen on understanding how a certain measure has evolved over time. In
trying to understand this evolution of a system’s metric, or a person’s behaviour, they
could rely on a mathematical model as a hypothesis.
3. They could consider the impact of some properties on the states of systems,
interactions and people. In trying to understand such relationships between different
measures and properties, they could construct machine learning models of different
kinds.
Ultimately, the purpose of such hypothesis generation is to simplify some aspect of system
behaviour and represent such behaviour in a manner that’s tangible and tractable based on
simple, explicable rules. This makes story-telling easier for data scientists when they become
new-age raconteurs, straddling data visualisations, dashboards with data summaries and
machine learning models.

Understanding Hypothesis Generation:


The importance of hypothesis generation in data science teams is many folds:
1. Hypothesis generation allows the team to experiment with theories about
the data.
2. Hypothesis generation can allow the team to take a systems-thinking
approach to the problem to be solved.
3. Hypothesis generation allows us to build more sophisticated models
based on prior hypotheses and understanding.
When data science teams approach complex projects, some of them may be
wont to diving right into building complex systems based on available
resources, libraries and software. By taking a hypothesis-centred view of the
data science problem, they could build up complexity and different
understanding in a very natural way, and build up hypotheses and ideas in the
process.
What is Hypothesis Generation?
Hypothesis generation is an educated “guess” of various factors that are impacting the
business problem that needs to be solved using machine learning. In framing a hypothesis,
the
data scientist must not know the outcome of the hypothesis that has been generated based on
any evidence.
“A hypothesis may be simply defined as a guess. A scientific hypothesis is an
intelligent guess.” – Isaac Asimov
Hypothesis generation is a crucial step in any data science project. If you skip this
or skim through this, the likelihood of the project failing increases exponentially.

Hypothesis Generation vs. Hypothesis Testing


Hypothesis generation is a process beginning with an educated guess
whereas hypothesis testing is a process to conclude that the educated guess is
true/false or the relationship between the variables is statistically significant or
not.
This latter part could be used for further research using statistical proof.
A hypothesis is accepted or rejected based on the significance level and test
score of the test used for testing the hypothesis.

How Does Hypothesis Generation Help?


Here are 5 key reasons why hypothesis generation is so important in data science:
● Hypothesis generation helps in comprehending the business problem as
we dive deep in inferring the various factors affecting our target variable
● You will get a much better idea of what are the major factors that are
responsible to solve the problem
● Data that needs to be collected from various sources that are key in
converting your business problem into a data science-based problem
● Improves your domain knowledge if you are new to the domain as you
spend time understanding the problem
● Helps to approach the problem in a structured manner
When Should you Perform Hypothesis Generation?
The million-dollar question – when in the world should you perform hypothesis
generation?
● The hypothesis generation should be made before looking at the dataset
or collection of the data
● You will notice that if you have done your hypothesis generation
adequately, you would have included all the variables present in the
dataset in your hypothesis generation
● You might also have included variables that are not present in the dataset

Case Study: Hypothesis Generation on “RED Taxi Trip


Duration Prediction”
Let us now look at the “RED TAXI TRIP DURATION PREDICTION” problem
statement and generate a few hypotheses that would affect our taxi trip duration to
understand hypothesis generation.
Here’s the problem statement:
To predict the duration of a trip so that the company can assign the cabs that are
free for the next trip. This will help in reducing the wait time for customers and
will also help in earning customer trust.
Let’s begin!
Hypothesis Generation Based On Various Factors
1. Distance/Speed based Features
Let us try to come up with a formula that would have a relation with trip duration
and would help us in generating various hypotheses for the problem:
TIME=DISTANCE/SPEED
Distance and speed play an important role in predicting the trip duration.
We can notice that the trip duration is directly proportional to the distance travelled
and inversely proportional to the speed of the taxi. Using this we can come up with
a hypothesis based on distance and speed.
● Distance: More the distance travelled by the taxi, the more will be the trip
duration.
● Interior drop point: Drop points to congested or interior lanes could result
in an increase in trip duration
● Speed: Higher the speed, the lower the trip duration

2. Features based on Car


Cars are of various types, sizes, brands, and these features of the car could be vital
for commute not only on the basis of the safety of the passengers but also for the
trip duration. Let us now generate a few hypotheses based on the features of the
car.
● Condition of the car: Good conditioned cars are unlikely to have
breakdown issues and could have a lower trip duration
● Car Size: Small-sized cars (Hatchback) may have a lower trip duration and
larger-sized cars (XUV) may have higher trip duration based on the size of
the car and congestion in the city

3. Type of the Trip


Trip types can be different based on trip vendors – it could be an outstation trip,
single or pool rides. Let us now define a hypothesis based on the type of trip used.
● Pool Car: Trips with pooling can lead to higher trip duration as the car
reaches multiple places before reaching your assigned destination

4. Features based on Driver Details


A driver is an important person when it comes to commute time. Various factors
about the driver can help in understanding the reason behind trip duration and here
are a few hypotheses this.
● Age of driver: Older drivers could be more careful and could contribute to
higher trip duration
● Gender: Female drivers are likely to drive slowly and could contribute to
higher trip duration
● Driver experience: Drivers with very less driving experience can cause
higher trip duration
● Medical condition: Drivers with a medical condition can contribute to
higher trip duration

5. Passenger details
Passengers can influence the trip duration knowingly or unknowingly. We usually
come across passengers requesting drivers to increase the speed as they are getting
late and there could be other factors to hypothesize which we can look at.
● Age of passengers: Senior citizens as passengers may contribute to higher
trip duration as drivers tend to go slow in trips involving senior citizens
● Medical conditions or pregnancy: Passengers with medical conditions
contribute to a longer trip duration
● Emergency: Passengers with an emergency could contribute to a shorter
trip duration
● Passenger count: Higher passenger count leads to shorter duration trips due
to congestion in seating
6. Date-Time Features

The day and time of the week are important as New York is a busy city and could
be highly congested during office hours or weekdays. Let us now generate a few
hypotheses on the date and time-based features.
Pickup Day:
● Weekends could contribute to more outstation trips and could have a higher
trip duration
● Weekdays tend to have higher trip duration due to high traffic

● If the pickup day falls on a holiday then the trip duration may be shorter
● If the pickup day falls on a festive week then the trip duration could be
lower due to lesser traffic
Time:

● Early morning trips have a lesser trip duration due to lesser traffic
● Evening trips have a higher trip duration due to peak hours
7. Road-based Features
Roads are of different types and the condition of the road or obstructions in the
road are factors that can’t be ignored. Let’s form some hypotheses based on these
factors.
● Condition of the road: The duration of the trip is more if the condition of
the road is bad
● Road type: Trips in concrete roads tend to have a lower trip duration

● Strike on the road: Strikes carried out on roads in the direction of the trip
causes the trip duration to increase
8. Weather Based Features

Weather can change at any time and could possibly impact the commute if the
weather turns bad. Hence, this is an important feature to consider in our
hypothesis.
● Weather at the start of the trip: Rainy weather condition contributes to a
higher trip duration
After writing down our hypothesis and looking at the dataset you will notice
that you would have covered the writing of hypothesis on most of the features
present in the data set. There could also be a possibility that you might have to
work with fewer features and the features on which you have generated hypotheses
are not currently being captured/stored by the business and are not available.
Always go ahead and capture data from external sources if you think that the
data is relevant for your prediction. Ex: Getting weather information
It is also important to note that since hypothesis generation is an estimated
guess, the hypothesis generated could come out to be true or false once exploratory
data analysis and hypothesis testing is performed on the data.
MODELING:
After all the cleaning, formatting and feature selection, we will now feed
the data to the chosen model. But how does one select a model to use?
How to choose a model?
IT DEPENDS. It all depends on what the goal of your task or project is and this should
already be identified in the Business Understanding phase
Steps in choosing a model
1. Determine size of training data — if you have a small dataset, fewer number of
observations, high number of features, you can choose high bias/low variance
algorithms (Linear Regression, Naïve Bayes, Linear SVM). If your dataset is large and
has a high number of observations compared to number of features, you can choose a
low bias/high variance algorithms (KNN, Decision trees).
2. Accuracy and/or interpretability of the output — if your goal is inference, choose
restrictive models as it is more interpretable (Linear Regression, Least Squares). If
your goal is higher accuracy, then choose flexible models (Bagging, Boosting, SVM).
3. Speed or training time — always remember that higher accuracy as well as large
datasets means higher training time. Examples of easy to run and to implement
algorithms are: Naïve Bayes, Linear and Logistic Regression. Some examples of
algorithms that need more time to train are: SVM, Neural Networks, and Random
Forests.
4. Linearity —try checking first the linearity of your data by fitting a linear line or by
trying to run a logistic regression, you can also check their residual errors. Higher
errors mean that the data is not linear and needs complex algorithms to fit. If data is
Linear, you can choose: Linear Regression, Logistic Regression, Support Vector
Machines. If Non-linear: Kernel SVM, Random Forest, Neural Nets.
Parametric vs. Non-Parametric Machine Learning Models
Parametric Machine Learning Algorithms
Parametric ML Algorithms are algorithms that simplify the function to a know form. They
are often are called the “Linear ML Algorithms”.
Parametric ML Algorithms
● Logistic Regression
● Linear Discriminant Analysis
● Perceptron
● Naïve Bayes
● Simple Neural Networks
Benefits of Parametric ML Algorithms
● Simpler — easy to understand methods and easy to interpret results
● Speed — very fast to learn from the data provided
● Less data — it does not require as much training data
Limitations of Parametric ML Algorithms
● Limited Complexity —suited only to simpler problems
● Poor Fit — the methods are unlikely to match the underlying mapping function
Non-Parametric Machine Learning Algorithms
Non-Parametric ML Algorithms are algorithms that do not make assumptions about the
form of the mapping functions. It is good to use when you have a lot of data and no prior
knowledge and you don’t want to worry too much about choosing the right features.
Non-Parametric ML Algorithms
● K-Nearest Neighbors (KNN)
● Decision Trees like CART
● Support Vector Machines (SVM)
Benefits of Non-Parametric ML Algorithms
● Flexibility— it is capable of fitting a large number of functional forms
● Power — do not assume about the underlying function
● Performance — able to give a higher performance model for predictions
Limitations of Non-Parametric ML Algorithms
● Needs more data — requires a large training dataset
● Slower processing — they often have more parameters which means that training time
is much longer
● Overfitting — higher risk of overfitting the training data and results are harder to
explain why specific predictions were made
In the process flow above, Data Modeling is broken down into four tasks
together with its projected outcome or output in detail.
Simply put, the Data Modeling phase’s goal is to:

1. Selecting modeling techniques


The wonderful world of data mining offers lots of modeling techniques,
but not all of them will suit your needs. Narrow the list based on the kinds
of variables involved, the selection of techniques available in your tools,
and any business considerations that are important to you.
For example, many organizations favour methods with output that’s easy
to interpret, so decision trees or logistic regression might be acceptable,
but neural networks would probably not be accepted.
Deliverables for this task include two reports:
● Modeling technique: Specify the technique(s) that you will use.
● Modeling assumptions: Many modeling techniques are based on
certain assumptions. For example, a model type may be intended for
use with data that has a specific type of distribution. Document these
assumptions in this report.

2. Designing tests
The test in this task is the test that you’ll use to determine how well your model works. It
may be as simple as splitting your data into a group of cases for model training and another
group for model testing.
Training data is used to fit mathematical forms to the data model, and test data is used during
the model-training process to avoid overfitting: making a model that’s perfect for one dataset,
but no other. You may also use holdout data, data that is not used during the model-training
process, for an additional test.
The deliverable for this task is your test design. It need not be elaborate, but you should at
least take care that your training and test data are similar and that you avoid introducing any
bias into the data.

3. Building model(s)
Modeling is what many people imagine to be the whole job of the data miner, but it’s just one
task of dozens! Nonetheless, modeling to address specific business goals is the heart of the
data-mining profession.
Deliverables for this task include three items:
● Parameter settings: When building models, most tools give you the option of
adjusting a variety of settings, and these settings have an impact on the structure of
the final model. Document these settings in a report.
● Model descriptions: Describe your models. State the type of model (such as linear
regression or neural network) and the variables used. Explain how the model is
interpreted. Document any difficulties encountered in the modeling process.
● Models: This deliverable is the models themselves. Some model types can be easily
defined with a simple equation; others are far too complex and must be transmitted in
a more sophisticated format.
4. Assessing model(s)
Now you will review the models that you’ve created, from a technical standpoint and also
from a business standpoint (often with input from business experts on your project team).
Deliverables for this task include two reports:
● Model assessment: Summarizes the information developed in your model review. If
you have created several models, you may rank them based on your assessment of
their value for a specific application.
● Revised parameter settings: You may choose to fine-tune settings that were used to
build the model and conduct another round of modeling and try to improve your
results.

VALIDATION:
Why data validation?
Data validation happens immediately after data preparation/wrangling and before
modeling. it is because during data preparation there is a high possibility of things going
wrong especially in complex scenarios.
Data validation ensures that modeling happens on the right data. faulty data as input
to the model would generate faulty insight!
How is data validation done?
Data validation should be done by involving minimum one external person who has a
proper understanding of the data and business. I
t is usually clients who technically good enough to check the data. Once we go
through data preparation and just before data modeling, we usually make data visualization
and give my newly prepared data to the client.
The clients with the help of SQL queries or any other tools try to validate if output
contains no error.
Combing CRISP-DM/ASUM-DM with the agile methodology, steps can be taken in
parallel meaning you do not have to wait for the green light for data validation to do the
modeling. But once you get feedback from the domain expert that there are faults in the data,
we need to correct the data by re-doing the data-preparation and re-model the data.
What are the common causes leading to a faulty output from data preparation?
Common causes are:
1. Lack of proper understanding of the data, therefore, the logic of the data
preparation is not correct.
2. Common bugs in programming/data preparation pipeline that led to a faulty output.
EVALUATION:
The evaluation phase includes three tasks. These are
● Evaluating results
● Reviewing the process
● Determining the next steps

Task: Evaluating results


At this stage, you’ll assess the value of your models for meeting the business goals that
started the data-mining process. You’ll look for any reasons why the model would not be
satisfactory for business use. If possible, you’ll test the model in a practical application, to
determine whether it works as well in the workplace as it did in your tests.
Deliverables for this task include two items:
● Assessment of results (for business goals): Summarize the results with respect to the
business success criteria that you established in the business-understanding phase.
Explicitly state whether you have reached the business goals defined at the start of the
project.
● Approved models: These include any models that meet the business success criteria.

Task: Reviewing the process


Now that you have explored data and developed models, take time to review your process.
This is an opportunity to spot issues that you might have overlooked and that might draw
your attention to flaws in the work that you’ve done while you still have time to correct the
problem before deployment. Also consider ways that you might improve your process for
future projects.
The deliverable for this task is the review of process report. In it, you should outline your
review process and findings and highlight any concerns that require immediate attention,
such as steps that were overlooked or that should be revisited.
Task: Determining the next steps
The evaluation phase concludes with your recommendations for the next move. The model
may be ready to deploy, or you may judge that it would be better to repeat some steps and try
to improve it. Your findings may inspire new data-mining projects.
Deliverables for this task include two items:
● List of possible actions: Describe each alternative action, along with the strongest
reasons for and against it.
● Decision: State the final decision on each possible action, along with the reasoning
behind the decision.

Data Interpretation Examples


Data interpretation is the final step of data analysis. This is where you turn results into
actionable items. To better understand it, here are 2 instances of interpreting data:
Let's say you've got four age groups of the user base. So a company can notice which age
group is most engaged with their content or product. Based on bar charts or pie charts, they
can either: develop a marketing strategy to make their product more appealing to
non-involved groups or develop an outreach strategy that expands on their core user base.
Steps Of Data Interpretation
Data interpretation is conducted in 4 steps:
● Assembling the information you need (like bar graphs and pie charts);
● Developing findings or isolating the most relevant inputs;
● Developing conclusions;
● Coming up with recommendations or actionable solutions.
Considering how these findings dictate the course of action, data analysts must be accurate
with their conclusions and examine the raw data from multiple angles. Different variables
may allude to various problems, so having the ability to backtrack data and repeat the analysis
using different templates is an integral part of a successful business strategy.
What Should Users Question During Data Interpretation?
To interpret data accurately, users should be aware of potential pitfalls present within this
process. You need to ask yourself if you are mistaking correlation for causation. If two things
occur together, it does not indicate that one caused the other.

Data interpretation as the process of assigning meaning to the collected information


and determining the conclusions, significance, and implications of the findings.
The 2nd thing you need to be aware of is your own confirmation bias. This occurs when you
try to prove a point or a theory and focus only on the patterns or findings that support that
theory while discarding those that do not.
The 3rd problem is data. To be specific, you need to make sure that the data you
have collected and analyzed is relevant to the problem you are trying to solve.

Data analysts or help people make sense of the numerical data that has been
aggregated, transformed, and displayed. There are two main methods for data interpretation:
quantitative and qualitative.
Qualitative Data Interpretation Method
This is a method for breaking down or analyzing so-called qualitative data, also known as
categorical data. It is important to note that no bar graphs or line charts are used in this
method. Instead, they rely on text. Because qualitative data is collected through
person-to-person techniques, it isn't easy to present using a numerical approach.
Surveys are used to because they allow you to assign numerical values to answers,
making them easier to analyze. If we rely solely on the text, it would be a time-consuming
and error-prone process. This is why it must be transformed.

This data interpretation is applied when we are dealing with quantitative or numerical data.
Since we are dealing with numbers, the values can be displayed in a bar chart or pie chart.
There are two main types: Discrete and Continuous. Moreover, numbers are easier to analyze
since they involve statistical modeling techniques like mean and standard deviation.
is an average value of a particular data set obtained or calculated by dividing the sum
of the values within that data set by the number of values within that same set.
is a technique is used to ascertain how responses align with or deviate
from the average value or mean. It relies on the meaning to describe the consistency of the
replies within a particular data set. You can use this when calculating the average pay for a
certain profession and then displaying the upper and lower values in the data set.
As stated, some tools can do this automatically, especially when it comes to quantitative data.
Whatagraph is one such tool as it can aggregate data from multiple sources using different
system integrations. It will also automatically organize and analyze that which will later be
displayed in pie charts, line charts, or bar charts, however you wish.
Benefits Of Data
Interpretation
Multiple data interpretation benefits explain its significance within the corporate world,
medical industry, and financial industry:

Informed decision-making. The managing board must examine the data to take action and
implement new methods. This emphasizes the significance of well-analyzed data as well as a
well-structured data collection process.

Anticipating needs and identifying trends . Data analysis provides users with relevant
insights that they can use to forecast trends. It would be based on customer concerns and
expectations.
For example, a large number of people are concerned about privacy and the leakage of
personal information. Products that provide greater protection and anonymity are more likely
to become popular.

Clear foresight. Companies that analyze and aggregate data better understand their own
performance and how consumers perceive them. This provides them with a better
understanding of their shortcomings, allowing them to work on solutions that will
significantly improve their performance.

DEPLOYMENT AND ITERATIONS:


The deployment phase includes four tasks. These are
● Planning deployment (your methods for integrating data-mining discoveries into use)
● Planning monitoring and maintenance
● Reporting final results
● Reviewing final results

Task: Planning deployment


When your model is ready to use, you will need a strategy for putting it to work in your
business.
The deliverable for this task is the deployment plan. This is a summary of your strategy for
deployment, the steps required, and the instructions for carrying out those steps.
Task: Planning monitoring and maintenance
Data-mining work is a cycle, so expect to stay actively involved with your models as they are
integrated into everyday use.
The deliverable for this task is the monitoring and maintenance plan. This is a summary of
your strategy for ongoing review of the model’s performance. You’ll need to ensure that it is
being used properly on an ongoing basis, and that any decline in model performance will be
detected.

Task: Reporting final results


Deliverables for this task include two items:
● Final report: The final report summarizes the entire project by assembling all the
reports created up to this point, and adding an overview summarizing the entire
project and its results.
● Final presentation: A summary of the final report is presented in a meeting with
management. This is also an opportunity to address any open questions.

Task: Review project


Finally, the data-mining team meets to discuss what worked and what didn’t, what would be
good to do again, and what should be avoided!
This step, too, has a deliverable, although it is only for the use of the data-mining team, not
the manager (or client). It’s the experience documentation report.
This is where you should outline any work methods that worked particularly well, so that
they are documented to use again in the future, and any improvements that might be made to
your process. It’s also the place to document problems and bad experiences, with your
recommendations for avoiding similar problems in the future.
Iterations are done to upgrade the performance of the system
The outcome of decision, action and the conclusion conducted from the model are
documented and updated into the database. This helps in changing and upgrading the
performance of the existing system.

Some queries are updated in the database such as “were the decision and action impactful?”
“What was the return or investment?”,” how was the analysis group compared with the
regulating class?”. The performance-based database is continuously updated once the new
insight or knowledge is extracted.
UNIT II
BUSINESS INTELLIGENCE

DATA WAREHOUSES AND DATA MART:


What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from single
and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modelling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in
support of management's decisions."

Characteristics of Data Warehouse


Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.

Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among different
data sources.

Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.

Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures
in data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.

History of Data Warehouse


The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin and
Paul Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model for the flow
of information from the operational system to decisional support environments. The concept
attempt to address the various problems associated with the flow, mainly the high costs
associated with it.
In the absence of data warehousing architecture, a vast amount of space was required to support
multiple decision support environments. In large corporations, it was ordinary for various
decision support environments to operate independently.
Goals of Data Warehousing
o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse
Data Warehouse is needed for the following reasons:
1. Business User: Business users require a data warehouse to view summarized data from
the past. Since these people are non-technical, the data may be presented to them in an
elementary form.
2. Store historical data: Data Warehouse is required to store the time variable data from the
past. This input is made to be used for various purposes.
3. Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency
in data.
5. High response time: Data warehouse has to be ready for somewhat unexpected loads and
types of queries, which demands a significant degree of flexibility and quick response
time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand,
and query.
4. Queries that would be complex in many normalized databases could be easier to build and
maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from
lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.

DATA MART
A Data Mart is focused on a single functional area of an organization and contains a
subset of data stored in a Data Warehouse. A Data Mart is a condensed version of Data
Warehouse and is designed for use by a specific department, unit or set of users in an
organization. E.g., Marketing, Sales, HR or finance. It is often controlled by a single department
in an organization.
Data Mart usually draws data from only a few sources compared to a Data warehouse. Data
marts are small in size and are more flexible compared to a Datawarehouse.
Why do we need Data Mart?
● Data Mart helps to enhance user’s response time due to reduction in volume of data
● It provides easy access to frequently requested data.
● Data mart are simpler to implement when compared to corporate Datawarehouse. At the
same time, the cost of implementing Data Mart is certainly lower compared with
implementing a full data warehouse.
● Compared to Data Warehouse, a DataMart is agile. In case of change in model, DataMart
can be built quicker due to a smaller size.
● A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse
is defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more
open to change compared to Datawarehouse.
● Data is partitioned and allows very granular access control privileges.
● Data can be segmented and stored on different hardware/software platforms.
Types of Data Mart
There are three main types of data mart:
1. Dependent: Dependent data marts are created by drawing data directly from operational,
external or both sources.
2. Independent: Independent data mart is created without the use of a central data
warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or operational
systems.
Dependent Data Mart
A dependent data mart allows sourcing organization’s data from a single Data Warehouse. It is
one of the data marts examples which offers the benefit of centralization. If you need to develop
one or more physical data marts, then you need to configure them as dependent data marts.
Dependent Data Mart in data warehouse can be built in two different ways. Either where a user
can access both the data mart and data warehouse, depending on need, or where access is limited
only to the data mart. The second approach is not optimal as it produces sometimes referred to
as a data junkyard. In the data junkyard, all data begins with a common source, but they are
scrapped, and mostly junked.
Dependent Data Mart
Independent Data Mart
An independent data mart is created without the use of central Data warehouse. This kind of
Data Mart is an ideal option for smaller groups within an organization.
An independent data mart has neither a relationship with the enterprise data warehouse nor with
any other data mart. In Independent data mart, the data is input separately, and its analyses are
also performed autonomously.
Implementation of independent data marts is antithetical to the motivation for building a data
warehouse. First of all, you need a consistent, centralized store of enterprise data which can be
analysed by multiple users with different interests who want widely varying information.
Independent Data Mart
Hybrid Data Mart:
A hybrid data mart combines input from sources apart from Data warehouse. This could be
helpful when you want ad-hoc integration, like after a new group or product is added to the
organization.
It is the best data mart example suited for multiple database environments and fast
implementation turnaround for any organization. It also requires least data cleansing effort.
Hybrid Data mart also supports large storage structures, and it is best suited for flexible for
smaller data-centric applications.
Hybrid Data Mart

Steps in Implementing a Datamart

Implementing a Data Mart is a rewarding but complex procedure. Here are the detailed steps to
implement a Data Mart:
Designing
Designing is the first phase of Data Mart implementation. It covers all the tasks between
initiating the request for a data mart to gathering information about the requirements. Finally, we
create the logical and physical Data Mart design.
The design step involves the following tasks:
● Gathering the business & technical requirements and Identifying data sources.
● Selecting the appropriate subset of data.
● Designing the logical and physical structure of the data mart.
Data could be partitioned based on following criteria:
● Date
● Business or Functional Unit
● Geography
● Any combination of above
Data could be partitioned at the application or DBMS level. Though it is recommended to
partition at the Application level as it allows different data models each year with the change in
business environment.
What Products and Technologies Do You Need?
A simple pen and paper would suffice. Though tools that help you create UML or ER
diagram would also append meta data into your logical and physical designs.
Constructing
This is the second phase of implementation. It involves creating the physical database and the
logical structures.
This step involves the following tasks:
● Implementing the physical database designed in the earlier phase. For instance, database
schema objects like table, indexes, views, etc. are created.
What Products and Technologies Do You Need?
You need a relational database management system to construct a data mart. RDBMS have
several features that are required for the success of a Data Mart.
● Storage management: An RDBMS stores and manages the data to create, add, and delete
data.
● Fast data access: With a SQL query you can easily access data based on certain
conditions/filters.
● Data protection: The RDBMS system also offers a way to recover from system failures
such as power failures. It also allows restoring data from these backups incase of the disk
fails.
● Multiuser support: The data management system offers concurrent access, the ability for
multiple users to access and modify data without interfering or overwriting changes made
by another user.
● Security: The RDMS system also provides a way to regulate access by users to objects
and certain types of operations.
Populating:
In the third phase, data in populated in the data mart.
The populating step involves the following tasks:
● Source data to target data Mapping
● Extraction of source data
● Cleaning and transformation operations on the data
● Loading data into the data mart
● Creating and storing metadata

What Products and Technologies Do You Need?


You accomplish these population tasks using an ETL (Extract Transform Load) Tool. This tool
allows you to look at the data sources, perform source-to-target mapping, extract the data,
transform, cleanse it, and load it back into the data mart.
In the process, the tool also creates some metadata relating to things like where the data came
from, how recent it is, what type of changes were made to the data, and what level of
summarization was done.
Accessing
Accessing is a fourth step which involves putting the data to use: querying the data, creating
reports, charts, and publishing them. End-user submit queries to the database and display the
results of the queries
The accessing step needs to perform the following tasks:
● Set up a meta layer that translates database structures and objects names into business
terms. This helps non-technical users to access the Data mart easily.
● Set up and maintain database structures.
● Set up API and interfaces if required
What Products and Technologies Do You Need?
You can access the data mart using the command line or GUI. GUI is preferred as it can easily
generate graphs and is user-friendly compared to the command line.
Managing
This is the last step of Data Mart Implementation process. This step covers management tasks
such as-
● Ongoing user access management.
● System optimizations and fine-tuning to achieve the enhanced performance.
● Adding and managing fresh data into the data mart.
● Planning recovery scenarios and ensure system availability in the case when the system
fails.
What Products and Technologies Do You Need?
You could use the GUI or command line for data mart management.
Advantages and Disadvantages of a Data Mart
Advantages
● Data marts contain a subset of organization-wide data. This Data is valuable to a specific
group of people in an organization.
● It is cost-effective alternatives to a data warehouse, which can take high costs to build.
● Data Mart allows faster access of Data.
● Data Mart is easy to use as it is specifically designed for the needs of its users. Thus a
data mart can accelerate business processes.
● Data Marts needs less implementation time compare to Data Warehouse systems. It is
faster to implement Data Mart as you only need to concentrate the only subset of the data.
● It contains historical data which enables the analyst to determine data trends.
Disadvantages
● Many a times enterprises create too many disparate and unrelated data marts without
much benefit. It can become a big hurdle to maintain.
● Data Mart cannot provide company-wide data analysis as their data set is limited.
KNOWLEDGE MANAGEMENT:
What is Knowledge Management?
Knowledge management is the conscious process of defining, structuring, retaining, and
sharing the knowledge and experience of employees within an organization.
Knowledge Management is a systematic process that involves handling, overseeing, and
understanding all kinds of knowledge databases within an organization, including
the intangible skills that their employees possess.
The process of knowledge management is a multi-disciplinary approach that makes use of
many departments in an organization. The primary objective is to understand, modify, improve,
and maintain an organization’s efficiency in terms of knowledge. Retaining, growing, and
bringing in new talent and new knowledge is also a part of knowledge management.
It also includes managing, understanding, scheduling, and executing training programs for
the organization’s employees.
Importance of Knowledge Management
Knowledge is an essential aspect of an organization. The organization runs on the knowledge of
the employees. The more knowledgeable they are, the better it is in terms of business.
Every organization has understood the importance of knowledge management and is
implementing it accordingly. Knowledge management is one of the best ways to stay ahead in
the market and have the edge over competitors. Building a hard-working workforce is essential,
but having a knowledgeable and hard-working workforce is an asset.
Since knowledge management primarily focuses on the organization’s vision and mission,
the employees are trained in sync with the organizational objectives. Most of the companies
have seen a high increase in productivity with knowledge management.
It is vital for organizations that are on a growth phase and need access to a
reliable database of knowledge.
If knowledge management is not present in your organization, then your employees will
have to re-learn various information and processes. This will be not only inefficient but also
expensive.
Areas of Knowledge Management
There are primarily two areas of knowledge management, which are crucial for an organization.
These two areas are:
1. Explicit knowledge
The skills that are understood and transferred from one to the other are called explicit
knowledge. It is also known as codified or formal knowledge. It can also be written down in the
manual or as instructions and passed on from one person to the other.
2. Tacit knowledge
Tacit knowledge is precisely the opposite of explicit knowledge. It is challenging to articulate
and transfer it to others. Things such as aesthetic senses, body language, innovative thinking,
which are difficult to teach someone, are part of Tacit knowledge.
Both types are further categorized into four different categories.
1. Factual knowledge is verifiable, observable, and measurable.
2. Conceptual knowledge, which is in regards to systems and perspectives
3. Methodological knowledge which deals with problem-solving and decision-making
skills
4. And Exceptional knowledge which is related to judgments, hypothesis, and
expectations.
Knowledge management helps lifelong learning in an organization in which companies are
experts and invested. With the help of knowledge management, companies gather knowledge of
the production processes involved in preparing a product or a service.
The objective of organizations to accumulate knowledge is not only to maintain it in the
company but also to disperse it to their employees, existing, and new ones.
Strategies of Knowledge Management
There are three stages of accessing knowledge. They are before, during, or after knowledge
management and its related activities. Knowledge management has become a part
of performance improvement or performance measurement plans in organizations.
One such strategy of knowledge management involves active management of knowledge.
Individuals share their knowledge in the database in coded form, and decoding is required while
retrieving the knowledge. This process is called codification.
Another strategy of knowledge management is requesting expert associates on a requirement
basis. This is also known as a pull strategy.
The following are a few of the other strategies of knowledge management that are incorporated
by different organizations.
1. Knowledge sharing
This concept encourages sharing the available information based on the fact that knowledge is
not exclusive and should be updated and shared. It is one of the critical roles in the job
description of the employee.
Different types of knowledge sharing’s are expected, like inter-project knowledge transfer, inter-
and intra-organizational knowledge sharing.
2. Knowledge retention
Whenever an employee leaves the organization, there is a loss of knowledge along with the
employee. This creates a challenge for other people in the organization because the knowledge
of the person who has left will be unique, and not everybody will know about it.
Companies, in this case, try to retain the knowledge instead of losing it. This also involves
mapping knowledge competences and predicting current as well as future gaps in the
knowledge. Knowledge transfer is conducted in the organization before the departure of the
employee.
This involves sharing the relevant documents, mentoring, shadowing, and implementing every
appropriate step by which the employee’s existing knowledge is transferred to his replacement
employee in the organization before he quits.
3. Storytelling
Storytelling is one of the best ways of transferring tacit knowledge. As we already discussed,
tacit knowledge is not something that can be easily transferred. Still, with the help of
storytelling, different analogies are used to transfer the knowledge from one person to the other.
Storytelling makes knowledge transfer more natural, more understandable, and less technical
and complicated.
4. Mentoring
This is a knowledge management strategy that is used in almost every organization. In
mentoring, a senior employee is allocated as a mentor to the new employee.
An attempt is made that the mentor is the one who has already worked in a similar role as that of
the new employee. The mentor’s primary job is to train the new candidate in the practicalities of
his new job and pass on his experience, knowledge, and database to the new employee.
The process of mentoring is very successful in knowledge management because the seasoned
employee or the mentor who knows everything about the organization will selectively pass on
the knowledge to his new colleague.
The new employee can ask any practical difficulties which he will face in his work.
Even individual queries can be handled, which may not be possible in a group training session.
5. Document Management systems
These are the document systems like cloud drives, which help organizations store data about
their company. New employees get access and permission to view and update their knowledge
regarding the company. However, there is selective access for selective employees.
Other strategies of knowledge management are after-action reviews, communities of practice,
knowledge mapping, best practice transfer, expert directions, expert systems, collaborative
software, knowledge brokers, etc.
Many organizations have even developed different strategies of their own to manage their
knowledge.
Working of knowledge management
All the employees in the organization should be educated with the best possible decisions of the
business. Continuous learning and improvement is one way to do that.
Implementation of knowledge management will help and make it possible for continuous
improvement. This way, the employees can not only practice the existing skills, but future
employees will also be able to grasp the same knowledge and catch up with current employees.
Following are three common ways by which knowledge management can be approached:
1. People-centric
This learning is based entirely on people, relations, and now people from different learning
communities share knowledge.
2. Technology centric
This primarily facilitates knowledge transfer with the help of technology, and it achieves a
system that encourages the sharing of knowledge in the organization and amongst different
employees.
3. Process centric
It is interested in how the process of organization accommodates knowledge sharing. He tries to
understand the process of knowledge sharing in the existing organizational structure.
Type of information to be captured in knowledge management
Different types of information are captured as a part of the knowledge management process.
This includes but is not limited to:
1. Documents
Frequently asked questions of product, product templates, brochures, company handbooks, press
release notes.
2. Data
Competitor information, presentation tips, product strategies, and product development,
successful practices, the strategy of the product
3. Organizational data
Information about employee turnover, procurement sheets, financial information, office
location, brand information, etc.
4. Organizational news
Information regarding future strategies, important dates, important updates, and promotions.
Advantages of knowledge management
Knowledge management is an essential process that every organization has to include. Apart
from benefiting the organization, it saves future costs and can be a useful resource for new
employees.
Following are a few of the advantages of knowledge management:
1. Pass the baton
Consider that an employee has joined the organization in its initial phase and has learned many
things. After a few years, the employee decides to quit the organization, and for the next six
months, the organization does not find a suitable candidate.
After six months, the new candidate who joins will have two options: either start from where his
predecessor started or start from where his predecessor stopped. The latter way will only be
possible if the organization has a knowledge management system.
This saves time for the organization and the employee, and it also speeds up the business. This
way, the company can focus on other important matters.
2. Easier access to information
Knowledge management is a systematic arrangement of available knowledge in the database of
the organization. Since the arrangement is orderly, it is effortless to access required pieces of
knowledge or information instantly.
Any new person who is not aware of the database can access it easily.
3. Avoid mistakes
With the help of knowledge management, mistakes can be avoided. All the information related
to the past is stored in the database. History should not repeat itself at least when it comes to
pass mistakes.
Alternatively, good things and best practices which are stored can be repeated with the help of
knowledge management.
Newbies can access the old database and understand how they did things. Past lessons can be
learned without living through the lesson and wasting time on it.
4. Decision making
Different employees have different experiences. While making important decisions, one can
access the earlier database in the knowledge system to check past decisions and their
repercussions.
One can review the information present in multiple pieces of data, which will help him make an
informed decision. Sometimes similar decisions work across generations.
A CEO who presided over in the 1920s will not be available in 2020, and this is why the
knowledge database collected in the 1920s will be useful in 2020.
5. Better service
When you have an adequate knowledge management system in your organization, it helps the
employees and customer support team solve customers’ problems efficiently and quickly.
You can even form a frequently asked questions database to help your customer service team
answer the queries faster. This way, the customer service team, and other employees will be able
to stay satisfied and happy along with your customers themselves.
6. Standardization
When a process is documented, or a piece of information is stored in your database, it will help
you to standardize things more efficiently. Everyone will follow a single way of understanding,
and following the procedures will be easy.
Challenges of Knowledge Management
Now that we have seen the benefits let us understand the challenges of knowledge management.
1. Security
While maintaining knowledge management, a database is essential, and securing it is an entirely
different and challenging task. External threats are always prying on the organization’s
knowledge management database, and the organization must secure its database.
If it is in the wrong hands, sensitive and private information and knowledge can destroy the
entire company.
2. Collaboration and flexibility
With the help of knowledge management, it isn’t easy to implement new policies because people
develop a comfort zone around the existing policies. The fact that everyone is learning from the
old database gets mixed in every level of organization.
Employees find it very difficult to implement new policies and procedures. Learning new skills
that are not there in the old database can be difficult for the employees.
3. Knowledge measurement
Tacit knowledge is something that cannot be quantified or passed on easily, like explicit
knowledge.
Measuring tacit knowledge in your employees will be difficult. Because tacit knowledge cannot
be quantified easily, its measurement will be challenging. And an employee has to have both
tacit as well as explicit knowledge.
4. Knowledge storage
The process is usually established in your organization, where different team members can
access the knowledge database. While this is very convenient to pass on your knowledge to
other employees, it will not be easy theoretically to store so much knowledge.
Also, it will equally be difficult to access and go through the massive database.
5. Knowledge manager
Managing the entire database of knowledge requires a specific person who will be an expert in
keeping every type of knowledge. The person himself has to have certain knowledge in the
processes and a basic level of knowledge management along with the methods of the
organization.
Getting such a person is difficult and expensive, and resting the entire data with one person is
wasting too much power.
6. Management of documents and storage
Documentation is complicated when it comes to storage and management. Every day, an
organization generates thousands of documents from a single transaction, and it is essential to
store all of them in one single place.
Thousands of GB of storage is required to store and maintain such a vast database.
7. Continuous improvement and upgrading
Having a knowledge management database is one thing, and maintaining that database is a
different thing. Security, which secures the database should be updated every six months to one
year. Periodic reviews of the relevant database, discarding the unnecessary one, and keeping the
necessary one is a humongous task, especially if the data is too large.
Continuously updating and improving the database is a big challenge for large organizations.
For smaller organizations, it could still be manageable, but eventually, it gets tedious.

TYPES OF DECISIONS
Decision-making is one of the core functions of management. And it is actually a very scientific
function with a well-defined decision-making process. There are various types of decisions the
managers have to take in the day-to-day functioning of the firm. Let us take a look at some of
the types of decisions.
1. Programmed and Non-Programmed Decisions
2. Major and Minor Decisions
3. Routine and Strategic Decisions
4. Organizational and Personal Decisions
5. Individual and Group Decisions
6. Policy and Operation Decisions and
7. Long-Term Departmental and Non-Economic Decisions.
Type # 1. Programmed and Non-Programmed Decisions:
(a) Programmed decisions are those made in accordance with some habit, rule or procedure.
Every organisation has written or unwritten policies that simplify decision making in recurring
situations by limiting or excluding alternatives.
For example, we would not usually have to worry about what to pay to a newly hired employee;
organizations generally have an established salary scale for all positions. Routine procedures
exist for dealing with routine problems.
Routine problems are not necessarily simple ones; programmed decisions are used for dealing
with complex as well as with uncomplicated issues. To some extent, of course, programmed
decisions limit our freedom, because the organization rather than the individual decides what to
do.
However, the policies, rules or procedures by which we make decisions free us of the time
needed to work out new solutions to old problems, thus allowing us to devote attention to other,
more important activities in the organization.
(b) Non-programmed decisions are those that deal with unusual or exceptional problems. If a
problem has not come up often enough to be covered by a policy or is so important that it
deserves special treatment, it must be handled by a non-programmed decision.
Such problems as:
(1) How to allocate an organisation’s resources
(2) What to do about a failing product line,
(3) How community relations should be improved will usually require non- programmed
decisions.
As one moves up in the organizational hierarchy, the ability to make non- programmed decisions
becomes more important because progressively more of the decisions made are
non-programmed.
Type # 2. Major and Minor Decisions:
A decision related to the purchase of a CNC machine costing several lakhs is a major decision
and purchase of a few reams of typing paper is a minor (matter or) decision.
Type # 3. Routine and Strategic Decisions:
Routine decisions are of repetitive nature, do not require much analysis and evaluation, are in
the context of day-to-day operations of the enterprise and can be made quickly at middle
management level. An example is, sending samples of a food product to the Government
investigation centre.
Strategic decisions relate to policy matter, are taken at higher levels of management after careful
analysis and evaluation of various alternatives, involve large expenditure of funds and a slight
mistake in decision making is injurious to the enterprise. Examples of strategic decisions are-
capital expenditure decisions, decisions related to pricing, expansion and change in product line
etc.
Type # 4. Organizational and Personal Decisions:
A manager makes organizational decisions in the capacity of a company officer. Such decisions
reflect the basic policy of the company. They can be delegated to others. Personal decisions
relate the manager as an individual and not as a member of an organization. Such decisions
cannot be delegated.
Type # 5. Individual and Group Decisions:
Individual decisions are taken by a single individual in context of routine decisions where
guidelines are already provided. Group decisions are taken by a committee constituted for this
specific purpose. Such decisions are very important for the organisation.
Type # 6. Policy and Operative Decisions:
Policy decisions are very important, they are taken by top management, they have a long-term
impact and mostly relate to basic policies. Operative decisions relate to day-to-day operations of
the enterprise and are taken at lower or middle management level. Whether to give bonus to
employees is a policy decision but calculating bonus for each employee is an operative decision.
Type # 7. Long-Term Departmental and Non-Economic Decisions:
In case of long-term decisions, the time period covered is long and the risk involved is more.
Departmental decisions relate to a particular department only and are taken by departmental
head. Non-economic decisions relate to factors such as technical values, moral behaviour etc.

DECISION MAKING PROCESS


In general, the decision-making process helps managers and other business professionals solve
problems by examining choices and deciding on the best route to take. Using a step-by-step
approach is an efficient way to make thoughtful, informed decisions that positively impact your
organization’s short- and long-term goals.
The business decision making process is commonly divided into seven steps. Managers may
utilize many of these steps without realizing it, but gaining a clearer understanding of best
practices can improve the effectiveness of your decisions.
Steps of the Decision-Making Process
The following are the seven key steps of the decision-making process.
Step 1: Identify the decision that needs to be made
When you're identifying the decision, ask yourself a few questions:
● What is the problem that needs to be solved?
● What is the goal you plan to achieve by implementing this decision?
● How will you measure success?
These questions are all common goal setting techniques that will ultimately help you come up
with possible solutions. When the problem is clearly defined, you then have more information to
come up with the best decision to solve the problem.
Step 2: Gather relevant information
​Gathering information related to the decision being made is an important step to making an
informed decision. Does your team have any historical data as it relates to this issue? Has
anybody attempted to solve this problem before?
It's also important to look for information outside of your team or company. Effective decision
making requires information from many different sources. Find external resources, whether it’s
doing market research, working with a consultant, or talking with colleagues at a different
company who have relevant experience. Gathering information helps your team identify
different solutions to your problem.
Step 3: Identify alternative solutions
This step requires you to look for many different solutions for the problem at hand. Finding
more than one possible alternative is important when it comes to business decision-making,
because different stakeholders may have different needs depending on their role. For example, if
a company is looking for a work management tool, the design team may have different needs
than a development team. Choosing only one solution right off the bat might not be the right
course of action.
Step 4: Weigh the evidence
This is when you take all of the different solutions you’ve come up with and analyze how they
would address your initial problem. Your team begins identifying the pros and cons of each
option, and eliminating alternatives from those choices.
There are a few common ways your team can analyze and weigh the evidence of options:
● Pros and cons list
● SWOT analysis
● Decision matrix
Step 5: Choose among the alternatives
The next step is to make your final decision. Consider all of the information you've collected
and how this decision may affect each stakeholder.
Sometimes the right decision is not one of the alternatives, but a blend of a few different
alternatives. Effective decision-making involves creative problem solving and thinking out of
the box, so don't limit you or your teams to clear-cut options.One of the key values at Asana is
to reject false tradeoffs. Choosing just one decision can mean losing benefits in others. If you
can, try and find options that go beyond just the alternatives presented.
Step 6: Take action
Once the final decision maker gives the green light, it's time to put the solution into action. Take
the time to create an implementation plan so that your team is on the same page for next steps.
Then it’s time to put your plan into action and monitor progress to determine whether or not this
decision was a good one.
Step 7: Review your decision and its impact (both good and bad)
Once you’ve made a decision, you can monitor the success metrics you outlined in step 1. This
is how you determine whether or not this solution meets your team's criteria of success.
Here are a few questions to consider when reviewing your decision:
● Did it solve the problem your team identified in step 1?
● Did this decision impact your team in a positive or negative way?
● Which stakeholders benefited from this decision? Which stakeholders were impacted
negatively?
If this solution was not the best alternative, your team might benefit from using an iterative form
of project management. This enables your team to quickly adapt to changes, and make the best
decisions with the resources they have.
Types of decision-making models
While most decision-making models revolve around the same seven steps, here are a few
different methodologies to help you make a good decision.
​Rational decision-making models
This type of decision-making model is the most common type that you'll see. It's logical and
sequential. The seven steps listed above are an example of the rational decision-making model.
When your decision has a big impact on your team and you need to maximize outcomes, this is
the type of decision-making process you should use. It requires you to consider a wide range of
viewpoints with little bias so you can make the best decision possible.
Intuitive decision-making models
This type of decision-making model is dictated not by information or data, but by gut instincts.
This form of decision making requires previous experience and pattern recognition to form
strong instincts.
This type of decision making is often made by decision makers who have a lot of experience
with similar kinds of problems. They have already had proven success with the solution they're
looking to implement.
Creative decision-making model
The creative decision-making model involves collecting information and insights about a
problem and coming up with potential ideas for a solution, similar to the rational
decision-making model.
The difference here is that instead of identifying the pros and cons of each alternative, the
decision maker enters a period in which they try not to actively think about the solution at all.
The goal is to have their subconscious take over and lead them to the right decision, similar to
the intuitive decision-making model.
This situation is best used in an iterative process so that teams can test their solutions and adapt
as things change.
Track key decisions with a work management tool
Tracking key decisions can be challenging when not documented correctly. Learn more about
how a work management tool like Asana can help your team track key decisions, collaborate
with teammates, and stay on top of progress all in one place.
Common Challenges of Decision Making
Although following the steps outlined above will help you make more effective decisions, there
are some pitfalls to look out for. Here are common challenges you may face and best practices to
help you avoid them.
● Having too much or not enough information. Gathering relevant information is key
when approaching the decision-making process, but it’s important to identify how much
background information is truly required. “An overload of information can leave you
confused and misguided, and prevents you from following your intuition,” according
to Corporate Wellness Magazine.
In addition, relying on one single source of information can lead to bias and misinformation,
which can have disastrous effects down the line.
● Misidentifying the problem. In many cases, the issues surrounding your decision will be
obvious. However, there will be times when the decision is complex and you aren’t sure
where the main issue lies. Conduct thorough research and speak with internal experts who
experience the problem first hand to mitigate this. Corporate Wellness Magazine says it
will save you time and resources in the long run.
● Overconfidence in the outcome. Even if you follow the steps of the decision-making
process, there is still a chance that the outcome won’t be exactly what you had in mind.
That’s why it’s so important to identify a valid option that is plausible and achievable.
Being overconfident in an unlikely outcome can lead to adverse results.
Decision making is a vital skill in the business workplace, particularly for managers and those in
leadership positions. Following a logical procedure like the one outlined here and being aware
of common challenges can help ensure both thoughtful decision making and positive results.
Advantages of Decision-Making Process
Gives More Information
Before taking some action, a good decision-making process gathers enough details. A
significant number of individuals are interested in decision-making. It is carried out by the
whole community rather than by a single person. Each person expresses his or her point of view
on how to deal with a specific situation.
Increase in Available Options
In group decision-making, businesses may obtain a variety of options for a given situation. For
proper decisions, a community of people is working together. Each individual uniquely
approaches a problem. Many of the options are thoroughly examined in light of the handling
situation. To achieve a better outcome, the best one is chosen.
Strengthening The Organization
It fosters a sense of teamwork and solidarity among those who work there. They all work
together to achieve the company’s objectives. This improves the organization’s overall
competitiveness and enhances its overall structure.
Disadvantages of Decision-Making Process
Time Consuming
Decisions are useless if they are not made promptly. Making a decision entails a set of steps that
must be taken to reach a certain conclusion. Different individuals are involved in the
decision-making process. Planning, organizing, and coordinating various people to meet and
have quality conversations requires time and effort.
Costly
The first and most significant drawback to decision-making is that it is prohibitively costly to
carry out. Different individuals are involved in decision-making in organizations to take
appropriate action. Bringing multiple people together in one place necessitates a significant
amount of work. Furthermore, analyzing the various viewpoints offered by members of a
community is too difficult.
Uncertain Responsibility
Another downside to decision-making is the lack of clarity on who is responsible. Individual
decision-making places the burden of proof on a single person. However, when a group makes a
decision, the whole group is involved, and liability is unclear. Hence this narrows the reach of
responsibility to a single person.
Decision Support Systems (DSS)

Decision Support Systems (DSS) help executives make better decisions by


using historical and current data from internal Information Systems and external
sources. By combining massive amounts of data with sophisticated analytical
models and tools, and by making the system easy to use, they provide a much
better source of information to use in the decision-making process.

Decision Support Systems (DSS) are a class of computerized information


systems that support decision-making activities. DSS are interactive
computer-based systems and subsystems intended to help decision makers use
communications technologies, data, documents, knowledge and/or models to
successfully complete decision process tasks.

While many people think of decision support systems as a specialized part of a


business, most companies have actually integrated this system into their day to
day operating activities. For instance, many companies constantly download
and analyze sales data, budget sheets and forecasts and they update their
strategy once they analyze and evaluate the current results. Decision support
systems have a definite structure in businesses, but in reality, the data and
decisions that are based on it are fluid and constantly changing.

Types of Decision Support Systems (DSS)

1. Data-Driven DSS take the massive amounts of data available through the
company’s TPS and MIS systems and cull from it useful information
which executives can use to make more informed decisions. They don’t
have to have a theory or model but can “free-flow” the data. The first
generic type of Decision Support System is a Data-Driven DSS. These
systems include file drawer and management reporting systems, data
warehousing and analysis systems, Executive Information Systems (EIS)
and Spatial Decision Support Systems. Business Intelligence Systems are
also examples of Data- Driven DSS. Data-Driven DSS emphasize access
to and manipulation of large databases of structured data and especially a
time-series of internal company data and sometimes external data. Simple
file systems accessed by query and retrieval tools provide the most
elementary level of functionality. Data warehouse systems that allow the
manipulation of data by computerized tools tailored to a specific task and
setting or by more general tools and operators provide additional
functionality. Data-Driven DSS with Online Analytical Processing
(OLAP) provide the highest level of functionality and decision support
that is linked to analysis of large collections of historical data.
2. Model-Driven DSS A second category, Model-Driven DSS, includes
systems that use accounting and financial models, representational
models, and optimization models. Model-Driven DSS emphasize access
to and manipulation of a model. Simple statistical and analytical tools
provide the most elementary level of functionality. Some OLAP
systems that allow complex analysis of data may be classified as hybrid
DSS systems providing modeling, data retrieval and data summarization
functionality. Model-Driven DSS use data and parameters provided by
decision-makers to aid them in analyzing a situation, but they are not
usually data intensive. Very large databases are usually not needed for
Model-Driven DSS. Model-Driven DSS were isolated from the main
Information Systems of the organization and were primarily used for the
typical “what-if” analysis. That is, “What if we increase production of our
products and decrease the shipment time?” These systems rely heavily on
models to help executives understand the impact of their decisions on the
organization, its suppliers, and its customers.
3. Knowledge-Driven DSS The terminology for this third generic type of
DSS is still evolving. Currently, the best term seems to be
Knowledge-Driven DSS. Adding the modifier “driven” to the word
knowledge maintains a parallelism in the framework and focuses on the
dominant knowledge base component. Knowledge-Driven DSS can
suggest or recommend actions to managers. These DSS are personal
computer systems with specialized problem-solving expertise. The
“expertise” consists of knowledge about a particular domain,
understanding of problems within that domain, and “skill” at solving
some of these problems. A related concept is Data Mining. It refers to a
class of analytical applications that search for hidden patterns in a
database. Data mining is the process of sifting through large amounts of
data to produce data content relationships.
4. Document-Driven DSS A new type of DSS, a Document-Driven DSS or
Knowledge Management System, is evolving to help managers retrieve
and manage unstructured documents and Web pages. A Document-Driven
DSS integrates a variety of storage and processing technologies to
provide complete document retrieval and analysis. The Web provides
access to large document databases including databases of hypertext
documents, images, sounds and video. Examples of documents that
would be accessed by a Document-Based DSS are policies and
procedures, product specifications, catalogs, and corporate historical
documents, including minutes of meetings, corporate records, and
important correspondence. A search engine is a powerful decision aiding
tool associated with a Document-Driven DSS.
5. Communications-Driven and Group DSS Group Decision Support
Systems (GDSS) came first, but now a broader category of
Communications-Driven DSS or groupware can be identified. This fifth
generic type of Decision Support System includes communication,
collaboration and decision support technologies that do not fit within
those DSS types identified. Therefore, we need to identify these systems
as a specific category of DSS. A Group DSS is a hybrid Decision Support
System that emphasizes both the use of communications and decision
models. A Group Decision Support System is an interactive
computer-based system intended to facilitate the solution of problems by
decision-makers working together as a group. Groupware supports
electronic communication, scheduling, document sharing, and other
group productivity and decision support enhancing activities We have a
number of technologies and capabilities in this category in the framework
— Group DSS, two-way interactive video, White Boards, Bulletin
Boards, and Email.
Components of Decision Support Systems (DSS)

Traditionally, academics and MIS staffs have discussed building Decision


Support Systems in terms of four major components:

● The user interface


● The database
● The models and analytical tools and
● The DSS architecture and network

This traditional list of components remains useful because it identifies


similarities and differences between categories or types of DSS. The DSS
framework is primarily based on the different emphases placed on DSS
components when systems are actually constructed.
Data-Driven, Document-Driven and Knowledge-Driven DSS need specialized
database components. A Model- Driven DSS may use a simple flat-file database
with fewer than 1,000 records, but the model component is very important.
Experience and some empirical evidence indicate that design and
implementation issues vary for Data- Driven, Document-Driven, Model-Driven
and Knowledge-Driven DSS.

Multi-participant systems like Group and Inter-Organizational DSS also create


complex implementation issues. For instance, when implementing a
Data-Driven DSS a designer should be especially concerned about the user’s
interest in applying the DSS in unanticipated or novel situations. Despite the
significant differences created by the
specific task and scope of a DSS, all Decision Support Systems have similar
technical components and share a common purpose, supporting
decision-making.

A Data-Driven DSS database is a collection of current and historical structured


data from a number of sources that have been organized for easy access and
analysis. We are expanding the data component to include unstructured
documents in Document- Driven DSS and “knowledge” in the form of rules or
frames in Knowledge-Driven DSS. Supporting management decision-making
means that computerized tools are used to make sense of the structured data or
documents in a database.

Mathematical and analytical models are the major component of a


Model-Driven DSS. Each Model-Driven DSS has a specific set of purposes and
hence different models are needed and used. Choosing appropriate models is a
key design issue. Also, the software used for creating specific models needs to
manage needed data and the user interface. In Model-Driven DSS the values of
key variables or parameters are changed, often repeatedly, to reflect potential
changes in supply, production, the economy, sales, the marketplace, costs,
and/or other environmental and internal factors. Information from the models is
then analyzed and evaluated by the decision-maker.

Knowledge-Driven DSS use special models for processing rules or identifying


relationships in data. The DSS architecture and networking design component
refers to how hardware is organized, how software and data are distributed in
the system, and how components of the system are integrated and connected. A
major issue today is whether DSS should be available using a Web browser on a
company intranet and also available on the Global Internet. Networking is the
key driver of Communications- Driven DSS.

Advantages of Decision Support Systems (DSS)

● Time savings. For all categories of decision support systems, research


has demonstrated and substantiated reduced decision cycle time,
increased employee productivity and more timely information for
decision making. The time savings that have been documented from using
computerized decision support are often substantial. Researchers,
however, have not always demonstrated that decision quality remained
the same or actually improved.
● Enhance effectiveness. A second category of advantage that has been
widely discussed and examined is improved decision making
effectiveness and better decisions. Decision quality and decision making
effectiveness are however hard to document and measure. Most
researches have examined soft measures like perceived decision quality
rather than objective measures. Advocates of building data warehouses
identify the possibility of more and better analysis that can improve
decision making.
● Improve interpersonal communication. DSS can improve
communication and collaboration among decision makers. In appropriate
circumstances, communications- driven and group DSS have had this
impact. Model-driven DSS provides a means for sharing facts and
assumptions. Data-driven DSS make “one version of the truth” about
company operations available to managers and hence can encourage
fact-based decision making. Improved data accessibility is often a major
motivation for building a data-driven DSS. This advantage has not been
adequately demonstrated for most types of DSS.
● Competitive advantage. Vendors frequently cite this advantage for
business intelligence systems, performance management systems, and
web-based DSS. Although it is possible to gain a competitive advantage
from computerized decision support, this is not a likely outcome. Vendors
routinely sell the same product to competitors and even help with the
installation. Organizations are most likely to gain this advantage from
novel, high risk, enterprise-wide, inward facing decision support systems.
Measuring this is and will continue to be difficult.
● Cost reduction. Some researches and especially case studies have
documented DSS cost saving from labor savings in making decisions and
from lower infrastructure or technology costs. This is not always a goal of
building DSS.
● Increase decision maker satisfaction. The novelty of using computers
has and may continue to confound analysis of this outcome. DSS may
reduce frustrations of decision makers, create perceptions that better
information is being used and/or creates perceptions that the individual is
a “better” decision maker. Satisfaction is a complex measure and
researchers often measure satisfaction with the DSS rather than
satisfaction with using a DSS in decision making. Some studies have
compared satisfaction with and without computerized decision aids.
Those studies suggest the complexity and “love/hate” tension of using
computers for decision support.
● Promote learning. Learning can occur as a by-product of initial and
ongoing use of a DSS. Two types of learning seem to occur: learning of
new concepts and the development of a better factual understanding of
the business and decision making environment. Some DSS serve as “de
facto” training tools for new employees. This potential advantage has not
been adequately examined.
● Increase organizational control. Data-driven DSS often make business
transaction data available for performance monitoring and ad hoc
querying. Such systems can enhance management understanding of
business operations and managers perceive that this is useful. What is not
always evident is the financial benefit from increasingly detailed data.

Regulations like Sarbanes-Oxley often dictate reporting requirements and hence


heavily influence the control information that is made available to managers. On
a more ominous note, some DSS provide summary data about decisions made,
usage of the systems, and recommendations of the system. Managers need to be
very careful about how decision-related information is collected and then used
for organizational control purposes. If employees feel threatened or spied upon
when using a DSS, the benefits of the DSS can be reduced. More research is
needed on these questions.
Disadvantages of Decision Support Systems (DSS)

Decision Support Systems can create advantages for organizations and can have
positive benefits, however building and using DSS can create negative
outcomes in some situations.

● Monetary cost. The decision support system requires investing in


information system to collect data from many sources and analyze them
to support the decision making. Some analysis for Decision Support
System needs the advance of data analysis, statistics, econometrics and
information system, so it is the high cost to hire the specialists to set up
the system.
● Overemphasize decision making. Clearly the focus of those of us
interested in computerized decision support is on decisions and decision
making. Implementing Decision Support System may reinforce the
rational perspective and overemphasize decision processes and decision
making. It is important to educate managers about the broader context of
decision making and the social, political and emotional factors that
impact organizational success. It is especially important to continue
examining when and under what circumstances Decision Support System
should be built and used. We must continue asking if the decision
situation is appropriate for using any type of Decision Support System
and if a specific Decision Support System is or remains appropriate to use
for making or informing a specific decision.
● Assumption of relevance. According to Wino grad and Flores (1986),
“Once a computer system has been installed it is difficult to avoid the
assumption that the things it can deal with are the most relevant things for
the manager’s concern.” The danger is that once DSS become common in
organizations, that managers will use them inappropriately. There is
limited evidence that this occurs. Again training is the only way to avoid
this potential problem.
● Transfer of power. Building Decision Support Systems, especially
knowledge-driven Decision Support System, may be perceived as
transferring decision authority to a software program. This is more a
concern with decision automation systems than with DSS. We advocate
building computerized decision support systems because we want to
improve decision making while keeping a human decision maker in the
“decision loop”. In general, we value the “need for human discretion and
innovation” in the decision making process.
● Unanticipated effects. Implementing decision support technologies may
have unanticipated consequences. It is conceivable and it has been
demonstrated that some DSS reduce the skill needed to perform a
decision task. Some Decision Support System overload decision makers
with information and actually reduce decision making effectiveness.
● Obscuring responsibility. The computer does not make a “bad” decision,
people do. Unfortunately some people may deflect personal responsibility
to a DSS. Managers need to be continually reminded that the
computerized decision support system is an intermediary between the
people who built the system and the people who use the system. The
entire responsibility associated with making a decision using a DSS
resides with people who built and use the system.
● False belief in objectivity. Managers who use Decision Support Systems
may or may not be more objective in their decision making. Computer
software can encourage more rational action, but managers can also use
decision support technologies to rationalize their actions. It is an
overstatement to suggest that people using a DSS are more objective and
rational than managers who are not using computerized decision support.
● Status reduction. Some managers argue using a Decision Support
System will diminish their status and force them to do clerical work. This
perceptual problem can be a disadvantage of implementing a DSS.
Managers and IS staff who advocate building and using computerized
decision support need to deal with any status issues that may arise. This
perception may or should be less common now that computer usage is
common and accepted in organizations.
● Information overload. Too much information is a major problem for
people and many DSS increase the information load. Although this can be
a problem, Decision Support System can help managers organize and use
information. Decision Support System can actually reduce and manage
the information load of a user. Decision Support System developers need
to try to measure the information load created by the system and Decision
Support System users need to monitor their perceptions of how much
information they are receiving. The increasing ubiquity of handheld,
wireless computing devices may exacerbate this problem and
disadvantage.

In conclusion, before firms will invest in the Decision Support Systems, they
must compare the advantages and disadvantages of the decision support system
to get valuable investment.
BUSINESS INTELLIGENCE

BI is an umbrella term for data analysis techniques, applications and practices used to
support the decision-making processes in business. The term was proposed by Howard
Dresner in 1989 and became widespread in the late 1990s. Business intelligence assists
business owners in making important decisions based on their business data.
Rather than directly telling business owners what to do, business intelligence allows
them to analyze the data they have to understand trends and get insights, thus
scaffolding the decision-making process. BI includes a wide variety of techniques and
tools for data analytics, including tools for ad- hoc analytics and reporting, OLAP tools,
real-time business intelligence, SaaS BI, etc. Another important area of BI is data
visualization software, dashboards, and scorecards.

THE ROLE OF OLAP IN BUSINESS INTELLIGENCE

OLAP (online analytical processing) is sometimes used as a synonym of business intelligence.


However, it is not correct - it could be better described a function of BI software that enables a
user to extract and view data from different viewpoints.

There are several reasons why OLAP is popular in BI:

∙ It represents data in a multidimensional form, which makes it convenient for analysts and other
business users to analyze numeric values from different perspectives.

∙ OLAP is good for storing, extracting and analyzing large amounts of data. Business
intelligence specialists are able to analyze data accumulated over a long period of time, which
enables more precise results and better forecasting. The architecture of OLAP systems allows
fast access to the data as they typically pre-aggregate data.

∙ OLAP provides wide opportunities for data slicing and dicing, drill down/up/through, which
helps analysts narrow down the data used for BI analysis and reporting.

∙ OLAP systems usually have an intuitive and easy-to-use interface, which allows nontechnical
users to analyze data and generate reports without involving IT department. Also, OLAP
dimension uses familiar business terms so that employees don't have to receive any additional
training.
Business Forecasting

3.1 Introduction
The growing competition, rapidity of change in circumstances and the trend
towards automation demand that decisions in business are based on a careful
analysis of data concerning the future course of events and not purely on
guesses and hunches. The future is unknown to us and yet every day we are
forced to make decisions involving the future and therefore, there is uncertainty.
Great risk is associated with business affairs. All businessmen are forced to
make forecasts regarding business activities.
Success in business depends upon successful forecasts of business events. In
recent times, considerable research has been conducted in this field. Attempts
are being made to make forecasting as scientific as possible.
Business forecasting is not a new development. Every businessman must
forecast; even if the entire product is sold before production. Forecasting has
always been necessary. What is new in the attempt to put forecasting on a
scientific basis is to forecast by reference to past history and statistics rather
than by pure intuition and guess-work.
One of the most important tasks before businessmen and economists these days
is to make estimates for the future. For example, a businessman is interested in
finding out his likely sales next year or as long term planning in next five or ten
years so that he adjusts his production accordingly and avoid the possibility of
either inadequate production to meet the demand or unsold stocks.
Similarly, an economist is interested in estimating the likely population in the
coming years so that proper planning can be carried out with regard to jobs for
the people, food supply, etc. First step in making estimates for the future
consists of gathering information from the past. In this connection we usually
deal with statistical data which is collected, observed or recorded at successive
intervals of time. Such data is generally referred to as time series. Thus, when
we observe numerical data at different points of time the set of observations is
known as time series.
Objectives:
After studying this unit, you should be able to:
● describe the meaning of business forecasting
● distinguish between prediction, projection and forecast
● describe the forecasting methods available
● apply the forecasting theories in taking effective business decisions
3.2 Business Forecasting
Business forecasting refers to the analysis of past and present economic
conditions with the object of drawing inferences about probable future business
conditions. The process of making definite estimates of future course of events
is referred to as forecasting and the figure or statements obtained from the
process is known as ‘forecast’; future course of events is rarely known. In order
to be assured of the coming course of events, an organised system of forecasting
helps. The following are two aspects of scientific business forecasting:
1. Analysis of past economic conditions
For this purpose, the components of time series are to be studied. The secular
trend shows how the series has been moving in the past and what its future
course is likely to be over a long period of time. The cyclic fluctuations would
reveal whether the business activity is subjected to a boom or depression. The
seasonal fluctuations would indicate the seasonal changes in the business
activity.
2. Analysis of present economic conditions
The object of analysing present economic conditions is to study those factors
which affect the sequential changes expected on the basis of the past conditions.
Such factors are new inventions, changes in fashion, changes in economic and
political spheres, economic and monetary policies of the government, war, etc.
These factors may affect and alter the duration of trade cycle. Therefore, it is
essential to keep in mind the present economic conditions since they have an
important bearing on the probable future tendency.
3.2.1 Objectives of forecasting in business
Forecasting is a part of human nature. Businessmen also need to look to the
future. Success in business depends on correct predictions. In fact when a man
enters business, he automatically takes with it the responsibility for attempting
to forecast the future.
To a very large extent, success or failure would depend upon the ability to
successfully forecast the future course of events. Without some element of
continuity between past, present and future, there would be little possibility of
successful prediction. But history is not likely to repeat itself and we would
hardly expect economic conditions next year or over the next 10 years to follow
a clear cut prediction. Yet, past patterns prevail sufficiently to justify using the
past as a basis for predicting the future.
A businessman cannot afford to base his decisions on guesses. Forecasting
helps a businessman in reducing the areas of uncertainty that surround
management decision
making with respect to costs, sales, production, profits, capital investment,
pricing, expansion of production, extension of credit, development of markets,
increase of inventories and curtailment of loans. These decisions are to be based
on present indications of future conditions.
However, we know that it is impossible to forecast the future precisely. There is
a possibility of occurrence of some range of error in the forecast. Statistical
forecasts are the methods in which we can use the mathematical theory of
probability to measure the risks of errors in predictions.
3.2.1.1 Prediction, Projection and Forecasting
A great amount of confusion seems to have grown up in the use of words
‘forecast’, ‘prediction’ and ‘projection’.

Forecasts are made by estimating future values of the external factors by means
of prediction, projection or forecast and from these values calculating the
estimate of the dependent variable.
3.2.2 Characteristics of Business Forecasting
● Based on past and present conditions
Business forecasting is based on past and present economic condition of the
business. To forecast the future, various data, information and facts concerning
to economic condition of business for past and present are analysed.
● Based on mathematical and statistical methods
The process of forecasting includes the use of statistical and mathematical
methods. By using these methods, the actual trend which may take place in
future can be forecasted.
● Period
The forecasting can be made for long term, short term, medium term or any
specific period.
● Estimation of future
Business forecasting is to forecast the future regarding probable economic conditions.
● Scope
Forecasting can be physical as well as financial.
3.2.3 Steps in forecasting
Forecasting of business fluctuations consists of the following steps:
1. Understanding why changes in the past have occurred
One of the basic principles of statistical forecasting is that the forecaster should
use past performance data. The current rate and changes in the rate constitute
the basis of forecasting. Once they are known, various mathematical techniques
can develop projections from them. If an attempt is made to forecast business
fluctuations without understanding why past changes have taken place, the
forecast will be purely mechanical.
Business fluctuations are based solely upon the application of mathematical
formulae and are subject to serious error.
2. Determining which phases of business activity must be measured
After understanding the reasons of occurrence of business fluctuations, it is
necessary to measure certain phases of business activity in order to predict
what changes will probably follow the present level of activity.

3. Selecting and compiling data to be used as measuring devices


There is an independent relationship between the selection of statistical data and
determination of why business fluctuations occur. Statistical data cannot be
collected and analysed in an intelligent manner unless there is sufficient
understanding of business fluctuations. It is important that reasons for business
fluctuations be stated in such a manner that it is possible to secure data that is
related to the reasons.
4. Analysing the data
Lastly, the data is analysed to understanding the reason why change occurs. For
example, if it is reasoned that a certain combination of forces will result in a
given change, the statistical part of the problem is to measure these forces, from
the data available, to draw conclusions on the future course of action. The
methods of drawing conclusions may be called forecasting techniques.

3.2.4 Methods of Business Forecasting


Almost all businessmen forecast about the conditions related to their business.
In recent years scientific methods of forecasting have been developed. The base
of scientific forecasting is statistics. To handle the increasing variety of
managerial forecasting problems, several forecasting techniques have been
developed in recent
years. Forecasting techniques vary from simple expert guesses to complex
analysis of mass data. Each technique has its special use, and care must be taken
to select the correct technique for a particular situation.
Before applying a method of forecasting, the following questions should be
answered:
● What is the purpose of the forecast and how is it to be used?
● What are the dynamics and components of the system for which the
forecast will be made?
● How important is the past, in estimating the future?
The following are the two main types of business forecasting methods:
quantitative and qualitative. While both have unique approaches, they’re similar
in their goals and the information used to make predictions – company data and
market knowledge.

Quantitative forecasting
The quantitative forecasting method relies on historical data to predict future needs
and trends. The data can be from your own company, market activity, or both. It
focuses on cold, hard numbers that can show clear courses of change and action. This
method is beneficial for companies that have an extensive amount of data at their
disposal.

There are four quantitative forecasting methods:


1. Trend series method:Also referred to as time series analysis, this is the most common
forecasting method. Trend series collects as much historical data as possible to identify
common shifts over time. This method is useful if your company has a lot of past data
that already shows reliable trends.
2. The average approach: This method is also based on repetitive trends. The average
approach assumes that the average of past metrics will predict future events.
Companies most commonly use the average approach for inventory forecasting.
3. Indicator approach: This approach follows different sets of indicator data that help
predict potential influences on the general economic conditions, specific target
markets, and supply chain. Some examples of indicators include changes in Gross
Domestic Product (GDP), unemployment rate, and Consumer Price Index (CPI). By
monitoring the applicable indicators, companies can easily predict how these changes
may affect their own business needs and profitability by observing how they interact
with each other. This approach would be the most effective for companies whose sales
are heavily affected by specific economic factors.
4. Econometric modeling:This method takes a mathematical approach using regression
analysis to measure the consistency in company data over time. Regression analysis
uses statistical equations to predict how variables of interest interact and affect a
company. The data used in this analysis can be internal datasets or external factors that
can affect a business,
such as market trends, weather, GDP growth, political changes, and more. Econometric
modeling observes the consistency in those datasets and factors to identify the potential
for repeat scenarios in the future.
For example, a company that sells hurricane impact windows may use econometric
modeling to measure how hurricane season has affected their sales in the past and create
forecasts for future hurricane seasons.

Qualitative forecasting
The qualitative forecasting method relies on the input of those who influence your
company’s success. This includes your target customer base and even your leadership
team. This method is beneficial for companies that don’t have enough complex data to
conduct a quantitative forecast.
There are two approaches to qualitative forecasting:
1. Market research: The process of collecting data points through direct correspondence
with the market community. This includes conducting surveys, polls, and focus groups
to gather real-time feedback and opinions from the target market. Market research
looks at competitors to see how they adjust to market fluctuations and adapt to
changing supply and demand. Companies commonly utilize market research to
forecast expected sales for new product launches.
2. Delphi method:This method collects forecasting data from company professionals.
The company’s foreseeable needs are presented to a panel of experts, who then work
together to forecast the expectations and business decisions that can be made with the
derived insights. This method is used to create long-term business predictions and can
also be applied to sales forecasts.
3.2 Utility of Business Forecasting
Business forecasting acquires an important place in every field of the economy.
Business forecasting helps the businessmen and industrialists to form the
policies and plans related with their activities. On the basis of the forecasting,
businessmen can forecast the demand of the product, price of the product,
condition of the market and so on. The business decisions can also be reviewed
on the basis of business forecasting.
3.3.1 Advantages of business forecasting
● Helpful in increasing profit and reducing losses
Every business is carried out with the purpose of earning maximum profits. So,
by forecasting the future price of the product and its demand, the businessman
can predetermine the production cost, production and the level of stock to be
determined. Thus, business forecasting is regarded as the key of success of
business.
● Helpful in taking management decisions
Business forecasting provides the basis for management decisions, because in
present times the management has to take the decision in the atmosphere of
uncertainties. Also, business forecasting explains the future conditions and
enables the management to select the best alternative.
● Useful to administration
On the basis of forecasting, the government can control the circulation of
money. It can also modify the economic, fiscal and monetary policies to avoid
adverse effects of trade cycles. So, with the help of forecasting, the government
can control the expected fluctuations in future.
● Basis for capital market
Business forecasting helps in estimating the requirement of capital, position of
stock exchange and the nature of investors.
● Useful in controlling the business cycles
The trade cycles cause various depressions in business such as sudden change in
price level, increase in the risk of business, increase in unemployment, etc. By
adopting a systematic business forecasting, businessmen and government can
handle and control the depression of trade cycles.
● Helpful in achieving the goals
Business forecasting helps to achieve the objective of business goals through
proper planning of business improvement activities.
● Facilitates control
By business forecasting, the tendency of black marketing, speculation,
uneconomic activities and corruption can be controlled.
● Utility to society
With the help of business forecasting the entire society is also benefited because
the adverse effects of fluctuations in the conditions of business are kept under
control.
3.3.2 Limitations of business forecasting
Business forecasting cannot be accurate due to various limitations which are
mentioned below.
● Forecasting cannot be accurate, because it is largely based on future events
and there is no guarantee that they will happen.
● Business forecasting is generally made by using statistical and
mathematical methods. However, these methods cannot claim to make an
uncertain future a definite one.
● The underlying assumptions of business forecasting cannot be satisfied
simultaneously. In such a case, the results of forecasting will be misleading.
● The forecasting cannot guarantee the elimination of errors and mistakes.
The managerial decision will be wrong if the forecasting is done in a wrong
way.
● Factors responsible for economic changes are often difficult to discover and
measure. Hence, business forecasting becomes an unnecessary exercise.
● Business forecasting does not evaluate risks.
● The forecasting is made on the basis of past information and data and relies
on the assumption that economic events are repeated under the same
conditions. But there may be circumstances where these conditions are not
repeated.
● Forecasting is not a continuous process. In order to be effective, it requires
continuous attention.
Predictive Analytics
Predictive Analytics is a statistical method that utilizes algorithms and machine
learning to identify trends in data and predict future behaviors.
With increasing pressure to show a return on investment (ROI) for implementing
learning analytics, it is no longer enough for a business to simply show how learners
performed or how they interacted with learning content. It is now desirable to go
beyond descriptive analytics and gain insight into whether training initiatives are
working and how they can be improved.
Predictive Analytics can take both past and current data and offer predictions of what
could happen in the future. This identification of possible risks or opportunities enables
businesses to take actionable intervention in order to improve future learning
initiatives.

How does Predictive Analytics work?

The software for predictive analytics has moved beyond the realm of statisticians and is
becoming more affordable and accessible for different markets and industries,
including the field of learning & development.
For online learning specifically, predictive analytics is often found incorporated in the
Learning Management System (LMS), but can also be purchased separately as
specialized software.
For the learner, predictive forecasting could be as simple as a dashboard located on the
main screen after logging in to access a course. Analyzing data from past and current
progress, visual indicators in the dashboard could be provided to signal whether the
employee was on track with training requirements.
At the business level, an LMS system with predictive analytic capability can help
improve decision-making by offering in-depth insight to strategic questions and
concerns. This could
range from anything to course enrolment, to course completion rates, to employee
performance.
Predictive analytic models
Because predictive analytics goes beyond sorting and describing data, it relies heavily
on complex models designed to make inferences about the data it encounters. These
models utilize algorithms and machine learning to analyze past and present data in
order to provide future trends.
Each model differs depending on the specific needs of those employing predictive
analytics. Some common basic models that are utilized at a broad level include:
● Decision trees use branching to show possibilities stemming from each
outcome or choice.
● Regression techniques assist with understanding relationships between
variables.

● Neural networks utilize algorithms to figure out possible relationships


within data sets.

What does a business need to know before using predictive analytics?


For businesses who want to incorporate predictive analytics into their learning
analytics strategy, the following steps should be considered:
● Establish a clear direction Predictive analytics rely on specifically
programmed algorithms and machine learning to track and analyze data,
all of which depend on the unique questions being asked. For example,
wanting to know whether employees will complete a course is a specific
question; the software would need to analyze the relevant data in order
to formulate possible trends on completion rates. It is important that
businesses know what their needs are.

● Be actively involved Predictive analytics require active input and


involvement from those utilizing the technique. This means deciding
and understanding what data is being collected and why. The quality of
data should also be monitored. Without human involvement, the data
collected and models used for analysis may provide no beneficial
meaning.
What are the benefits of using predictive analytics?
Here are a few key benefits that businesses can expect to find when incorporating
predictive analytics into their overall learning analytics strategy:
● Personalize the training needs of employees by identifying their gaps,
strengths, and weaknesses; specific learning resources and training can
be offered to support individual needs.
● Retain Talent by tracking and understanding employee career
progression and forecasting what skills and learning resources would
best benefit their career paths. Knowing what skills employees need
also benefits the design of future training.
● Support employees who may be falling behind or not reaching their
potential by offering intervention support before their performance puts
them at risk.
● Simplified reporting and visuals that keep everyone updated when
predictive forecasting is required.
Examples of how Predictive Analytics are being used in online learning
Many businesses are beginning to incorporate predictive analytics into their learning
analytics strategy by utilizing the predictive forecasting features offered in Learning
Management Systems and specialized software.
Here are a few examples:
1. Training targets Some systems monitor and collect data on how
employees interact within the learning environment, such as tracking
how often courses or resources are accessed and whether they are
completed. Achievement level can also be analyzed, including
assessment performance, length of time to complete training, and
outstanding training requirements. An analysis of these aggregated data
patterns can reveal how employees may continue to perform in the
future. This makes it easier to identify employees who are not on track
to fulfilling ongoing training requirements.

2. Talent management Predictive reporting can also forecast how


employees are developing in their role and within the company; this
involves tracking and forecasting on individual employee learning
paths, training, and upskilling activity. This is important for Human
Resources (HR) who may need to manage the talent pool for a large
number of employees or training departments wanting to know what
resources will be effective for individual skill development.

Predictive Modeling
Predictive modeling means developing models that can be used to forecast or predict
future events. In business analytics, models can be developed based on logic or data.
Logic-Driven Models
A logic-driven model is one based on experience, knowledge, and logical relationships
of variables and constants connected to the desired business performance outcome
situation.
The question here is how to put variables and constants together to create a model that
can predict the future. Doing this requires business experience. Model building requires
an understanding of business systems and the relationships of variables and constants
that seek to generate a desirable business performance outcome. To help conceptualize
the relationships inherent in a business system, diagramming methods can be helpful.
For example, the cause-and-effect diagram is a visual aid diagram that permits a user to
hypothesize relationships between potential causes of an outcome (see Figure). This
diagram lists potential causes in terms of human, technology, policy, and process
resources in an effort to establish some basic relationships that impact business
performance. The diagram is used by tracing contributing and relational factors from
the desired business performance goal back to possible causes, thus allowing the user to
better picture sources of potential causes that could affect the performance. This
diagram is sometimes referred to as a fishbone diagram because of its appearance.
Fig Cause-and-effect diagram*

Another useful diagram to conceptualize potential relationships with business


performance variables is called the influence diagram. According to Evans influence
diagrams can be useful to conceptualize the relationships of variables in the
development of models. An example of an influence diagram is presented in Figure 6.2.
It maps the relationship of variables and a constant to the desired business performance
outcome of profit. From such a diagram, it is easy to convert the information into a
quantitative model with constants and variables that define profit in this situation:

Profit = Revenue − Cost,

or Profit = (Unit Price × Quantity Sold) − [(Fixed Cost) + (Variable Cost ×Quantity

Sold)], or P = (UP × QS) − [FC + (VC × QS)]


Figure 6.2 An influence diagram

The relationships in this simple example are based on fundamental business


knowledge. Consider, however, how complex cost functions might become without
some idea of how they are mapped together. It is necessary to be knowledgeable about
the business systems being modeled in order to capture the relevant business behavior.
Cause-and-effect diagrams and influence diagrams provide tools to conceptualize
relationships, variables, and constants, but it often takes many other methodologies to
explore and develop predictive models.

6.2.2 Data-Driven Models

Logic-driven modeling is often used as a first step to establish


relationships through data-driven models (using data collected from many
sources to quantitatively establish model relationships). To avoid
duplication of content and focus on conceptual material in the chapters, we
have relegated most of the computational aspects and some computer
usage content to the appendixes. In addition, some of the methodologies
are illustrated in the case problems presented in this book. Please refer to
the Additional Information column in Table 6.1 to obtain further
information on the use and application of the data- driven models.
Table 6.1 Data-Driven Models

6.3 Data Mining


As mentioned in Chapter 3, data mining is a discovery-driven software
application process that provides insights into business data by finding
hidden patterns and relationships in big or small data and inferring rules
from them to predict future behavior. These observed patterns and rules
guide decision- making. This is not just numbers, but text and social media
information from the Web. For example, Abrahams et al. (2013) developed
a set of text-mining rules that automobile manufacturers could use to distill
or mine specific vehicle component issues that emerge on the Web but take
months to show up in complaints or other damaging media. These rules cut
through the mountainous data that exists on the Web and are reported to
provide marketing and competitive intelligence to manufacturers,
distributors, service centers, and suppliers. Identifying a product’s defects
and quickly recalling or correcting the problem before customers
experience a failure reduce customer dissatisfaction when problems occur.

6.3.1 A Simple Illustration of Data Mining

Suppose a grocery store has collected a big data file on what customers
put into their baskets at the market (the collection of grocery items a
customer purchases at one time). The grocery store would like to know if
there are any associated items in a typical market basket. (For example, if a
customer purchases product A, she will most often associate it or purchase
it with product
B.) If the customer generally purchases product A and B together, the store
might only need to advertise product A to gain both product A’s and B’s
sales. The value of knowing this association of products can improve the
performance of the store by reducing the need to spend money on
advertising both products. The benefit is real if the association holds true.

Finding the association and proving it to be valid require some


analysis. From the descriptive analytics analysis, some possible
associations may have been uncovered, such as product A’s and B’s
association. With any size data file, the normal procedure in data mining
would be to divide the file into two parts. One is referred to as a training
data set, and the other as a validation data set. The training data set
develops the association rules, and the validation data set tests and proves
that the rules work. Starting with the training data set, a common data
mining methodology is what-if analysis using logic-based
software. SAS has a what-if logic-based software application, and so do a
number of other software vendors (see Chapter 3). These software
applications allow logic expressions. (For example, if product A is present,
then is product B present?) The systems can also provide frequency and
probability information to show the strength of the association. These
software systems have differing capabilities, which permit users to
deterministically simulate different scenarios to identify complex
combinations of associations between product purchases in a market
basket.

Once a collection of possible associations is identified and their


probabilities are computed, the same logic associations (now considered
association rules) are rerun using the validation data set. A new set of
probabilities can be computed, and those can be statistically compared
using hypothesis testing methods to determine their similarity. Other
software systems compute correlations for testing purposes to judge the
strength and the direction of the relationship. In other words, if the
consumer buys product A first, it could be referred to as the Head and
product B as the Body of the association (Nisbet et al., 2009, p. 128). If the
same basic probabilities are statistically significant, it lends validity to the
association rules and their use for predicting market basket item purchases
based on groupings of products.

6.3.2 Data Mining Methodologies

Data mining is an ideal predictive analytics tool used in the BA


process. We mentioned in Chapter 3 different types of information that
data mining can glean, and Table 6.2 lists a small sampling of data mining
methodologies to acquire different types of information. Some of the same
tools used in the descriptive analytics step are used in the predictive step
but are employed to establish a model (either based on logical connections
or quantitative formulas) that may be useful in predicting the future.
Table 6.2 Types of Information and Data Mining Methodologies

Several computer-based methodologies listed in Table 6.2 are briefly


introduced here. Neural networks are used to find associations where
connections between words or numbers can be determined. Specifically,
neural networks can take large volumes of data and potential variables and
explore variable associations to express a beginning variable (referred to as
an input layer), through middle layers of interacting variables, and finally
to an ending variable (referred to as an output). More than just identifying
simple one-on- one associations, neural networks link multiple association
pathways through big data like a collection of nodes in a network. These
nodal relationships constitute a form of classifying groupings of variables
as related to one another, but even more, related in complex paths with
multiple associations (Nisbet et al., 2009, pp. 128–138). Differing software
have a variety of association network function capabilities. SAS offers a
series of search engines that can identify associations. SPSS has two
versions of neural network software functions: Multilayer Perception
(MLP) and Radial Basis Function (RBF). Both procedures produce a
predictive model for one or more dependent
variables based on the values of the predictive variables. Both allow a
decision maker to develop, train, and use the software to identify particular
traits (such as bad loan risks for a bank) based on characteristics from data
collected on past customers.

Discriminant analysis is similar to a multiple regression model except


that it permits continuous independent variables and a categorical
dependent variable. The analysis generates a regression function whereby
values of the independent variables can be incorporated to generate a
predicted value for the dependent variable. Similarly, logistic regression is
like multiple regression. Like discriminant analysis, its dependent variable
can be categorical. The independent variables in logistic regression can be
either continuous or categorical. For example, in predicting potential
outsource providers, a firm might use a logistic regression, in which the
dependent variable would be to classify an outsource provider as either
rejected (represented by the value of the dependent variable being zero) or
acceptable (represented by the value of one for the dependent variable).

Hierarchical clustering is a methodology that establishes a hierarchy of


clusters that can be grouped by the hierarchy. Two strategies are suggested
for this methodology: agglomerative and divisive. The agglomerative
strategy is a bottom-up approach, in which one starts with each item in the
data and begins to group them. The divisive strategy is a top-down
approach, in which one starts with all the items in one group and divides
the group into clusters. How the clustering takes place can involve many
different types of algorithms and differing software applications. One
method commonly used is to employ a Euclidean distance formula that
looks at the square root of the sum of distances between two variables,
their differences squared. Basically, the formula seeks to match up variable
candidates that have the least squared error differences. (In other words,
they’re closer together.)

K-mean clustering is a classification methodology that permits a set of


data to be reclassified into K groups, where K can be set as the number of
groups desired. The algorithmic process identifies initial candidates for the
K groups and then interactively searches other candidates in the data set to
be averaged into a mean value that represents a particular K group. The
process of selection
is based on maximizing the distance from the initial K candidates selected
in the initial run through the list. Each run or iteration through the data set
allows the software to select further candidates for each group.

The K-mean clustering process provides a quick way to classify data


into differentiated groups. To illustrate this process, use the sales data in
Figure 6.3 and assume these are sales from individual customers. Suppose
a company wants to classify the sales customers into high and low sales
groups.

Figure 6.3 Sales data for cluster classification problem

The SAS K-Mean cluster software can be found in Proc Cluster. Any
integer value can designate the K number of clusters desired. In this
problem set, K=2. The SAS printout of this classification process is shown
in Table 6.3.
The Initial Cluster Centers table listed the initial high (20167) and a low
(12369) value from the data set as the clustering process begins. As it turns
out, the software divided the customers into 9 high sales customers and 11
low sales customers.

Table 6.3 SAS K-Mean Cluster Solution

Consider how large big data sets can be. Then realize this kind of
classification capability can be a useful tool for identifying and predicting
sales based on the mean values.

There are so many BA methodologies that no single section, chapter, or


even book can explain or contain them all. The analytic treatment and
computer usage in this chapter have been focused mainly on conceptual
use. For a more applied use of some of these methodologies, note the case
study that follows and some of the content in the appendixes.

6.4 Continuation of Marketing/Planning Case Study


Example: Prescriptive Analytics Step in the BA Process

In the last sections, an ongoing marketing/planning case study of the


relevant BA step discussed in those chapters is presented to illustrate some
of the tools and strategies used in a BA problem analysis. This is the
second
installment of the case study dealing with the predictive analytics analysis
step in BA.

6.4.1 Case Study Background Review

The case study firm had collected a random sample of monthly sales
information presented in Figure 6.4 listed in thousands of dollars. What the
firm wants to know is, given a fixed budget of $350,000 for promoting this
service product, when it is offered again, how best should the company
allocate budget dollars in hopes of maximizing the future estimated
month’s product sales? Before the firm makes any allocation of budget,
there is a need to understand how to estimate future product sales. This
requires understanding the behavior of product sales relative to sales
promotion efforts using radio, paper, TV, and point-of-sale (POS) ads.
Figure 6.4 Data for marketing/planning case study

The previous descriptive analytics analysis in Chapter 5 revealed a


potentially strong relationship between radio and TV commercials that
might be useful in predicting future product sales. The analysis also
revealed little regarding the relationship of newspaper and POS ads to
product sales. So although radio and TV commercials are most promising,
a more in-depth predictive analytics analysis is called for to accurately
measure and document the degree of relationship that may exist in the
variables to determine the best predictors of product sales.

6.4.2 Predictive Analytics Analysis

An ideal multiple variable modeling approach that can be used in this


situation to explore variable importance in this case study and eventually
lead to the development of a predictive model for product sales is
correlation and multiple regression. We will use SAS’s statistical package
to compute the statistics in this step of the BA process.

First, we must consider the four independent variables—radio, TV,


newspaper, POS—before developing the model. One way to see the
statistical direction of the relationship (which is better than just comparing
graphic charts) is to compute the Pearson correlation coefficients r
between each of the independent variables with the dependent variable
(product sales). The SAS correlation coefficients and their levels of
significance are presented in Table
6.4. The larger the Pearson correlation (regardless of the sign) and the
smaller the Significance test values (these are t-tests measuring the
significance of the Pearson r value; see Appendix A), the more significant
the relationship. Both radio and TV are statistically significant
correlations, whereas at a 0.05 level of significance, paper and POS are not
statistically significant.
Table 6.4 SAS Pearson Correlation Coefficients: Marketing/Planning Case
Study

Although it can be argued that the positive or negative correlation


coefficients should not automatically discount any variable from what will
be a predictive model, the negative correlation of newspapers suggests that
as a firm increases investment in newspaper ads, it will decrease product
sales. This does not make sense in this case study. Given the illogic of such
a relationship, its potential use as an independent variable in a model is
questionable. Also, this negative correlation poses several questions that
should be considered. Was the data set correctly collected? Is the data set
accurate? Was the sample large enough to have included enough data for
this variable to show a positive relationship? Should it be included for
further analysis? Although it is possible that a negative relationship can
statistically show up like this, it does not make sense in this case. Based on
this reasoning and the fact that the correlation is not statistically
significant, this variable (newspaper ads) will be removed from further
consideration in this exploratory analysis to develop a predictive model.
Some researchers might also exclude POS based on the insignificance
(p=0.479) of its relationship with product sales. However, for purposes of
illustration, continue to consider it a candidate for model inclusion. Also,
the other two independent variables (radio and TV) were found to be
significantly related to product sales, as reflected in the correlation
coefficients in the tables.

At this point, there is a dependent variable (product sales) and three


candidate independent variables (POS, TV, and Radio) in which to
establish a predictive model that can show the relationship between
product sales and those independent variables. Just as a line chart was
employed to reveal the behavior of product sales and the other variables in
the descriptive analytic step, a statistical method can establish a linear
model that combines the three predictive variables. We will use multiple
regression, which can incorporate any of the multiple independent
variables, to establish a relational model for product sales in this case
study. Multiple regression also can be used to continue our exploration of
the candidacy of the three independent variables.

The procedure by which multiple regression can be used to evaluate


which independent variables are best to include or exclude in a linear
model is called step-wise multiple regression. It is based on an evaluation
of regression models and their validation statistics—specifically, the
multiple correlation coefficients and the F-ratio from an ANOVA. SAS
software and many other statistical systems build in the step-wise process.
Some are called backward selection or step-wise regression, and some are
called forward selection or step-wise regression. The backward step-wise
regression starts with all the independent variables placed in the model,
and the step-wise process removes them one at a time based on worst
predictors first until a statistically significant model emerges. The forward
step-wise regression starts with the best related variable (using correction
analysis as a guide), and then step-wise adds other variables until adding
more will no longer improve the accuracy of the model. The forward
step-wise regression process will be illustrated here manually. The first
step is to generate individual regression models and statistics for each
independent variable with the dependent variable one at a time. These
three SAS models are presented in Tables 6.5, 6.6, and 6.7 for the POS,
radio, and TV variables, respectively.
Table 6.5 SAS POS Regression Model: Marketing/Planning Case Study
Table 6.6 SAS Radio Regression Model: Marketing/Planning Case Study
Table 6.7 SAS TV Regression Model: Marketing/Planning Case Study

The computer printouts in the tables provide a variety of statistics for


comparative purposes. Discussion will be limited here to just a few. The R-
Square statistics are a precise proportional measure of the variation that is
explained by the independent variable’s behavior with the dependent
variable. The closer the R-Square is to 1.00, the more of the variation is
explained, and the better the predictive variable. The three variables
’R-Squares are 0.0002 (POS), 0.9548 (radio), and 0.9177 (TV). Clearly,
radio is the best predictor variable of the three, followed by TV and,
without almost any relationship, POS. This latter result was expected based
on the prior Pearson correlation. What it is suggesting is that only 0.0823
percent (1.000−0.9177) of the variation in product sales is explained by
TV commercials.

From ANOVA, the F-ratio statistic is useful in actually comparing the


regression model’s capability to predict the dependent variable. As
R-Square increases, so does the F-ratio because of the way in which they
are computed
and what is measured by both. The larger the F-ratio (like the R-Square
statistic), the greater the statistical significance in explaining the variable’s
relationships. The three variables ’F-ratios from the ANOVA tables are
0.00 (POS), 380.22 (radio), and 200.73 (TV). Both radio and TV are
statistically significant, but POS has an insignificant relationship. To give
some idea of how significant the relationships are, assuming a level of
significance where α=0.01, one would only need a cut-off value for the
F-ratio of 8.10 to designate it as being significant. Not exceeding that
F-ratio (as in the case of POS at 0.00) is the same as saying that the
coefficient in the regression model for POS is no different from a value of
zero (no contribution to Product Sales). Clearly, the independent variables
radio and TV appear to have strong relationships with the dependent
variable. The question is whether the two combined or even three variables
might provide a more accurate forecasting model than just using the one
best variable like radio.

Continuing with the step-wise multiple regression procedure, we next


determine the possible combinations of variables to see if a particular
combination is better than the single variable models computed previously.
To measure this, we have to determine the possible combinations for the
variables and compute their regression models. The combinations are (1)
POS and radio;
(2) POS and TV; (3) POS, radio, and TV; and (4) radio and TV.

The resulting regression model statistics are summarized and presented


in Table 6.8. If one is to base the selection decision solely on the R-Square
statistic, there is a tie between the POS/radio/TV and the radio/TV
combination (0.979 R-Square values). If the decision is based solely on the
F-ratio value from ANOVA, one would select just the radio/TV
combination, which one might expect of the two most significantly
correlated variables.
Table 6.8 SAS Variable Combinations and Regression Model Statistics:
Marketing/Planning Case Study

To aid in supporting a final decision and to ensure these analytics are


the best possible estimates, we can consider an additional statistic. That tie
breaker is the R-Squared (Adjusted) statistic, which is commonly used in
multiple regression models.

The R-Square Adjusted statistic does not have the same interpretation
as R- Square (a precise, proportional measure of variation in the
relationship). It is instead a comparative measure of suitability of
alternative independent variables. It is ideal for selection between
independent variables in a multiple regression model. The R-Square
adjusted seeks to take into account the phenomenon of the R-Square
automatically increasing when additional independent variables are added
to the model. This phenomenon is like a painter putting paint on a canvas,
where more paint additively increases the value of the painting. Yet by
continually adding paint, there comes a point at which some paint covers
other paint, diminishing the value of the original. Similarly, statistically
adding more variables should increase the ability of the model to capture
what it seeks to model. On the other hand, putting in too many variables,
some of which may be poor predictors, might bring down the total
predictive ability of the model. The R-Square adjusted statistic provides
some information to aid in revealing this behavior.

The value of the R-Square adjusted statistic can be negative, but it will
always be less than or equal to that of the R-Square in which it is related.
Unlike R-Square, the R-Square adjusted increases when a new independent
variable is included only if the new variable improves the R-Square more
than would be expected in the absence of any independent value being
added. If a set of independent variables is introduced into a regression
model one at a time in forward step-wise regression using the highest
correlations ordered first, the R- Square adjusted statistic will end up being
equal to or less than the R-Square value of the original model. By
systematic experimentation with the R-Square adjusted recomputed for
each added variable or combination, the value of the R-Square adjusted
will reach a maximum and then decrease. The multiple regression model
with the largest R-Square adjusted statistic will be the most
accurate combination of having the best fit without excessive or
unnecessary independent variables. Again, just putting all the variables
into a model may add unneeded variability, which can decrease its
accuracy. Thinning out the variables is important.

Finally, in the step-wise multiple regression procedure, a final decision


on the variables to be included in the model is needed. Basing the decision
on the R-Square adjusted, the best combination is radio/TV. The SAS
multiple regression model and support statistics are presented in Table 6.9.

Table 6.9 SAS Best Variable Combination Regression Model and Statistics:
Marketing/Planning Case Study

Although there are many other additional analyses that could be


performed to validate this model, we will use the SAS multiple regression
model in Table
6.9 for the firm in this case study. The forecasting model can be expressed
as follows:

Yp = −17150 + 275.69065 X1 + 48.34057 X2

where:

Yp = the estimated number of dollars of product sales

X1 = the number of dollars to invest in radio

commercials X2 = the number of dollars to invest in TV

commercials

Because all the data used in the model is expressed as dollars, the
interpretation of the model is made easier than using more complex data.
The interpretation of the multiple regression model suggests that for every
dollar allocated to radio commercials (represented by X1), the firm will
receive
$275.69 in product sales (represented by Yp in the model). Likewise, for every
dollar allocated to TV commercials (represented by X2), the firm will receive
$48.34 in product sales.

A caution should be mentioned on the results of this case study. Many


factors might challenge a result, particularly those derived from using
powerful and complex methodologies like multiple regression. As such,
the results may not occur as estimated, because the model is not reflecting
past performance. What is being suggested here is that more analysis can
always be performed in questionable situations. Also, additional analysis
to confirm a result should be undertaken to strengthen the trust that others
must have in the results to achieve the predicted higher levels of business
performance.

In summary, for this case study, the predictive analytics analysis has
revealed a more detailed, quantifiable relationship between the generation
of product sales and the sources of promotion that best predict sales. The
best way
to allocate the $350,000 budget to maximize product sales might involve
placing the entire budget into radio commercials because they give the best
return per dollar of budget. Unfortunately, there are constraints and
limitations regarding what can be allocated to the different types of
promotional methods. Optimizing the allocation of a resource and
maximizing business performance necessitate the use of special business
analytic methods designed to accomplish this task. This requires the
additional step of prescriptive analytics analysis in the BA process, which
will be presented in the last section of Chapter 7.

Summary

This chapter dealt with the predictive analytics step in the BA process.
Specifically, it discussed logic-driven models based on experience and
aided by methodologies like the cause-and-effect and the influence
diagrams. This chapter also defined data-driven models useful in the
predictive step of the BA analysis. A further discussion of data mining was
presented. Data mining methodology such as neural networks, discriminant
analysis, logistic regression, and hierarchical clustering was described. An
illustration of K- mean clustering using SAS was presented. Finally, this
chapter discussed the second installment of a case study illustrating the
predictive analytics step of the BA process. The remaining installment of
the case study will be presented in Chapter 7.

Once again, several of this book’s appendixes are designed to augment


the chapter material by including technical, mathematical, and statistical
tools. For both a greater understanding of the methodologies discussed in
this chapter and a basic review of statistical and other quantitative
methods, a review of the appendixes is recommended.

As previously stated, the goal of using predictive analytics is to


generate a forecast or path for future improved business performance.
Given this predicted path, the question now is how to exploit it as fully as
possible. The purpose of the prescriptive analytics step in the BA process
is to serve as a guide to fully maximize the outcome in using the
information provided by the predictive
analytics step. The subject of Chapter 7 is the prescriptive analytics step in
the BA process.

Discussion Questions

1. Why is predictive analytics analysis the next logical step in any


business analytics (BA) process?

2. Why would one use logic-driven models to aid in developing data-driven


models?

3. How are neural networks helpful in determining both associations


and classification tasks required in some BA analyses?

4. Why is establishing clusters important in BA?

5. Why is establishing associations important in BA?

6. How can F-tests from the ANOVA be useful in BA?

Problems

1. Using a similar equation to the one developed in this chapter for


predicting dollar product sales (note below), what is the forecast for dollar
product sales if the firm could invest $70,000 in radio commercials and
$250,000 in TV commercials?

Yp = –17150.455 + 275.691 X1 + 48.341 X2

where:

Yp = the estimated number of dollars of product sales


X1 = the number of dollars to invest in radio

commercials X2 = the number of dollars to invest in TV

commercials

2. Using the same formula as in Question 1, but now using an investment of


$100,000 in radio commercials and $300,000 in TV commercials, what is the
prediction on dollar product sales?

3. Assume for this problem the following table would have held true
for the resulting marketing/planning case study problem. Which
combination of variables is estimated here to be the best predictor set?
Explain why.

4. Assume for this problem that the following table would have held
true for the resulting marketing/planning case study problem. Which of the
variables is estimated here to be the best predictor? Explain why.

• PRIVACY POLICY

You might also like