Audit Course 2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 31

AUDIT COURSE ON

Business Intelligence
AUDIT COURSE (COMPUTER ENGINEERING)
PRESENTED BY: -

SAMIKSHA RAHUL BANKAR

UNDER GUIDANCE OF

Prof. N.R. Talhar

DEPARTMENT OF COMPUTER ENGINEERING


AISSMS COLLEGE OF ENGINEERING
KENNEDY ROAD, PUNE 411001
UNIVERSITY OF PUNE
2020-2021

AISSMS College of Engineering, Shivaji Nagar


Pune - 411001

1|Page
Certificate
This is to certify that the audit course entitles
(Business Intelligence)

Submitted by: -

SAMIKSHA RAHUL BANKAR ROLL NO. 17CO006


is a bonafide work carried out by him under the
supervision of Mr. N.R. Talhar and it is approved for
the partial fulfillment of the requirement of
Savitribai Phule Pune University Pune for the award of
the degree of SE (Computer Engineering) as prescribed
by the University of Pune in the academic year 2020-
2021.

Mr. N.R.Talhar Dr. D.P.Gaikwad


Coordinator Head
Computer Department

Place : Pune

2|Page
Date :
ACKNOWLEDGEMENT

We would like to extend my sincere gratitude and thanks to my


guide Mr. N.R. Talhar, for her valuable guidance and for giving
us useful inputs and encouragement time and again, which
inspired us to work harder. Due to her forethought, appreciation of
the work involved and continuous imparting of useful tips, this
report has been successfully completed.

We are extremely grateful to Dr. D.P. Gaikwad , Head of the


Department of Computer Engineering, for his encouragement
during the course of the project work.

We also extend our heartfelt gratitude to the staff of Computer


Engineering Department for their cooperation and support.

We also take this opportunity to thank all our classmates, friends


and all those who have directly or indirectly provided their
overwhelming support during our audit work and the
development of this report

Samiksha Rahul
Bankar

3|Page
CONTENTS

 Chapter: 1
Concepts with Mathematical treatment.

 Chapter: 2
Decision Making Concepts.

 Chapter: 3
Data-Warehouse.

 Chapter: 4
Data Pre-processing and outliers.

 Chapter: 5
Designing and managing BI systems.

4|Page
Chapter 1
Concepts with Mathematical Treatment

 Introduction of BI:
The purpose of Business Intelligence (BI) is to provide decision makers with
the information necessary to make informed decisions. The information is
delivered by the Business Intelligence system via reports. This book focuses on
the architecture and infrastructure needed to deliver the information.
Architecture is a set of rules or structures providing a framework for the overall
design of a system or product (Poe et al. 1998). The BI system includes the
following parts:
 Interested parties and their respective information needs
 Input of data
 Storage of data
 Analysis of data
 Automatic and selective dissemination of information
A BI system includes the rules (architecture) providing a framework for the
organization of the technologies, platforms, databases, gateways, people and
processes. To implement an architecture the Business Intelligence architect
must implement an infrastructure. Technical infrastructures are the
technologies, platforms, databases, gateways, people and processes necessary to
make the architecture functional within the corporation (Poe et al. 1998).
In sum, decision makers need reports that deliver the information that allows
them to understand his/her organization and the world in order to make better
decisions.
Given these functions of the BI system the most enduring definition focuses on
the system, tools, technology, process, and techniques that compose these four
elements that help decision makers understand their world. BI augments the
ability of decision makers to turn data into information by aiding in extracting
data from data sources, organizing the data based on established business
knowledge and then presenting the information in a manner that is organized in
a way to be useful to the decision maker. It merges technology with knowledge
in order to provide useful information to management as quickly as possible.
In sum, Business Intelligence system includes the rules (architecture) that
outline how to organize the parts of the system (infrastructure) to deliver the
5|Page
information needed to thrive in a competitive market (business) or to provide
the best service to the people (government) (Poe et al.1998). Regardless of the
technology, which is simply tools, the core of a business intelligence system has
and will not change.

 Determining BI Cycle:

Intelligence is not just a set of tools to analyze raw data to help make strategic
and operational decisions. It is a framework that offers guidance in
understanding what to look for in the volumes of disparate data. As a
framework, BI is a continuous cycle of analysis, insight, action and
measurement.
Analysing a business is based on what we know and feel to be important while
filtering out the aspects of the business not considered mission critical or
detrimental to the growth of the organization. Deciding what is important is
based on our understanding and assumptions of what is important to customers,
supplies, competitors and employees. All of this knowledge is unique to a
business and is an incredible resource when formulating a BI strategy.
However, having such granular grassroots knowledge of the business can
subconsciously limit the ability to see patterns obvious to others. One of the
benefits of BI reporting is performing ad hoc reporting by drilling down layers
of data pivoting on the rows and columns. Such flexibility opens up human’s
inquisitive nature to ask more questions that wouldn’t necessarily be asked if
such access to data wasn’t available. Effective analysis helps to understand the
business better so to challenge conventional wisdom and assumptions as well as
what is considered to be the right analysis.
6|Page
Insight comes in many forms. There are operational insights, such as
determining the effect on production costs with the installation of new more
energy efficient machines that have slightly lower production yields per unit of
measure. There are strategic insights analysing, for example, new market
opportunities by conducting research on the barriers to entry. Insight is the
intangible product of analysis developed from asking questions that only
humans can ask. Computers can be used for the identification of patterns, but
only humans can recognize what patterns are useful.
The issue with having insight is convincing others to believe or support the new
perspective in order for the insight to be useful. As in life, anything new or
different is slow to acceptance or given credibility. Well organized business
intelligence that supports the insight by providing clear data, patterns, logic,
presentation (i.e., graphs, reports) and calculations are the drivers to help sell
the new insight.
Once the analysis is done and the insight has been sold the next process in the
BI cycle is performing the action or decision making. Well thought out
decisions backed up by good analysis and insight gives confidence and courage
to the proposed action. Otherwise, decisions not supported by quality analytics
are made with overbearing safety measures or less dedication or commitment
from the stakeholders. In addition, quality business intelligence delivered
quickly improves the speed to action. Today’s organizations need to react more
quickly, develop new approaches faster, conduct more agile R&D and get
products and services to market faster than ever before. BI based decision
making with faster access to information and feedback provides more
opportunity for quicker prototyping and testing.

 BI Environment and Architecture:


Business intelligence architecture refers to the infrastructure that organizations
use to define their data collection streams, information administration, and all
the technology that supports their business intelligence.

7|Page
Creating a sustainable architecture depends on understanding the different
components that are involved with developing successful business intelligence
tools. The process is broadly divided into three areas: data collection,
information management, and business intelligence.
The first area refers to the different channels and methods of collecting data
from activities carried out within your organization. This includes
understanding which data different users need to meet their requirements, as
well as a clear idea of the quality, type, and currency of the data. This step is
vital for adding value as the right data produces the best BI insights. The second
major component is data management. This covers various aspects of
integrating data, scrubbing datasets, and fashioning the overall structures that
will house and administer data.
Finally, business intelligence is the part of an organizations architecture that
analyses properly organized data sets to produce insights. This area involves
using real-time analytics, data visualizations, and other BI tools.

 Benefits of BI:
 Faster reporting, analysis or planning
 More accurate reporting, analysis or planning
 Better business decisions
 Improved data quality
 Improved employee satisfaction
 Improved operational efficiency

8|Page
 Improved customer satisfaction
 Increased competitive advantage
 Reduced costs
 Increased revenues

 Saved headcount

 Factors Responsible for successful BI Project:

 Don’t boil the ocean.


 Focus on ease of use.
 Performance is key.
 Pick your technology carefully.
 Understand where the cost is.
 Nominate advocates.
 Reconcile, reconcile, reconcile.
 Ensure executive involvement.

 Obstacle to Business Intelligence in an Organization:


 Lack of BI Strategy

 Business Intelligence When You Don’t Know How to Code

 Lack of Training & Execution


 Lack of BI Impact (Low utilization)
 Business Intelligence with Unstructured Data
 Installation and deployment

9|Page
Chapter 2
Decision Making Concepts
 Concept of Decision Making:
Decision-making is the act of making a choice among available alternatives.
There are innumerable decisions that are taken by human beings in day-to-day
life. In business undertakings, decisions are taken at every step. It is also
regarded as one of the important functions of management. Managerial
functions like planning, organizing, staffing, directing, coordinating and
controlling are carried through decisions.
Decision making is possible when there are two or more alternatives to solve a
single problem or difficulty. If there is only one alternative then there is no
question of decision making. It is believed that the management without a
decision is a man without a backbone. Therefore, decision making is a problem-
solving approach by choosing a specific course of action among various
alternatives. In conclusion, we can say that decision making is the process of
choosing a specific course of action from various alternatives to solve the
organizational problems or difficulties.

 Importance of Decision Tree:

Decision making is considered as the backbone for the business management


because without taking the right decision at right time, nothing can be
performed. The further importance of decision making can be discussed under
the following points:

 Proper utilization of resources

 Selecting the best alternative

 Evaluation of the managerial performance

 Employee’s motivation

 Indispensable element/ component

 Achievement of goal/ objectives

 Pervasive function

10 | P a g e
 Steps of Decision-Making Process:

For the rationality, reliability, and enforceability of decisions, managers should


follow a sequential set of steps. It is said that a decision is rational if appropriate
means are chosen to reach desired ends. In this regards, various management
authorities have recognized and described different steps in the process of
decision-making. Ricky W. Griffin has suggested six steps in the process of
decision making. Accordingly, the steps are:

 Identification of problem

 Analysis of problem

 Developing an alternative course of action

 Evaluating alternative course of action

 Development of Decision Support System (DSS):


Similarly, to other systems, DSS systems require a structured approach. Such a
framework includes people, technology, and the development approach. The
Early Framework of Decision Support System consists of four phases:

11 | P a g e
 Intelligence — Searching for conditions that call for decision;

 Design — Developing and analysing possible alternative actions of solution

 Choice — Selecting a course of action among those;

 Implementation — Adopting the selected course of action in decision

DSS technology level (of hardware and software) may include:

 The actual application that will be used by the user. This is the part of the
application that allows the decision maker to make decisions in a particular
problem area. The user can act upon that particular problem.

 Generator contains Hardware/software environment that allows people to


easily develop specific DSS applications. This level makes use of case
tools or systems such as Crystal.

 Tools include lower-level hardware/software. DSS generators including


special languages, function libraries and linking modules.

An iterative developmental approach allows for the DSS to be changed and


redesigned at various intervals. Once the system is designed, it will need to be
tested and revised where necessary for the desired outcome.

 Applications of DSS:

 Medical diagnosis

 Business and Management

 Agricultural Production

12 | P a g e
Chapter 3
Data Warehouse

 Introduction: Datawarehouse Modeling


Data warehousing is the process of constructing and using a data warehouse. A
data warehouse is constructed by integrating data from multiple heterogeneous
sources that support analytical reporting, structured and/or ad hoc queries, and
decision making. Data warehousing involves data cleaning, data integration, and
data consolidations.
A data warehouse is a technique for collecting and managing data from varied
sources to provide meaningful business insights. It is a blend of technologies
and components which allows the strategic use of data. Data Warehouse is
electronic storage of a large amount of information by a business
A data warehouse is a relational database that aggregates structured data from
across an entire organization. It pulls together data from multiple sources —
much of it is typically online transaction processing (OLTP) data. The data
warehouse selects, organizes and aggregates data for efficient comparison and
analysis. Data Warehousing may be defined as a collection of corporate
information and data derived from operational systems and external data
sources. A data warehouse is designed with the purpose of inducing business
decisions by allowing data consolidation, analysis, and reporting at different
aggregate levels. Data is populated into the DW by extraction, transformation,
and loading.
Data Warehousing incorporates data stores and conceptual, logical, and physical
models to support business goals and end-user information needs. Creating a
DW requires mapping data between sources and targets, then capturing the
details of the transformation in a metadata repository. The data warehouse
provides a single, comprehensive source of current and historical information.

13 | P a g e
 How it Works.
Data Warehousing combines information collected from multiple sources into
one comprehensive database. To cite an example from the business world, I
might say that data warehouse incorporates customer information from a
company’s point-of-sale systems (the cash registers), its website, its mailing
lists, and its comment cards. Data Warehousing may also consider confidential
information about employee details, salary information, etc.
Companies use this information to analyze their customers. Data warehousing
also related to data mining which means looking for meaningful data patterns in
the huge data volumes and devise newer strategies for higher sales and profits.

 Introduction to Data Warehousing Types:


The three main types of Data Warehouses are:

1. Enterprise Data Warehouse-


It is a centralized warehouse, which provides decision support service across
the enterprise. It offers a unified approach to organizing and representing
data. It also provides the ability to classify data according to the subject and
give access according to those divisions.

2. Operational Data Store-


Operational Data Store, also called ODS, is data store required when neither
Data warehouse nor OLTP systems support organizations reporting needs. In
ODS, Data warehouse is refreshed in real time. Hence, it is widely preferred
for routine activities like storing records of the Employees.

3. Data Mart-
A Data Mart is a subset of the data warehouse. It specially designed for
specific segments like sales, finance, sales, or finance. In an independent
data mart, data can collect directly from sources.
Data Warehouses and data marts are mostly built on dimensional data
modeling where fact tables relate to dimension tables. This is useful for
users to access data since a database can be visualized as a cube of several
dimensions. A data warehouse allows a user to splice the cube along each of
its dimensions.

 Data warehouse design:


Data Warehouse design approaches are very important aspect of building data
warehouse. Selection of right data warehouse design could save lot of time and
project cost.
14 | P a g e
There are two different Data Warehouse Design Approaches normally followed
when designing a Data Warehouse solution and based on the requirements of
your project you can choose which one suits your particular scenario. These
methodologies are a result of research from Bill Inmon and Ralph Kimball.
1. Bottom-up design:
In the bottom-up approach, data marts are first created to provide reporting
and analytical capabilities for specific business processes. These data marts
can then be integrated to create a comprehensive data warehouse. The data
warehouse bus architecture is primarily an implementation of "the bus", a
collection of conformed dimensions and conformed facts, which are
dimensions that are shared (in a specific way) between facts in two or more
data marts. Ralph Kimball is a renowned author on the subject of data
warehousing. His data warehouse design approach is called dimensional
modelling or the Kimball methodology. This methodology follows the
bottom-up approach.
As per this method, data marts are first created to provide the reporting and
analytics capability for specific business process, later with these data marts
enterprise data warehouse is created.

Below are the steps that are involved in bottom-up approach:

15 | P a g e
1. The data flow in the bottom-up approach starts from extraction of data
from various source system into the stage area where it is processed and
loaded into the data marts that are handling specific business process.
2. After data marts are refreshed the current data is once again extracted in
stage area and transformations are applied to create data into the data
mart structure. The data is the extracted from Data Mart to the staging
area is aggregated, summarized and so on loaded into EDW and then
made available for the end user for analysis and enables critical business
decisions.

2. Top-down design:
The top-down approach is designed using a normalized enterprise data
model. "Atomic" data, that is, data at the lowest level of detail, are stored in
the data warehouse. Dimensional data marts containing data needed for
specific business processes or specific departments are created from the data
warehouse. “Bill Inmon” is sometimes also referred to as the “father of data
warehousing”; his design methodology is based on a top-down approach. In
the top-down approach, the data warehouse is designed first and then data
mart are built on top of data warehouse.

Below are the steps that are involved in top-down approach:


1. Data is extracted from the various source systems. The extracts are
loaded and validated in the stage area. Validation is required to make
sure the extracted data is accurate and correct. You can use the ETL tools
or approach to extract and push to the data warehouse.

16 | P a g e
2. Data is extracted from the data warehouse in regular basis in stage area.
At this step, you will apply various aggregation, summarization
techniques on extracted data and loaded back to the data warehouse.
3. Once the aggregation and summarization are completed, various data
marts extract that data and apply the some more transformation to make
the data structure as defined by the data marts.

3. Hybrid design:
Data warehouses (DW) often resemble the hub and spokes architecture.
Legacy systems feeding the warehouse often include customer relationship
management and enterprise resource planning, generating large amounts of
data. To consolidate these various data models, and facilitate the extract
transform load process, data warehouses often make use of an operational
data store, the information from which is parsed into the actual DW. To
reduce data redundancy, larger systems often store the data in a normalized
way. Data marts for specific reports can then be built on top of the DW.

 Distributed data warehouse:


The Distributed Data Warehouse. Most organizations build and maintain a
single centralized data warehouse environment. This setup makes sense for
many reasons: The volume of data in the data warehouse is such that a single
centralized repository of data makes sense.

Data warehousing is one of the major research topics of applied-side database


investigators. Most of the work to date has focused on building large centralized
systems that are integrated repositories founded on pre-existing systems upon
which all corporate-wide data are based. Unfortunately, this approach is very
expensive and tends to ignore the advantages realized during the past decade in
the area of distribution and support for data localization in a geographically
dispersed corporate structure. This research investigates building distributed
data warehouses with particular emphasis placed on distribution design for the
data warehouse environment. This dissertation contributes by providing an
architectural model for a distributed data warehouse, the formal definition of the
relational data model for data warehouse, a methodology for distributed data
warehouse design along with two approaches to horizontally fragment the huge
fact relation in the data warehouse and an analytical technique that provides
insight into which approach presented is most suited to a particular distributed
data warehouse environment.
 Data Warehousing with Materialized Views:

17 | P a g e
An enterprise data warehouse contains historical detailed data about the
organization. Typically, data flows from one or more online transaction
processing (OLTP) databases into the data warehouse on a monthly, weekly, or
daily basis. The data is usually processed in a staging file before being added to
the data warehouse. Data warehouses typically range in size from tens of
gigabytes to a few terabytes, usually with the vast majority of the data stored in
a few very large fact tables.
A data mart contains a subset of corporate data that is of value to a specific
business unit, department, or set of users. Typically, a data mart is derived from
an enterprise data warehouse.
One of the techniques employed in data warehouses to improve performance is
the creation of summaries, or aggregates. They are a special kind of aggregate
view which improves query execution times by pre-calculating expensive joins
and aggregation operations prior to execution, and storing the results in a table
in the database. For example, a table may be created which would contain the
sum of sales by region and by product.
Today, organizations using summaries spend a significant amount of time
manually creating summaries, identifying which ones to create, indexing the
summaries, updating them, and advising their users on which ones to use. The
introduction of summary management in the Oracle server changes the
workload of the DBA dramatically and means the end-user no longer has to be
aware of which summaries have been defined.

The DBA creates one or more materialized views, which are the equivalent of a
summary. The end-user queries the tables and views in the database and the
query rewrite mechanism in the Oracle server automatically rewrites the SQL
query to use the summary tables. This results in a significant improvement in
response time for returning results from the query and eliminates the need for
the end-user or database application to be aware of the summaries that exist
within the data warehouse.

Although summaries are usually accessed indirectly via the query rewrite
mechanism, an end-user or database application can construct queries which
directly access the summaries. However, serious consideration should be given
to whether users should be allowed to do this because, once the summaries are
directly referenced in queries, the DBA will not be free to drop and create
summaries without affecting applications.

The summaries or aggregates that are referred to in this book and in literature on
data warehousing are created in Oracle using a schema object called a
materialized view. Materialized views can be used to perform a number of roles,

18 | P a g e
such as improving query performance or providing replicated data, as described
below.

 Materialized Views for Data Warehouses:


In data warehouses, materialized views can be used to precompute and store
aggregated data such as sum of sales. Materialized views in these environments
are typically referred to as summaries since they store summarized data. They
can also be used to precompute joins with or without aggregations. So, a
materialized view is used to eliminate overhead associated with expensive joins
or aggregations for a large or important class of queries.

Chapter 4
Data Pre-Processing and Outliers

 Data Analytics Life-Cycle:


19 | P a g e
Phase 1—Discovery: In Phase 1, the team learns the business domain, including
relevant history such as whether the organization or business unit has attempted
similar projects in the past from which they can learn. The team assesses the
resources available to support the project in terms of people, technology, time,
and data. Important activities in this phase include framing the business
problem as an analytics challenge that can be addressed in subsequent phases
and formulating initial hypotheses (IHs) to test and begin learning the data.
Phase 2 — Data preparation: Phase 2 requires the presence of an analytic
sandbox, in which the team can work with data and perform analytics for the
duration of the project. The team needs to execute extract, load, and transform
(ELT) or extract, transform and load (ETL) to get data into the sandbox. The
ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed
in the ETLT process so the team can work with it and analyze it. In this phase,
the team also needs to familiarize itself with the data thoroughly and take steps
to condition the data
Phase 3 — Model planning: Phase 3 is model planning, where the team
determines the methods, techniques, and workflow it intends to follow for the
subsequent model building phase. The team explores the data to learn about the
relationships between variables and subsequently selects key variables and the
most suitable models.
Phase 4 — Model building: In Phase 4, the team develops datasets for testing,
training, and production purposes. In addition, in this phase the team builds and
executes models based on the work done in the model planning phase. The team
also considers whether its existing tools will suffice for running the models, or
if it will need a more robust environment for executing models and workflows
(for example, fast hardware and parallel processing, if applicable).
Phase 5 — Communicate results: In Phase 5, the team, in collaboration with
major stakeholders, determines if the results of the project are a success or a
failure based on the criteria developed in Phase 1. The team should identify key

20 | P a g e
findings, quantify the business value, and develop a narrative to summarize and
convey findings to stakeholders.
Phase 6—Operationalize: In Phase 6, the team delivers final reports, briefings,
code, and technical documents. In addition, the team may run a pilot project to
implement the models in a production environment.

 Data Preprocessing:
Data preprocessing is a data mining technique that involves transforming raw
data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or trends, and is likely to
contain many errors. Data preprocessing is a proven method of resolving such
issues.
Data preprocessing is a data mining technique that involves transforming raw
data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or trends, and is likely to
contain many errors. Data preprocessing is a proven method of resolving such
issues. Data preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as customer
relationship management and rule-based applications (like neural networks)
Data preprocessing is a data mining technique that involves transforming raw
data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or trends, and is likely to
contain many errors. Data preprocessing is a proven method of resolving such
issues. Data preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as customer
relationship management and rule-based applications (like neural networks).
Steps in Data preprocessing:
1. Data Cleaning: Data cleaning, also called data cleansing or scrubbing. Fill
in missing values, smooth noisy data, identify or remove the outliers, and
resolve inconsistencies. Data cleaning is required because source systems
contain “dirty data” that must be cleaned.
Steps in data cleaning-

 Parsing:

Parsing locates and identifies individual data elements in the source files
and then isolates these data elements in the target files.

Example: includes parsing the first, middle and the last name.

21 | P a g e
 Correcting:

Correct parsed individual data components using sophisticated data


algorithms and secondary data sources.

Example: includes replacing a vanity address and adding a zip code.

 Standardizing:

Standardizing applies conversion routines to transform data into its


preferred and consistent format using both standard and custom business
rules.

Examples: include adding a pre name, replacing a nickname.

 Matching:

Searching and matching records within and across the parsed, corrected
and standardized data based on predefined business rules to eliminate
duplications.

Examples: include identifying similar names and addresses.

 Consolidating:

Analysing and identifying relationships between matched records and


consolidating/merging them into one representation.

 Data cleansing

It must deal with many types of possible errors like these include missing
a data and incorrect data at one source

 Data Staging:

Accumulates data from asynchronous sources. At a predefined cut-off


time, data in the staging file is transformed and loaded to the warehouse.
There is usually no end user access to the staging file. An operational
data store may be used for data staging.

2. Data integration and Transformation:

22 | P a g e
Data integration: Combines data from multiple sources into a coherent
data store e.g., data warehouse. Sources may include multiple databases,
data cubes or data files.

Issues in data integration:

Schema integration:

Integrate metadata from different sources.

Entity identification problem: identify real world entities from multiple data
sources, e.g. A cut- id=B.cust#.

Detecting and resolving data value conflicts:

For the same real-world entity, attribute values from different sources are
different. Possible reasons: different representations, different scales.
Redundant data occur often when integration of multiple databases. The
same attribute may have different names in different databases.

Data Transformation:
Transformation process deals with rectifying any inconsistency (if any).

One of the most common transformation issues is Attribute Naming


Inconsistency. It is common for the given data element to be referred to by
different data names in different databases. Eg. Employee Name may be
EMP_NAME in one database, ENAME in the other.

23 | P a g e
Thus, one set of Data Names are picked and used consistently in the data
warehouse. Once all the data elements have right names, they must be
converted to common formats.

3. Data Reduction:

Obtains reduced representation in volume but produces the same or similar


analytical results. Need for data reduction:

 Reducing the number of attributes

 Reducing the number of attribute values

 Reducing the number of tuples

 Introduction to OLAP:
OLAP allows business users to slice and dice data at will. Normally data in an
organization is distributed in multiple data sources and are incompatible with
each other. A retail example: Point-of-sales data and sales made via call-center
or the Web are stored in different location and formats. It would a time-
consuming process for an executive to obtain OLAP reports such as - What are
the most popular products purchased by customers between the ages 15 to 30?
Part of the OLAP implementation process involves extracting data from the
various data repositories and making them compatible. Making data compatible
involves ensuring that the meaning of the data in one repository matches all
other repositories. An example of incompatible data: Customer ages can be
stored as birth date for purchases made over the web and stored as age
categories (i.e., between 15 and 30) for in store sales.

It is not always necessary to create a data warehouse for OLAP analysis. Data
stored by operational systems, such as point-of-sales, are in types of databases
called OLTPs. OLTP, Online Transaction Process, databases do not have any

24 | P a g e
difference from a structural perspective from any other databases. The main
difference, and only, difference is the way in which data is stored.

Examples of OLTPs can include ERP, CRM, SCM, Point-of-Sale applications,


Call Center.

OLTPs are designed for optimal transaction speed. When a consumer makes a
purchase online, they expect the transactions to occur instantaneously. With a
database design, call data modeling, optimized for transactions the record
'Consumer name, Address, Telephone, Order Number, Order Name, Price,
Payment Method' is created quickly on the database and the results can be
recalled by managers equally quickly if needed.

25 | P a g e
Types Of Outliers:
Type1: Global-Outliers:

A data point is considered a global outlier if its value is far outside the entirety
of the data set in which it is found (similar to how “global variables” in a
computer program can be accessed by any function in the program). Global
Anomaly.
Type 2: Contextual (Conditional) Outliers:

A data point is considered a contextual outlier if its value significantly


deviates from the rest of the data points in the same context. Note that this
means that same value may not be considered an outlier if it occurred in a
different context. If we limit our discussion to time series data, the “context”
is almost always temporal, because time series data are records of a specific
quantity over time. It’s no surprise then those contextual outliers are common
in time series data.

Type3: Collective Outliers:

A subset of data points within a data set is considered anomalous if those


values as a collection deviate significantly from the entire data set, but the
values of the individual data points are not themselves anomalous in either a
contextual or global sense. In time series data, one way this can manifest is as
normal peaks and valleys occurring outside of a time frame when that
seasonal sequence is normal or as a combination of time series that is in an
outlier state as a group.

Outlier detection Methods:


Z-Score:

The z-score or standard score of an observation is a metric that indicates how


many standard deviations a data point is from the samples mean, assuming a
gaussian distribution. This makes z-score a parametric method. Very
frequently data points are not to described by a gaussian distribution, this
problem can be solved by applying transformations to data i.e, scaling it.

Some Python libraries like SciPy and Sci-kit Learn have easy to use functions
and classes for an easy implementation along with Pandas and Numpy. After

26 | P a g e
making the appropriate transformations to the selected feature space of the
dataset, the z-score of any data point can be calculated with the following
expression:

When computing the z-score for each sample on the data set a threshold must
be specified. Some good thumb-rule thresholds can be: 2.5, 3, 3.5 or more
standard deviations.

By tagging or removing the data points that lay beyond a given threshold we
are classifying data into outliers and not outliers.

Z-score is a simple, yet powerful method to get rid of outliers in data if you
are dealing with parametric distributions in a low dimensional feature space.
For nonparametric problems Dbscan and Isolation Forests can be good
solutions.

27 | P a g e
Chapter 5
Designing and Managing the BI Systems

1. End-User Experience
These are the core capabilities for all the end users of your application. Business
intelligence requirements in this category may include dashboards and reports as
well as the interactive and analytical functions users can perform. Ideally, such
self-service capabilities let users answer their own questions without having to
involve IT. That frees up IT to work on more strategic projects rather than
answering ad hoc requests. It also empowers users to be more self-sufficient.

During your evaluation, make sure the capabilities important to your project are
demonstrated and understand how you will deliver and iterate upon these
capabilities inside your application.

These common self-service capabilities may affect your business intelligence


requirements:

 User personas: Increase the adoption of analytics by providing a broad


range of users with a tailored experience that matches their needs and
skills.

 Presentation and information delivery: These requirements affect you


present data in visualizations, dashboards, and reports, as well as the
compatibility of your BI solution across different devices and formats.

 Interactivity and automation: Do users need to be able to interact with


your dashboards? Consider whether you need to personalize
visualizations, let users kick off workflows, or drill down into
information.

28 | P a g e
 Analysis and authoring: Empowering your users to query their own data,
create visualizations and reports, and share their findings with colleagues
can add value to your analytics application.

2. Data Environment
The BI solutions you evaluate should be compatible with your current data
environment, while at the same time have enough flexibility to meet future
demands as your data architecture evolves. These are the diverse data
requirements commonly evaluated by application providers:

 Data sources: Make sure your primary data source is supported by your
BI solution. Also look for a vendor that supports generic connectors and
has flexibility through APIs or plug-ins.

 Data management: Consider what business intelligence requirements you


have for directly querying the data source for real-time reporting, caching
data, and blending data from multiple sources. Also ask yourself if your
users need to transform or enrich data for analysis.

3. Embeddability and Customization


Depending on your organizations need for embedding analytics into an
application or existing software, a primary consideration may be the level of
integration with your applications environment. How much emphasis do you
place on customization and integration capabilities compared to standard
business intelligence implementations? Do you want to offer a seamless user
experience within the context of an existing application? Have you struggled
with adopting analytics in the past?

One way to look at embeddability is to focus on driving adoption. The deeper


you integrate analytics into the fabric of the applications your users leverage
every day, the higher your user adoption will be. For instance, if you are a
salesperson and you spend much of your time using Salesforce, you don’t want
to go into applications that are siloed from your daily work. Organizations need
to infuse that analytic workflow within user’s daily activities.

Consider these requirements around embeddability and customization:


29 | P a g e
 Security: It should be easy to adopt the security from your application to
the analytics content. Scrutinize vendors on the flexibility of their
security models, implementation of single sign-on, and whether data
needs to synchronized or replicated between applications.

 Multi-tenancy: With built-in multi-tenancy, you can create a report once


and deploy for multiple customers.

 User experience: Many analytic applications need to be white-labelled to


match your brand. Look for embedding APIs to ensure visualizations are
rendered in the correct context.

 Workflow: Create an efficient experience where users can immediately


take action on what they see in any visualization or report.

 Extensibility: Choose a BI solution that supports custom coding and


third-party charting libraries to extend your application further.

4. Development and Deployment


Since time-to-value is so critical to the success of the project, having a
development environment where you can create, style, embed, deploy, and
iterate on analytics will enable your team to deliver the functionality your
business demands. These common capabilities may be part of your business
intelligence requirements:

 Development: A rich set of out-of-the-box functionalities is just the


beginning. Also look for sample applications and rapid development tools
to make both small functionality changes and mass changes that affect the
entire application.

 Deployment: Quickly deploy and scale an implementation that is aligned


with your current technology stack. The best solution fits into your web
architecture, minimizing the need to deploy proprietary technology, and
utilizes well-known techniques to scale the implementation.

5. Licensing, Services, and Company Expertise


Choosing the right partner is not just about the technology. It’s also about
finding the level of expertise you require for training, support, and services—as
well as agreeing on the business terms that ensure shared success — to get you

30 | P a g e
to the finish line, and beyond. These factors may affect your business
intelligence requirements:

 Licensing: Terms of the license can depend on a variety of factors,


including number of users/ customers, servers, usage, and whether you
are embedding into a commercial product. Be sure the terms make
business sense over the short and long runs.

 Services: Completing your project on time may require technical support


from your BI vendor, along with professional services to get you off the
ground and ongoing training.

 Customer success: Your BI vendor needs to be dedicated to your success.


Look for one that supports onboarding, gives you a dedicated account
management, and plenty of documentation and support offerings.

 Expertise: The best analytics vendors have experience in your specific


use case or industry.

6. Planning for scalability and availability


Scalability is the capability to make a software application available to those
who need it and want it. Scalable applications can help increase user adoption
across an organization, and may even increase the time users spend engaged in
self-service activities.

Scalability makes insights accessible to everyone. Dashboards and reports are


available to any number of users, not just restricted to the data analysts or
executives in an organization. In fact, many companies are embracing the
concept of the extended enterprise − sharing insights with field operations,
customers, and external partners in order to promote a common understanding
of how to drive business forward.

31 | P a g e

You might also like