Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Data Mining :-

Database –
A database is an organized collection of structured information, or data, typically stored
electronically in a computer system.Database is collection of current data.
Examples are MySql etc.
Data warehouse –
Data Warehouse is electronic storage of a large amount of information by a
business.A data warehouse is a technique for collecting and managing data
from varied sources to provide meaningful business insights

Data Mining:
Data mining is considered as a process of extracting data from large data
sets.Data mining is the process of analyzing unknown patterns of data
OR
Data mining is the process of extracting and discovering patterns in
large data sets involving methods of machine learning, statistics,
and database systems.

MCQS:
1960
Master Files & Reports (on Tape)

1965
Lots of Master files(on Tape)

1970
Direct Access Memory & DBMS

1975
Online high performance transaction processing
1980
PCs and 4GL Technology (MIS/DSS)

1985 & 1990


Extract programs, extract processing,
The legacy system’s web
Difference between Database and Data warehouse:

Use Recording data Analyzing data

Processing Methods OLTP OLAP

Concurrent Users Thousands Limited number

Use Cases Small transactions Complex analysis

Downtime Always available Some scheduled downtime

Optimization For CRUD (create, read, update, delete) operations For complex analysis

Data Type Real-time detailed data Summarized historical data

Why Data ware housing is imp?


Make better business decisions.
Ensure consistency.
Data warehousing improves the speed and efficiency of accessing different data sets.

Large amount of data is not data ware house?


Big Data basically refers to the data which is in large volume and has complex
data sets. This large amount of data can be structured, semi-structured, or non-
structured and cannot be processed by traditional data processing software and
databases.
But Data Warehouse is basically the collection of data from various
heterogeneous sources. It involves the process of extraction, loading, and
transformation for providing the data for analysis.
Transaction System:-
In computer programming, a transaction usually means a sequence of information
exchange and related work (such as database updating) and for ensuring database
integrity.

Ad-Hoc access:-
Ad hoc analysis is a business intelligence (BI) process designed to answer a specific
business question by using company data from various sources.

Mcqs:

Total hardware and software cost to store and manage 1 Mbyte of data
1990: ~ $15
2002: ~ ¢15 (Down 100 times)
By 2007: < ¢1 (Down 150 times)
A Few Examples
WalMart: 24 TB
France Telecom: ~ 100 TB
CERN: Up to 20 PB by 2006
Stanford Linear Accelerator Center (SLAC): 500TB

Telecomm calls are much much more as compared to bank transactions- 18 months.

Retailers interested in analyzing yearly seasonal patterns- 65 weeks.

Insurance companies want to do actuary analysis, use the historical data in order to predict risk- 7
years.

Rate of update depends on:


volume of data,
nature of business,
cost of keeping historical data,
benefit of keeping historical data

DWH SDLC (CLDS):-


Implement warehouse
Integrate data
Test for biasness (partial)
Program w.r.t data
Design DSS system
Analyze results
Understand requirement

OLTP:
OLTP (Online Transactional Processing) is a type of data processing that executes
transaction-focused tasks. It involves inserting, deleting, or updating small
quantities of database data

Data Warehousing DWH Online transaction


It is technique that gathers or collect It is technique that is used for detailed day
data from different sources into central to day transaction data which keep
repository. chaining on everyday.
It is designed for decision making It is designed for business transaction
process. process.
It stores large amount of data or It holds current data.
historical data.
It used for analyzing the business. It used for running the business.
In Data warehousing, the size of In Online transaction processing, the size
database is around 100GB-2TB . of data base is around 10MB-100GB.
In Data warehousing, denormalized In Online transaction processing,
data is present. normalized data is present.
It uses Query processing. It uses transaction processing
It is subject-oriented. It is application-oriented.
In Data warehousing, data redundancy In Online transaction processing, there is
is present. no data redundancy.

Comparison of Response Times:


On-line analytical processing (OLAP) queries must be executed in a small number of
seconds.
Often requires denormalization and/or sampling.

Complex query scripts and large list selections can generally be executed in a small
number of minutes.

Sophisticated clustering algorithms (e.g., data mining) can generally be executed in


a small number of hours (even for hundreds of thousands of customers).

Difference between OLTP and OLAP

Below is the difference between OLAP and OLTP in Data Warehouse:

Parameters OLTP OLAP


It is an online transactional system. It OLAP is an online analysis and data retrieving
Process
manages database modification. process.
Parameters OLTP OLAP
It is characterized by large numbers of
Characteristic It is characterized by a large volume of data.
short online transactions.
OLTP is an online database modifying OLAP is an online database query management
Functionality
system. system.
Method OLTP uses traditional DBMS. OLAP uses the data warehouse.
Insert, Update, and Delete information
Query Mostly select operations
from the database.
Table Tables in OLTP database are normalized. Tables in OLAP database are not normalized.
OLTP and its transactions are the sources Different OLTP databases become the source of
Source
of data. data for OLAP.
OLTP database must maintain data OLAP database does not get frequently modified.
Data Integrity
integrity constraint. Hence, data integrity is not an issue.
Response
It’s response time is in millisecond. Response time in seconds to minutes.
time
The data in the OLTP database is always
Data quality The data in OLAP process might not be organized.
detailed and organized.
It helps to control and run fundamental It helps with planning, problem-solving, and
Usefulness
business tasks. decision support.
Operation Allow read/write operations. Only read and rarely write.

Types of data warehouse:-


Financial:
First data warehouse that an organization builds. This is appealing because:

Nerve center, easy to get attention.

In most organizations, smallest data set.

Touches all aspects of an organization, with a common value i.e. money.

Inherent structure of data directly affected by the day-to-day activities of financial


processing.
Telecommunication:
Controlled by complete volume of data.

Many ways to put up call level detail:

Only a few months of call level detail,

Storing lots of call level detail scattered over different storage media,

Storing only selective call level detail, etc.


Unfortunately, for many kinds of processing, working at an aggregate level is simply
not possible.

Insurance:
Insurance data warehouses are similar to other data warehouses BUT with a few
exceptions.

Stored data that is very, very old, used for actuarial processing.

Typical business may change dramatically over last 40-50 years, but not insurance.

In retailing or telecomm there are a few important dates, but in the insurance
environment there are many dates of many kinds.
Insurance data warehouses are similar to other data warehouses BUT with a few
differences.

Long operational business cycles, in years. Processing time in months. Thus the
operating speed is different.

Transactions are not gathered and processed, but are in kind of “frozen”.

Thus a very unique approach of design & implementation.

Typical Applications of Data warehouse:


Fraud detection:
By observing data usage patterns.
People have typical purchase patterns.
Deviation from patterns.
Certain cities notorious for fraud.
Certain items bought by stolen cards.
Similar behavior for stolen phone cards

Profitability Analysis:
Banks know if they are profitable or not.
Don’t know which customers are profitable.
Typically more than 50% are NOT profitable.
Don’t know which one?
Balance is not enough, transactional behavior is the key.
Restructure products and pricing strategies.
Life-time profitability models (next 3-5 years)

Direct mail marketing


Targeted marketing.
Offering high bandwidth package NOT to all users.
Know from call detail records.
Know from data packages.
Saves marketing expense, saving pennies.
Knowing your customers better.

Credit risk prediction


Who should get a loan?
Customer separation i.e. stable vs. unstable.
Qualitative decision making NOT subjective.
Different interest rates for different customers.
Do not subsidize bad customer on the basis of good.

Profit Management
Works for fixed inventory businesses.
The price of item suddenly goes to zero.
Item prices vary for varying customers.
Example: Air Lines, Hotels etc.
Price of (say) Air Ticket depends on:
How much in advance ticket was bought?
How many vacant seats were present?
How profitable is the customer?
Ticket is one-way or return?

Agriculture Systems
Agri and related data collected for decades.
Metrological data consists of 50+ attributes.
Decision making based on expert judgment.
Lack of integration results in underutilization.
What is required, in which amount and when?

Normalization Explain .Types and Techniques?


Normalization is a method for dissecting tables to remove data redundancy
(repetition) and standardize the information for better data
workflows.Normalization Techniques in Data Mining is used for reducing the range
of an attribute’s values

Why do you need Normalization Techniques in Data Mining?

When dealing with huge data sets and When many characteristics exist, but their values vary,
building models may result in inaccurate predictions. Consequently, they are normalized to
put all qualities on the same scale.
There are several reasons for using Normalization Techniques in Data Mining:

 The Normalization Techniques in Data Mining are becoming more effective and
efficient.

 The data is translated into a format that everyone can understand; the data can be
pulled from databases more quickly, and the data can be analyzed in a specified way.

Techniques:
Min-Max normalization
The first technique we will cover is min-max normalization. It is the linear transformation of the
original unstructured data. It scales the data from 0 to 1
Z-score normalization
The next technique is z-score normalization. It is also called zero-mean normalization. The
essence of this technique is the data transformation by the values conversation to a common
scale where an average number equals zero and a standard deviation is one. A value is
normalized to ′ under the formula:

Data normalization by decimal scaling

And now we finally will move on to the decimal scaling normalization technique. It involves the
data transformation by dragging the decimal points of values of feature F. The movement of
decimals is very dependent on the absolute value of the maximum. A value of feature F is
transformed to by calculating:

Advantages:

 Reduces redundant data.


 Provides data consistency within the database.
 More flexible database design.
 Higher database security.
 Better and quicker execution.
 Greater overall database organization.

Disadvantages

Requires more joins to get the coveted effect

Maintenance overhead.

More tables to join

Data model is troublesome to inquiry against:

Tables hold codes rather than genuine information:

De-normalization

Denormalization is the process of adding precomputed redundant data to an


otherwise normalized relational database to improve read performance of the
database.

Why De-normalization used In DSS?

Query performance in DSS significantly dependent on physical data model.

The level of de-normalization should be carefully considered.


Denormalization is a technique used by database administrators to optimize the
efficiency of their database infrastructure.
How De-normalization improves performance?
De-normalization specifically improves performance by either:

Reducing the number of tables and hence the reliance on joins, which consequently
speeds up performance.

Reducing the number of joins required during query execution, or

Reducing the number of rows to be retrieved from the Primary Data Table.
4 Guidelines for De-normalization?
1. Carefully do a cost-benefit analysis (frequency of use, additional storage, join
time).

2. Do a data requirement and storage analysis.

3. Evaluate against the maintenance issue of the redundant data (triggers used).

4. When in doubt, don’t de-normalize.

Areas for Applying De-Normalization Techniques?


Dealing with the large quantity of star schemas(central table links with other tables).
Hieratical model
Many star schemas in data ware house.
Fast access of time series data for analysis.

Fast aggregate (sum, average etc.) results and complicated calculations.

Multidimensional analysis (e.g. geography) in a complex hierarchy.

Dealing with few updates but many join queries.


Principals of Denormalization techniques:
1. Collapsing Tables (joining).
- Two entities with a One-to-One relationship.
Two entities with a Many-to-Many relationship.

2. Splitting Tables (Horizontal/Vertical Splitting).

3. Pre-Joining (one-to-many relationshiph).

4. Adding Redundant Columns (Reference Data).

5. Derived Attributes (Summary, Total, Balance etc)


Collapsing Tables

ColA ColB ColA ColC

Reduced storage space.

Reduced update time. denormalized


Does not changes business view.
ColA ColB ColC
Reduced foreign keys.

Reduced indexing.

Splitting Tables
Table
ColA ColB ColC
T T
able_
C C able_
C C

Ta Ta
C C C C C C

Horizontal splitting
Breaks a table into multiple tables based upon common column values. Example:
Campus specific queries.

GOAL

Spreading rows for exploiting parallelism.

Grouping data to avoid unnecessary query load in WHERE clause.


Vertical Splitting
Tables split into column subsets using the INNER JOIN operator.
Pre-joining Tables

Pre-joined tables store the frequently used pieces of information together into one
table. The process comes in handy when:

 Queries frequently execute on the tables together.


 The join operation is costly.

Adding Redundant columns

In this method, only the redundant column which is frequently used in the joins is
added to the main table. The other table is retained as it is

Derived Attributes

Age is also a derived attribute, calculated as Current_Date – DoB (calculated


periodically).

GP (Grade Point) column in the data warehouse data model is included as a derived
value. The formula for calculating this field is Grade*Credits.

You might also like