Koneru Lakshmaiah College of Engineering: S.indira Priya Darsini V.chaitanya

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

Presented by,

s.indira priya darsini v.chaitanya


¾ C.S.E. ¾ I.S.T
priyadarshni_sikhinam@yahoo.com chaitu_8918@ yahoo.c om

FROM

Koneru Lakshmaiah College Of Engineering

PDF created with pdfFactory Pro trial version www.software-partners.co.uk


DATA WAREHOUSING
AND DATA
MINING
ABSTRACT
Data mining, the extraction of hidden predictive information from
large databases, is a powerful new technology with great potential to help
companies focus on the most important information in their data
warehouses. Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions. The automated,
prospective analyses offered by data mining move beyond the analyses of
past events provided by retrospective tools typical of decision support
systems.
Data warehouse is a computer system designed to give business
decision-makers instant access to information. The warehouse copies its
data from existing systems like order entry, general ledger, and human
resources and stores it for use by executives rather than programmers.
Data warehouse users use special software that enables them to create
and access information when they need it, as opposed to a reporting
schedule defined by the information systems(IS) department.

This paper describes about the basic architecture of data


warehousing, it’s software and process of data warehousing. It also
presents different techniques followed in data mining.

PDF created with pdfFactory Pro trial version www.software-partners.co.uk


INTRODUCTION
Databases today can range in size into terabytes.
Within these masses of data lies hidden information of
strategic importance. Data mining is the process of
uncovering that information. Innovative organisations
worldwide are already using data mining to locate and
appeal to higher-value customers, to reconfigure their
product offerings to increase sales, and to minimize
losses due to error or fraud. Data mining is a relatively
unique process which extracts information from a
database that the user did not know existed.

There's the more difficult way to use the results of data mining: getting the user to actually
understand what is going on so that they can take action directly.
1) Visualization of the data mining output in a meaningful way, and
2) Allowing the user to interact with the visualization so that simple questions can be answered.

Stag es of the data mining process

The phases depicted start with the raw data and finish with the extracted knowledge which
was acquired as a result of the following stages:

DATA WAREHOUSING

Data mining potential can be enhanced if the appropriate data has been collected and stored in a data
warehouse. Data warehousing is technique making it possible to extract archived operational data and
overcome inconsistencies between different legacy data formats.
The logical link between what the managers see in their decision support
EIS applications and the company's operational activities

Characteristics of a data warehouse

PDF created with pdfFactory Pro trial version www.software-partners.co.uk


Subject-oriented: data is organized according to subject instead of application
Integrated: When data resides in many separate applications in the operational
environment, encoding of data is often inconsistent.
Time-variant: The data warehouse contains data for comparisions.
Non-volatile: Data are not updated or changed in any way once they enter the data
warehouse, but are only loaded and accessed.

ARCHITECTURE OF DATA WAREHOUSE SYSTEM:

The Data Warehouse architecture is a method by which the overall structure of data,
communication, processing and presentation for end-user computing with in the enterprise can be
represented. The architecture of the data warehouse is implemented layer-wise so as to achieve
a higher degree of abstraction. The layered architecture of a data warehouse is illustrated in
the
following diagram.

APPLICATION
Information

Data Access

Data

Data Staging

Operational Data

Data Directory
Data
Process Direct

Layered Architecture of Data Wa rehouse


Infor mation Access La yer:
Data Access Layer:
Process Management Layer:
Data W arehous e La yer:
Data Stagi ng La yer:
Application Messaging Layer:
WAREHOUSE SOFTWARE

4
PDF created with pdfFactory Pro trial version www.software-partners.co.uk
A warehousing team will require several different types of tools during the course of a
warehousing project. These software products generally fall into one or more of the categories:

Extraction and Ware house Data access and retrieval


Transformation Technology
Source Data OLAP

DATA MARTS
middleware
Report
writers

extraction

DATA
EIS/DSS WAREHOUSE
Transformation

Data
Mining
Quality assurance

Extraction and t ransformation.:


Warehouse storage:
Data access and retrieval:
Extraction tools
There are two primary methods for extracting data from source systems

Change-based Replication
-------
Ware
House
Source
System ----
--

lk Bulk Extraction

-----
--- Ware
Source House
-- -----
System
---

5
Bulk extractions:
Change-based replication:
Transfo rma tion tools.
Transformation tools are aptly named; they transform extracted data into the appropriate
format, data structure, and values that are required by the data warehouse.

Data transformations

Field splitting and consolidation:


Standardization:.
Data quality too ls:
Data access and retrieval tools
Data warehouse users derive and obtain information through these types of tools. Data
access and retrieval tools are currently classified into the subcategories below.
Online analytical processing (OLAP) tools.
Executive information systems(EIS)

PROCESSES IN DATA WAREHOUSING

The first phase in data warehousing is to "insulate" your current operational


information, i.e., to preserve the security and integrity of mission-critical OLTP applications,
while giving you access to the broadest possible base of data.

DATA MINING ISSUES

1. One of the key issues raised by data mining technology is not a business or technological
one, but a social one. It is the issue of individual privacy.
2. Another issue is that of data integrity. Clearly, data analysis can only be as good as the
data that is being analyzed. A key implementation challenge is integrating conflicting or
redundant data from different sources.
3. A hotly debated technical issue is whether it is better to set up a relational database
structure or a multidimensional one.
4. Finally, there is the issue of cost. While system hardware costs have dropped dramatically
within the past five years, data mining and data warehousing tend to be self-reinforcing.

DATA MIN IN G TEC HN IQ UES

These provide a description of some of the most common data mining algorithms in use today.
We have broken the discussion into two sections, each with a specific theme:

Classical Techniques: Statistics, Neighborhoods and Clustering

PDF created with pdfFactory Pro trial version www.software-partners.co.uk


Next Generation Techniques: Trees, Networks and Rules

Classical Techniques

Statistics
By strict definition "statistics" or statistical techniques are not data mining. They were
being used long before the term data mining was coined to apply to business applications.

Nearest Neighbor

Clustering and the Nearest Neighbor prediction technique are among the oldest
techniques used in data mining. Nearest neighbor is a prediction technique that is quite similar to
clustering -

Clustering

Clustering is the method by which like records are grouped together. Usually this is done
to give the end user a high level view of what is going on in the database. Clustering is sometimes
used
to mean segmentation -

Name Income Age Education Vendor


Blue Blood Estates Wealthy 35-54 College Claritas Prizm™
Shotguns and Pickups Middle 35-64 High School Claritas Prizm™
Southside City Poor Mix Grade School Claritas Prizm™
Living Off the Land Middle-Poor School Age Low Equifax MicroVision™
Families
University USA Very low Young - Mix Medium to Equifax MicroVision™
High
Sunset Years Medium Seniors Medium Equifax MicroVision™

Table Some Commercially Available Cluster Tags

This clustering information is then used by the end user to tag the customers in their
database. Once this is done the business user can get a quick high level view of what is
happening within the cluster. Once the business user has worked with these codes for some time
they also begin to build intuitions about how these different customers clusters will react to the
marketing offers particular to their business.

Next Generation Techniques


7

PDF created with pdfFactory Pro trial version www.software-partners.co.uk


The data mining techniques in this section represent the most often used techniques that have
been developed over the last two decades of research. These techniques can be used for
either discovering new information within large databases or for building predictive models.

Decision Trees

A decision tree is a predictive model that, as its name implies, can be viewed as a tree.
Specifically each branch of the tree is a classification question and the leaves of the tree are partitions
of the dataset with their classification. For instance if we were going to classify customers who churn
(don’t renew their phone contracts) in the Cellular Telephone Industry a decision tree might
look something like that found in Figure

Figure A decision tree is a predictive model that makes a prediction on the basis of a series of
decision much like the game of 20 questions.

We may notice some interesting things about the tree:

It divides up the data on each branch point without losing any of the data (the number of total
records in a given parent node is equal to the sum of the records contained in its two
children).
The number of churners and non-churners is conserved as you move up or down the
tree

Neural Networks

Foremost among the advantages of neural networks is their highly accurate predictive
models that can be applied across a large number of different types of problems. True neural
networks are biological systems that detect patterns, make predictions and learn.

PDF created with pdfFactory Pro trial version www.software-partners.co.uk


The node - This loosely corresponds to the neuron in the human brain.

The link - This loosely corresponds to the connections between neurons (axons, dendrites and
synapses) in the human brain.

Which Technique and When?

Some of the criteria that are important in determining the technique to be used are determined by
trial and error. There are definite differences in the types of problems that are most conducive to
each technique but the reality of real world data and the dynamic way in which markets,
customers and hence the data that represents them is formed means that the data is constantly
changing. These dynamics mean that it no longer makes sense to build the "perfect" model on the
historical data since whatever was known in the past cannot adequatel y predict the future because
the future is so unlike what has gone before.

POTENTIAL APPLICATIONS

Data mining has many and varied fields of application some of which are listed below.

Retail/Marketing

Banking

Insurance and Health Care

Transportation

Medicine

CONCLUSION

Data Mining is not a new phenomenon. All large organizations already have data
warehouses, but they are just not managing them. The Data Warehousing solution
should enhance intelligence in decision-making process of an enterpris. Over the next few
years, the growth of data mining is going to be enormous with new products and technologies
coming out frequently. In order to get the most out of this period, it is going to be important that data
warehousing and mining planners and developers have a clear idea of what they are looking for and
then choose strategies and methods that will provide them with performance today and flexibility for
tomorrow.

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

You might also like