Download as pdf or txt
Download as pdf or txt
You are on page 1of 135

1

DATA MINING

1.1 Definition of 'Data Mining:


Definition: In simple words, data mining is defined as a process used to extract usable data from
a larger set of any raw data. It implies analyzing data patterns in large batches of data using one
or more software. Data mining has applications in multiple fields, like science and research. As
an application of data mining, businesses can learn more about their customers and develop more
effective strategies related to various business functions and in turn leverage resources in a more
optimal and insightful manner. This helps businesses be closer to their objective and make better
decisions. Data mining involves effective data collection and warehousing as well as computer
processing. For segmenting the data and evaluating the probability of future events, data mining
uses sophisticated mathematical algorithms. Data mining is also known as Knowledge Discovery
in Data (KDD).

Description:
Key features of data mining:

• Automatic pattern predictions based on trend and behavior analysis.

• Prediction based on likely outcomes.

• Creation of decision-oriented information.

• Focus on large data sets and databases for analysis.

• Clustering based on finding and visually documented groups of facts not previously known.

Student Note:
1.2 Specific use of data mining
• Market segmentation
o Data mining helps to identify the common characteristics of customers who buy
the same products from your company
• Customer anticipation(expectation)
o It helps to predict which customers may leave your company and go to a
competitor
• Fraud detection- it identifies which transaction are most likely to be fraudulent.
• Direct marketing
o Direct marketing identifies which prospects should be included to obtain the
highest response rate.
• Interactive marketing
o It is useful for predicting what each user on web site is most likely interested in
seeing.
• Market basket analysis
o It helps to understand what product or services are commonly purchased together.
• Trend analysis
o Trend analysis identifies the difference between typical customers this month and
last.

Student Note:
1.3 Challenges of Data Mining:

• Scalability: Scalable techniques are needed for handling massive size of datasets that are
now created.
• Poor efficiency: Such large datasets require the use of efficient method for storing,
indexing and retrieving data from secondary or even tertiary storage system.
• Complexity: Such techniques can dramatically increase the size of the datasets which can
be handled and for that it requires new design and algorithm.
• Dimensionality: some domain has number of dimensions which are very large and makes
the data analyzing difficult. That is called as curse of dimensionality for example
bioinformatics.
• Poor quality: poor quality such as noisy data dirty data missing value, in exact or
incorrect data.

Student Note:
1.4 Knowledge discovery in Database

Fig:An Outline of the Steps of the KDD Process

The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:

1. Developing an understanding of
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: (SELECTION)
o selecting a data set, or
o Focusing on a subset of variables, or data samples (a data sample is a set
of data collected and/or selected from a statistical population by a defined procedure. The
elements of a sample are known as sample points, sampling units or observations. ... The
sample usually represents a subset of manageable size.), on which discovery is to be
performed.
3. Data cleaning and preprocessing.(PREPRCESSING)
o Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors .Data preprocessing is a
proven method of resolving such issues.
o Removal of noise or outliers.
1. Fill missing values
2. Data not entered due to misunderstanding

How to handle missing data:

o Strategies for handling missing data fields.


1. Filling missing value manually
2. Use of global constant
3. Imputations(use of attribute mean to fill the missing value
o Accounting for time sequence information and known changes.
4. Data reduction and projection.(TRANSFORMATION)
o Finding useful features to represent the data depending on the goal of the
task.
o Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
5. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of the
KDD process.
6. Data mining.
o Searching for patterns of interest in a particular representational form or
a set of such representations as classification rules or trees, regression,
clustering, and so forth.
7. Interpreting mined patterns.
8. Consolidating discovered knowledge.

The terms knowledge discovery and data mining are distinct.

KDD refers to the overall process of discovering useful knowledge from data. It involves the
evaluation and possibly interpretation of the patterns to make the decision of what qualifies
as knowledge. It also includes the choice of encoding schemes, preprocessing, sampling, and
projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data without
the additional steps of the KDD process.

Student Note:
1.5 Data pre-processing

Why pre-processing the data ?

We pre-process the data because real world data are generally


1.1. Incomplete:
Lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
1.2. Noisy:
Containing errors or outliers
1.3. Inconsistent:
Containing discrepancies in codes or names

Because of this reason we need to pre-process the data.

Froms of data pre-processing

Student Note:
1. Data cleaning

Real world data tend to be incomplete, noisy and inconsistent; Data cleaning routines
attempt to fill the missing values, smooth out noise while identifying outliers and correct
inconsistencies in the data. Basic methods of data cleaning are:

1. Fill in missing values

If the values are missed then following steps can be taken.


1.1. Ignore the tuple:
Usually done when class label is missing. This method is not very effective
unless the tuples contain several attributes with missing values.
1.2. Use the attribute mean (or majority nominal value) to fill in the missing value.
Suppose that the average income of a company of customer is 2000. Use this
value to replace the missing value of income.
1.3. Use the attribute mean (or majority nominal value) for all samples belonging to the
same class.
If classifying customer according to credit risk, replace the missing value with
the average income value for customers in the same credit risk category as that
of the given tuple.
1.4. Predict the missing value by using a learning algorithm:
Consider the attribute with the missing value as a dependent (class) variable and
run a learning algorithm (usually Bays or decision tree) to predict the missing
value.

2. Identify outliers and smooth out noisy data:


2.1. Binning
Sort the attribute values and partition them into bins (see "Unsupervised
discretization" below);
Then smooth by bin means, bin median, or bin boundaries.
2.2. Clustering:
Outliers may be detected by clustering where similar values are organised into
groups or clusters. Intuitively, values that fall outside of the site of clusters may be
considered outliers.
2.3. Regression:
Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the “best” line to fit two variables, so that on variable can
be used to predict the other.

2.4. Correct inconsistent data:


There may be inconsistencies in data recorded for some transactions. Some data
inconsistencies may be corrected manually using external references, for ex, errors
made at data entry may be corrected by performing a paper trace. Knowledge
engineering tools may also be used to detect the violation of known data constraints.
There may be inconsistencies due to data integration, where a given attribute can
have different name in different database.

2. Data transformation
a) Normalization:
a. Scaling attribute values to fall within a specified range.
i. Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-
Min)/(Max-Min)
b. Scaling by using mean and standard deviation (useful when min and max are
unknown or when there are outliers): V'=(V-Mean)/StDev
b) Aggregation: moving up in the concept hierarchy on numeric attributes.
c) Generalization: moving up in the concept hierarchy on nominal attributes.
d) Attribute construction: replacing or adding new attributes inferred by existing attributes.

3. Data reduction
1. Reducing the number of attributes
o Data cube aggregation: applying roll-up, slice or dice operations.
o Removing irrelevant attributes: attribute selection (filtering and wrapper
methods), searching the attribute space .
o Principle component analysis (numeric attributes only): searching for a lower
dimensional space that can best represent the data..
2. Reducing the number of attribute values
o Binning (histograms): reducing the number of attributes by grouping them into
intervals (bins).
o Clustering: grouping values in clusters.
o Aggregation or generalization
3. Reducing the number of tuples
o Sampling

SUMMARY PRE-PROCESSING

1.1. Data cleaning:


fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
1.2. Data integration:
Using multiple databases, data cubes, or files.
1.3. Data transformation:
Normalization and aggregation.
1.4. Data reduction:
Reducing the volume but producing the same or similar analytical results.
1.5. Data discretization:
part of data reduction, replacing numerical attributes with nominal ones.
1.6 Applications of Data Mining

1. E-Commerce:
• For Business Intellegence
o Offers upsell.example Amazon.com
• Fraud detection:
o A problem faced by all e-commerce companies is misuse of our systems
and, in some cases, fraud. For example, sellers may deliberately list a
product in the wrong category to attract user attention, or the item sold is
not as the seller described it. On the buy side, all retailers face problems
with users using stolen credit cards to make purchases or register new
user accounts.

o Fraud detection involves constant monitoring of online activities, and


automatic triggering of internal alarms. Data mining uses statistical
analysis and machine learning for the technique of “anomaly detection”,
that is, detecting abnormal patterns in a data sequence.

o Detecting seller fraud requires mining data on seller profile, item


category, listing price and auction activities. By combining all of this
data, we can have a complete picture and fast detection in real time.

• Product Search:

o When the user searches for a product, how do we find the best results for
the user? Typically, a user query of a few keywords can match many
products. For example, “Verizon Cell phones” is a popular query at eBay,
and it matches more than 34,000 listed items.

o One factor we can use in product ranking is user click-through rates or


product sell-through rate. Both indicate a facet of the popularity of a
product page. In addition, user behavioral data gives us the link from a
query, to a product page view, and all the way to the purchase event.
Through large-scale data analysis of query logs, we can create graphs
between queries and products, and between different products. For
example, the user who searches for “Verizon cell phones” might click on
the Samsung SCH U940 Glyde product, and the LG VX10000 Voyager.
We now know the query is related to those two products, and the two
products have a relationship to each other since a user viewed (and
perhaps considered buying) both.

• Product recommendation

o Recommending similar products is an important part of eBay. A good


product recommendation can save hours of search time and delight our
users.

o Typical recommendation systems are built upon the principle of


“collaborative filtering”, where the aggregated choices of similar, past
users can be used to provide insights for the current user. We do this in
our new product based experience. Try viewing our Apple iPod touch 2nd
generation page and scroll down — you’ll see that users who viewed this
product also viewed other generations of the iPod touch and the iPod
classic.

o Discovering item similarity requires understanding product attributes,


price ranges, user purchase patterns, and product categories. Given the
hundreds of millions of items sold on eBay, and the diversity of
merchandise on our website, this is a challenging computational task.
Data mining provides possible tools to tackle this problem, and we are
always actively improving our approach to the problem.

2. Crime Agencies:
• Use to spot trends across the data helping with everything from where to deploy
police manpower. (where the crime is mostly likely to happen).
• To search at a border crossing (based on age/type of the vehicle, age of
occupation.
• Data mining and criminal intelligence techniques
o Entity extraction: Commonly used to automatically identify people,
organizations, vehicles and personal details in unstructured data such as
police reports. Even if entity extraction provides only basic information, it
can accelerate the investigation by rapidly providing precise details from
large amounts of unstructured data.
o Clustering techniques: Clustering techniques are used to group similar
characteristics together in classes in order to gain intelligence by
maximizing or minimizing similarities; for example, to identify suspects or
criminal groups conducting crimes in similar ways. Clustering techniques
could be effectively applied through conceptual space algorithms to
discover criminal relations by cross referencing entities in criminal
records.
o Association rules: This data mining technique has been used to discover
recurring items in databases in order to create pattern rules and detect
potential future events. This technique has been effective in preventing
network intrusions and attacks, such as denial of service attacks(DDoS).
o Sequential pattern mining: as association rule it is useful to identify
sequences or recurring item in order to define patterns and prevent
attacks, in network security.
o Classification: This technique is useful for analyzing unstructured data to
discover common properties among criminal entities. Classification has
been used together with inferential statistics techniques to predict crime
trends. This technique can dramatically narrow down different criminal
entities and organize them into predefined classes.
o String comparison: This technique is used to reveal deceptive information
in criminal records by comparing structured text fields. This requires
highly intensive computational capabilities.

3. Telecommunication
o Telecommunication companies maintain data about the phone calls
that traverse their networks in the form of call detail records, which
contain descriptive information for each phone call. In 2001, AT&T
long distance customers generated over 300 million call detail records
per day (Cortes & Pregibon, 2001) and, because call detail records
are kept online for several months, this meant that billions of call
detail records were readily available for data mining. Call detail data
is useful for marketing and fraud detection applications.
o Telecommunication companies also maintain extensive customer
information, such as billing information, as well as information
obtained from outside parties, such as credit score information. This
information can be quite useful and often is combined with
telecommunication-specific data to improve the results of data mining.
For example, while call detail data can be used to identify suspicious
calling patterns, a customer’s credit score is often incorporated into
the analysis before determining the likelihood that fraud is actually
taking place.
o Telecommunications companies also generate and store an extensive
amount of data related to the operation of their networks. This is
because the network elements in these large telecommunication
networks have some self-diagnostic capabilities that permit them to
generate both status and alarm messages. These streams of messages
can be mined in order to support network management functions,
namely fault isolation and prediction.

4. Biological Data Analysis:


• Thousands of genes (~25K in human DNA) function in a
complicated and orchestrated way that creates the mystery of life. !

• Genomic studies the functionality of specific genes, their relations


to diseases, their associated proteins and their participation in
biological processes !

• Proteins (~1M in human organism) are responsible for many


regulatory functions in cells, tissues and organism !

• Proteome, the collection of proteins produced, evolves dynamically


during time depending on environmental signals. !
• Proteomic studies the sequences of proteins and their
functionalities
• (case of stem cell, haruko obokata)
1.7 Data mining issues:
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −

• Mining Methodology and User Interaction

• Performance Issues

• Diverse Data Types Issues

The following diagram describes the major issues.


1. Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

• Mining different kinds of knowledge in databases − Different users may be interested


in different kinds of knowledge. Therefore it is necessary for data mining to cover a
broad range of knowledge discovery task.

• Interactive mining of knowledge at multiple levels of abstraction − The data mining


process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.

• Incorporation of background knowledge − To guide discovery process and to express


the discovered patterns, the background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in concise terms but at multiple
levels of abstraction.

• Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.

• Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.

• Handling noisy or incomplete data − the data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.

• Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
2. Performance Issues
There can be performance-related issues such as follows −

• Efficiency and scalability of data mining algorithms− In order to effectively extract


the information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.

• Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions are merged. The incremental algorithms, update
databases without mining the data again from scratch.

3. Diverse Data Types Issues


• Handling of relational and complex types of data − the database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible
for one system to mine all these kind of data.

• Mining information from heterogeneous databases and global information


systems − the data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
1.8 Disadvantages of data mining
• Privacy Issues
o The concerns about the personal privacy have been increasing enormously
recently especially when the internet is booming with social networks, e-
commerce, forums, blogs…. Because of privacy issues, people are afraid of their
personal information is collected and used in an unethical way that potentially
causing them a lot of troubles. Businesses collect information about their
customers in many ways for understanding their purchasing behaviors trends.
However businesses don’t last forever, some days they may be acquired by other
or gone. At this time, the personal information they own probably is sold to other
or leak.

• Security issues
o Security is a big issue. Businesses own information about their employees and
customers including social security number, birthday, payroll and etc. However
how properly this information is taken care is still in questions. There have been a
lot of cases that hackers accessed and stole big data of customers from the big
corporation such as Ford Motor Credit Company, Sony… with so much personal
and financial information available, the credit card stolen and identity theft
become a big problem.
• Misuse of information/inaccurate information
o Information is collected through data mining intended for the ethical purposes can
be misused. This information may be exploited by unethical people or businesses
to take benefits of vulnerable people or discriminate against a group of people.

In addition,

Data mining technique is not perfectly accurate. Therefore, if inaccurate information is used for
decision-making, it will cause serious consequence.

1.9 Problems of Data Mining


The amount of data being generated and stored every day is exponential. A recent study
estimated that every minute, Google receives over 2 million queries, e-mail users send over 200
million messages, YouTube users upload 48 hours of video, Face book users share over 680,000
pieces of content, and Twitter users generate 100,000 tweets. Besides, media sharing sites, stock
trading sites and news sources continually pile up more new data throughout the day.

The common problems in Data Mining.

1. Poor data quality such as noisy data, dirty data, missing values, inexact or incorrect values,
inadequate data size and poor representation in data sampling.
2. Integrating conflicting or redundant data from different sources and forms: multimedia files
(audio, video and images), geo data, text, social, numeric, etc…
3. Proliferation of security and privacy concerns by individuals, organizations and governments.
4. Unavailability of data or difficult access to data.
5. Efficiency and scalability of data mining algorithms to effectively extract the information from
huge amount of data in databases.
6. Dealing with huge datasets that require distributed approaches.
7. Dealing with non-static, unbalanced and cost-sensitive data.
8. Mining information from heterogeneous databases and global information systems.
9. Constant updation of models to handle data velocity or new incoming data.
10. High cost of buying and maintaining powerful softwares, servers and storage hardwares that
handle large amounts of data.
11. Processing of large, complex and unstructured data into a structured format.
12. Sheer quantity of output from many data mining methods.
Difference between Database and Data Warehouse

Data Base Data Warehouse


Defination Any collection of data A type of database that integrates
organized for storage, copies of transaction data from
accessibility, and retrieval. disparate source systems and
provisions them for analytical use.
Types There are different types of A data warehouse is an OLAP
databases, but the term database. An OLAP database layers on
usually applies to an OLTP top of OLTPs or other databases to
application database, perform analytics. An important side
XML, CSV files, flat text, and note about this type of database. They
even Excel spreadsheets. differ according to how the data is
We’ve actually found that modeled.
many healthcare Data warehouses employ either an
organizations use Excel enterprise or dimensional data model.
spreadsheets to perform
analytics (a solution that is
not scalable
Data An OLTP database structure In an OLAP database structure, data is
Organization features very complex tables organized specifically to facilitate
and joins because the data reporting and analysis, not for quick-
is normalized (it is structured hitting transactional needs. The data is
in such a way that no data is denormalized to enhance analytical
duplicated). Making data query response times and provide ease
relational in this way is what of use for business users. Fewer tables
delivers storage and and a simpler structure result in easier
processing efficiencies—and reporting and analysis.
allows those sub-second
response times.

Reporting/ Reporting is typically limited When it comes to analyzing data and


analyzing to more static. We can reporting, a static list is insufficient.
actually get report running There’s an intrinsic need for
form OLTP database. But aggregating, summarizing, and drilling
these reports are static, one- down into the data. A data warehouse
time lists in PDF format. For enables you to perform many types of
example, we might generate analysis and report:
a monthly report of heart

failure readmissions or a list Descriptive (what has happened)

of all patients with a central Diagnostic (why it happened)
line inserted. • Predictive (what will happen)
• Prescriptive(what to do about it)
Service Level OLTP databases must With OLAP databases, Service Level
typically meet 99.99% Agreements are more flexible because
uptime. System failure can occasional downtime for data loads is
result in chaos and lawsuits. expected. The OLAP database is
The database is directly separated from frontend applications,
linked to the front end which allows it to be scalable. Data is
application. Data is available refreshed from source systems as
in real time to serve the needed (typically this refresh occurs
here-and-now needs of the every 24 hours). It serves historical
organization. In healthcare, trend analysis and business decisions.
this data contributes to
clinicians delivering precise,
timely bedside care.
2
DATA WAREHOUSE

2.1 What is data Warehouse?


(DEFINATIONS)

• Data warehouse is a central location where consolidated data from multiple


locations are stored .

• A data warehouse is the data(meta/fact/dimension/aggregation)


and the process managers(query/tool)that makes information
available enabling people to make informal decision

• Data warehousing is a technology that aggregates


structured data from one or more sources so that it can be
compared and analyzed for greater business intelligence.

• A Data Warehouse is a repository of information


collected from multiple sources, stored under a
unified scheme, and which usually resides a a
single site. Data warehouse is constructed via a
process of data cleaning, data transformation, data
integration, data loading and periodic data
refreshing.
2.2 Introduction
Data Warehouse (DW) stores corporate information and data from operational
systems and a wide range of other data resources. Data Warehouses are
designed to support the decision-making process through data collection,
consolidation, analytics, and research. They can be used in analyzing a
specific subject area, such as “sales,” and are an important part of
modern Business Intelligence.

The architecture for Data Warehouses was developed in the 1980s to assist in
transforming data from operational systems to decision-making support
systems. Normally, a Data Warehouse is part of a business’s mainframe
server or in the Cloud.

In a Data Warehouse, data from many different sources is brought to a single


location and then translated into a format the Data Warehouse can process
and store. For example, a business stores data about its customer’s
information, products, employees and their salaries, sales, and invoices. The
boss may ask about the latest cost-reduction measures, and getting answers
will require an analysis of all of the previously mentioned data. Unlike basic
operational data storage, Data Warehouses contains aggregate historical data
(highly useful data taken from a variety of sources).

Punch cards were the first solution for storing computer generated data. By
the 1950s, punch cards were an important part of the American government
and businesses. The warning “Do not fold, spindle, or mutilate” originally came
from punch cards. Punch cards continued to be used regularly until the mid-
1980s. They are still used to record the results of voting ballots and
standardized tests. “Magnetic storage” slowly replaced punch cards starting in
the 1960s. Disk storage came as the next evolutionary step for data storage.
Disk storage (hard drives and floppies) started becoming popular in 1964 and
allowed data to be accessed directly, which was a significant improvement
over the clumsier magnetic tapes. IBM was primarily responsible for the early
evolution of disk storage. They invented the floppy disk drive as well as the
hard disk drive. They are also credited with several of the improvements now
supporting their products. IBM began developing and manufacturing disk
storage devices in 1956. In 2003, they sold their “hard disk” business to
Hitachi.

The Need for Data Warehouses

During the 1990s major cultural and technological changes were taking place.
The internet was surging in popularity. Competition had increased due to new
free trade agreements, computerization, globalization, and networking. This
new reality required greater business intelligence, resulting in the need for true
data warehousing. During this time, the use of application systems exploded.

By the year 2000, many businesses discovered that, with the expansion of
databases and application systems, their systems had been badly
integrated and that their data was inconsistent. They discovered they were
receiving and storing lots of fragmented data. Somehow, the data needed to
be integrated to provide the critical “Business Information” needed for
decision-making in a competitive, constantly-changing global economy.

Data Warehouses were developed by businesses to consolidate the data


they were taking from a variety of databases, and to help support their
strategic decision-making efforts
Benefits of Data Warehouse
1. High Return of investment:
a. Implementation of data warehousing by an organisation requires a hug
investment typically from Rs 10-15 lakhs, however a study by the
internal data corporation (IDC) in 1996 reported that avg 3 year returns
on investment (ROI) in data warehousing reached 401%.
2. More Cost-effective decision making:
a. Data Warehousing helps to reduce the overall cost of product by
reducing the number of channels.
3. Competitive advantage:
a. The competitive advantage is gained by allowing decision-makers
access to data that can reveal previously unavailable, unknown and
untapped information. For e.g. Customers, trends, and demand.
4. Increased productivity of corporate decision makers-
a. Data warehousing improves the productivity of corporate decision-
makers by creating and integrated database of consistent, subject-
oriented, and historical data.
5. Better enterprise intelligence:
a. It helps to provide better enterprise intelligence.

Student Note:
Application of Data Warehouse
Data Warehouses owing to their potential have deep-rooted applications in every
industry which use historical data for prediction, statistical analysis, and decision
making. Listed below are the applications of Data warehouses across innumerable
industry backgrounds.

1. Banking Industry
In the banking industry, concentration is given to
• risk management
• analyzing consumer data, market trends,
• government regulations and reports,
• Financial decision making.
• Most banks also use warehouses to manage the resources available on deck
in an effective manner. Certain banking sectors utilize them for market
research, performance analysis of each product, interchange and exchange
rates, and to develop marketing programs.
• Analysis of card holder’s transactions, spending patterns and merchant
classification, all of which provide the bank with an opportunity to introduce
special offers and lucrative deals based on cardholder activity. Apart from all
these, there is also scope for co-branding.
2. Finance Industry
Similar to the applications seen in banking, mainly revolve around evaluation and
trends of customer expenses which aids in maximizing the profits earned by their
clients.

3. Consumer Goods Industry


They are used for
• prediction of consumer trends,
• inventory management,
• Market and advertising research.
• In-depth analysis of sales and production is also carried out.
• Apart from these, information is exchanged business partners and clientele.

4. Government
The federal government utilizes the warehouses for
• Research in compliance, whereas the state government uses it for services
related to human resources like recruitment, and accounting like payroll
management.
• to maintain and analyze tax records,
• analyse health policy records and their respective providers,
• Analyse entire criminal law database . Criminal activity is predicted from the
patterns and trends, results of the analysis of historical data associated with
past criminals.

5. Education
Universities use warehouses for
• extracting of information used for the proposal of research grants,
• understanding their student demographics, and human resource
management.
• The entire financial department of most universities depends on data
warehouses, inclusive of the Financial Aid department.

6. Healthcare
One of the most important sector which utilizes data warehouses is the Healthcare
sector. All of their financial, clinical, and employee records are fed to warehouses as
it helps them
• to strategize and predict outcomes,
• track and analyse their service feedback,
• generate patient reports,
• share data with tie-in insurance companies,
• Medical aid services, etc.
7. Hospitality Industry
A major proportion of this industry is dominated by hotel and restaurant services, car
rental services, and holiday home services. They utilize warehouse services to
• Design and evaluate their advertising and promotion campaigns where they
target customers based on their feedback and travel patterns.

8. Insurance
As the saying goes in the insurance services sector, “Insurance can never be
bought, it can be only be sold”, the warehouses are primarily used to

• Analyze data patterns and customer trends, apart from maintaining records of
already existing participants.

9. Manufacturing and Distribution Industry


This industry is one of the most important sources of income for any state. A
manufacturing organization has to take several make-or-buy decisions which can
influence the future of the sector, which is why they utilize high-end OLAP tools as a
part of data warehouses to:
• predict market changes,
• analyze current business trends,
• detect warning conditions,
• view marketing developments
• Take better decisions.
They also use them for product shipment records, records of product portfolios,
identify profitable product lines, analyze previous data and customer feedback to
evaluate the weaker product lines and eliminate them.
For the distributions, the supply chain management of products operates through
data warehouses.

10. The Retailers


Retailers serve as middlemen between producers and consumers. It is important for
them to maintain records of both the parties to ensure their existence in the market.
They use warehouses to
• track items,
• advertising promotions,
• Consumers buying trends.
• They also analyze sales to determine fast selling and slow selling product
lines and determine their shelf space through a process of elimination.
11. Services Sector
Data warehouses find themselves to be of use in the service sector for maintenance
of financial records, revenue patterns, customer profiling, resource management,
and human resources.

12. Telephone Industry & Transportation Industry


The telephone industry operates over both offline and online data burdening them
with a lot of historical data which has to be consolidated and integrated.
Apart from those operations, it also
• analysis of customer’s calling patterns for sales representatives to push
advertising campaigns,
• Tracking of customer queries, all require the facilities of a data warehouse.
In the transportation industry, data warehouses record customer data enabling
traders to experiment with target marketing where the marketing campaigns are
designed by keeping customer requirements in mind.
The internal environment of the industry uses them to analyze customer feedback,
performance, manage crews on board as well as analyze customer financial reports
for pricing strategies.

Student Note:
Datawarehouse Model
From the architecture pointof view , there are three

Problems of Data Warehousing


1. Underestimation of resource of data loading.
a. Sometime we underestimate the time required to extract, clean and
load the data into the warehouse.
2. Hidden problem with source systems
a. Sometime hidden problems associated with the source system feeding
the data warehouse may be identified after years of being
undetected.eg. when entering details of the new properties certain field
may allow nulls which may result in staff entering incomplete properties
data even when available are applicable.
3. Required data not captured
a. In some cases the required data is not captured by the source system
which may be very important for the data warehouse purpose. e.g.
Data of registration for the property may not be used in the source
system but it may be very important analysis purpose.
4. Data Homogenization:
a. The concept of data warehouse deals with the similarity of data formats
between different data source, Thus results in to lose of some
important value of data.
5. High demand for resources:
a. The data warehouse requires large amount of data
6. High maintenance Cost
a. Data warehouse are high maintenance system. Any organisation of the
business process and the source system may affect the data
warehouse and its results high maintenance cost.
7. Long duration
a. The building of warehouse can take up to three years which is why
some organisation reluctant in investing in to data warehouse.
8. Data ownership
a. Data warehousing may change the attitude of end-user to the
ownership of data. Sensitive data that owned by one department has to
loaded in data warehouse for decision making purpose. But sometime
it result in to reluctance of that department because it may hesitate to
share it with other.
Data Mart
• A data mart is a repository of data that is designed to serve a particular community of
knowledge workers.
• A data mart is a simple form of a data warehouse that is focused on a single subject
(or functional area), such as Sales or Finance or Marketing. Data marts are often
built and controlled by a single department within an organization.
Types of Data Marts

Dependent, Independent, and Hybrid Data Marts

Three basic types of data marts are dependent, independent, and hybrid. The categorization
is based primarily on the data source that feeds the data mart. Dependent data marts draw
data from a central data warehouse that has already been created. Independent data marts,
in contrast, are standalone systems built by drawing data directly from operational or
external sources of data or both. Hybrid data marts can draw data from operational systems
or data warehouses.

Dependent Data Marts

A dependent data mart allows you to unite your organization's data in one data warehouse.
This gives you the usual advantages of centralization. Figure below illustrates a dependent
data mart.

Independent Data Marts

An independent data mart is created without the use of a central data warehouse. This could
be desirable for smaller groups within an organization. It is not, however, the focus of this
Guide. See the Data Mart Suites documentation for further details regarding this
architecture. Figure below illustrates an independent data mart.

Hybrid Data Marts

A hybrid data mart allows you to combine input from sources other than a data warehouse.
This could be useful for many situations, especially when you need ad hoc integration, such
as after a new group or product is added to the organization. Figure below illustrates a
hybrid data mart.

Figure Hybrid Data Mart


Difference Between Data Warehouse and Data Mart

Data Warehouse Data Mart

Data Warehouse is a big


Definition
central repository of Data Mart can be considered as a subset of data warehouse
historical data

Focus Multiple subject areas Specific subject area

Control Central organization unit Generally, single department

Scope Corporate Line of Business

Data
Multiple Few selected
Sources

Size 100 GB-TB+ < 100 GB

Designing Comparatively difficult Easy

Implementa
Months to years Months
tion

Decision Strategic Tactical


3
SOFTWARE & HARDWARE DESIGN

3. Introduction
3.1. Multidimensional structure
3.1.1.Fact table
3.1.2.Dimension table
3.1.3.Difference between fact table and dimension table

3.2. Data Warehouse Schema


3.2.1.Star schema
3.2.2.Snowflake schema
3.2.3.Fact Constellation
3.2.4.
3.3. Concept Hierarchy
3.4. Starnet Query model
3.5. Overview of hard ware and I/0 considerations
3.6. Index
3.7. Materialize view

3. Introduction:
3.1. Multidimensional Structure
Data Warehouses and OLAP tools are based on a multidimensional data model.
This model views data in the form of data cube.

What is a data Cube?


A data cube allows data to be modeled and viewed in multiple dimensions. It is
defined by dimensions and facts.

Fact Table
 The large central table, containing the bulk of the data with no
redundancy.
 Usually the fact table in the schemas are in third normal form
 A flat table can contain fact’s data on detail or aggregate level
 Fig : fact table.

Dimension table:
Dimensions are the perspectives or entities with respect to create a sales
data warehouse in order to keep records. It allows the store to keep track of
things like monthly sale of items, and the branches and the locations at which the
items were sold. E.g. of dimensions are time, branch, location, item etc. Each
dimension may have a table associated with it which further describes the
dimension is called dimension table.

For e.g. Dimension table for item may contain


item_name
Brand
Type Item
Dimension table

Item name
Brand
Type
Supplier
Item key
-Dimension table can be specified by users or
Experts.
-Dimension table are de-normalized.
-it is composed of one or more hierarchies that catagorise
Data .if the dimension hasnot got a hierarchies and level it is called flat
dimension or list.
-A dimension table is a table in a star schema
of a data warehouse
-dimensions table are generally small in size than fact table.
- Dimensions categorize and describe data
warehouse facts and measures in ways that
support meaningful answers to business questions.

Foreign key is the key which establish a relation with two tables.

Difference between Dimension table vs. Fact table

Parameters Fact Table Dimension Table

Definition Measurements or facts Companion table to the fact table


about a business process. contains descriptive attributes to be
used as query constraining.

Characteristi Located at the center of a Connected to the fact table and


c star or snowflake schema located at the edges of the star or
and surrounded by
dimensions. snowflake schema

Design Defined by their grain or its Should be wordy, descriptive,


most atomic level. complete, and quality assured.

Type of Data Facts tables could contain Evert dimension table contains
information like sales against attributes which describe the details
a set of dimensions like of the dimension. E.g., Product
Product and Date. dimensions can contain Product ID,
Product Category, etc.

Key Primary Key in fact is Foreign key to the facts table


mapped as foreign keys to
Dimensions.

Hierarchy Does not contain Hierarchy Contains Hierarchies. For example


Location could contain, country, pin
code, state, city, etc.

3.2. Data warehouse Schema


Schema is a logical description of the entire database. It includes the name and description
of records of all record types including all associated data-items and aggregates. Much like
a database, a data warehouse also requires to maintain a schema. A database uses
relational model, while a data warehouse uses

 Star,
 Snowflake, and
 Fact Constellation schema.

3.2.1. Star Schema


The star schema architecture is the simplest data warehouse schema. It is call star
schema because the diagram resembles a star, with points radiating from the center.

 Each dimension in a star schema is represented with only one-dimension table.

 This dimension table contains the set of attributes.

 There is a fact table at the center. It contains the keys to each of four
dimensions.

 The fact table also contains the attributes, namely dollars sold and units sold.
 A large central table (called fact table) containing the bulk of the data with no
redundancy.
 Usually the fact table in the star schema are in third normal form.(3NF).
 Fact table typically have two columns: foreign key to dimension table and
measures those that contact numeric facts.
 A fact table can contain facts data on detail or aggregate level.

 the following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
Student Note:

fig:
star schema

3.2.2. Snow flake schema


A snowflake schema is a logical arrangement of tables in a multidimensional database such that
the entity relationship diagram resembles a snowflake shape. The snowflake schema is represented
by centralized fact tables which are connected to multiple dimensions. "Snow flaking" is a method of
normalizing the dimension tables in a star schema. When it is completely normalized along all the
dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. The
principle behind snow flaking is normalization of the dimension tables by removing low cardinality
attributes and forming separate tables.

The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions
are normalized into multiple related tables, whereas the star schema's dimensions are de normalized
with each dimension represented by a single table. A complex snowflake shape emerges when the
dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and the
child tables have multiple parent tables ("forks in the road").

The snowflake schema is a vibrant of the star schema model, where some of the
dimension table are normalized, therefore further splitting the data into additional
tables.

Student Note:

Diagram of snowflake schema


Disadvantages:

 The primary disadvantage of the snowflake schema is that the additional levels of
attribute normalization adds complexity to source query joins, when compared to
the star schema.

 Snowflake schemas, in contrast to flat single table dimensions, have been heavily
criticized. Their goal is assumed to be an efficient and compact storage of normalized
data but this is at the significant cost of poor performance when browsing the joins
required in this dimension this disadvantage may have reduced in the years since it
was first recognized, owing to better query performance within the browsing tools.

 When compared to a highly normalized transactional schema, the snowflake


schema's denormalization removes the data integrity assurances provided by
normalized schemas Data loads into the snowflake schema must be highly controlled
and managed to avoid update and insert anomalies.

Student Note:

3.2.3. Fact Constellation


 In the fact constellation architecture conatin multiple fact tables that share
many dimension tables.
 This schema is also galaxy schema(collection of stars)
 This schema is more complex than star or snowflake schema architecture,
which is because it contains multiple fact tables.
 This schema is flexible however it may be hard to manage and support.

Advantage of Fact Constellation Schema Data Warehouses

 Provides a flexible schema


 Different fact tables are explicitly assigned to the
dimensions
Disadvantage of Fact Constellation Schema Data Warehouses

 Fact Constellation solution is difficult to maintain


 Complexity of the schema involved due to the number of aggregations

Diagram of fact constellation (galaxy)

3.3. Concept Hierarchy


Concept hierarchies organize data or concept in hierarchical forms or in certain
partial order. If is used for expressing knowledge in concise high level form and
facilitating mining knowledge at multiple level of abstraction.
Concept hierarchy facilitates drilling and rolling in data warehouse to view data in
multiple granularity.
It defines a sequence of mapping from the set of low-level concept to higher level
concept.
Let us take a concept hierarchy for the dimension [location] city values for location
include Vancouver, taint, new york &Chicago each city is then mapped to porcine re
state to which it belong. Then province or state is mapped to country to which they
belong such as Canada/USA.
The concept hierarchy can be illustrated by following fig.
Fig: A concept hierarchy for the dimension location.

These mapping from the concept hierarchy for the dimension ‘location’ mapping a
set of low-level concept (e.g. cities) to higher-level (e.g. countries). These attributes
are related by a total order, forming a concept hierarchy such as street <city<
province or state < country.

Country

Province on …..

City Street

Fig: hierarchy for location.

Lattice
The attributes of a dimension may be organized in partial order for the
(time)dimension based on the attributes day ,month, quarter and year is day
<{month<quarter and year is day<{month <quarter; week}<year.
3.4. Starnet Query Model
A starnet query model for querying multidimensional data base .if consist of radial line
from the central point .Each line represent a concept hierarchy for the dimension . Each
abstraction level in the hierarchy is called footprint .These footprint represent the
granularities available for use by OLAP operation such as drill-down and drill up.
Fig:Modeling business queries a starnet model.

In the fig it has for radial lines. This represents concept hierarchies of dimension,
location, customer, item, and time. Each line consists of footprints representing
abstraction level of dimension.

E.g. the footprint of time dimension is: “day” month, quarter, and “year”. A concept
hierarchy may include a single attribute or several attributes.

In order to examine the item sale user can roll up along the “time” dimension from
month to quarter OR.

Drilldown along the “location” dimension from country to city.


3.5. Overview of hard ware and I/0 considerations
I/0 performances should always be a key consideration for data warehouse designer
and administrators

The typical workload in the data warehouse is especially I/O intensive, with operations
such as large data loads and index builds, Creation. Of materialization view and queries
over large volume of data. The underlying I/O system for the data warehouse should be
designed to meet these heavy requirements.

Five level guideline for data warehouse I/O configuration.

The I/O configuration used by a data warehouse will depend on the characteristics of
the specific storage and server capabilities so the material in this chapter is only
intended to provide guidelines for designing and toning I/O system.

Configure I/O bandwidth not capacity

 Storage configuration for the data warehouse should be chosen base on the I/O
band with then can provide, and not necessarily on their overall storage capacity.
 Buying storage based solely on capacity has the potential for making mistake.
 Especially for system less than 500 GB in total size. The capacity of individual
disk drives is growing foster than the I/O through put rates provides by those disk
leading to the situation in which a small number of disk can store a large volume
of data. Eg consider a 200 GB data mart using 72 GB drives this data mart could
be built with as few as six drives in a fully mirrored environment. However six
drives might not provide enough I/O bandwidth to handle a minimum number of
concurrent users on a 4 CPU sever. Thus even though six drives provides
sufficient storage a large number of drives may be required to provide acceptable
performance this system.

Stripe far and wide

The guiding principle in configuring on I/O system for a data warehouse is to


maximize I/O band with by having multiple disks and channel access each
database objects.
We can do this by striping the data files of the oracle database.
A striped file is a file distributed access multiple disks. This striping can be
managed by software, or within the storage hardware.

Use Redundancy
Because data warehouse are often the larges database system in a company
they have they most disk and thus are also the most susceptible to the failure of
a single disk. Therefore disk redundancy is a requirement for data warehouse to
protect against a hardware failure like disk-striping, redundancy can be achieved
in many ways using software or hardware.

Test the I/O system before building the database

The most important time to examine and tone the I/O system is before the
database is even created. Once the database files are created it is more difficult
to reconfigure the files. Some logical volume managers many support dynamic
reconfiguration of files while other storage configuration order to reconfigure their
I/O layout in both cane considerable system resources must be devoted to this
reconfiguration.

Plan for Growth:

The data warehouse designer should plan for future growth of a data ware. There
are many approaches to handling the growth of a data ware. There are many
approaches to handling the growth in the system and the key consideration is to
be able to grow the I/O system without compromising on the I/O bandwidth.

3.6. Index
A database index is a data structure that improves the speed of data retrieval
operation on a database table at the cast of additional writes and strange space to
maintain the index data structure.

Indexes are to quickly locate data without having to search overy row in a database
table every time aab table is accessed.

Indexes can be created using one or more columns of database table, providing the
basis for both rapid random lookups and efficient access of address records.

An index is a copy of selected column of data from a table that can be searched very
efficiently that also includes a low level disk block address or direct link to the
complete row of data it was copied from.

Types of Index:
Bitmap index

A bitmap index is a special kind of indexing that stores the bulk of its data as bit
arrays (bitmaps) and answers most queries by performing bitwise logical
operations on these bitmaps. The most commonly used indexes, such as B+ trees,
are most efficient if the values they index do not repeat or repeat a small number of
times. In contrast, the bitmap index is designed for cases where the values of a
variable repeat very frequently. For example, the sex field in a customer database
usually contains at most three distinct values: male, female or unknown (not
recorded). For such variables, the bitmap index can have a significant performance
advantage over the commonly used trees.

Dense index

A dense index in databases is a file with pairs of keys and pointers for
every record in the data file. Every key in this file is associated with a particular
pointer to a record in the sorted data file. In clustered indices with duplicate keys,
the dense index points to the first record with that key.[3]

Sparse index

A sparse index in databases is a file with pairs of keys and pointers for
every block in the data file. Every key in this file is associated with a particular
pointer to the block in the sorted data file. In clustered indices with duplicate keys,
the sparse index points to the lowest search key in each block.

Reverse index

A reverse key index reverses the key value before entering it in the index. E.g., the
value 24538 becomes 83542 in the index. Reversing the key value is particularly
useful for indexing data such as sequence numbers, where new key values
monotonically increase.

Student Note:

3 in most cases, an index is used to quickly locate the data record(s) from which
the required data is read. In other words, the index is only used to locate data
records in the table and not to return data.
4
5
6 Index architure /indexing method.
7
8 In non-Clustered Index.
9 The physical order of the rows is not the same as the index-order.
10 The indexed column used are typically non-primary key column used in JOIN,
WHERE, ORDER BY columns
11 There can be more than one non-clustered index on a database table.
12
13 Clustered:
14 Clustering alters the data block into a certain distinct order to match the index,
resulting in the row data being stored in order therefore, only one clustered index
can be created on a given database table. Chittered indices can greatly increases
overall speed of retrieval but usually only where the data is accessed sequentially in
the same or reverse order of the clustered index or when a range of item is selected.
15
16 Cluster:
17 When multiple databases and multiple tables are joined, it is referred to as a
cluster. The record
18 For the tables sharing the values of a cluster key shall be stored together in the
same or nearby data blocks. This may improve the joins of these tables on the
cluster key.
19 Since the matching record are stored together and less I/0 is required to locate
them.
20 A chanter can be keyed with B-tree index or a hash table. The data block where
the table record is stored is defined by the value of cluster key.
21
22 Types of indexes
23 Bitmap index:
24 A bitmap index is a special kind of indexing that stores the bulla of its data
as bit arrays(bitmaps) and answer most queries by performing bitwise logical
operation on these bit maps. The most commonly used indexed, such as B+ Trees,
are most efficient if

25
3.7. Materialized view
26 Typically, data flows from are or more OLTP database into a data warehouse on a
monthly, weekly or daily basis. The data is normally processed in a staging file
before bag added to the data ware house. Data warehouse commonly rouge in size
from 1043 to few terabyte. Usually, the vast majorities of the data is stored in a few
very large fast tables.
27 One technologies employed in data warehouse to improve performance is
the creation of summaries. Summaries one special kind of aggregate view that
improve query execution time by pre calculating expensive joins and aggregation
operation prior to execution and storing the results in a table in the database for
e.g.: we can create a table to curtain the sum of sales by region and by product .

28 The summaries or aggregate that are referred in to book and literatures on data
warehouse are created in oracle using a schema objet is called materialized view.

29 Materialized view can perform a number of roles, such as improving query


performance or providing replicated data.

30 In data ware we can use materialize view to pre compute and stone aggregated
data such on sum of the sales. Materialize such on sum of the sales. Materialize
view in these environment one often referred to as summaries, because they
summarized data.

31 They can also be used to pre compute joins with or without aggregation.

32 The need of materialize view

 To increase the speed of queries on very large data base. Queries to large data
base often involves join bet table aggregation such as SUM a both. These
operations one expensive in terms of time and processing power.
33 The type of materialize view we create determine how materialize view is we create
determine/how materialize view we create determine how materialize view is
refreshed and used by query rewrite.
 We can use almost identical syntax to perform number of roles.
34 E.g. materialize view can replicate data; a process formerly achieved by using the
CREATE VIEW is a synonym for CREATE SNAPSHOT.
 Materialize view improve query performance by pre calculating exercise join and
aggregation operation on the database prior to execution and storing the results
in the database.

35 Type of Materialize views

- Materialized view with aggregates.


- Materialized view containing only joins
- Nested materialized view.
4

Data warehouse Technologies and Implementation

4.1 Introduction
Before we create an architecture for data warehouse , we must first understand the major
process that constitute a data warehouse.

This process can be explained by following figure.

Fig: process flow within the data warehouse.

4.2 Staging area


A staging area, or landing zone, is an intermediate storage area used for data processing during
the extract, transform and load (ETL) process. The data staging area sits between the data source(s)
and the data target(s), which are often data warehouses, data marts, or other data repositories. Data
staging areas are often transient in nature, with their contents being erased prior to running an ETL
process or immediately following successful completion of an ETL process.
Staging areas can be designed to provide many benefits, but the primary motivations for their use
are to increase efficiency of ETL processes, ensure data integrity and support data quality
operations.

Functions of Staging are:

1. Aggregate Pre calculation(complex calculations and application of


complex business logic may be done in a staging area )
2. Consolidation(combining and mixing in one storage)
3. Independent scheduling(data collected from different sources form
different time are processed at a single time)
4. Cleaning data
4.3 Data Extraction
Data extraction is where data is analyzed and crawled through to retrieve
relevant information from data sources (like a database) in a specific pattern. Further
data processing is done, which involves adding metadata and other data integration;
another process in the data workflow.

The majority of data extraction comes from unstructured data sources and different data
formats. This unstructured data can be in any form, such as tables, indexes, and
analytics.

During extraction the desired data is identified and extracted from many different
sources, including database system and application. Very often, it is not possible to
identify the specific subset of interest, therefore more data than necessary has to be
extracted.

The size of extracted data varies from hundred of kilobytes up to gigabytes, depending
on the source system and the business situation.

How to Control the extract process.

The mechanism that determine when to start extracting the data run the transformations
and consistency checks and so on, are very important.

E.g. it may be inappropriate to start the process that extract EPOS transaction for a
retrial sales analysis data warehouse until all EPOS transactions have been received
from all stores.

When to initiate the extract

Data should be in a consistent state when it is extracted from the source system.
Source data should be extracted only at a point where it represents the same instance
of time as the extracts from the other data source.

Ways to perform an extract:

Update notification:- if the source system is able to provide a notification that a record
has been changes and describe the change it is the easiest way to get the data.

Incremental extract: - Some system may not be able to provide notification that an
undated has occurred, but they are able to identify which record have been modified
and provide an extract of such record. During further ETL process, the system needs to
identify changed and propagate it down.
Full extract: - Some system is not able to identify which data has been changed at all,
so a full extract is the one way one can get the data out of the system. The full extract
requires keeping a copy of the least extract in the same format in order to a\be able to
identify changes. Full extract handles deletion as well.

4.4 Data Transportation


Transportation is the operation of moving data from one system to another system. In a
data warehouse environment the most common requirement for transportation are in
moving form

 A source system to data warehouse database

 A data warehouse database to data warehouse

 A data warehouse to data mart

Three basis choice for transporting data in Data warehouse.

1. Transportation using Flat files

The most common method for transporting data is by the transfer of flat files, using
mechanism such as FTP or other remote exported from the source system into flat files
and is then transported to the target platform using FTP or similar mechanism.

2. Transportation through distributed operation

Distributed queries can be effective mechanisms for extracting data. These mechanisms
also transport the data directly to the target systems, thus providing both extraction and
transformation in a single step. Opposed to flat file transportation, the success or failure
of the transportation is recognized immediately with the result of the distributed query or
transaction.

3. Transportation using transportable table spaces.

Oracle transportable table spaces are the fastest way for moving large volumes of data
between two oracle databases.
Previous to the introduction of transportable table spaces the most scalable data
transportation mechanisms relied on moving flat files containing raw data. This
mechanism required that data be unloaded or exported into files from source database
then after transportation these file were loaded or imported into the target database.

Transportable table space, entirely bypass the unloaded and reload steps.

Using transportable table space oracle data files can be directly transported from one
database to another.

4.5 Data Transformation


Data transformation is often the most complex and most costly part (in terms of
processing time) out of ETL process. They can range from simple data conversion to
extremely complex data scrubbing techniques.

It can also be called as the process that takes the loaded data and structures it for a
query performance and for minimizing operational cost.

Before the transform of data takes place the following task may take place Data need to
be cleaned and checked in the following ways.

A) CLEANING:

I. Make sure data is consistent within itself.:-When we take a row of data and
examine it , the content of the row must make sense, Errors at this point are
mainly to do with error in the source system. Typical check are for non-sencial
phone number, address, and so on.

II. Make sure data is consistent with other data within the same source:-When
we examine the data against other table within the same source, he data
must make sense e.g. Checks for the existence of the stock keeping unit /
customers in the transaction by comparing it with the list of valid SKU/
customers.

III. Make sure data is consistent with other data within the same source system:-
This is when we examine a record and compare it with the similar record in
different source system. Ex- reconciling a customer record with a copy in a
customer database and a copy in a customer event database. These checks
are he most complex and are likely to result in the application of complex
business rules to resolve any discrepancies (inconsistency, difference).

IV. Make sure data is consistent with other data with the information already in
the warehouse:-This is when we ensure that any data being loaded does not
contradict the information already within the data warehouse. E.g.: update
info about the product hierarchy, but the changes need to be controlled
carefully, so on not to render meaninglessly any of the existing information
already in the data warehouse.

B) Filtering: selecting only certain column to load.

C) Splitting: a column into multiple column or vice viersa

D) Rollup & Drilldown-joining data together from multiple sources.

Data transformation can be in following ways:

 Smoothing:

o Smoothing is the process which works to remove the noise from the data ,
such techniques include clustering , regression.

 Aggregation-

o Aggregation is a process where summary or aggregation operations are


applied to the data. the daily sales may be aggregated to compute
monthly or annual data
 Generalization of the data:

o In this process low level or primitive (raw) data are replaced by higher
level concepts through use of concept hierarchy.eg: categorical attribute
like street, can be generalized to higher level concept like city and country.

 Normalization: where the attribute data are scaled so as to fall within a small
specified range such as –

-1.0 to 0

0.0 to 1.0

Numerical

It perforsms linear transformation on the original data

Suppose that mina and maxA are the minimum and maximum values of attribute
A.

Min-max normalization maps values of Ʋ of A to Ʋ’ in the range . by computing

Example :Suppose that the minimum and max values of the attributes income are
$12000 and $98000. We would like to map income to range [0.0 to 1.0

By min-max Normalization a values $73000 for income is transformed to

736000 - 12000 (1.0 - 0) +0 ==0.76 answer..

98000 - 12000

4.6 Loading
Before any transformation can occur within the database, the raw data
must become accessible for the database. The approach is called loading.

Fig: loading

SQL loader is used to move data from flat files into an Oracle data
warehouse.

OCJ (oracle certified java) and direct path API (application program
interface) are frequently used when the transformation and computation
are done outside the database and there is no need for flat file staging.

Load manager is used in the loading mechanism. Load manager


performs all the operations necessary to support the extract and load
process. The size and complexity of load manager will vary between
specific solutions from data warehouse to data warehouse. The larger the
degree of overlap between the sources, the larger the load manager will
be.

Load manager is constructed using combination of programming shell


script and c programming
The architecture of a load manager performs the following operations.

1. Extracts the data from the source system.

2. Validation of data about accuracy

3. Cleaning of data by eliminating meaningless values and make it


usable

4. Fast load the extracted into temporary data store.

5. Perform to the simple transformation into structure similar to the


one in the data warehouse.

6. Fig: load manager


4.7 Refreshing
We must update our data warehouse on regular basis to ensure that the
information derived from its current location. This process of updating the
data is called the refresh process.

ETL is done on a scheduled basis to reflect changes made to the original


source system. During this step we physically insert the new, update data
and take all the other steps necessary to make this new data available to
the user. Once all of this data has been loaded into the data warehouse
the materialized view must be updated to reflect the latest data.

The partitioning scheme of the data warehousing is often crucial in


determining the efficiency of refresh operation in the data warehouse
loading process.

Most data warehouse are loaded on a regular schedule every night, every
week, or every moths, new data is bought into the data warehouse. The
data being loaded at the end of the week/month typically corresponds to
the transaction for the week/month. In this scenario the data warehouse is
being loaded by time. This suggests that the data warehouse tables
should be partitioned on a date column. In the data warehouse example,
suppose the new data is loaded into the sales table every month.
Furthermore, the sales table has been partitioned by month. These steps
show how the load process will proceed to add the data for a new month
to the table sales.
5

Data warehouse To Data mining


5. Introduction
5.1. Data mining Architecture
5.2. Design of Data Mining
5.3. Data Warehouse Architecture
5.4. Data warehouse Model
5.4.1.Enterprise data warehouse
5.4.2.Data mart
5.4.3.Virtual warehouse
5.5. OLAP
5.5.1.Architecture OLAP
5.5.2.OLAP operations on multidimensional data model
5.5.3.Types of OLAP server
5.5.3.1. ROLAP
5.5.3.2. MOLAP
5.5.3.3. HOLAP
5.5.3.4. Comparison chart ROLAP and MOLAP
5.6. OLAP TO OLAM
5.6.1.Architecture of integrated OLAP TO OLAM

5 Introduction

5.1 Data Mining Architecture


Data mining is a very important process where potentially useful and previously
unknown information is extracted from large volumes of data. There are a number of
components involved in the data mining process. These components constitute the
architecture of a data mining system.
Data Mining Architecture

The major components of any data mining system are data source, data warehouse
server, data mining engine, pattern evaluation module, graphical user interface and
knowledge base

Fig: data mining architecture


Data Sources

Database, data warehouse, World Wide Web (WWW), text files and other documents
are the actual sources of data. You need large volumes of historical data for data mining
to be successful. Organizations usually store data in databases or data warehouses.
Data warehouses may contain one or more databases, text files, spreadsheets or other
kinds of information repositories. Sometimes, data may reside even in plain text files or
spreadsheets. World Wide Web or the Internet is another big source of data.

Different Processes

The data needs to be cleaned, integrated and selected before passing it to the database
or data warehouse server. As the data is from different sources and in different formats,
it cannot be used directly for the data mining process because the data might not be
complete and reliable. So, first data needs to be cleaned and integrated. Again, more
data than required will be collected from different data sources and only the data of
interest needs to be selected and passed to the server. These processes are not as
simple as we think. A number of techniques may be performed on the data as part of
cleaning, integration and selection.

Data cleaning tasks-

Filling the missing values- Data is not always available . missing data may cause
due to equipment malfunction, inconsistent with other data thus deleted, data not
entered due to misunderstand . So, we need to handle missing data . missing data is
handled in the following wys.

 Filling data manually

 Imputation-use the attribute mean to fill missing values.

Integration-

Combines data from multiple sources into a coherent sotre . it marges the data from
multiple data sorces.

b) Database or Data Warehouse Server

The database or data warehouse server contains the actual data that is ready to be
processed. Hence, the server is responsible for retrieving the relevant data based on
the data mining request of the user.

c) Data Mining Engine

The data mining engine is the core component of any data mining system. It consists of
a number of modules for performing data mining tasks including association,
classification, characterization, clustering, prediction, time-series analysis etc.
This basically involves following tasks:

Anomaly detection (outlier/change/deviation):-The identification of unusual data records


that might be interesting or data error that requires further investigation.

Association Rule Mining-:- Searching for relationship among variable. E.g. a


supermarket might gather data on customer purchasing habit using association rule
learning. The supermarket can determine which products are frequently bought together
and use this information for marketing purpose.

Clustering: Clustering is the process of making a group of abstract objects into classes
of similar objects. A cluster of object can be treated as one group.

Deviation analysis Deviation analysis is the reality based technology that gets machine
or process back online quickly when deviation occurs.

d) Pattern Evaluation Modules

The pattern evaluation module is mainly responsible for the measure of interestingness
of the pattern by using a threshold value. It interacts with the data mining engine to
focus the search towards interesting patterns.

e) Graphical User Interface

The graphical user interface module communicates between the user and the data
mining system. This module helps the user use the system easily and efficiently without
knowing the real complexity behind the process. When the user specifies a query or a
task, this module interacts with the data mining system and displays the result in an
easily understandable manner.

f) Knowledge Base

The knowledge base is helpful in the whole data mining process. It is useful for guiding
the search or evaluating the interestingness of the result patterns. The knowledge base
might even contain user beliefs and data from user experiences that can be useful in
the process of data mining. The data mining engine might get inputs from the
knowledge base to make the result more accurate and reliable. The pattern evaluation
module interacts with the knowledge base on a regular basis to get inputs and also to
update it.
Summary

Each and every component of data mining system has its own role and importance in
completing data mining efficiently. These different modules need to interact correctly
with each other in order to complete the complex process of data mining successfully.

5.2 Design of data mining


The architecture and design of a data mining system is critically important. It is noted
that because of its popular and diverse application it is expected that good variety of
data mining system will be designed and developed in future years.

The good system architecture will facilitate the system to make best use of the software
environment, accomplish data mining tasks in the efficient and timely manner.

1. No-coupling
2. Loose coupling
3. Semi tight coupling
4. Light coupling
5.3 Data Warehouse Architecture
The three tier architecture of data warehouse can be explained using Bottom tier,
middle tier and top tier.

Bottom tier-
The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities or Gateways are used to feed data into
the bottom tier from operational databases or other external sources (such as customer
profile information provided by external consultants). These tools and utilities perform
data extraction, cleaning, and transformation The data are extracted using application
program interfaces known as gateways.

A gateway is supported by the underlying DBMS and allow client program to generate
SQL code to be executed at the server.

[Example of gateways: ODBC Connection (Open Database Connection) and OLE-DB


(Open Linking and Embedding for database.) and JDBC (Java Data Connection)]

Data Warehouse Back-End Tools and Utilities

Data extraction: get data from multiple, heterogeneous, and external sources

Data cleaning: detect errors in the data and rectify them when possible Data
transformation: convert data from legacy or host format to warehouse format

Load: sort, summarize, consolidate, compute views, check integrity, and build index
and partitions

Refresh propagate the updates from the data sources to the warehouse

Metadata Repository
Meta data is the data defining warehouse objects. It has the following kinds Description
of the structure of the warehouse Schema, view, dimensions, hierarchies, derived data
definition, data mart locations and contents
 Operational meta-data data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged), monitoring information
(warehouse usage statistics, error reports, audit trails)
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse
 Data related to system performance warehouse schema, view and derived data
definitions

Diagram: A three-tier data warehousing architecture

Middle tier-
The middle tier is an OLAP server that is typically implemented using either

(i) A relational OLAP (ROLAP) model that is an extended relational DBMS that
maps operations on multidimensional data to standard relational operations.

(ii) A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.

Top Tier-
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

5.4 From Architecture point of view there are three


Data Warehouse models.

5.4.1 Enterprise Data Warehouse

An Enterprise warehouse collects all the information about subject spanning the entire
organization. It provides corporate wide data integration, usually from one or more
operational systems or external information providers, and is cross functional in scope.

It typically contain detailed data as well as summarized data, and can range in size from
a few gigabytes to hundreds of gigabytes, terabytes or beyond. It requires extensive
modeling and may take year to design and build

An enterprise data warehouse may be implemented on traditional mainframes, UNIX


super servers or parallel architecture platforms.

5.4.2 Data Mart

A data mart contain a subset of corporate wide data that is of valued to a specific group
of users……..

5.4.3 Virtual Warehouse.

A virtual warehouse is a set of views over operational data bases. For efficient query
processing only some of the possible summary views may be materialized. A virtual
warehouse is easy to build but requires excess capacity on operational database server.

5.5 OLAP Architecture


As in Data Warehouse architecture the components are similar in OLAP. Data are
collected from operational source and these data are preprocessed and passed to the
data warehouse server through ETL. Data warehouse is responsible to collect the data
and store before it is passed to the OLAP servers.

The next layer is an OLAP server that is typically implemented using either

(i) A relational OLAP (ROLAP) model that is an extended relational DBMS that
maps operations on multidimensional data to standard relational operations.

(ii) A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.

The topmost layer is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
5.5.1 Types of OLAP Servers
Logically OLAP servers present business users with multidimensional data from data
warehouse rod at marts, without concerns regarding how or where the data are stored.
However, the physical architecture and implementation of OLAP servers must consider
data storage issue.

Implementation of warehouse servers for OLAP processing include the following

5.5.1.1 ROLAP

These are intermediate servers that stand in between a relational back end servers and
client front end tools. They use a relational or extended relational DBMS to store and
manage warehouse data.

It includes optimization for each DBMS back end, implementation of aggregation


navigation logic and additional tools and service.

ROAP technology tends to have greater scalability than MOLAP

The DSS micro strategy and met cube of Informix, adopt ROLAP technology.

5.5.1.2 MOLAP
These server support multidimensional view of the data through array-based
multidimensional storage engine.

If the data is stored in a relational database it can be viewed multidimensional but only
by successively accessing and processing a table for each dimension and aspect of a
data but MOLAP processes data that is already stored in multidimensional array in
which all possible combination are reflected.

The advantage of using a data cube is that it allows fast indexing to precomputed
summarized data. Notice that with multidimensional data stores, the storage utilization
may be low if data set is sparse.

Many MOLAP servers adopt a two level storage representation to handle sparse and
dense data sets.The dense sub cubes are identified and stored as array structures
while the sparse sub cubes employ compression technology for efficient utilization.

Ex of MOLAP server- Essbase of Arbor is a MOLAP server.

5.5.1.3 HOLAP

The hybrid OLAP server approach combines ROLAP and MOLAP technology
benefiting from the greater scalability of ROLAP and faster computation of MOLAP.
EX. the Microsoft SQL server 7.0 OLAP Service support a hybrid OLAP server
5.5.1.4 COMPARISION CHART

BASIS FOR
ROLAP MOLAP
COMPARISON

Full Form ROLAP stands for MOLAP stands for


Relational Online Analytical Multidimensional Online Analytical
Processing. Processing.

Storage & Data is stored and fetched Data is Stored and fetched from the
Fetched from the main data Proprietary database MDDBs.
warehouse.

Data Form Data is stored in the form of Data is Stored in the large
relational tables. multidimensional array made of
data cubes.

Data volumes Large data volumes. Limited summaries data is kept in


MDDBs.

Technology Uses Complex SQL queries MOLAP engine created a


to fetch data from the main precalculated and prefabricated
warehouse. data cubes for multidimensional
data views.
Sparse matrix technology is used
to manage data sparsity.

View ROLAP creates a MOLAP already stores the static


multidimensional view of multidimensional view of data in
data dynamically. MDDBs.

Access Slow access. Faster access.


5.5.1.5 OLAP operation on
multidimensional cube

Roll up-
Roll-down-
Dice for-
Slice-
Pivot-
Fig: OLAP operation on multidimensional cube

5.1 OLAP to OLAM


WHY OLAP TO OLAM?

In the field of data mining substantial research has been performed for data mining at
various platforms, including transaction databases, relational databases, spatial
database, text databases, text databases, time=series databases, flat files, data
warehouse and so on.

The integration of OLAP with data mining is OLAP mining or OLAM, The architecture of
OLAM is particularly important for the following reasons.
 High quality of data in data warehouse:
o Most data mining tools need to work on integrated, consistent and cleaned
data, which requires costly data cleaning, data transformation and data
integration as preprocessing steps. A data warehouse constructed by such
preprocessing serves as valuable sources of high quality data for OLAP
as well as for data mining.
 Available information processing infrastructure surrounding data warehouse:
o Comprehensive information processing and data analysis infrastructure
have been or will be systematically constructed surrounded data
warehouse, which includes accessing integration, consolidation and
transformation of multiple heterogeneous database.
 OLAP based exploratory data analysis
o Effective data mining needs exploratory data analysis. A user will often
want to traverse through a database, select portion of relevant data,
analyze them at different granularities, and present knowledge results in
different forms. On-line analytical mining providing facilities for data mining
on different subsets of data and at different level of abstraction, by drilling,
pivoting, filtering, dicing and slicing on a data cube and on some
intermediate data mining results.

5.1.1Architecture for OLAM

An OLAM server performs analytical mining in data cube in a similar manner as an


OLAP server performs on-line analytical processing.

An integrated OLAM and OLAP architecture is shown in figure.


OLAP AND OLAM servers both accept user online queries via a graphical user
interface. And work with data cube in the data analysis via a cube APE. A metadata
directory is used to guide the access of the data cube/The data cube can be
constructed by accessing and/or integrating multiple database via and MDDB API
and by filtering a data warehouse via database API.
Since and OLAM server may perform multiple data mining tasks, such as concept
description, association, classification, prediction, clustering, and is more
sophisticated than OLAP server.

5.2 A
6
Data Mining Approaches and
Methods
6. Introduction
6.1. Data mining techniques
6.2. Data mining tasks
6.3. Classification
6.4. Prediction
6.5. Decision tree
6.6. Rule based classification
6.7. Back propagation
6.8. Genetic algorithm
6.9. Regression
6.9.1. Linear regression
6.9.2. Non-Linear regression
6.10. Association rules and mining frequent patterns
6.11. Clustering
6.11.1. Partitioning method
6.11.1.1. K mean
6.11.1.2. K medoids
6.11.2. Hierarchical method
6.11.2.1. Agglomerative
6.11.3. Divisive

6. Introduction
6.1. Data mining techniques
Association – Association is one of the widely-known data mining techniques. Under
this, a pattern is deciphered based on a relationship between items in the same
transaction. Hence, it is also known as relation technique. Big brand retailers rely on this
technique to research customer’s buying habits/preferences. For example, when
tracking people’s buying habits, retailers might identify that a customer always buys
cream when they buy chocolates, and therefore suggest that the next time that they buy
chocolates they might also want to buy cream.

Data mining is not so much a single techniques, as the idea that there is more
knowledge hidden in the data that shows itself on the surface. Any data that helps
extract more out of data is useful. So Data mining Techniques form a quite a
heterogeneous group.

Association and co-relation is usually to find frequent item set finding among large data
set. This type of finding helps business to make certain decision such as catalogue
design, cross marketing and customer shopping behavior analysis.

Association rule algorithm need to be able generate rules with confidence values less
than one.

Classification – This data mining technique differs from the above in a way that it is
based on machine learning and uses mathematical techniques such as Linear
programming, Decision trees, Neural network. In classification, companies try to build
software that can learn how to classify the data items into groups. For instance, a
company can define a classification in the application that “given all records of
employees who offered to resign from the company, predict the number of individuals
who are likely to resign from the company in future.” Under such a scenario, the
company can classify the records of employees into two groups that namely “leave” and
“stay”. It can then use its data mining software to classify the employees into separate
groups created earlier.

Classification is the most commonly applied data mining techniques, which employs a
set of pre-classified examples to develop a model that can classify the population of
record at large.

Fraud detection and credit risk application are particularly well suited to this type of
analysis. This approach frequently employs decision tree or neural network based
classification algorithm.
Clustering – Different objects exhibiting similar characteristics are grouped together in
a single cluster via automation. Many such clusters are created as classes and objects
(with similar characteristics) are placed in it accordingly. To understand this better, let us
consider an example of book management in the library. In a library, the vast collection
of books is fully cataloged. Items of the same type are listed together. This makes it
easier for us to find a book of our interest. Similarly, by using the clustering technique,
we can keep books that have some kinds of similarities in one cluster and assign it a
suitable name. So, if a reader is looking to grab a book relevant to his interest, he only
has to go to that shelf instead of searching the entire library. Thus, clustering technique
defines the classes and puts objects in each class, while in the classification
techniques, objects are assigned into predefined classes.

By using clustering techniques we can further identify dense and sparse regions in
object space and can discover overall distribution pattern and correlation among data
attributes.

Types of clustering

 Partitioning method
 Hierarchical
 Agglomerative method

Prediction – The prediction is a data mining technique that is often used in combination
with the other data mining techniques. It involves analyzing trends, classification,
pattern matching, and relation. By analyzing past events or instances in a proper
sequence one can safely predict a future event. For instance, the prediction analysis
technique can be used in the sale to predict future profit if the sale is chosen as an
independent variable and profit as a variable dependent on sale. Then, based on the
historical sale and profit data, one can draw a fitted regression curve that is used for
profit prediction.

Decision trees – Within the decision tree, we start with a simple question that has
multiple answers. Each answer leads to a further question to help classify or identify the
data so that it can be categorized, or so that a prediction can be made based on each
answer. For example, We use the following decision tree to determine whether or not to
play cricket ODI: Data Mining Decision Tree: Starting at the root node, if the weather
forecast predicts rain then, we should avoid the match for the day. Alternatively, if the
weather forecast is clear, we should play the match.

Data Mining is at the heart of analytics efforts across a variety of industries and
disciplines like communications, Insurance, Education, Manufacturing, Banking and
Retail and more. Therefore, having correct information about it is essential before apply
the different techniques.
Regression: Regression Techniques can be adapted for prediction. Regression
analysis can be used to model the relationship between one or more independent
variables and dependent variable.(independent variable are attributes already known
and response variables are what we want to predict). Unfortunately many real world
problems are not simply prediction for instance – sales volume stock price and product
failure rates are all difficult to predict because they may depend on complex interactions
and multiple predictor variable.

Neural Network: A Neural Network is an information processing paradigm


that is inspired by the way biological nervous systems, such as the brain, process
information. The key element of this paradigm is the novel structure of the information
processing system. It is composed of a large number of highly interconnected
processing elements (neurons) working in unison to solve specific problems. Neural
Network, like people, learn by example. An Neural Network is configured for a specific
application, such as pattern recognition or data classification, through a learning
process. Learning in biological systems involves adjustments to the synaptic
connections that exist between the neurons.

Neural network is a set of connected input output units and each connection has a
weight present with it. During the learning phase network learns by adjusting weight so
as to be able to predict the correct class of the input tuples.
It has remarkable ability to derive meaning from complicated or imprecise data and can
be used to extract patterns and detect trends that are too complex to be noticed by
either human or other computer techniques.
E.g. handwriting character recognition, training a computer to pronounce English
text.

Fig: Components of neurons fig: form human Neuron to Artificial


neurons
Feed Forward Network:
Feed-forward ANNs allow signals to travel one way only; from input
to output. There is no feedback (loops) i.e. the output of any layer
does not affect that same layer. Feed-forward ANNs tend to be
straight forward networks that associate inputs with outputs. They
are extensively used in pattern recognition. This type of organization
is also referred to as bottom-up or top-down

6.2. Data mining tasks


Data mining tasks are as follows:
6.2.1. Classification
This data mining technique differs from the above in a way that it is based on machine
learning and uses mathematical techniques such as Linear programming, Decision
trees, Neural network. In classification, companies try to build software that can learn
how to classify the data items into groups. For instance, a company can define a
classification in the application that “given all records of employees who offered to
resign from the company, predict the number of individuals who are likely to resign from
the company in future.” Under such a scenario, the company can classify the records of
employees into two groups that namely “leave” and “stay”. It can then use its data
mining software to classify the employees into separate groups created earlier.
Example:
Suppose that we have as database of customers on the Electronics company mailing
list. The mailing list is used to send out promotional literature describing new products
and upcoming price discounts. The database describes attributes of the customers,
such as their name, age, income, occupation and credit rating. The customers can be
classified as to whether or not they have purchased a computer at this electronic
company. Suppose that new customers are added to the database and that you would
like to notify these customers of an upcoming commuter sale. To send out promotional
literature to every new customer in the database can be quite costly. A more cost
efficient method would be to target only those new customers who are likely to
purchase a new customer. A classification model can be constructed and use for this
purpose.

Classification can be done by the following

6.2.1.1. Decision tree


A decision tree is a flow chart like tree structure , where each internal node denotes a
test on an attribute, each branch represents an outcome of the test and leaf node
represent classes or class distributions. The top most node in the tree is the root node.
A typical decision tree is shown in Figure.

It represent the concept buys computer , that is , it predicts whether or not a customer at
the Electronics company is likely to purchase a computer, Internal nodes are
represented by rectangles and leaf nodes are denoted by ovals.

In order to classify an unknown sample, the attributes values of the sample are tested
against the decision tree. A path is traced from the root to a leaf node that holds the
class prediction for that sample. Decision tree can easily be converted to classification
rules.
Decision Tree Induction

The basic algorithm for decision tree induction is a greedy algorighm that constructs
decision tree in a top-down recursive divide and conquer manner.

The algorithm for creating Decision Tree, summarized below.

Create a node N;
If samples are all of the same class, C then
Return N as a leaf node labeled with class C;
If attribute-list is empty then
Return N as leaf node labeled with the most common class in samples;

Select test=attribute, the attribute among attribute-list with the highest information
gain;
Label node N with test-attribute;
For each know value ai of test –attribute;
Grow a branch from node N for the condition test-attribute=ai;
List si be the set of samples in samples for which test-attribute=ai;
If si is empty then
Attach a leaf labeled with the most common class in samples;
Else
Attach a node returned by Generate –decision- tree

InformationGain, Entropy,Gain

The information gain is used to select the test attribute at each node in the tree. Such a
measure is referred to as attribute selection measure or a measure of the goodness of
split. The attribute with the highest information gain is chosen as the test attribute for the
current node.

Let S be a set of s data samples. Suppose the class label attribute has m distinct
values. defining m distinct classes Ci(for i=1….,m). Let si be the number of samples of S
in class Ci. The expected information needed to classify a given sample is given by
n

I (s1, s2, s3…... sm) = - ∑ pi log 2( pi)


i=1
pi =probability that an arbitrary sample belong to class c i and is estimated by si/s. Note
that a log funchtion to the base 2 is used since the information in encoded in bits.

Exercise :

Fom the above training set answer the following.

1. FindInformation gain of class – Buys Computer.?


2. Find information gain of attribute age(youth)?
3. Find information gain of attribute age(middle_aged)?
4. Find information gain of attribute age(senior)?
5. Find entropy of age E(age)?

Solution 1: information gain of class-buys Computer.

The class lebelattribute , buys computer has two distinct values (namely,
{yes,no}) therefore, there are two distinct classes (m=2).

Yes Sample=9
No Sample=5

To compute the information gain of each attribute, we use following formula.

I(s1,s2)=-Yes/total data log2(yes/total data) –No/total data log2 (No/total data)


I(9,5) = -9/14 log2(9/14) – 5/14 log2 (5/14)

=0.940 Solved.

Solution 2: information gain of attribute age (youth)

Yes Sample=2
No Sample=3

I(2,3)= - 2/5 log2(2/5) – 3/5 log2(3/5)


= 0.971 solved

Solution 3: information gain of attribute age (middle-aged)

Yes Sample=4
No Sample=0

I(2,3)= - 4/4 log2(4/4) – 0/4 log2(0/4)


= 0 solved

Solution 4: information gain of attribute age (senior) (student task)

= solved

Solution 5: entropy of age

Entropy of age is denoted by E(age).


E(age) = 5/14 *information gain of age(youth)+ 4/14* information gain of age(middle aged)+ 5/14* information
gain of age(senior)
=5/14*I(2,3)+4/14*I(4,0)+4/14* I(3,2)
=0.694
Solved

6.2.1.2.Rule based classification


6.2.1.3.Back propagation
Back propagation is a method used in artificial neural networks to calculate
a gradient that is needed in the calculation of the weights to be used in the
network. It is commonly used to train deep neural networks, a term referring to
neural networks with more than one hidden layer.

Back propagation is a special case of an older and more general technique


called automatic differentiation. In the context of learning, back propagation is
commonly used by the gradient descent optimization algorithm to adjust the weight
of neurons by calculating the gradient of the loss function. This technique is also
sometimes called backward propagation of errors, because the error is calculated at
the output and distributed back through the network layers.

The back propagation algorithm has been repeatedly rediscovered and is equivalent
to automatic differentiation in reverse accumulation mode.

Steps of Back Propagation

 Initialize the weight: the weight of the network are initialized to small random
number (e.g., ranging from -1.0 to -5.0-0.5). Each unit has a bias associated
with it as explained below. The bias are similarly initialized to small random
number.

 Propagate the input forward: in this step, the net input and output of each unit
in the hidden and output layer are computed. First the training sample is fed
out to equal to its input of the network. Note that for unit j in the input layer, its
output is equal to its input, that is Oj=Ij for input unit j. The net input to each
unit in the hidden and output layers is computed as a linear combination of its
input.

The inputs to the unit are, in fact, the output of the units connected to it in the
previous layer. To compute the net input to the unit, each input connected to
the unit is multiplied by its corresponding weight, and this is summed.

Given a unit j in a hidden or output layer, the net input, Ij, to unit j is

Ij=wij Oi + θj,

wij is the weight of the connection from unit i in the previous layer to unit j

Oi = the output of unit I form previous layer.

θj = Is the bias of the unit(the bias acts as a threshold in that it servers to


vary the activity of the unit.

6.2.1.4.Genetic algorithm
Genetic algorithm attempt to incorporate ideas of natural evolution. In general, genetic
learning starts as follow.
An initial population is created consisting of randomly generated rules. Each rule can be
represented by a string of bits.
Ex. suppose that samples in a given training set are described by two Boolean
attributes A1 and A2 and that there are two classes C1 and C2.
Rule: if A1 and not A2 then
C2 can be encoded as a bit 100, where the two leftmost bit represent attributes
A1 and A2 respectively and the right most bit represent the class

Rule: If Not A1 and not A2 then


C2 can be encoded as 001
If and attribute has K values where K>2 then k bits may be used to encode the attribute
values. Class can be encoded in a similar fashion.

6.2.1.5.Regression
Introduction
Regression is a data mining function that predicts a number. Profit, sales, mortgage
rates, house values, square footage, temperature, or distance could all be predicted
using regression techniques. For example, a regression model could be used to predict
the value of a house based on location, number of rooms, lot size, and other factors.

A regression task begins with a data set in which the target values are known. For
example, a regression model that predicts house values could be developed based on
observed data for many houses over a period of time. In addition to the value, the data
might track the age of the house, square footage, number of rooms, taxes, school
district, proximity to shopping centers, and so on. House value would be the target, the
other attributes would be the predictors, and the data for each house would constitute a
case.

In the model build (training) process, a regression algorithm estimates the value of the
target as a function of the predictors for each case in the build data. These relationships
between predictors and target are summarized in a model, which can then be applied to
a different data set in which the target values are unknown.

Regression models are tested by computing various statistics that measure the
difference between the predicted values and the expected values. The historical data for
a regression project is typically divided into two data sets: one for building the model,
the other for testing the model.

Regression modeling has many applications in trend analysis, business planning,


marketing, financial forecasting, time series prediction, biomedical and drug response
modeling, and environmental modeling.
6.2.1.6.Linear regression
A linear regression technique can be used if the relationship between the predictors and the target
can be approximated with a straight line.

Regression with a single predictor is the easiest to visualize. Simple linear regression with a single
predictor is shown in Figure.

Figure Linear Regression With a Single Predictor

Linear regression with a single predictor can be expressed with the following equation.

Y = α + βX

The regression parameters in simple linear regression are:

 Y =is constant

 α = regression coefficient.

 β = Y-intercept and slope of line.

These coefficient can be solved by the method of least squares, which minimizes the
error between the actual or data and estimate the line. Given S samples or data points
of the form (x1, y1), (x2, y2)…………(xs, ys). Then the regression co-efficient can be
estimated using the method with the following eq.

S
β=∑ ( Xi− X ) (Yi−Ý )
i=1

∑ ( Xi−Ý )
i=1

Where X - is the average of X1, X2, X3…Xs.

and Ý - is the average of Y1,Y2,….Ys.

Exercise:

Predict the salary of the graduates after 10 years after observing following data sets.

Salary data
X(years of experience) Y (Salary in 1000)
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83

Solution:

X=number of years of work experience of college graduates

Y=corresponding salay of the graduates

Let us first get Xmean and Ymean……

6.2.1.7.Non-Linear regression
When the linear equation has one basic form non-linear equation can take many
different forms. The easiest way to determine whether an equation is non-linear is to
focus on the term itself, it is not linear.

If the given response variable and predictor variable have a relationship that may be
modeled by a polynomial function .Polynomial regression can be modeled by adding
polynomial terms to the basic linear model.

By applying transformation to the variable we can convert the non-linear model into
linear on that can then solved by the method of least squares.

Transformation of polynomial regression model to linear regression model.

Consider a cubic polynomial relationship given by

Y = ∞+β1X+ β2X2+β3X3

To convert this equation to linear form we define new variables

X1=X, X2=X2,X3=X3

Now the equation becomes,

Y = ∞+β1X1+ β2x2 + β3x3

Which can be solved by method of least squares.

6.2.2. Association rules


6.2.2.1.Market Basket Analysis
Market Basket Analysis

Introduction

Market Basket Analysis is a modeling technique based upon the theory that if you buy a
certain group of items, you are more (or less) likely to buy another group of items. For
example, if you are in apub and you buy a pint of beer and don't buy a meal, you are
more likely to buy crisps at the same time than somebody who didn't buy beer.

The set of items a customer buys is referred to as an itemset and market basket
analysis seeks to find relationships between these item sets.

Typically the relationship will be in the form of a rule:

IF {beer, no bar meal} THEN {crisps}.

It studies customers' buying patterns and preferences to predict what they will prefer to
purchase along with the existing items in their cart

The algorithms for performing market basket analysis are fairly straightforward .The
complexities mainly arise inand dealing with the large amounts of transaction data that
may be available.

A major difficulty is that a large number of the rules found may be trivial for anyone
familiar with the business. Although the volume of data has been reduced, we are still
asking the user to find a needle in a haystack. Requiring rules to have a high minimum
support level and high confidence level risks missing any exploitable result we might
have found. One partial solution to this problem is differential market basket analysis, as
described below.

How is it used?

In retailing, most purchases are bought on impulse. Market basket analysis gives clues
as to what a customer might have bought if the idea had occurred to them. (As a first
step, therefore, market basket analysis can be used in deciding the location and
promotion of goods inside a store. If, as has been observed, purchasers of Barbie dolls
have are more likely to buy candy, then high-margin candy can be placed near to the
Barbie doll display. Customers who would have bought candy with their Barbie dolls had
they thought of it will now be suitably tempted.

But this is only the first level of analysis. Differential market basket analysis can find
interesting results and can also eliminate the problem of a potentially high volume of
trivial results.

In differential analysis, we compare results between different stores, between customers


in different demographic groups, between different days of the week, different seasons
of the year, etc.

If we observe that a rule holds in one store, but not in any other (or does not hold in one
store, but holds in all others), then we know that there is something interesting about
that store. Perhaps its clientele are different, or perhaps it has organized its displays in
a novel and more lucrative way. Investigating such differences may yield useful insights
which will improve company sales.

6.1 Benefits of Market Basket Analysis:

6.1.1 1. Store Layout:

Based on the insights from market basket analysis you can organize your store to
increase revenues. Items that go along with each other should be placed near each
other to help consumers notice them. This will guide the way a store should be
organized to shoot for best revenues. With the help of this data you can eliminate the
guesswork while determining the optimal store layout.

6.1.2 2. Marketing Messages:

Whether it is email, phone, social media or an offer by a direct salesman, market basket
analysis can improve the efficiency of all of them. By using data from MBA you can
suggest the next best product which a customer is likely to buy. Hence you will help
your customers with fruitful suggestions instead of annoying them with marketing blasts.

6.1.3 3. Maintain Inventory:

Based on the inputs from MBA you can also predict future purchases of customers over
a period of time. Using your initial sales data, you can predict which item would probably
fall short and maintain stocks in optimal quality. This will help you improve the
allocations of resources to different items of the inventory.

6.1.4 4. Content Placement:

In case of e-commerce businesses, website content placement is very important. If


goods are displayed in right order than it can help boost conversions. MBA can also be
used by online publishers and bloggers to display content which consumer is most likely
to read next. This will reduce bounce rate, improve engagement and result in better
performance in search results.
6.1.5 5. Recommendation Engines:

Recommendation engines are already used by some popular companies like Netflix,
Amazon, Facebook, etc. If you want to create an effective recommendation system for
your company then you will also need market basket analysis to efficiently maintain one.
MBA can be considered as the basis for creating a recommendation engine.

As we have seen, market basket analysis can help companies especially retailers, to
analyze buying behavior and predict their next purchase. If used effectively this can
significantly improve cross-selling and in turn, help you increase your customer’s
lifetime value.

6.2.2.2.Aprori Algorithm
Introduction:
The apriori algorithm is an influential algorithm for mining frequent item set for
Boolean association rule. It uses bottom up approach where frequent subset are
extended one item at a time. ie. The steps are candidate generation and groups of
candidates are tested against the data. It is designed to operate on database
containing transaction, for ex. Collection of items brought by a customer or details of
a website frequentation.
Support and Confidence
Transaction Item sets
T1 X,Y,Z
T2 X,Z
T3 W
T4 X
T5 Y,Z
T6 A,B,Z
T7 X,Z,B
T8 X,Z,W
T9 A,X,Z
T10 Z,Y

Support:

Support shows the frequency of the patterns in the rule; it is the percentage of
transactions that contain both X and Z, i.e.

Support = Probability (X and Z)

Support = (# of transactions involving X and Z) / (total number of transactions).

Ques: From the above data set find support of X=>Z?


Solution:

Support=X=>Z [number of time when x is purchase, z is also purchase]

total number of transaction

=5

10

=0.5

Confidence

Confidence is the strength of implication of a rule; it is the percentage of transactions


that contain B if they contain A, ie.

Confidence = Probability (Z if X) = P(Z/X)

Confidence = (# of transactions involving A and B) / (total number of transactions that


have A).

Ques:Find confidence X=>Z ?

Solution:

Confidence = X=>Z [number of time when x is purchase, z is also purchase]

SUPPORT OF X [total number of time x is purchased ]

Total transaction
=5
6/10
=50 =
6

Exercise
A database has four transactions. Let min_support=60% and min_confi=80%.
Find all frequent item sets using aprori algorithm.

TID DATE ITEM BROUGHT


T10 10/15/201 K,A,D,B
0 8
T20 10/15/201 D,A,C,E,B
0 8
T30 10/19/201 C,A,B,E
0 8
T40 10/19/201 B,A,D
0 9
SOLUTION:
Min-support=60%
Min_conf=80%
Support=(support percentage/100) X total transaction
=(60/100) X 4
=2.4
Each item set. Candidate generation of all items.
(here we have listed all the items )
C1:
Item set Support count
A 4
B 3
C 2
D 3
E 2
K 1
Compare min_support with each item set support count .which is 2.4.
Here we have listed only those items which have support count > 2.4
L1:
Item set Support count
A 4
B 3
D 3
Generate pair to generate C2
Combination of item sets which are formed from l1
C2:
Item Set Support count
A,B 3
A,D 3
B,D 3

Compare min_support with each item set support count. Whichare 2.4.
Here we have listed only those items which have support count > 2.4

L2
Item Set Support count
A,B 3
A,D 3
B,D 3

Generate pair to generate C3


C3:
Item Set Support count
A,B,D 3
Now Create association rules with support and confidence for A,B,D
Association Support count Confidence Confidence %
Rule
A^B=>D 3 3/4=(.75) 75
A^D=>B 3 3/3 =(1) 100
B^D=>A 3 3/3=(1) 100
D=>A^B 3 3/3=(1) 100
B=>A^D 3 3/4 =(.75) 75
A=>B^D 3 3/4 =(.75) 75
How to find confidence??
Confidence=Support / No of times it occurs
For A^B=>D [Note: count only for A^B not D]
Support=3
Confidence=Support / No of times it occurs
= 3/4
=.75

For A^D=>B [Note: count only for A^D not B]


Support=3
Confidence= Support / No of times it occurs
3/
Similarly find Confidence.
Now,
As we know confidence =80% , then we select only that rule which is greater than
80%
Rule Support Confidence
A^D=>B 3 100
B^D=>A 3 100
D=>A^B 3 100

Solved.
6.2.2.3.FP growth
Introduction

The FP-growth algorithm is an efficient and scalable method for mining the complete set
of frequent patterns by pattern fragment growth using and extended prefix-tree structure
for storing compressed and crucial information about frequent patterns named frequent-
pattern tree.

Then the FP – tree is constructed as follows. Create the root of the tree and scan the
database second time. The items in each transaction are processed in the order of
frequent items list and a branch is created for each transaction. When considering the
branch to be added to a transaction, the count of each node along a common prefix is
incremented by 1. After constructing the tree the mining proceeds as follows. Start from
each frequent length-1 pattern, construct its conditional pattern base, then construct its
conditional FP-tree and perform mining recursively on such a tree. The support of a
candidate (conditional) item set is counted traversing the tree. The sum of count values
at least frequent item’s nodes gives the support value.

This approach is based on divide-and-conquer strategy. The first step is to compress


the whole database into a frequent pattern tree that preserves the association
information of item sets. The next step is to divide this compressed database into a set
of conditional databases, where each conditional database is associated with one
frequent item and also these databases are mined separately. Because for each
frequent item its associated data sets are needed to be examined only. This approach is
beneficial as it reduces the size of the data sets to be searched.

The FP-Growth algorithmic program works as follow:

1. Scan the transaction database once, as among the Apriori algorithmic program,
to seek out all the frequent items and their Support.

2. Sort the frequent items in descending order of their Support.

3. Initially, begin making the FP-tree with a root “null”.

4. Get the primary transaction from the transaction database. Takeaway all non-
frequent items and list the remaining items in line with the order among the
sorted frequent items.

5. Use the transaction to construct the primary branch of the tree with each node
corresponding to a frequent item and showing that item’s frequency that’s one for
the primary transaction.

6. Get the next transaction from the transaction database. Takeaway all non-
frequent items and list the remaining items in line with the order among the
sorted frequent items.

7. Insert the transaction within the tree using any common prefix that may appear.
Increase the item counts.

8. Continue with Step 6 until all transactions among the database are processed.

Exercise:
Find the conditional pattern base and conditional fp-tree from the following datasets.
Where min-support=3.
T-ID ITEM SET
1 F,A,C,D,G,M,P
2 A,B,C,F,L,M,O
3 B,F,H,O
4 B,K,C,P
5 A,F,C,L,P,M,N
Solution:
Step 1:
Separate all the items in column.
Item set Support count
A 3
B 3
C 4
D 1
F 4
G 1
K 1
L 2
M 3
N 1
0 2
P 3

Now, choose only those items which have support greater than 3.
Item set Support count
F 4
C 4
A 3
B 4
M 3
P 3
Step 3:
Create a pattern by comparing item sets of question and items obtained in step 2(i.e
F,C,A,B,M,P)
T-ID ITEM SET ORDERED ITEMS
1 F,A,C,D,G,M,P F,C,A,M,P
2 A,B,C,F,L,M,O F,C,A,B,M
3 B,F,H,O F,B
4 B,K,C,P C,B,P
5 A,F,C,L,P,M,N F,C,A,M,P
Count each items from the ordered item list. I,e F=4,C=4,A=3,B=3,M=3,P=3
Item Conditional pattern base Conditional fp-tree
P (FCAM:2) (CB:1) (C:3)|P
M (FCA:2)(FCAB:1) (F:3)(C:3)(A:3)|M
B (FCA:1)(F:1)(C:1) EMPTY
A (FC:3) (F:3)(C:3)|A
C (F:3) (F:3)|C
F EMPTY EMPTY

LOGIC: in conditional fp-tree choose only those which are common in conditional
pattern base. ( in (FCAM:2) AND (CB:1) C , C is common so only (C:3)is chosen)
6.2.3. Clustering
Background
When answering this, it is important to understand that data mining part of data science.
Data mining focuses using machine learning, pattern recognition and statistics to
discover patterns in data.

Clustering would fall into the machine learning / pattern recognition realm.
It is important to remember there are 2 types of machine learning algorithms:
1. Supervised Learning - These include machine learning algorithms that have variables
used as predictors and a variable to predict. The predictors are tied to the prediction
variable and are trained against that variable to make future predictions. Within
supervised learning there are two major types of algorithms:
a. Regression – These algorithms use the predictors to predict a quantitative variable
such as with a regression model.
b. Classification – These algorithms are typically looking to label data into categories. A
classic example would be sick or not sick in a medical study, but can contain numerous
category labels. Algorithms such as logistic regression and random forest classification
models.
2. Unsupervised Learning – These algorithms have no variable to predict tied to the
data. Instead of having an output, the data only has an input which would be multiple
variables that describe the data. This is where clustering comes in.
Clustering is an unsupervised machine learning method that attempts to uncover the
natural groupings and statistical distributions of data. There are multiple clustering
methods such as K-means or Hierarchical Clustering. Often, a measure of distance
from point to point is used to find which category a point should belong to as with K-
means. Hierarchical clustering seeks to build up or break down sets of clusters based
on the input information. This allows the user to use the sets of clusters that best
accomplish their purpose. The algorithm will not name the groups it creates for you, but
it will show you where they are and then they can be named anything. Below is a really
simple example of clustering of 3 groups:

So how clustering is useful?


This is useful when it is not feasible to manually classify every data point to find patterns
in data. This could mean the user does not know how many clusters there are or should
be, there is too much data to classify by hand or the relationships between variables
and observations is not understood.
What is Clustering?

Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group are more similar to each other than to those in other
groups.
The process by which objects are classified into a number of group so that they are
much dissimilar as possible from one group to another group, but as much similar as
possible within each group. The attributes of the objects are allowed to determine which
object should be grouped together.

How pizza shop use clustering?

Fig 1:: let us suppose the following are the delivery location for pizza.
Fig2. Lets locate cluster center randomly

Fig3: find the distance of the points

Why clustering is needed??

1. Organizing data into clusters shows internal structure of the data –


Ex. Clusty and clustering genes above
2. Sometimes the partitioning is the goal
Ex. Market segmentation
3. Prepare for other AI techniques
Ex. Summarize news (cluster and then find centroid)
4. Techniques for clustering is useful in knowledge discovery in data – Ex.
Underlying rules, reoccurring patterns, topics,

Application of Clustering

Medicine
On PET scans, cluster analysis can be used to differentiate between different types
of tissue in a three-dimensional image for many different purposes
Analysis of antimicrobial activity
Cluster analysis can be used to analyse patterns of antibiotic resistance, to classify
antimicrobial compounds according to their mechanism of action, to classify antibiotics
according to their antibacterial activity.

Business and marketing


Market research
Cluster analysis is widely used in market research when working with multivariate data
from surveys and test panels. Market researchers use cluster analysis to partition the
general population of consumers into market segments and to better understand the
relationships between different groups of consumers/potential customers, and for use
in market segmentation, Product positioning, New product development and Selecting test
markets.
Grouping of shopping items
Clustering can be used to group all the shopping items available on the web into a set of
unique products. For example, all the items on eBay can be grouped into unique products.

World Wide Web


Social network analysis
In the study of social networks, clustering may be used to recognize communities within
large groups of people.
Search result grouping
In the process of intelligent grouping of the files and websites, clustering may be used to
create a more relevant set of search results compared to normal search engines like Google.
There are currently a number of web-based clustering tools such as Clusty. It also may be
used to return a more comprehensive set of results in cases where a search term could refer
to vastly different things. Each distinct use of the term corresponds to a unique cluster of
results, allowing a ranking algorithm to return comprehensive results by picking the top result
from each cluster.

Computer science
Image segmentation
Clustering can be used to divide a digital image into distinct regions for border
detection or object recognition
Recommender systems
Recommender systems are designed to recommend new items based on a user's tastes.
They sometimes use clustering algorithms to predict a user's preferences based on the
preferences of other users in the user's cluster.
Anomaly detection
Anomalies/outliers are typically – be it explicitly or implicitly – defined with respect to
clustering structure in data.
Natural language processing
Clustering can be used to resolve lexical ambiguity

Social science
Crime analysis
Cluster analysis can be used to identify areas where there are greater incidences of
particular types of crime. By identifying these distinct areas or "hot spots" where a similar
crime has happened over a period of time, it is possible to manage law enforcement
resources more effectively.
Educational data mining
Cluster analysis is for example used to identify groups of schools or students with similar
properties.

6.2.3.1. Partitioning method


6.2.3.1.1. K mean

K-Mean Clustering

K-mean is one of the simplest unsupervised learning algorithms that solve the well
known clustering problems. The procedure follows a simple and easy way toe classify a
given data set through a certain number of cluster (assume K clusters) fixed apriori.

The main idea is to define K centers, one for each cluster. These center should be
placed should be placed in a cunning way because of different location causes different
result. So the better choice is to place them as much as possible far away from each
other. The next step is to take each point belonging to a given data set and associate it
to the nearest center. When no point is pending , the first step is completed and an early
group age is done. AT this point we need to recalculate k new centroids as barycenter
of the clusters resulting from the previous step. After we have these k new centroids, a
new a new binding has to be done between the same data set points and the nearest
new center. A loop has been generated. As a result of this loop we may notice that
the k centers change their location step by step until no more changes are done or
in other words centers do not move any more.

Algorithmic steps for k-means clustering


Let X = {x1,x2,x3,……..,xn} be the set of data points

1) Randomly select ‘c’ cluster centres.

2) Calculate the distance between each data point and cluster centres. (X1 - c)2.

3) Assign the data point to the cluster centre whose distance from the cluster centre is
minimum of all the cluster centres.

4) Recalculate the new cluster centre using mean of the data points.

5) Recalculate the distance between each data point and new obtained cluster centres.

6) If no data point was reassigned then stop, otherwise repeat from step 3).

Advantages

1. Fast, robust and easier to understand.

2. Relatively efficient.

3. Gives best result when data set are distinct or well separated from each other.
Disadvantages

1. If there are two highly overlapping data then k-means will not be able to
resolve that there are two clusters.

2. Euclidean distance measures can unequally weight underlying factors.

3. Randomly choosing of the cluster centre cannot lead us to the fruitful result.

4. Applicable only when mean is defined i.e. fails for categorical data.

5. Unable to handle noisy data and outliers.

Exercise:

Ques: Apply K-mean clustering for the following data sets for two clusters

Sample X,Y
1 185,72
2 170,56
3 168,60
4 179,68
5 182,72
6 188,77
Solution:

Given Cluster=2
Let us choose data points as K1=(185,72) and K2=(170,56)
Step1:
Using Euclidean distance measure= SQRT (X2-X1)2 + (Y2-Y1)2
For value: (168, 60)
Euclidean distance between (168, 60) and K1 (185, 72)
=20.80
Euclidean distance between (168, 60) and K2 (170, 56)
=4.472
Here the distance (168, 60), K1 (185,72) > (168,60),K2(170,56). So the
data point lies towards K2.
Now new cluster point K2 is changes. Get the mean of K2 and (168,60)
i.e new K2=( 170+168 , 56+60)
2 2
new K2 = (169,58)

For value= (179, 68)::


Euclidean distance between (179, 68) and K1 (185, 72)
=7.2
Euclidean distance between (179, 68) and K2 (169, 58)
=14.14
Here the distance (179, 68), K1 (185, 72) < (179.68), K2(169,58). So the
data point lies towards K1.
Now new cluster point K1 is changes. Get the mean of K1 and (179, 68)
i.e new K2=( 179+185 , 68+72)
2 2
new K1 = (182,70)

For value= (182, 72)::


Euclidean distance between (182,72) and K1 (185, 72)
=2
Euclidean distance between (182,72) and K2 (169, 58)
=19.10
Here the distance (182, 72), K1 (185, 72) < (182, 72), K2(169,58). So the
data point lies towards K1.
Now new cluster point K1 is changes. Get the mean of K1 and (182, 72)
i.e new K2=( 182+182 , 70+72)
2 2
new K1 = (182,71)

For value= (188, 77)::


Euclidean distance between (188,77) and K1 (182, 71)
=8.48
Euclidean distance between (188,77) and K2 (169, 58)
=26.87
Here the distance (182, 72), K1 (185, 72) < (182, 72), K2(169,58). So the
data point lies towards K1.

Conclusion :

K1 K2

(185,72), (170,56)
179,68) (168,60)
182,72), (188,77)

Exercise

Apply K-mean clustering for the following data set into 2 cluster.

Datasets {2,4,10,12,3,20,30,11,25}

Solution:

Step 1
Let us assume two cluster K1=4, K2=11
K1=4 K2=11 d1 d2 d3 d4 d5 d6 d7 d8 d8
2 4 10 12 3 20 30 11 25
Dist(D1) 2 0 6 8 1 16 26 7 21
Dist(D2) 9 7 1 1 8 9 19 0 14
Cluster K1 K1 K2 K2 K1 K2 K2 K2 K2
Assign

Here: D1 = distance between K1 and Data Points (d1, d2, d3, d4, d5, d6, d7,
d8) simultaneously
D2= distance between K2 and Data Points (d1, d2, d3, d4, d5, d6, d7,
d8) simultaneously
Cluster Assign= if D1>D2 then cluster assign=K2 else K1
From above calculation,
Data which belongs with cluster K1= {2, 4, 3}
Data which belongs with cluster K2= {10, 12, 20, 11, 25, 30}
Now
Calculate new mean
K1=2+4+3
3
New K1 =3
K2=10+12+20+11+25+30
6
New K2=18

Step 2
From step 1 New Cluster K1=3,K2=18
K1=3 K2=18 d1 d2 d3 d4 d5 d6 d7 d8 d8
2 4 10 12 3 20 30 11 25
Dist(D1) 1 1 7 9 0 17 27 8 22
Dist(D2) 16 14 8 6 15 2 12 7 7
Cluster K1 K1 K1 K2 K1 K2 K2 K2 K2
Assign

Here: D1 = distance between K1 and Data Points (d1, d2, d3, d4, d5, d6, d7,
d8) simultaneously
D2= distance between K2 and Data Points (d1, d2, d3, d4, d5, d6, d7,
d8) simultaneously
Cluster Assign= if D1>D2 then cluster assign=K2 else K1
From above calculation,
Data which belongs with cluster K1= {2, 4,10, 3}
Data which belongs with cluster K2= {12, 20, 11, 25, 30}
Now
Calculate new mean
K1=2+4+3+10
4
New K1 =4.75
K2=12+20+11+25+30
6
New K2=19.6

Now continue this process until New K1=Old K1 and New K2=Old
K2……….contd….

6.2.3.1.2. K medoids

The K-Medoids Method

Partitioning around Medoids or the K-Medoids algorithm is a partitioned clustering


algorithm which is slightly modified from the K-means algorithm. They both attempt
to minimize the squared-error but the K-Medoids algorithm is more robust to noise
than K-means algorithm. In K-means algorithm, they choose means as the centroids
but in the K-medoid, data points are chosen to be the Medoids.

A medoid can be defined as that object of a cluster, whose average dissimilarity to


all the objects in the cluster is minimal.

The difference between k-means and k-Medoids is analogous to the difference


between mean and median: where mean indicates the average value of all data
items collected, while median indicates the value around that which all data items
are evenly distributed around it. The basic idea of this algorithm is to first compute
the K representative objects which are called as medoid. After finding the set of
medoid, each object of the data set is assigned to the nearest medoid. That is,
object i is put into cluster vi, when medoid mvi is nearer than any other medoid mw.
The algorithm proceeds in two steps:

 BUILD-step: This step sequentially selects k "centrally located" objects, to be used


as initial medoid

 SWAP-step: If the objective functions can be reduced by interchanging (swapping)


a selected object with an unselected object, then the swap is carried out. This is
continued till the objective function can no longer be decreased.

The basic strategy of k-medoids clustering algorithm is to find k-cluster in n object by


first arbitrarily finding a representative object (the medoid) for each cluster. Each
remaining object is clustered with the medoid to which it is the most similar. The
strategy then iteratively replaces one of the medoids by one of the non medoids as
long as the quality of the resulting clustering is improves. The is quality is estivated
using a cost function that measures athe average dissimilarity between and object
and the medoid of its cluster.
To determine non medoid object, O_random is a good replacement for a current
medoids , oj, is the following four cases are examined for each of the non medoid
object, p.

6.2.3.2. Hierarchical method


In data mining, hierarchical clustering is a method of cluster analysis which seeks to
build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two
types. It is a stepwise algorithm which merges two objects at each step,the two which
have least dissimilarity.
Difficulties of hierarchical clustering:
The hierarchical clustering method , though simple, often encounters difficulties
rergarding the selection of merge or split points. Such a decision is critical because
once a group of objects is merged or split, the process at the next step will operate on
the newly generated cluster. It will neither undo what was done previously , not well
chosen at some step, may lead to low quality clusters.
In general there are two types of hierarchical clustering methods.
6.2.3.2.1. Agglomerative
(AGNES)-Agglomerative Nesting- This is a "bottom up" approach: each observation
starts in its own cluster, and pairs of clusters are merged as one moves up the
hierarchy. In this clustering method it starts by placing each object in its own cluster
and then merges these atomic cluster into larger and larger cluster, until all of the
objects are in a single cluster or certain termination condition are satisfied.

FIG: ABOVE SHOWS AGGLOMERTIVE CLUSTERING METHOD

In the above diagram a,b,c,d,e are clusters.

a cluster and b cluster combines to form ab.

d cluster and e combines to form de

c and de combines to form cde

cde and ab combines to form abcde.

Abcde is the final cluster and the program terminates.

6.2.3.2.2. Divisive
7. DIANA-(Divisive ANAlysis)- This is a "top down" approach: and this strategy does
the reverse of agglomorative hierarchinal clustering by starting with all object in one
cluster.It subdivides the cluster into smaller and smaller pieces until each object
formsa cluster on its own or until it satisfies certain termination conditions, such as a
desired number of cluster is obtained or the distance between the two closest cluster
is above a certain threshold distance.

In DIANA, all of the objects are used to form one initial cluster. The cluster is split
according to some principle, such as the maximum Euclidean distance between the
closest neighbor objects in the cluster. The cluster splitting process repeats until,
eventually each new cluster contains only a single object.

6.1. NEW TOPIC


Chapter 7: Mining Complex types of data
7.1 Multimedia data mining

 Multimedia data mining refers to the analysis of large amounts of multimedia information in
order to find patterns or statistical relationships. Once data is collected, computer programs are
used to analyze it and look for meaningful connections. This information is often used by
governments to improve social systems. It can also be used in marketing to discover consumer
habits.

 Multimedia data mining requires the collection of huge amounts of data. The sample size is
important when analyzing data because predicted trends and patterns are more likely to be
inaccurate with a smaller sample. This data can be collected from a number of different media,
including videos, sound files, and images. Some experts also consider spatial data and text to be
multimedia. Information from one or more of these media is the focus of data collection.

 Whereas an analysis of numerical data can be straightforward, multimedia data analysis requires
sophisticated computer programs which can turn it into useful numerical data. There are a number
of computer programs available that make sense of the information gathered from multimedia
data mining. These computer programs are used to search for relationships that may not be
apparent or logically obvious.

 When multimedia is mined for information, one of the most common uses for this information
is to anticipate behavior patterns or trends. Information can be divided into classes as well,
which allows different groups, such as men and women or Sundays and Mondays, to be
analyzed separately. Data can be clustered, or grouped by logical relationship, which can
help track consumer affinity for a certain brand over another, for example.

 Multimedia data mining has a number of uses in today’s society. An example of this would be
the use of traffic camera footage to analyze traffic flow. This information can be used when
planning new streets, expanding existing streets, or diverting traffic. Government
organizations and city planners can use the information to help traffic flow more smoothly
and quickly.

 While the term data mining is relatively new, the practice of mining data has been around for
a long time. Grocery stores, for example, have long used data mining to track consumer
behavior by collecting data from their registers. The numerical data relating to sales
information can be used by a computer program to learn what people are buying and when
they are likely to buy certain products. This information is often used to determine where to
place certain products and when to put certain products on sale.

7.2 Text Mining

Mining Complex types of data (College of Information technology and Engineering)


 Text mining can help an organization derive potentially valuable business insights from text-
based content such as word documents, email and postings on social media streams like
Facebook, Twitter and LinkedIn. Mining unstructured data with natural language processing
(NLP), statistical modeling and machine learning techniques can be challenging, however,
because natural language text is often inconsistent. It contains ambiguities caused by inconsistent
syntax and semantics, including slang, language specific to vertical industries and age groups,
double entendres and sarcasm.

 Text analytics software can help by transposing words and phrases in unstructured data into
numerical values which can then be linked with structured data in a database and analyzed with
traditional data mining techniques. With an iterative approach, an organization can successfully
use text analytics to gain insight into content-specific values such as sentiment, emotion, intensity
and relevance. Because text analytics technology is still considered to be an emerging technology,
however, results and depth of analysis can vary wildly from vendor to vendor.

7.3 Web Mining

Web mining is the use of data mining techniques to automatically discover and extract information from
Web documents and services.

There are three general classes of information that can be discovered by web mining:

 Web activity, from server logs and Web browser activity tracking.

 Web graph, from links between pages, people and other data.

 Web content, for the data found on Web pages and inside of documents.

At Scale Unlimited we focus on the last one – extracting value from web pages and other documents
found on the web.

Note that there’s no explicit reference to “search” in the above description. While search is the biggest
web miner by far, and generates the most revenue, there are many other valuable end uses for web mining
results. A partial list includes:

 Business intelligence

 Competitive intelligence

Mining Complex types of data (College of Information technology and Engineering)


 Pricing analysis

 Events

 Product data

 Popularity

 Reputation

Four Steps in Content Web Mining

When extracting Web content information using web mining, there are four typical steps.

1. Collect – fetch the content from the Web

2. Parse – extract usable data from formatted data (HTML, PDF, etc)

3. Analyze – tokenize, rate, classify, cluster, filter, sort, etc.

4. Produce – turn the results of analysis into something useful (report, search index, etc)

Web Mining versus Data Mining

When comparing web mining with traditional data mining, there are three main differences to consider:

1. Scale – In traditional data mining, processing 1 million records from a database would be large
job. In web mining, even 10 million pages wouldn’t be a big number.

2. Access – When doing data mining of corporate information, the data is private and often requires
access rights to read. For web mining, the data is public and rarely requires access rights. But web
mining has additional constraints, due to the implicit agreement with webmasters regarding
automated (non-user) access to this data. This implicit agreement is that a webmaster allows
crawlers access to useful data on the website, and in return the crawler (a) promises not to
overload the site, and (b) has the potential to drive more traffic to the website once the search
index is published. With web mining, there often is no such index, which means the crawler has
to be extra careful/polite during the crawling process, to avoid causing any problems for the
webmaster.

3. Structure – A traditional data mining task gets information from a database, which provides
some level of explicit structure. A typical web mining task is processing unstructured or semi-
structured data from web pages. Even when the underlying information for web pages comes
from a database, this often is obscured by HTML markup.

7.4 Types of Web Mining

Mining Complex types of data (College of Information technology and Engineering)


1) Web Content Mining

 Web content mining, also known as text mining, is generally the second step in Web data mining.
Content mining is the scanning and mining of text, pictures and graphs of a Web page to
determine the relevance of the content to the search query. This scanning is completed after the
clustering of web pages through structure mining and provides the results based upon the level of
relevance to the suggested query. With the massive amount of information that is available on the
World Wide Web, content mining provides the results lists to search engines in order of highest
relevance to the keywords in the query.

 Text mining is directed toward specific information provided by the customer search information
in search engines. This allows for the scanning of the entire Web to retrieve the cluster content
triggering the scanning of specific Web pages within those clusters. The results are pages relayed
to the search engines through the highest level of relevance to the lowest. Though, the search
engines have the ability to provide links to Web pages by the thousands in relation to the search
content, this type of web mining enables the reduction of irrelevant information.
 Web text mining is very effective when used in relation to a content database dealing with
specific topics. For example, online universities use a library system to recall articles related to
their general areas of study. This specific content database enables to pull only the information
within those subjects, providing the most specific results of search queries in search engines. This
allowance of only the most relevant information being provided gives a higher quality of results.
This increase of productivity is due directly to use of content mining of text and visuals.
 The main uses for this type of data mining are to gather, categorize, organize and provide the best
possible information available on the WWW to the user requesting the information. This tool is
imperative to scanning the many HTML documents, images, and text provided on Web pages.
The resulting information is provided to the search engines in order of relevance giving more
productive results of each search.
 Web content categorization with a content database is the most important tool to the efficient use
of search engines. A customer requesting information on a particular subject or item would
otherwise have to search through thousands of results to find the most relevant information to his
query. Thousands of results through use of mining text are reduced by this step. This eliminates
the frustration and improves the navigation of information on the Web.
 Business uses of content mining allow for the information provided on their sites to be structured
in a relevance-order site map. This allows for a customer of the Web site to access specific
information without having to search the entire site. With the use of this type of mining, data
remains available through order of relativity to the query, thus providing productive marketing.

Mining Complex types of data (College of Information technology and Engineering)


Used as a marketing tool this provides additional traffic to the Web pages of a company’s site
based on the amount of keyword relevance the pages offer to general searches.
As the second section of data mining, text mining is useful to improve the productive uses of
mining for businesses, Web designers, and search engines operations. Organization,
categorization, and gathering of the information provided by the WWW becomes easier and
produces results that are more productive through the use of this type of mining.
 In short, the ability to conduct Web content mining allows results of search engines to maximize
the flow of customer clicks to a Web site, or particular Web pages of the site, to be accessed
numerous times in relevance to search queries. The clustering and organization of Web content in
a content database enables effective navigation of the pages by the customer and search engines.
Images, content, formats and Web structure are examined to produce a higher quality of
information to the user based upon the requests made. Businesses can maximize the use of this
text mining to improve marketing of their sites as well as the products they offer.

2) Web Usage Mining

 Web usage mining is the third category in web mining. This type of web mining allows for the
collection of Web access information for Web pages. This usage data provides the paths leading
to accessed Web pages. This information is often gathered automatically into access logs via the
Web server. CGI scripts offer other useful information such as referrer logs, user subscription
information and survey logs. This category is important to the overall use of data mining for
companies and their internet/ intranet based applications and information access.
 Usage mining allows companies to produce productive information pertaining to the future of
their business function ability. Some of this information can be derived from the collective
information of lifetime user value, product cross marketing strategies and promotional campaign
effectiveness. The usage data that is gathered provides the companies with the ability to produce
results more effective to their businesses and increasing of sales. Usage data can also be useful
for developing marketing skills that will out-sell the competitors and promote the company’s
services or product on a higher level.
 Usage mining is valuable not only to businesses using online marketing, but also to e-businesses
whose business is based solely on the traffic provided through search engines. The use of this
type of web mining helps to gather the important information from customers visiting the site.
This enables an in-depth log to complete analysis of a company’s productivity flow. E-businesses

Mining Complex types of data (College of Information technology and Engineering)


depend on this information to direct the company to the most effective Web server for promotion
of their product or service.
 This web mining also enables Web based businesses to provide the best access routes to services
or other advertisements. When a company advertises for services provided by other companies,
the usage mining data allows for the most effective access paths to these portals. In addition, there
are typically three main uses for mining in this fashion.
 The first is usage processing, used to complete pattern discovery. This first use is also the most
difficult because only bits of information like IP addresses, user information, and site clicks are
available. With this minimal amount of information available, it is harder to track the user
through a site, being that it does not follow the user throughout the pages of the site.
 The second use is content processing, consisting of the conversion of Web information like text,
images, scripts and others into useful forms. This helps with the clustering and categorization of
Web page information based on the titles, specific content and images available.
 Finally, the third use is structure processing. This consists of analysis of the structure of each page
contained in a Web site. This structure process can prove to be difficult if resulting in a new
structure having to be performed for each page.
 Analysis of this usage data will provide the companies with the information needed to provide an
effective presence to their customers. This collection of information may include user
registration, access logs and information leading to better Web site structure, proving to be most
valuable to company online marketing. These present some of the benefits for external marketing
of the company’s products, services and overall management.
 Internally, usage mining effectively provides information to improvement of communication
through intranet communications. Developing strategies through this type of mining will allow
for intranet based company databases to be more effective through the provision of easier access
paths. The projection of these paths helps to log the user registration information giving
commonly used paths the forefront to its access.
 Therefore, it is easily determined that usage mining has valuable uses to the marketing of
businesses and a direct impact to the success of their promotional strategies and internet traffic.
This information is gathered on a daily basis and continues to be analyzed consistently. Analysis
of this pertinent information will help companies to develop promotions that are more effective,
internet accessibility, inter-company communication and structure, and productive marketing
skills through web usage mining.

3) Web structure mining

 Web structure mining, one of three categories of web mining for data, is a tool used to identify the
relationship between Web pages linked by information or direct link connection. This structure

Mining Complex types of data (College of Information technology and Engineering)


data is discoverable by the provision of web structure schema through database techniques for
Web pages. This connection allows a search engine to pull data relating to a search query directly
to the linking Web page from the Web site the content rests upon. This completion takes place
through use of spiders scanning the Web sites, retrieving the home page, then, linking the
information through reference links to bring forth the specific page containing the desired
information.
 Structure mining uses minimize two main problems of the World Wide Web due to its vast
amount of information. The first of these problems is irrelevant search results. Relevance of
search information become misconstrued due to the problem that search engines often only allow
for low precision criteria. The second of these problems is the inability to index the vast amount if
information provided on the Web. This causes a low amount of recall with content mining. This
minimization comes in part with the function of discovering the model underlying the Web
hyperlink structure provided by Web structure mining.
 The main purpose for structure mining is to extract previously unknown relationships between
Web pages. This structure data mining provides use for a business to link the information of its
own Web site to enable navigation and cluster information into site maps. This allows its users the
ability to access the desired information through keyword association and content mining.
Hyperlink hierarchy is also determined to path the related information within the sites to the
relationship of competitor links and connection through search engines and third party co-links.
This enables clustering of connected Web pages to establish the relationship of these pages.
On the WWW, the use of structure mining enables the determination of similar structure of Web
pages by clustering through the identification of underlying structure. This information can be
used to project the similarities of web content. The known similarities then provide ability to
maintain or improve the information of a site to enable access of web spiders in a higher ratio.
The larger the amount of Web crawlers, the more beneficial to the site because of related content
to searches.
 In the business world, structure mining can be quite useful in determining the connection between
two or more business Web sites. The determined connection brings forth a useful tool for
mapping competing companies through third party links such as resellers and customers. This
cluster map allows for the content of the business pages placing upon the search engine results
through connection of keywords and co-links throughout the relationship of the Web pages. This
determined information will provide the proper path through structure mining to improve
navigation of these pages through their relationships and link hierarchy of the Web sites.
 With improved navigation of Web pages on business Web sites, connecting the requested
information to a search engine becomes more effective. This stronger connection allows

Mining Complex types of data (College of Information technology and Engineering)


generating traffic to a business site to provide results that are more productive. The more links
provided within the relationship of the web pages enable the navigation to yield the link hierarchy
allowing navigation ease. This improved navigation attracts the spiders to the correct locations
providing the requested information, proving more beneficial in clicks to a particular site.
 Therefore, Web mining and the use of structure mining can provide strategic results for marketing
of a Web site for production of sale. The more traffic directed to the Web pages of a particular site
increases the level of return visitation to the site and recall by search engines relating to the
information or product provided by the company. This also enables marketing strategies to
provide results that are more productive through navigation of the pages linking to the homepage
of the site itself.
To truly utilize your website as a business tool web structure mining is a must.

Mining Complex types of data (College of Information technology and Engineering)


Chapter 8: Application and trends
in data warehousing and data
mining
8.0 Data Mining tools
 Six of the best open source data mining tools
o RapidMiner (Formerly known as YALE)
o WEKA
o R-programming
o Orange (based on Python)
o KNIME (does all three process Extraction, transformation and load)
o NLTE

8.1 Integration of data mining tools with database system


There are two types of integration:
a) Loose Integration

Data mining Tools

Database
management Data base
system

Fig: Loose Integration between database and data mining

b) Tight integration
Database
management
system
+ Data base
Data
mining
tools

Fig: Loose Integration between database and data mining

8.2 Heterogeneous database System


 The functions/components in the whole database is of different types:
o Network used to interconnect the nodes
o Data description language (DDL) and data manipulation language (DML)
o DBMS and the functions they ensure
o Data models (Relational, Network, Hierarchical)
o Operating systems in the nodes which run database management system
o Computer hardware

Banking
and Financial
Financial Sector

Data base

Marketing
Universities

Fig: Heterogeneous database system

8.3 Importance of data mining in Marketing


 Basket Analysis
 Sales forecast
 Database Marketing
 Card Marketing
 Call detail record analysis
 Customer Loyalty
 Market segmentation
 Product Production
 Warranties

8.3 Importance in E-commerce


 Marketing
 Recommendation engine
 Pricing
 Supply Chain management
 Fraud detection

8.4 Importance in CRM (Customer Relationship management)


 Importance means of gaining and maintaining customer information.
 Improving customer value

 CRM is about acquiring and retaining customers, improving customer loyalty, gaining customer
insight, and implementing customer-focused strategies. A true customer-centric enterprise helps
your company drive new growth, maintain competitive agility, and attain operational excellence.”
SAP

 Customer Relationship Management (CRM) is a business philosophy involving identifying,


understanding and better providing for your customers while building a relationship with each
customer to improve customer satisfaction and maximize profits. It’s about understanding,
anticipating and responding to customers’ needs.

 To manage the relationship with the customer a business needs to collect the right information
about its customers and organize that information for proper analysis and action. It needs to keep
that information up-to-date, make it accessible to employees, and provide the know-how for
employees to convert that data into products better matched to customers’ needs.

 The secret to an effective CRM package is not just in what data is collected but in the organizing
and interpretation of that data. Computers can’t, of course, transform the relationship you have
with your customer. That does take a cross-department, top to bottom, corporate desire to build
better relationships. But computers and a good computer based CRM solution, can increase sales
by as much as 40-50% – as some studies have shown.
Customer Improve Improve Equality and
sastisfaction Current Loyalty
quality of relationship

Improve Improve

Cutomer
Relationship Value
Value

Drawup
Maximize

Preceived Improve realize Relationship


Customized Service Strategy
quality

Information Value

Fig: Customer Relationship Management

8.5 Social Impact of data mining


As data mining is one of the most rapidly changing disciplines with new technologies and concepts
continually under development, academicians, researchers, and professionals of the discipline need access
to the most current information about the concepts, issues, trends, and technologies in this emerging field.
Social Implications of Data Mining and Information Privacy: Interdisciplinary Frameworks and Solutions
serves as a critical source of information related to emerging issues and solutions in data mining and the
influence of political and socioeconomic factors. An immense breakthrough, this essential reference
provides concise coverage of emerging issues and technological solutions in data mining, and covers
problems with applicable laws governing such issues.
8.6 Trends in Data Mining

Data mining concepts are still evolving and here are the latest trends that we get to see in this field −

 Application Exploration.

 Scalable and interactive data mining methods.

 Integration of data mining with database systems, data warehouse systems and web database
systems.

 Standardization of data mining query language.

 Visual data mining.


 New methods for mining complex types of data.

 Biological data mining.

 Data mining and software engineering.

 Web mining.

 Distributed data mining.

 Real time data mining.

 Multi database data mining.

 Privacy protection and information security in data mining.

You might also like