Professional Documents
Culture Documents
Data Mining
Data Mining
DATA MINING
Description:
Key features of data mining:
• Clustering based on finding and visually documented groups of facts not previously known.
Student Note:
1.2 Specific use of data mining
• Market segmentation
o Data mining helps to identify the common characteristics of customers who buy
the same products from your company
• Customer anticipation(expectation)
o It helps to predict which customers may leave your company and go to a
competitor
• Fraud detection- it identifies which transaction are most likely to be fraudulent.
• Direct marketing
o Direct marketing identifies which prospects should be included to obtain the
highest response rate.
• Interactive marketing
o It is useful for predicting what each user on web site is most likely interested in
seeing.
• Market basket analysis
o It helps to understand what product or services are commonly purchased together.
• Trend analysis
o Trend analysis identifies the difference between typical customers this month and
last.
Student Note:
1.3 Challenges of Data Mining:
• Scalability: Scalable techniques are needed for handling massive size of datasets that are
now created.
• Poor efficiency: Such large datasets require the use of efficient method for storing,
indexing and retrieving data from secondary or even tertiary storage system.
• Complexity: Such techniques can dramatically increase the size of the datasets which can
be handled and for that it requires new design and algorithm.
• Dimensionality: some domain has number of dimensions which are very large and makes
the data analyzing difficult. That is called as curse of dimensionality for example
bioinformatics.
• Poor quality: poor quality such as noisy data dirty data missing value, in exact or
incorrect data.
Student Note:
1.4 Knowledge discovery in Database
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1. Developing an understanding of
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: (SELECTION)
o selecting a data set, or
o Focusing on a subset of variables, or data samples (a data sample is a set
of data collected and/or selected from a statistical population by a defined procedure. The
elements of a sample are known as sample points, sampling units or observations. ... The
sample usually represents a subset of manageable size.), on which discovery is to be
performed.
3. Data cleaning and preprocessing.(PREPRCESSING)
o Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors .Data preprocessing is a
proven method of resolving such issues.
o Removal of noise or outliers.
1. Fill missing values
2. Data not entered due to misunderstanding
KDD refers to the overall process of discovering useful knowledge from data. It involves the
evaluation and possibly interpretation of the patterns to make the decision of what qualifies
as knowledge. It also includes the choice of encoding schemes, preprocessing, sampling, and
projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data without
the additional steps of the KDD process.
Student Note:
1.5 Data pre-processing
Student Note:
1. Data cleaning
Real world data tend to be incomplete, noisy and inconsistent; Data cleaning routines
attempt to fill the missing values, smooth out noise while identifying outliers and correct
inconsistencies in the data. Basic methods of data cleaning are:
2. Data transformation
a) Normalization:
a. Scaling attribute values to fall within a specified range.
i. Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-
Min)/(Max-Min)
b. Scaling by using mean and standard deviation (useful when min and max are
unknown or when there are outliers): V'=(V-Mean)/StDev
b) Aggregation: moving up in the concept hierarchy on numeric attributes.
c) Generalization: moving up in the concept hierarchy on nominal attributes.
d) Attribute construction: replacing or adding new attributes inferred by existing attributes.
3. Data reduction
1. Reducing the number of attributes
o Data cube aggregation: applying roll-up, slice or dice operations.
o Removing irrelevant attributes: attribute selection (filtering and wrapper
methods), searching the attribute space .
o Principle component analysis (numeric attributes only): searching for a lower
dimensional space that can best represent the data..
2. Reducing the number of attribute values
o Binning (histograms): reducing the number of attributes by grouping them into
intervals (bins).
o Clustering: grouping values in clusters.
o Aggregation or generalization
3. Reducing the number of tuples
o Sampling
SUMMARY PRE-PROCESSING
1. E-Commerce:
• For Business Intellegence
o Offers upsell.example Amazon.com
• Fraud detection:
o A problem faced by all e-commerce companies is misuse of our systems
and, in some cases, fraud. For example, sellers may deliberately list a
product in the wrong category to attract user attention, or the item sold is
not as the seller described it. On the buy side, all retailers face problems
with users using stolen credit cards to make purchases or register new
user accounts.
• Product Search:
o When the user searches for a product, how do we find the best results for
the user? Typically, a user query of a few keywords can match many
products. For example, “Verizon Cell phones” is a popular query at eBay,
and it matches more than 34,000 listed items.
• Product recommendation
2. Crime Agencies:
• Use to spot trends across the data helping with everything from where to deploy
police manpower. (where the crime is mostly likely to happen).
• To search at a border crossing (based on age/type of the vehicle, age of
occupation.
• Data mining and criminal intelligence techniques
o Entity extraction: Commonly used to automatically identify people,
organizations, vehicles and personal details in unstructured data such as
police reports. Even if entity extraction provides only basic information, it
can accelerate the investigation by rapidly providing precise details from
large amounts of unstructured data.
o Clustering techniques: Clustering techniques are used to group similar
characteristics together in classes in order to gain intelligence by
maximizing or minimizing similarities; for example, to identify suspects or
criminal groups conducting crimes in similar ways. Clustering techniques
could be effectively applied through conceptual space algorithms to
discover criminal relations by cross referencing entities in criminal
records.
o Association rules: This data mining technique has been used to discover
recurring items in databases in order to create pattern rules and detect
potential future events. This technique has been effective in preventing
network intrusions and attacks, such as denial of service attacks(DDoS).
o Sequential pattern mining: as association rule it is useful to identify
sequences or recurring item in order to define patterns and prevent
attacks, in network security.
o Classification: This technique is useful for analyzing unstructured data to
discover common properties among criminal entities. Classification has
been used together with inferential statistics techniques to predict crime
trends. This technique can dramatically narrow down different criminal
entities and organize them into predefined classes.
o String comparison: This technique is used to reveal deceptive information
in criminal records by comparing structured text fields. This requires
highly intensive computational capabilities.
3. Telecommunication
o Telecommunication companies maintain data about the phone calls
that traverse their networks in the form of call detail records, which
contain descriptive information for each phone call. In 2001, AT&T
long distance customers generated over 300 million call detail records
per day (Cortes & Pregibon, 2001) and, because call detail records
are kept online for several months, this meant that billions of call
detail records were readily available for data mining. Call detail data
is useful for marketing and fraud detection applications.
o Telecommunication companies also maintain extensive customer
information, such as billing information, as well as information
obtained from outside parties, such as credit score information. This
information can be quite useful and often is combined with
telecommunication-specific data to improve the results of data mining.
For example, while call detail data can be used to identify suspicious
calling patterns, a customer’s credit score is often incorporated into
the analysis before determining the likelihood that fraud is actually
taking place.
o Telecommunications companies also generate and store an extensive
amount of data related to the operation of their networks. This is
because the network elements in these large telecommunication
networks have some self-diagnostic capabilities that permit them to
generate both status and alarm messages. These streams of messages
can be mined in order to support network management functions,
namely fault isolation and prediction.
• Performance Issues
• Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
• Handling noisy or incomplete data − the data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
2. Performance Issues
There can be performance-related issues such as follows −
• Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions are merged. The incremental algorithms, update
databases without mining the data again from scratch.
• Security issues
o Security is a big issue. Businesses own information about their employees and
customers including social security number, birthday, payroll and etc. However
how properly this information is taken care is still in questions. There have been a
lot of cases that hackers accessed and stole big data of customers from the big
corporation such as Ford Motor Credit Company, Sony… with so much personal
and financial information available, the credit card stolen and identity theft
become a big problem.
• Misuse of information/inaccurate information
o Information is collected through data mining intended for the ethical purposes can
be misused. This information may be exploited by unethical people or businesses
to take benefits of vulnerable people or discriminate against a group of people.
In addition,
Data mining technique is not perfectly accurate. Therefore, if inaccurate information is used for
decision-making, it will cause serious consequence.
1. Poor data quality such as noisy data, dirty data, missing values, inexact or incorrect values,
inadequate data size and poor representation in data sampling.
2. Integrating conflicting or redundant data from different sources and forms: multimedia files
(audio, video and images), geo data, text, social, numeric, etc…
3. Proliferation of security and privacy concerns by individuals, organizations and governments.
4. Unavailability of data or difficult access to data.
5. Efficiency and scalability of data mining algorithms to effectively extract the information from
huge amount of data in databases.
6. Dealing with huge datasets that require distributed approaches.
7. Dealing with non-static, unbalanced and cost-sensitive data.
8. Mining information from heterogeneous databases and global information systems.
9. Constant updation of models to handle data velocity or new incoming data.
10. High cost of buying and maintaining powerful softwares, servers and storage hardwares that
handle large amounts of data.
11. Processing of large, complex and unstructured data into a structured format.
12. Sheer quantity of output from many data mining methods.
Difference between Database and Data Warehouse
The architecture for Data Warehouses was developed in the 1980s to assist in
transforming data from operational systems to decision-making support
systems. Normally, a Data Warehouse is part of a business’s mainframe
server or in the Cloud.
Punch cards were the first solution for storing computer generated data. By
the 1950s, punch cards were an important part of the American government
and businesses. The warning “Do not fold, spindle, or mutilate” originally came
from punch cards. Punch cards continued to be used regularly until the mid-
1980s. They are still used to record the results of voting ballots and
standardized tests. “Magnetic storage” slowly replaced punch cards starting in
the 1960s. Disk storage came as the next evolutionary step for data storage.
Disk storage (hard drives and floppies) started becoming popular in 1964 and
allowed data to be accessed directly, which was a significant improvement
over the clumsier magnetic tapes. IBM was primarily responsible for the early
evolution of disk storage. They invented the floppy disk drive as well as the
hard disk drive. They are also credited with several of the improvements now
supporting their products. IBM began developing and manufacturing disk
storage devices in 1956. In 2003, they sold their “hard disk” business to
Hitachi.
During the 1990s major cultural and technological changes were taking place.
The internet was surging in popularity. Competition had increased due to new
free trade agreements, computerization, globalization, and networking. This
new reality required greater business intelligence, resulting in the need for true
data warehousing. During this time, the use of application systems exploded.
By the year 2000, many businesses discovered that, with the expansion of
databases and application systems, their systems had been badly
integrated and that their data was inconsistent. They discovered they were
receiving and storing lots of fragmented data. Somehow, the data needed to
be integrated to provide the critical “Business Information” needed for
decision-making in a competitive, constantly-changing global economy.
Student Note:
Application of Data Warehouse
Data Warehouses owing to their potential have deep-rooted applications in every
industry which use historical data for prediction, statistical analysis, and decision
making. Listed below are the applications of Data warehouses across innumerable
industry backgrounds.
1. Banking Industry
In the banking industry, concentration is given to
• risk management
• analyzing consumer data, market trends,
• government regulations and reports,
• Financial decision making.
• Most banks also use warehouses to manage the resources available on deck
in an effective manner. Certain banking sectors utilize them for market
research, performance analysis of each product, interchange and exchange
rates, and to develop marketing programs.
• Analysis of card holder’s transactions, spending patterns and merchant
classification, all of which provide the bank with an opportunity to introduce
special offers and lucrative deals based on cardholder activity. Apart from all
these, there is also scope for co-branding.
2. Finance Industry
Similar to the applications seen in banking, mainly revolve around evaluation and
trends of customer expenses which aids in maximizing the profits earned by their
clients.
4. Government
The federal government utilizes the warehouses for
• Research in compliance, whereas the state government uses it for services
related to human resources like recruitment, and accounting like payroll
management.
• to maintain and analyze tax records,
• analyse health policy records and their respective providers,
• Analyse entire criminal law database . Criminal activity is predicted from the
patterns and trends, results of the analysis of historical data associated with
past criminals.
5. Education
Universities use warehouses for
• extracting of information used for the proposal of research grants,
• understanding their student demographics, and human resource
management.
• The entire financial department of most universities depends on data
warehouses, inclusive of the Financial Aid department.
6. Healthcare
One of the most important sector which utilizes data warehouses is the Healthcare
sector. All of their financial, clinical, and employee records are fed to warehouses as
it helps them
• to strategize and predict outcomes,
• track and analyse their service feedback,
• generate patient reports,
• share data with tie-in insurance companies,
• Medical aid services, etc.
7. Hospitality Industry
A major proportion of this industry is dominated by hotel and restaurant services, car
rental services, and holiday home services. They utilize warehouse services to
• Design and evaluate their advertising and promotion campaigns where they
target customers based on their feedback and travel patterns.
8. Insurance
As the saying goes in the insurance services sector, “Insurance can never be
bought, it can be only be sold”, the warehouses are primarily used to
• Analyze data patterns and customer trends, apart from maintaining records of
already existing participants.
Student Note:
Datawarehouse Model
From the architecture pointof view , there are three
Three basic types of data marts are dependent, independent, and hybrid. The categorization
is based primarily on the data source that feeds the data mart. Dependent data marts draw
data from a central data warehouse that has already been created. Independent data marts,
in contrast, are standalone systems built by drawing data directly from operational or
external sources of data or both. Hybrid data marts can draw data from operational systems
or data warehouses.
A dependent data mart allows you to unite your organization's data in one data warehouse.
This gives you the usual advantages of centralization. Figure below illustrates a dependent
data mart.
An independent data mart is created without the use of a central data warehouse. This could
be desirable for smaller groups within an organization. It is not, however, the focus of this
Guide. See the Data Mart Suites documentation for further details regarding this
architecture. Figure below illustrates an independent data mart.
A hybrid data mart allows you to combine input from sources other than a data warehouse.
This could be useful for many situations, especially when you need ad hoc integration, such
as after a new group or product is added to the organization. Figure below illustrates a
hybrid data mart.
Data
Multiple Few selected
Sources
Implementa
Months to years Months
tion
3. Introduction
3.1. Multidimensional structure
3.1.1.Fact table
3.1.2.Dimension table
3.1.3.Difference between fact table and dimension table
3. Introduction:
3.1. Multidimensional Structure
Data Warehouses and OLAP tools are based on a multidimensional data model.
This model views data in the form of data cube.
Fact Table
The large central table, containing the bulk of the data with no
redundancy.
Usually the fact table in the schemas are in third normal form
A flat table can contain fact’s data on detail or aggregate level
Fig : fact table.
Dimension table:
Dimensions are the perspectives or entities with respect to create a sales
data warehouse in order to keep records. It allows the store to keep track of
things like monthly sale of items, and the branches and the locations at which the
items were sold. E.g. of dimensions are time, branch, location, item etc. Each
dimension may have a table associated with it which further describes the
dimension is called dimension table.
Item name
Brand
Type
Supplier
Item key
-Dimension table can be specified by users or
Experts.
-Dimension table are de-normalized.
-it is composed of one or more hierarchies that catagorise
Data .if the dimension hasnot got a hierarchies and level it is called flat
dimension or list.
-A dimension table is a table in a star schema
of a data warehouse
-dimensions table are generally small in size than fact table.
- Dimensions categorize and describe data
warehouse facts and measures in ways that
support meaningful answers to business questions.
Foreign key is the key which establish a relation with two tables.
Type of Data Facts tables could contain Evert dimension table contains
information like sales against attributes which describe the details
a set of dimensions like of the dimension. E.g., Product
Product and Date. dimensions can contain Product ID,
Product Category, etc.
Star,
Snowflake, and
Fact Constellation schema.
There is a fact table at the center. It contains the keys to each of four
dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
A large central table (called fact table) containing the bulk of the data with no
redundancy.
Usually the fact table in the star schema are in third normal form.(3NF).
Fact table typically have two columns: foreign key to dimension table and
measures those that contact numeric facts.
A fact table can contain facts data on detail or aggregate level.
the following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
Student Note:
fig:
star schema
The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions
are normalized into multiple related tables, whereas the star schema's dimensions are de normalized
with each dimension represented by a single table. A complex snowflake shape emerges when the
dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and the
child tables have multiple parent tables ("forks in the road").
The snowflake schema is a vibrant of the star schema model, where some of the
dimension table are normalized, therefore further splitting the data into additional
tables.
Student Note:
The primary disadvantage of the snowflake schema is that the additional levels of
attribute normalization adds complexity to source query joins, when compared to
the star schema.
Snowflake schemas, in contrast to flat single table dimensions, have been heavily
criticized. Their goal is assumed to be an efficient and compact storage of normalized
data but this is at the significant cost of poor performance when browsing the joins
required in this dimension this disadvantage may have reduced in the years since it
was first recognized, owing to better query performance within the browsing tools.
Student Note:
These mapping from the concept hierarchy for the dimension ‘location’ mapping a
set of low-level concept (e.g. cities) to higher-level (e.g. countries). These attributes
are related by a total order, forming a concept hierarchy such as street <city<
province or state < country.
Country
Province on …..
City Street
Lattice
The attributes of a dimension may be organized in partial order for the
(time)dimension based on the attributes day ,month, quarter and year is day
<{month<quarter and year is day<{month <quarter; week}<year.
3.4. Starnet Query Model
A starnet query model for querying multidimensional data base .if consist of radial line
from the central point .Each line represent a concept hierarchy for the dimension . Each
abstraction level in the hierarchy is called footprint .These footprint represent the
granularities available for use by OLAP operation such as drill-down and drill up.
Fig:Modeling business queries a starnet model.
In the fig it has for radial lines. This represents concept hierarchies of dimension,
location, customer, item, and time. Each line consists of footprints representing
abstraction level of dimension.
E.g. the footprint of time dimension is: “day” month, quarter, and “year”. A concept
hierarchy may include a single attribute or several attributes.
In order to examine the item sale user can roll up along the “time” dimension from
month to quarter OR.
The typical workload in the data warehouse is especially I/O intensive, with operations
such as large data loads and index builds, Creation. Of materialization view and queries
over large volume of data. The underlying I/O system for the data warehouse should be
designed to meet these heavy requirements.
The I/O configuration used by a data warehouse will depend on the characteristics of
the specific storage and server capabilities so the material in this chapter is only
intended to provide guidelines for designing and toning I/O system.
Storage configuration for the data warehouse should be chosen base on the I/O
band with then can provide, and not necessarily on their overall storage capacity.
Buying storage based solely on capacity has the potential for making mistake.
Especially for system less than 500 GB in total size. The capacity of individual
disk drives is growing foster than the I/O through put rates provides by those disk
leading to the situation in which a small number of disk can store a large volume
of data. Eg consider a 200 GB data mart using 72 GB drives this data mart could
be built with as few as six drives in a fully mirrored environment. However six
drives might not provide enough I/O bandwidth to handle a minimum number of
concurrent users on a 4 CPU sever. Thus even though six drives provides
sufficient storage a large number of drives may be required to provide acceptable
performance this system.
Use Redundancy
Because data warehouse are often the larges database system in a company
they have they most disk and thus are also the most susceptible to the failure of
a single disk. Therefore disk redundancy is a requirement for data warehouse to
protect against a hardware failure like disk-striping, redundancy can be achieved
in many ways using software or hardware.
The most important time to examine and tone the I/O system is before the
database is even created. Once the database files are created it is more difficult
to reconfigure the files. Some logical volume managers many support dynamic
reconfiguration of files while other storage configuration order to reconfigure their
I/O layout in both cane considerable system resources must be devoted to this
reconfiguration.
The data warehouse designer should plan for future growth of a data ware. There
are many approaches to handling the growth of a data ware. There are many
approaches to handling the growth in the system and the key consideration is to
be able to grow the I/O system without compromising on the I/O bandwidth.
3.6. Index
A database index is a data structure that improves the speed of data retrieval
operation on a database table at the cast of additional writes and strange space to
maintain the index data structure.
Indexes are to quickly locate data without having to search overy row in a database
table every time aab table is accessed.
Indexes can be created using one or more columns of database table, providing the
basis for both rapid random lookups and efficient access of address records.
An index is a copy of selected column of data from a table that can be searched very
efficiently that also includes a low level disk block address or direct link to the
complete row of data it was copied from.
Types of Index:
Bitmap index
A bitmap index is a special kind of indexing that stores the bulk of its data as bit
arrays (bitmaps) and answers most queries by performing bitwise logical
operations on these bitmaps. The most commonly used indexes, such as B+ trees,
are most efficient if the values they index do not repeat or repeat a small number of
times. In contrast, the bitmap index is designed for cases where the values of a
variable repeat very frequently. For example, the sex field in a customer database
usually contains at most three distinct values: male, female or unknown (not
recorded). For such variables, the bitmap index can have a significant performance
advantage over the commonly used trees.
Dense index
A dense index in databases is a file with pairs of keys and pointers for
every record in the data file. Every key in this file is associated with a particular
pointer to a record in the sorted data file. In clustered indices with duplicate keys,
the dense index points to the first record with that key.[3]
Sparse index
A sparse index in databases is a file with pairs of keys and pointers for
every block in the data file. Every key in this file is associated with a particular
pointer to the block in the sorted data file. In clustered indices with duplicate keys,
the sparse index points to the lowest search key in each block.
Reverse index
A reverse key index reverses the key value before entering it in the index. E.g., the
value 24538 becomes 83542 in the index. Reversing the key value is particularly
useful for indexing data such as sequence numbers, where new key values
monotonically increase.
Student Note:
3 in most cases, an index is used to quickly locate the data record(s) from which
the required data is read. In other words, the index is only used to locate data
records in the table and not to return data.
4
5
6 Index architure /indexing method.
7
8 In non-Clustered Index.
9 The physical order of the rows is not the same as the index-order.
10 The indexed column used are typically non-primary key column used in JOIN,
WHERE, ORDER BY columns
11 There can be more than one non-clustered index on a database table.
12
13 Clustered:
14 Clustering alters the data block into a certain distinct order to match the index,
resulting in the row data being stored in order therefore, only one clustered index
can be created on a given database table. Chittered indices can greatly increases
overall speed of retrieval but usually only where the data is accessed sequentially in
the same or reverse order of the clustered index or when a range of item is selected.
15
16 Cluster:
17 When multiple databases and multiple tables are joined, it is referred to as a
cluster. The record
18 For the tables sharing the values of a cluster key shall be stored together in the
same or nearby data blocks. This may improve the joins of these tables on the
cluster key.
19 Since the matching record are stored together and less I/0 is required to locate
them.
20 A chanter can be keyed with B-tree index or a hash table. The data block where
the table record is stored is defined by the value of cluster key.
21
22 Types of indexes
23 Bitmap index:
24 A bitmap index is a special kind of indexing that stores the bulla of its data
as bit arrays(bitmaps) and answer most queries by performing bitwise logical
operation on these bit maps. The most commonly used indexed, such as B+ Trees,
are most efficient if
25
3.7. Materialized view
26 Typically, data flows from are or more OLTP database into a data warehouse on a
monthly, weekly or daily basis. The data is normally processed in a staging file
before bag added to the data ware house. Data warehouse commonly rouge in size
from 1043 to few terabyte. Usually, the vast majorities of the data is stored in a few
very large fast tables.
27 One technologies employed in data warehouse to improve performance is
the creation of summaries. Summaries one special kind of aggregate view that
improve query execution time by pre calculating expensive joins and aggregation
operation prior to execution and storing the results in a table in the database for
e.g.: we can create a table to curtain the sum of sales by region and by product .
28 The summaries or aggregate that are referred in to book and literatures on data
warehouse are created in oracle using a schema objet is called materialized view.
30 In data ware we can use materialize view to pre compute and stone aggregated
data such on sum of the sales. Materialize such on sum of the sales. Materialize
view in these environment one often referred to as summaries, because they
summarized data.
31 They can also be used to pre compute joins with or without aggregation.
To increase the speed of queries on very large data base. Queries to large data
base often involves join bet table aggregation such as SUM a both. These
operations one expensive in terms of time and processing power.
33 The type of materialize view we create determine how materialize view is we create
determine/how materialize view we create determine how materialize view is
refreshed and used by query rewrite.
We can use almost identical syntax to perform number of roles.
34 E.g. materialize view can replicate data; a process formerly achieved by using the
CREATE VIEW is a synonym for CREATE SNAPSHOT.
Materialize view improve query performance by pre calculating exercise join and
aggregation operation on the database prior to execution and storing the results
in the database.
4.1 Introduction
Before we create an architecture for data warehouse , we must first understand the major
process that constitute a data warehouse.
The majority of data extraction comes from unstructured data sources and different data
formats. This unstructured data can be in any form, such as tables, indexes, and
analytics.
During extraction the desired data is identified and extracted from many different
sources, including database system and application. Very often, it is not possible to
identify the specific subset of interest, therefore more data than necessary has to be
extracted.
The size of extracted data varies from hundred of kilobytes up to gigabytes, depending
on the source system and the business situation.
The mechanism that determine when to start extracting the data run the transformations
and consistency checks and so on, are very important.
E.g. it may be inappropriate to start the process that extract EPOS transaction for a
retrial sales analysis data warehouse until all EPOS transactions have been received
from all stores.
Data should be in a consistent state when it is extracted from the source system.
Source data should be extracted only at a point where it represents the same instance
of time as the extracts from the other data source.
Update notification:- if the source system is able to provide a notification that a record
has been changes and describe the change it is the easiest way to get the data.
Incremental extract: - Some system may not be able to provide notification that an
undated has occurred, but they are able to identify which record have been modified
and provide an extract of such record. During further ETL process, the system needs to
identify changed and propagate it down.
Full extract: - Some system is not able to identify which data has been changed at all,
so a full extract is the one way one can get the data out of the system. The full extract
requires keeping a copy of the least extract in the same format in order to a\be able to
identify changes. Full extract handles deletion as well.
The most common method for transporting data is by the transfer of flat files, using
mechanism such as FTP or other remote exported from the source system into flat files
and is then transported to the target platform using FTP or similar mechanism.
Distributed queries can be effective mechanisms for extracting data. These mechanisms
also transport the data directly to the target systems, thus providing both extraction and
transformation in a single step. Opposed to flat file transportation, the success or failure
of the transportation is recognized immediately with the result of the distributed query or
transaction.
Oracle transportable table spaces are the fastest way for moving large volumes of data
between two oracle databases.
Previous to the introduction of transportable table spaces the most scalable data
transportation mechanisms relied on moving flat files containing raw data. This
mechanism required that data be unloaded or exported into files from source database
then after transportation these file were loaded or imported into the target database.
Transportable table space, entirely bypass the unloaded and reload steps.
Using transportable table space oracle data files can be directly transported from one
database to another.
It can also be called as the process that takes the loaded data and structures it for a
query performance and for minimizing operational cost.
Before the transform of data takes place the following task may take place Data need to
be cleaned and checked in the following ways.
A) CLEANING:
I. Make sure data is consistent within itself.:-When we take a row of data and
examine it , the content of the row must make sense, Errors at this point are
mainly to do with error in the source system. Typical check are for non-sencial
phone number, address, and so on.
II. Make sure data is consistent with other data within the same source:-When
we examine the data against other table within the same source, he data
must make sense e.g. Checks for the existence of the stock keeping unit /
customers in the transaction by comparing it with the list of valid SKU/
customers.
III. Make sure data is consistent with other data within the same source system:-
This is when we examine a record and compare it with the similar record in
different source system. Ex- reconciling a customer record with a copy in a
customer database and a copy in a customer event database. These checks
are he most complex and are likely to result in the application of complex
business rules to resolve any discrepancies (inconsistency, difference).
IV. Make sure data is consistent with other data with the information already in
the warehouse:-This is when we ensure that any data being loaded does not
contradict the information already within the data warehouse. E.g.: update
info about the product hierarchy, but the changes need to be controlled
carefully, so on not to render meaninglessly any of the existing information
already in the data warehouse.
Smoothing:
o Smoothing is the process which works to remove the noise from the data ,
such techniques include clustering , regression.
Aggregation-
o In this process low level or primitive (raw) data are replaced by higher
level concepts through use of concept hierarchy.eg: categorical attribute
like street, can be generalized to higher level concept like city and country.
Normalization: where the attribute data are scaled so as to fall within a small
specified range such as –
-1.0 to 0
0.0 to 1.0
Numerical
Suppose that mina and maxA are the minimum and maximum values of attribute
A.
Example :Suppose that the minimum and max values of the attributes income are
$12000 and $98000. We would like to map income to range [0.0 to 1.0
98000 - 12000
4.6 Loading
Before any transformation can occur within the database, the raw data
must become accessible for the database. The approach is called loading.
Fig: loading
SQL loader is used to move data from flat files into an Oracle data
warehouse.
OCJ (oracle certified java) and direct path API (application program
interface) are frequently used when the transformation and computation
are done outside the database and there is no need for flat file staging.
Most data warehouse are loaded on a regular schedule every night, every
week, or every moths, new data is bought into the data warehouse. The
data being loaded at the end of the week/month typically corresponds to
the transaction for the week/month. In this scenario the data warehouse is
being loaded by time. This suggests that the data warehouse tables
should be partitioned on a date column. In the data warehouse example,
suppose the new data is loaded into the sales table every month.
Furthermore, the sales table has been partitioned by month. These steps
show how the load process will proceed to add the data for a new month
to the table sales.
5
5 Introduction
The major components of any data mining system are data source, data warehouse
server, data mining engine, pattern evaluation module, graphical user interface and
knowledge base
Database, data warehouse, World Wide Web (WWW), text files and other documents
are the actual sources of data. You need large volumes of historical data for data mining
to be successful. Organizations usually store data in databases or data warehouses.
Data warehouses may contain one or more databases, text files, spreadsheets or other
kinds of information repositories. Sometimes, data may reside even in plain text files or
spreadsheets. World Wide Web or the Internet is another big source of data.
Different Processes
The data needs to be cleaned, integrated and selected before passing it to the database
or data warehouse server. As the data is from different sources and in different formats,
it cannot be used directly for the data mining process because the data might not be
complete and reliable. So, first data needs to be cleaned and integrated. Again, more
data than required will be collected from different data sources and only the data of
interest needs to be selected and passed to the server. These processes are not as
simple as we think. A number of techniques may be performed on the data as part of
cleaning, integration and selection.
Filling the missing values- Data is not always available . missing data may cause
due to equipment malfunction, inconsistent with other data thus deleted, data not
entered due to misunderstand . So, we need to handle missing data . missing data is
handled in the following wys.
Integration-
Combines data from multiple sources into a coherent sotre . it marges the data from
multiple data sorces.
The database or data warehouse server contains the actual data that is ready to be
processed. Hence, the server is responsible for retrieving the relevant data based on
the data mining request of the user.
The data mining engine is the core component of any data mining system. It consists of
a number of modules for performing data mining tasks including association,
classification, characterization, clustering, prediction, time-series analysis etc.
This basically involves following tasks:
Clustering: Clustering is the process of making a group of abstract objects into classes
of similar objects. A cluster of object can be treated as one group.
Deviation analysis Deviation analysis is the reality based technology that gets machine
or process back online quickly when deviation occurs.
The pattern evaluation module is mainly responsible for the measure of interestingness
of the pattern by using a threshold value. It interacts with the data mining engine to
focus the search towards interesting patterns.
The graphical user interface module communicates between the user and the data
mining system. This module helps the user use the system easily and efficiently without
knowing the real complexity behind the process. When the user specifies a query or a
task, this module interacts with the data mining system and displays the result in an
easily understandable manner.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It is useful for guiding
the search or evaluating the interestingness of the result patterns. The knowledge base
might even contain user beliefs and data from user experiences that can be useful in
the process of data mining. The data mining engine might get inputs from the
knowledge base to make the result more accurate and reliable. The pattern evaluation
module interacts with the knowledge base on a regular basis to get inputs and also to
update it.
Summary
Each and every component of data mining system has its own role and importance in
completing data mining efficiently. These different modules need to interact correctly
with each other in order to complete the complex process of data mining successfully.
The good system architecture will facilitate the system to make best use of the software
environment, accomplish data mining tasks in the efficient and timely manner.
1. No-coupling
2. Loose coupling
3. Semi tight coupling
4. Light coupling
5.3 Data Warehouse Architecture
The three tier architecture of data warehouse can be explained using Bottom tier,
middle tier and top tier.
Bottom tier-
The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities or Gateways are used to feed data into
the bottom tier from operational databases or other external sources (such as customer
profile information provided by external consultants). These tools and utilities perform
data extraction, cleaning, and transformation The data are extracted using application
program interfaces known as gateways.
A gateway is supported by the underlying DBMS and allow client program to generate
SQL code to be executed at the server.
Data extraction: get data from multiple, heterogeneous, and external sources
Data cleaning: detect errors in the data and rectify them when possible Data
transformation: convert data from legacy or host format to warehouse format
Load: sort, summarize, consolidate, compute views, check integrity, and build index
and partitions
Refresh propagate the updates from the data sources to the warehouse
Metadata Repository
Meta data is the data defining warehouse objects. It has the following kinds Description
of the structure of the warehouse Schema, view, dimensions, hierarchies, derived data
definition, data mart locations and contents
Operational meta-data data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged), monitoring information
(warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance warehouse schema, view and derived data
definitions
Middle tier-
The middle tier is an OLAP server that is typically implemented using either
(i) A relational OLAP (ROLAP) model that is an extended relational DBMS that
maps operations on multidimensional data to standard relational operations.
(ii) A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.
Top Tier-
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
An Enterprise warehouse collects all the information about subject spanning the entire
organization. It provides corporate wide data integration, usually from one or more
operational systems or external information providers, and is cross functional in scope.
It typically contain detailed data as well as summarized data, and can range in size from
a few gigabytes to hundreds of gigabytes, terabytes or beyond. It requires extensive
modeling and may take year to design and build
A data mart contain a subset of corporate wide data that is of valued to a specific group
of users……..
A virtual warehouse is a set of views over operational data bases. For efficient query
processing only some of the possible summary views may be materialized. A virtual
warehouse is easy to build but requires excess capacity on operational database server.
The next layer is an OLAP server that is typically implemented using either
(i) A relational OLAP (ROLAP) model that is an extended relational DBMS that
maps operations on multidimensional data to standard relational operations.
(ii) A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.
The topmost layer is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
5.5.1 Types of OLAP Servers
Logically OLAP servers present business users with multidimensional data from data
warehouse rod at marts, without concerns regarding how or where the data are stored.
However, the physical architecture and implementation of OLAP servers must consider
data storage issue.
5.5.1.1 ROLAP
These are intermediate servers that stand in between a relational back end servers and
client front end tools. They use a relational or extended relational DBMS to store and
manage warehouse data.
The DSS micro strategy and met cube of Informix, adopt ROLAP technology.
5.5.1.2 MOLAP
These server support multidimensional view of the data through array-based
multidimensional storage engine.
If the data is stored in a relational database it can be viewed multidimensional but only
by successively accessing and processing a table for each dimension and aspect of a
data but MOLAP processes data that is already stored in multidimensional array in
which all possible combination are reflected.
The advantage of using a data cube is that it allows fast indexing to precomputed
summarized data. Notice that with multidimensional data stores, the storage utilization
may be low if data set is sparse.
Many MOLAP servers adopt a two level storage representation to handle sparse and
dense data sets.The dense sub cubes are identified and stored as array structures
while the sparse sub cubes employ compression technology for efficient utilization.
5.5.1.3 HOLAP
The hybrid OLAP server approach combines ROLAP and MOLAP technology
benefiting from the greater scalability of ROLAP and faster computation of MOLAP.
EX. the Microsoft SQL server 7.0 OLAP Service support a hybrid OLAP server
5.5.1.4 COMPARISION CHART
BASIS FOR
ROLAP MOLAP
COMPARISON
Storage & Data is stored and fetched Data is Stored and fetched from the
Fetched from the main data Proprietary database MDDBs.
warehouse.
Data Form Data is stored in the form of Data is Stored in the large
relational tables. multidimensional array made of
data cubes.
Roll up-
Roll-down-
Dice for-
Slice-
Pivot-
Fig: OLAP operation on multidimensional cube
In the field of data mining substantial research has been performed for data mining at
various platforms, including transaction databases, relational databases, spatial
database, text databases, text databases, time=series databases, flat files, data
warehouse and so on.
The integration of OLAP with data mining is OLAP mining or OLAM, The architecture of
OLAM is particularly important for the following reasons.
High quality of data in data warehouse:
o Most data mining tools need to work on integrated, consistent and cleaned
data, which requires costly data cleaning, data transformation and data
integration as preprocessing steps. A data warehouse constructed by such
preprocessing serves as valuable sources of high quality data for OLAP
as well as for data mining.
Available information processing infrastructure surrounding data warehouse:
o Comprehensive information processing and data analysis infrastructure
have been or will be systematically constructed surrounded data
warehouse, which includes accessing integration, consolidation and
transformation of multiple heterogeneous database.
OLAP based exploratory data analysis
o Effective data mining needs exploratory data analysis. A user will often
want to traverse through a database, select portion of relevant data,
analyze them at different granularities, and present knowledge results in
different forms. On-line analytical mining providing facilities for data mining
on different subsets of data and at different level of abstraction, by drilling,
pivoting, filtering, dicing and slicing on a data cube and on some
intermediate data mining results.
5.2 A
6
Data Mining Approaches and
Methods
6. Introduction
6.1. Data mining techniques
6.2. Data mining tasks
6.3. Classification
6.4. Prediction
6.5. Decision tree
6.6. Rule based classification
6.7. Back propagation
6.8. Genetic algorithm
6.9. Regression
6.9.1. Linear regression
6.9.2. Non-Linear regression
6.10. Association rules and mining frequent patterns
6.11. Clustering
6.11.1. Partitioning method
6.11.1.1. K mean
6.11.1.2. K medoids
6.11.2. Hierarchical method
6.11.2.1. Agglomerative
6.11.3. Divisive
6. Introduction
6.1. Data mining techniques
Association – Association is one of the widely-known data mining techniques. Under
this, a pattern is deciphered based on a relationship between items in the same
transaction. Hence, it is also known as relation technique. Big brand retailers rely on this
technique to research customer’s buying habits/preferences. For example, when
tracking people’s buying habits, retailers might identify that a customer always buys
cream when they buy chocolates, and therefore suggest that the next time that they buy
chocolates they might also want to buy cream.
Data mining is not so much a single techniques, as the idea that there is more
knowledge hidden in the data that shows itself on the surface. Any data that helps
extract more out of data is useful. So Data mining Techniques form a quite a
heterogeneous group.
Association and co-relation is usually to find frequent item set finding among large data
set. This type of finding helps business to make certain decision such as catalogue
design, cross marketing and customer shopping behavior analysis.
Association rule algorithm need to be able generate rules with confidence values less
than one.
Classification – This data mining technique differs from the above in a way that it is
based on machine learning and uses mathematical techniques such as Linear
programming, Decision trees, Neural network. In classification, companies try to build
software that can learn how to classify the data items into groups. For instance, a
company can define a classification in the application that “given all records of
employees who offered to resign from the company, predict the number of individuals
who are likely to resign from the company in future.” Under such a scenario, the
company can classify the records of employees into two groups that namely “leave” and
“stay”. It can then use its data mining software to classify the employees into separate
groups created earlier.
Classification is the most commonly applied data mining techniques, which employs a
set of pre-classified examples to develop a model that can classify the population of
record at large.
Fraud detection and credit risk application are particularly well suited to this type of
analysis. This approach frequently employs decision tree or neural network based
classification algorithm.
Clustering – Different objects exhibiting similar characteristics are grouped together in
a single cluster via automation. Many such clusters are created as classes and objects
(with similar characteristics) are placed in it accordingly. To understand this better, let us
consider an example of book management in the library. In a library, the vast collection
of books is fully cataloged. Items of the same type are listed together. This makes it
easier for us to find a book of our interest. Similarly, by using the clustering technique,
we can keep books that have some kinds of similarities in one cluster and assign it a
suitable name. So, if a reader is looking to grab a book relevant to his interest, he only
has to go to that shelf instead of searching the entire library. Thus, clustering technique
defines the classes and puts objects in each class, while in the classification
techniques, objects are assigned into predefined classes.
By using clustering techniques we can further identify dense and sparse regions in
object space and can discover overall distribution pattern and correlation among data
attributes.
Types of clustering
Partitioning method
Hierarchical
Agglomerative method
Prediction – The prediction is a data mining technique that is often used in combination
with the other data mining techniques. It involves analyzing trends, classification,
pattern matching, and relation. By analyzing past events or instances in a proper
sequence one can safely predict a future event. For instance, the prediction analysis
technique can be used in the sale to predict future profit if the sale is chosen as an
independent variable and profit as a variable dependent on sale. Then, based on the
historical sale and profit data, one can draw a fitted regression curve that is used for
profit prediction.
Decision trees – Within the decision tree, we start with a simple question that has
multiple answers. Each answer leads to a further question to help classify or identify the
data so that it can be categorized, or so that a prediction can be made based on each
answer. For example, We use the following decision tree to determine whether or not to
play cricket ODI: Data Mining Decision Tree: Starting at the root node, if the weather
forecast predicts rain then, we should avoid the match for the day. Alternatively, if the
weather forecast is clear, we should play the match.
Data Mining is at the heart of analytics efforts across a variety of industries and
disciplines like communications, Insurance, Education, Manufacturing, Banking and
Retail and more. Therefore, having correct information about it is essential before apply
the different techniques.
Regression: Regression Techniques can be adapted for prediction. Regression
analysis can be used to model the relationship between one or more independent
variables and dependent variable.(independent variable are attributes already known
and response variables are what we want to predict). Unfortunately many real world
problems are not simply prediction for instance – sales volume stock price and product
failure rates are all difficult to predict because they may depend on complex interactions
and multiple predictor variable.
Neural network is a set of connected input output units and each connection has a
weight present with it. During the learning phase network learns by adjusting weight so
as to be able to predict the correct class of the input tuples.
It has remarkable ability to derive meaning from complicated or imprecise data and can
be used to extract patterns and detect trends that are too complex to be noticed by
either human or other computer techniques.
E.g. handwriting character recognition, training a computer to pronounce English
text.
It represent the concept buys computer , that is , it predicts whether or not a customer at
the Electronics company is likely to purchase a computer, Internal nodes are
represented by rectangles and leaf nodes are denoted by ovals.
In order to classify an unknown sample, the attributes values of the sample are tested
against the decision tree. A path is traced from the root to a leaf node that holds the
class prediction for that sample. Decision tree can easily be converted to classification
rules.
Decision Tree Induction
The basic algorithm for decision tree induction is a greedy algorighm that constructs
decision tree in a top-down recursive divide and conquer manner.
Create a node N;
If samples are all of the same class, C then
Return N as a leaf node labeled with class C;
If attribute-list is empty then
Return N as leaf node labeled with the most common class in samples;
Select test=attribute, the attribute among attribute-list with the highest information
gain;
Label node N with test-attribute;
For each know value ai of test –attribute;
Grow a branch from node N for the condition test-attribute=ai;
List si be the set of samples in samples for which test-attribute=ai;
If si is empty then
Attach a leaf labeled with the most common class in samples;
Else
Attach a node returned by Generate –decision- tree
InformationGain, Entropy,Gain
The information gain is used to select the test attribute at each node in the tree. Such a
measure is referred to as attribute selection measure or a measure of the goodness of
split. The attribute with the highest information gain is chosen as the test attribute for the
current node.
Let S be a set of s data samples. Suppose the class label attribute has m distinct
values. defining m distinct classes Ci(for i=1….,m). Let si be the number of samples of S
in class Ci. The expected information needed to classify a given sample is given by
n
Exercise :
The class lebelattribute , buys computer has two distinct values (namely,
{yes,no}) therefore, there are two distinct classes (m=2).
Yes Sample=9
No Sample=5
=0.940 Solved.
Yes Sample=2
No Sample=3
Yes Sample=4
No Sample=0
= solved
The back propagation algorithm has been repeatedly rediscovered and is equivalent
to automatic differentiation in reverse accumulation mode.
Initialize the weight: the weight of the network are initialized to small random
number (e.g., ranging from -1.0 to -5.0-0.5). Each unit has a bias associated
with it as explained below. The bias are similarly initialized to small random
number.
Propagate the input forward: in this step, the net input and output of each unit
in the hidden and output layer are computed. First the training sample is fed
out to equal to its input of the network. Note that for unit j in the input layer, its
output is equal to its input, that is Oj=Ij for input unit j. The net input to each
unit in the hidden and output layers is computed as a linear combination of its
input.
The inputs to the unit are, in fact, the output of the units connected to it in the
previous layer. To compute the net input to the unit, each input connected to
the unit is multiplied by its corresponding weight, and this is summed.
Given a unit j in a hidden or output layer, the net input, Ij, to unit j is
Ij=wij Oi + θj,
wij is the weight of the connection from unit i in the previous layer to unit j
6.2.1.4.Genetic algorithm
Genetic algorithm attempt to incorporate ideas of natural evolution. In general, genetic
learning starts as follow.
An initial population is created consisting of randomly generated rules. Each rule can be
represented by a string of bits.
Ex. suppose that samples in a given training set are described by two Boolean
attributes A1 and A2 and that there are two classes C1 and C2.
Rule: if A1 and not A2 then
C2 can be encoded as a bit 100, where the two leftmost bit represent attributes
A1 and A2 respectively and the right most bit represent the class
6.2.1.5.Regression
Introduction
Regression is a data mining function that predicts a number. Profit, sales, mortgage
rates, house values, square footage, temperature, or distance could all be predicted
using regression techniques. For example, a regression model could be used to predict
the value of a house based on location, number of rooms, lot size, and other factors.
A regression task begins with a data set in which the target values are known. For
example, a regression model that predicts house values could be developed based on
observed data for many houses over a period of time. In addition to the value, the data
might track the age of the house, square footage, number of rooms, taxes, school
district, proximity to shopping centers, and so on. House value would be the target, the
other attributes would be the predictors, and the data for each house would constitute a
case.
In the model build (training) process, a regression algorithm estimates the value of the
target as a function of the predictors for each case in the build data. These relationships
between predictors and target are summarized in a model, which can then be applied to
a different data set in which the target values are unknown.
Regression models are tested by computing various statistics that measure the
difference between the predicted values and the expected values. The historical data for
a regression project is typically divided into two data sets: one for building the model,
the other for testing the model.
Regression with a single predictor is the easiest to visualize. Simple linear regression with a single
predictor is shown in Figure.
Linear regression with a single predictor can be expressed with the following equation.
Y = α + βX
Y =is constant
α = regression coefficient.
These coefficient can be solved by the method of least squares, which minimizes the
error between the actual or data and estimate the line. Given S samples or data points
of the form (x1, y1), (x2, y2)…………(xs, ys). Then the regression co-efficient can be
estimated using the method with the following eq.
S
β=∑ ( Xi− X ) (Yi−Ý )
i=1
∑ ( Xi−Ý )
i=1
Exercise:
Predict the salary of the graduates after 10 years after observing following data sets.
Salary data
X(years of experience) Y (Salary in 1000)
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Solution:
6.2.1.7.Non-Linear regression
When the linear equation has one basic form non-linear equation can take many
different forms. The easiest way to determine whether an equation is non-linear is to
focus on the term itself, it is not linear.
If the given response variable and predictor variable have a relationship that may be
modeled by a polynomial function .Polynomial regression can be modeled by adding
polynomial terms to the basic linear model.
By applying transformation to the variable we can convert the non-linear model into
linear on that can then solved by the method of least squares.
Y = ∞+β1X+ β2X2+β3X3
X1=X, X2=X2,X3=X3
Introduction
Market Basket Analysis is a modeling technique based upon the theory that if you buy a
certain group of items, you are more (or less) likely to buy another group of items. For
example, if you are in apub and you buy a pint of beer and don't buy a meal, you are
more likely to buy crisps at the same time than somebody who didn't buy beer.
The set of items a customer buys is referred to as an itemset and market basket
analysis seeks to find relationships between these item sets.
It studies customers' buying patterns and preferences to predict what they will prefer to
purchase along with the existing items in their cart
The algorithms for performing market basket analysis are fairly straightforward .The
complexities mainly arise inand dealing with the large amounts of transaction data that
may be available.
A major difficulty is that a large number of the rules found may be trivial for anyone
familiar with the business. Although the volume of data has been reduced, we are still
asking the user to find a needle in a haystack. Requiring rules to have a high minimum
support level and high confidence level risks missing any exploitable result we might
have found. One partial solution to this problem is differential market basket analysis, as
described below.
How is it used?
In retailing, most purchases are bought on impulse. Market basket analysis gives clues
as to what a customer might have bought if the idea had occurred to them. (As a first
step, therefore, market basket analysis can be used in deciding the location and
promotion of goods inside a store. If, as has been observed, purchasers of Barbie dolls
have are more likely to buy candy, then high-margin candy can be placed near to the
Barbie doll display. Customers who would have bought candy with their Barbie dolls had
they thought of it will now be suitably tempted.
But this is only the first level of analysis. Differential market basket analysis can find
interesting results and can also eliminate the problem of a potentially high volume of
trivial results.
If we observe that a rule holds in one store, but not in any other (or does not hold in one
store, but holds in all others), then we know that there is something interesting about
that store. Perhaps its clientele are different, or perhaps it has organized its displays in
a novel and more lucrative way. Investigating such differences may yield useful insights
which will improve company sales.
Based on the insights from market basket analysis you can organize your store to
increase revenues. Items that go along with each other should be placed near each
other to help consumers notice them. This will guide the way a store should be
organized to shoot for best revenues. With the help of this data you can eliminate the
guesswork while determining the optimal store layout.
Whether it is email, phone, social media or an offer by a direct salesman, market basket
analysis can improve the efficiency of all of them. By using data from MBA you can
suggest the next best product which a customer is likely to buy. Hence you will help
your customers with fruitful suggestions instead of annoying them with marketing blasts.
Based on the inputs from MBA you can also predict future purchases of customers over
a period of time. Using your initial sales data, you can predict which item would probably
fall short and maintain stocks in optimal quality. This will help you improve the
allocations of resources to different items of the inventory.
Recommendation engines are already used by some popular companies like Netflix,
Amazon, Facebook, etc. If you want to create an effective recommendation system for
your company then you will also need market basket analysis to efficiently maintain one.
MBA can be considered as the basis for creating a recommendation engine.
As we have seen, market basket analysis can help companies especially retailers, to
analyze buying behavior and predict their next purchase. If used effectively this can
significantly improve cross-selling and in turn, help you increase your customer’s
lifetime value.
6.2.2.2.Aprori Algorithm
Introduction:
The apriori algorithm is an influential algorithm for mining frequent item set for
Boolean association rule. It uses bottom up approach where frequent subset are
extended one item at a time. ie. The steps are candidate generation and groups of
candidates are tested against the data. It is designed to operate on database
containing transaction, for ex. Collection of items brought by a customer or details of
a website frequentation.
Support and Confidence
Transaction Item sets
T1 X,Y,Z
T2 X,Z
T3 W
T4 X
T5 Y,Z
T6 A,B,Z
T7 X,Z,B
T8 X,Z,W
T9 A,X,Z
T10 Z,Y
Support:
Support shows the frequency of the patterns in the rule; it is the percentage of
transactions that contain both X and Z, i.e.
=5
10
=0.5
Confidence
Solution:
Total transaction
=5
6/10
=50 =
6
Exercise
A database has four transactions. Let min_support=60% and min_confi=80%.
Find all frequent item sets using aprori algorithm.
Compare min_support with each item set support count. Whichare 2.4.
Here we have listed only those items which have support count > 2.4
L2
Item Set Support count
A,B 3
A,D 3
B,D 3
Solved.
6.2.2.3.FP growth
Introduction
The FP-growth algorithm is an efficient and scalable method for mining the complete set
of frequent patterns by pattern fragment growth using and extended prefix-tree structure
for storing compressed and crucial information about frequent patterns named frequent-
pattern tree.
Then the FP – tree is constructed as follows. Create the root of the tree and scan the
database second time. The items in each transaction are processed in the order of
frequent items list and a branch is created for each transaction. When considering the
branch to be added to a transaction, the count of each node along a common prefix is
incremented by 1. After constructing the tree the mining proceeds as follows. Start from
each frequent length-1 pattern, construct its conditional pattern base, then construct its
conditional FP-tree and perform mining recursively on such a tree. The support of a
candidate (conditional) item set is counted traversing the tree. The sum of count values
at least frequent item’s nodes gives the support value.
1. Scan the transaction database once, as among the Apriori algorithmic program,
to seek out all the frequent items and their Support.
4. Get the primary transaction from the transaction database. Takeaway all non-
frequent items and list the remaining items in line with the order among the
sorted frequent items.
5. Use the transaction to construct the primary branch of the tree with each node
corresponding to a frequent item and showing that item’s frequency that’s one for
the primary transaction.
6. Get the next transaction from the transaction database. Takeaway all non-
frequent items and list the remaining items in line with the order among the
sorted frequent items.
7. Insert the transaction within the tree using any common prefix that may appear.
Increase the item counts.
8. Continue with Step 6 until all transactions among the database are processed.
Exercise:
Find the conditional pattern base and conditional fp-tree from the following datasets.
Where min-support=3.
T-ID ITEM SET
1 F,A,C,D,G,M,P
2 A,B,C,F,L,M,O
3 B,F,H,O
4 B,K,C,P
5 A,F,C,L,P,M,N
Solution:
Step 1:
Separate all the items in column.
Item set Support count
A 3
B 3
C 4
D 1
F 4
G 1
K 1
L 2
M 3
N 1
0 2
P 3
Now, choose only those items which have support greater than 3.
Item set Support count
F 4
C 4
A 3
B 4
M 3
P 3
Step 3:
Create a pattern by comparing item sets of question and items obtained in step 2(i.e
F,C,A,B,M,P)
T-ID ITEM SET ORDERED ITEMS
1 F,A,C,D,G,M,P F,C,A,M,P
2 A,B,C,F,L,M,O F,C,A,B,M
3 B,F,H,O F,B
4 B,K,C,P C,B,P
5 A,F,C,L,P,M,N F,C,A,M,P
Count each items from the ordered item list. I,e F=4,C=4,A=3,B=3,M=3,P=3
Item Conditional pattern base Conditional fp-tree
P (FCAM:2) (CB:1) (C:3)|P
M (FCA:2)(FCAB:1) (F:3)(C:3)(A:3)|M
B (FCA:1)(F:1)(C:1) EMPTY
A (FC:3) (F:3)(C:3)|A
C (F:3) (F:3)|C
F EMPTY EMPTY
LOGIC: in conditional fp-tree choose only those which are common in conditional
pattern base. ( in (FCAM:2) AND (CB:1) C , C is common so only (C:3)is chosen)
6.2.3. Clustering
Background
When answering this, it is important to understand that data mining part of data science.
Data mining focuses using machine learning, pattern recognition and statistics to
discover patterns in data.
Clustering would fall into the machine learning / pattern recognition realm.
It is important to remember there are 2 types of machine learning algorithms:
1. Supervised Learning - These include machine learning algorithms that have variables
used as predictors and a variable to predict. The predictors are tied to the prediction
variable and are trained against that variable to make future predictions. Within
supervised learning there are two major types of algorithms:
a. Regression – These algorithms use the predictors to predict a quantitative variable
such as with a regression model.
b. Classification – These algorithms are typically looking to label data into categories. A
classic example would be sick or not sick in a medical study, but can contain numerous
category labels. Algorithms such as logistic regression and random forest classification
models.
2. Unsupervised Learning – These algorithms have no variable to predict tied to the
data. Instead of having an output, the data only has an input which would be multiple
variables that describe the data. This is where clustering comes in.
Clustering is an unsupervised machine learning method that attempts to uncover the
natural groupings and statistical distributions of data. There are multiple clustering
methods such as K-means or Hierarchical Clustering. Often, a measure of distance
from point to point is used to find which category a point should belong to as with K-
means. Hierarchical clustering seeks to build up or break down sets of clusters based
on the input information. This allows the user to use the sets of clusters that best
accomplish their purpose. The algorithm will not name the groups it creates for you, but
it will show you where they are and then they can be named anything. Below is a really
simple example of clustering of 3 groups:
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group are more similar to each other than to those in other
groups.
The process by which objects are classified into a number of group so that they are
much dissimilar as possible from one group to another group, but as much similar as
possible within each group. The attributes of the objects are allowed to determine which
object should be grouped together.
Fig 1:: let us suppose the following are the delivery location for pizza.
Fig2. Lets locate cluster center randomly
Application of Clustering
Medicine
On PET scans, cluster analysis can be used to differentiate between different types
of tissue in a three-dimensional image for many different purposes
Analysis of antimicrobial activity
Cluster analysis can be used to analyse patterns of antibiotic resistance, to classify
antimicrobial compounds according to their mechanism of action, to classify antibiotics
according to their antibacterial activity.
Computer science
Image segmentation
Clustering can be used to divide a digital image into distinct regions for border
detection or object recognition
Recommender systems
Recommender systems are designed to recommend new items based on a user's tastes.
They sometimes use clustering algorithms to predict a user's preferences based on the
preferences of other users in the user's cluster.
Anomaly detection
Anomalies/outliers are typically – be it explicitly or implicitly – defined with respect to
clustering structure in data.
Natural language processing
Clustering can be used to resolve lexical ambiguity
Social science
Crime analysis
Cluster analysis can be used to identify areas where there are greater incidences of
particular types of crime. By identifying these distinct areas or "hot spots" where a similar
crime has happened over a period of time, it is possible to manage law enforcement
resources more effectively.
Educational data mining
Cluster analysis is for example used to identify groups of schools or students with similar
properties.
K-Mean Clustering
K-mean is one of the simplest unsupervised learning algorithms that solve the well
known clustering problems. The procedure follows a simple and easy way toe classify a
given data set through a certain number of cluster (assume K clusters) fixed apriori.
The main idea is to define K centers, one for each cluster. These center should be
placed should be placed in a cunning way because of different location causes different
result. So the better choice is to place them as much as possible far away from each
other. The next step is to take each point belonging to a given data set and associate it
to the nearest center. When no point is pending , the first step is completed and an early
group age is done. AT this point we need to recalculate k new centroids as barycenter
of the clusters resulting from the previous step. After we have these k new centroids, a
new a new binding has to be done between the same data set points and the nearest
new center. A loop has been generated. As a result of this loop we may notice that
the k centers change their location step by step until no more changes are done or
in other words centers do not move any more.
2) Calculate the distance between each data point and cluster centres. (X1 - c)2.
3) Assign the data point to the cluster centre whose distance from the cluster centre is
minimum of all the cluster centres.
4) Recalculate the new cluster centre using mean of the data points.
5) Recalculate the distance between each data point and new obtained cluster centres.
6) If no data point was reassigned then stop, otherwise repeat from step 3).
Advantages
2. Relatively efficient.
3. Gives best result when data set are distinct or well separated from each other.
Disadvantages
1. If there are two highly overlapping data then k-means will not be able to
resolve that there are two clusters.
3. Randomly choosing of the cluster centre cannot lead us to the fruitful result.
4. Applicable only when mean is defined i.e. fails for categorical data.
Exercise:
Ques: Apply K-mean clustering for the following data sets for two clusters
Sample X,Y
1 185,72
2 170,56
3 168,60
4 179,68
5 182,72
6 188,77
Solution:
Given Cluster=2
Let us choose data points as K1=(185,72) and K2=(170,56)
Step1:
Using Euclidean distance measure= SQRT (X2-X1)2 + (Y2-Y1)2
For value: (168, 60)
Euclidean distance between (168, 60) and K1 (185, 72)
=20.80
Euclidean distance between (168, 60) and K2 (170, 56)
=4.472
Here the distance (168, 60), K1 (185,72) > (168,60),K2(170,56). So the
data point lies towards K2.
Now new cluster point K2 is changes. Get the mean of K2 and (168,60)
i.e new K2=( 170+168 , 56+60)
2 2
new K2 = (169,58)
Conclusion :
K1 K2
(185,72), (170,56)
179,68) (168,60)
182,72), (188,77)
Exercise
Apply K-mean clustering for the following data set into 2 cluster.
Datasets {2,4,10,12,3,20,30,11,25}
Solution:
Step 1
Let us assume two cluster K1=4, K2=11
K1=4 K2=11 d1 d2 d3 d4 d5 d6 d7 d8 d8
2 4 10 12 3 20 30 11 25
Dist(D1) 2 0 6 8 1 16 26 7 21
Dist(D2) 9 7 1 1 8 9 19 0 14
Cluster K1 K1 K2 K2 K1 K2 K2 K2 K2
Assign
Here: D1 = distance between K1 and Data Points (d1, d2, d3, d4, d5, d6, d7,
d8) simultaneously
D2= distance between K2 and Data Points (d1, d2, d3, d4, d5, d6, d7,
d8) simultaneously
Cluster Assign= if D1>D2 then cluster assign=K2 else K1
From above calculation,
Data which belongs with cluster K1= {2, 4, 3}
Data which belongs with cluster K2= {10, 12, 20, 11, 25, 30}
Now
Calculate new mean
K1=2+4+3
3
New K1 =3
K2=10+12+20+11+25+30
6
New K2=18
Step 2
From step 1 New Cluster K1=3,K2=18
K1=3 K2=18 d1 d2 d3 d4 d5 d6 d7 d8 d8
2 4 10 12 3 20 30 11 25
Dist(D1) 1 1 7 9 0 17 27 8 22
Dist(D2) 16 14 8 6 15 2 12 7 7
Cluster K1 K1 K1 K2 K1 K2 K2 K2 K2
Assign
Here: D1 = distance between K1 and Data Points (d1, d2, d3, d4, d5, d6, d7,
d8) simultaneously
D2= distance between K2 and Data Points (d1, d2, d3, d4, d5, d6, d7,
d8) simultaneously
Cluster Assign= if D1>D2 then cluster assign=K2 else K1
From above calculation,
Data which belongs with cluster K1= {2, 4,10, 3}
Data which belongs with cluster K2= {12, 20, 11, 25, 30}
Now
Calculate new mean
K1=2+4+3+10
4
New K1 =4.75
K2=12+20+11+25+30
6
New K2=19.6
Now continue this process until New K1=Old K1 and New K2=Old
K2……….contd….
6.2.3.1.2. K medoids
6.2.3.2.2. Divisive
7. DIANA-(Divisive ANAlysis)- This is a "top down" approach: and this strategy does
the reverse of agglomorative hierarchinal clustering by starting with all object in one
cluster.It subdivides the cluster into smaller and smaller pieces until each object
formsa cluster on its own or until it satisfies certain termination conditions, such as a
desired number of cluster is obtained or the distance between the two closest cluster
is above a certain threshold distance.
In DIANA, all of the objects are used to form one initial cluster. The cluster is split
according to some principle, such as the maximum Euclidean distance between the
closest neighbor objects in the cluster. The cluster splitting process repeats until,
eventually each new cluster contains only a single object.
Multimedia data mining refers to the analysis of large amounts of multimedia information in
order to find patterns or statistical relationships. Once data is collected, computer programs are
used to analyze it and look for meaningful connections. This information is often used by
governments to improve social systems. It can also be used in marketing to discover consumer
habits.
Multimedia data mining requires the collection of huge amounts of data. The sample size is
important when analyzing data because predicted trends and patterns are more likely to be
inaccurate with a smaller sample. This data can be collected from a number of different media,
including videos, sound files, and images. Some experts also consider spatial data and text to be
multimedia. Information from one or more of these media is the focus of data collection.
Whereas an analysis of numerical data can be straightforward, multimedia data analysis requires
sophisticated computer programs which can turn it into useful numerical data. There are a number
of computer programs available that make sense of the information gathered from multimedia
data mining. These computer programs are used to search for relationships that may not be
apparent or logically obvious.
When multimedia is mined for information, one of the most common uses for this information
is to anticipate behavior patterns or trends. Information can be divided into classes as well,
which allows different groups, such as men and women or Sundays and Mondays, to be
analyzed separately. Data can be clustered, or grouped by logical relationship, which can
help track consumer affinity for a certain brand over another, for example.
Multimedia data mining has a number of uses in today’s society. An example of this would be
the use of traffic camera footage to analyze traffic flow. This information can be used when
planning new streets, expanding existing streets, or diverting traffic. Government
organizations and city planners can use the information to help traffic flow more smoothly
and quickly.
While the term data mining is relatively new, the practice of mining data has been around for
a long time. Grocery stores, for example, have long used data mining to track consumer
behavior by collecting data from their registers. The numerical data relating to sales
information can be used by a computer program to learn what people are buying and when
they are likely to buy certain products. This information is often used to determine where to
place certain products and when to put certain products on sale.
Text analytics software can help by transposing words and phrases in unstructured data into
numerical values which can then be linked with structured data in a database and analyzed with
traditional data mining techniques. With an iterative approach, an organization can successfully
use text analytics to gain insight into content-specific values such as sentiment, emotion, intensity
and relevance. Because text analytics technology is still considered to be an emerging technology,
however, results and depth of analysis can vary wildly from vendor to vendor.
Web mining is the use of data mining techniques to automatically discover and extract information from
Web documents and services.
There are three general classes of information that can be discovered by web mining:
Web activity, from server logs and Web browser activity tracking.
Web graph, from links between pages, people and other data.
Web content, for the data found on Web pages and inside of documents.
At Scale Unlimited we focus on the last one – extracting value from web pages and other documents
found on the web.
Note that there’s no explicit reference to “search” in the above description. While search is the biggest
web miner by far, and generates the most revenue, there are many other valuable end uses for web mining
results. A partial list includes:
Business intelligence
Competitive intelligence
Events
Product data
Popularity
Reputation
When extracting Web content information using web mining, there are four typical steps.
2. Parse – extract usable data from formatted data (HTML, PDF, etc)
4. Produce – turn the results of analysis into something useful (report, search index, etc)
When comparing web mining with traditional data mining, there are three main differences to consider:
1. Scale – In traditional data mining, processing 1 million records from a database would be large
job. In web mining, even 10 million pages wouldn’t be a big number.
2. Access – When doing data mining of corporate information, the data is private and often requires
access rights to read. For web mining, the data is public and rarely requires access rights. But web
mining has additional constraints, due to the implicit agreement with webmasters regarding
automated (non-user) access to this data. This implicit agreement is that a webmaster allows
crawlers access to useful data on the website, and in return the crawler (a) promises not to
overload the site, and (b) has the potential to drive more traffic to the website once the search
index is published. With web mining, there often is no such index, which means the crawler has
to be extra careful/polite during the crawling process, to avoid causing any problems for the
webmaster.
3. Structure – A traditional data mining task gets information from a database, which provides
some level of explicit structure. A typical web mining task is processing unstructured or semi-
structured data from web pages. Even when the underlying information for web pages comes
from a database, this often is obscured by HTML markup.
Web content mining, also known as text mining, is generally the second step in Web data mining.
Content mining is the scanning and mining of text, pictures and graphs of a Web page to
determine the relevance of the content to the search query. This scanning is completed after the
clustering of web pages through structure mining and provides the results based upon the level of
relevance to the suggested query. With the massive amount of information that is available on the
World Wide Web, content mining provides the results lists to search engines in order of highest
relevance to the keywords in the query.
Text mining is directed toward specific information provided by the customer search information
in search engines. This allows for the scanning of the entire Web to retrieve the cluster content
triggering the scanning of specific Web pages within those clusters. The results are pages relayed
to the search engines through the highest level of relevance to the lowest. Though, the search
engines have the ability to provide links to Web pages by the thousands in relation to the search
content, this type of web mining enables the reduction of irrelevant information.
Web text mining is very effective when used in relation to a content database dealing with
specific topics. For example, online universities use a library system to recall articles related to
their general areas of study. This specific content database enables to pull only the information
within those subjects, providing the most specific results of search queries in search engines. This
allowance of only the most relevant information being provided gives a higher quality of results.
This increase of productivity is due directly to use of content mining of text and visuals.
The main uses for this type of data mining are to gather, categorize, organize and provide the best
possible information available on the WWW to the user requesting the information. This tool is
imperative to scanning the many HTML documents, images, and text provided on Web pages.
The resulting information is provided to the search engines in order of relevance giving more
productive results of each search.
Web content categorization with a content database is the most important tool to the efficient use
of search engines. A customer requesting information on a particular subject or item would
otherwise have to search through thousands of results to find the most relevant information to his
query. Thousands of results through use of mining text are reduced by this step. This eliminates
the frustration and improves the navigation of information on the Web.
Business uses of content mining allow for the information provided on their sites to be structured
in a relevance-order site map. This allows for a customer of the Web site to access specific
information without having to search the entire site. With the use of this type of mining, data
remains available through order of relativity to the query, thus providing productive marketing.
Web usage mining is the third category in web mining. This type of web mining allows for the
collection of Web access information for Web pages. This usage data provides the paths leading
to accessed Web pages. This information is often gathered automatically into access logs via the
Web server. CGI scripts offer other useful information such as referrer logs, user subscription
information and survey logs. This category is important to the overall use of data mining for
companies and their internet/ intranet based applications and information access.
Usage mining allows companies to produce productive information pertaining to the future of
their business function ability. Some of this information can be derived from the collective
information of lifetime user value, product cross marketing strategies and promotional campaign
effectiveness. The usage data that is gathered provides the companies with the ability to produce
results more effective to their businesses and increasing of sales. Usage data can also be useful
for developing marketing skills that will out-sell the competitors and promote the company’s
services or product on a higher level.
Usage mining is valuable not only to businesses using online marketing, but also to e-businesses
whose business is based solely on the traffic provided through search engines. The use of this
type of web mining helps to gather the important information from customers visiting the site.
This enables an in-depth log to complete analysis of a company’s productivity flow. E-businesses
Web structure mining, one of three categories of web mining for data, is a tool used to identify the
relationship between Web pages linked by information or direct link connection. This structure
Database
management Data base
system
b) Tight integration
Database
management
system
+ Data base
Data
mining
tools
Banking
and Financial
Financial Sector
Data base
Marketing
Universities
CRM is about acquiring and retaining customers, improving customer loyalty, gaining customer
insight, and implementing customer-focused strategies. A true customer-centric enterprise helps
your company drive new growth, maintain competitive agility, and attain operational excellence.”
SAP
To manage the relationship with the customer a business needs to collect the right information
about its customers and organize that information for proper analysis and action. It needs to keep
that information up-to-date, make it accessible to employees, and provide the know-how for
employees to convert that data into products better matched to customers’ needs.
The secret to an effective CRM package is not just in what data is collected but in the organizing
and interpretation of that data. Computers can’t, of course, transform the relationship you have
with your customer. That does take a cross-department, top to bottom, corporate desire to build
better relationships. But computers and a good computer based CRM solution, can increase sales
by as much as 40-50% – as some studies have shown.
Customer Improve Improve Equality and
sastisfaction Current Loyalty
quality of relationship
Improve Improve
Cutomer
Relationship Value
Value
Drawup
Maximize
Information Value
Data mining concepts are still evolving and here are the latest trends that we get to see in this field −
Application Exploration.
Integration of data mining with database systems, data warehouse systems and web database
systems.
Web mining.