Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 87

Chapter-1

What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data.

It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.

The key properties of data mining are

Automaticdiscoveryofpatterns
Prediction of likely outcomes
Creation of actionable information

Focus on large datasets and databases

The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-
on analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identifypreviouslyhiddenpatterns in onestep.Anexample ofpatterndiscoveryis the
analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.

TasksofDataMining
Datamininginvolvessixcommonclassesof tasks:

Anomaly detection (Outlier/change/deviation detection) – The identification ofunusual


data records, that might be interesting or data errors that require further investigation.

Association rule learning (Dependency modelling) – Searches for relationshipsbetween


variables. For example a supermarket might gather data on customer purchasing
habits.Usingassociationrulelearning,thesupermarketcandeterminewhichproductsare
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression–attemptstofind afunctionwhichmodelsthedatawiththeleast error.
Summarization–providingamorecompactrepresentationofthedataset,including visualization
and report generation.

ArchitectureofDataMining

Atypical dataminingsystem mayhavethe followingmajorcomponents.

1. KnowledgeBase:

Thisisthedomainknowledgethatisusedtoguidethesearchorevaluatethe
interestingnessofresultingpatterns.Suchknowledgecanincludeconcepthierarchies,
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, mayalso beincluded. Otherexamples of
domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).

2. DataMining Engine:

This is essential to the data mining systemand ideally consists ofa set of functional
modules for tasks such as characterization, association and correlationanalysis,
classification, prediction, cluster analysis, outlier analysis, and evolutionanalysis.

3. PatternEvaluationModule:

This component typically employs interestingness measures interacts with the data
mining modules so as to focus thesearch toward interesting patterns. It may use
interestingness thresholds to filterout discovered patterns. Alternatively, the pattern
evaluation module may be integratedwith the mining module, depending on the
implementation of the datamining method used. For efficient data mining, it is highly
recommendedtopushthe evaluationofpatterninterestingnessasdeepaspossibleinto the
mining processso as to confine the search to only the interesting patterns.

4. Userinterface:

Thismodule communicates between users and the data mining system,allowing the
user to interact with the system by specifying a data mining query ortask, providing
informationtohelpfocusthesearch,andperformingexploratorydataminingbasedon the
intermediate data mining results. In addition, this componentallows the user to browse
database and data warehouse schemas or data structures,evaluate mined patterns, and
visualize the patterns in different forms.
DataMiningProcess:

Data Mining is a process of discovering various models, summaries, and derived values from
agiven collection of data.
Thegeneralexperimentalprocedureadaptedtodata-miningproblemsinvolvesthefollowing steps:
1. Statetheproblemandformulatethehypothesis

Most data-based modeling studies are performed in a particular application domain.


Hence,domain-specificknowledgeandexperienceareusuallynecessaryinordertocome up
with a meaningful problem statement. Unfortunately, many application studies tend to
focus on the data-mining technique at the expense of a clear problem statement. In this
step, a modeler usually specifies a set of variables for the unknown dependency and, if
possible,ageneral formofthis dependencyas an initial hypothesis. There maybeseveral
hypotheses formulated for a single problem at this stage. The first step requires the
combined expertise of an application domain and a data-mining model. In practice, it
usually means a close interaction between the data-mining expert and the application
expert. Insuccessfuldata-miningapplications,thiscooperationdoesnotstopintheinitial phase;
it continues during the entire data-mining process.

2. Collectthedata

Thisstepisconcernedwithhowthedataaregeneratedandcollected. Ingeneral,thereare two


distinct possibilities. The first is when the data-generation process is under thecontrol of
an expert (modeler): this approach is known as a designed experiment. The
secondpossibilityiswhentheexpertcannotinfluencethedata- generationprocess:thisis known
as the observational approach. An observational setting, namely, random data
generation,isassumedinmostdata-miningapplications.Typically,thesampling
distributioniscompletelyunknownafterdataarecollected,oritispartiallyandimplicitly given
in the data-collection procedure. It is very important, however, to understand how data
collection affects its theoretical distribution, since such a priori knowledge can be very
useful for modeling and, later, for the final interpretation of results. Also, it is important
to make sure that the data used for estimating a model and the data used later for
testingand applying a model come from the same, unknown, sampling distribution. If this
is not the case, the estimated model cannot be successfullyused in a final application of
the results.

3. Preprocessingthedata

In the observational setting, data are usually "collected" from the existing databses, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:

1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:

a. Detectandeventuallyremove outliersasapart ofthepreprocessingphase, or


b. Developrobust modelingmethods that areinsensitive to outliers.

2. Scaling, encoding, and selecting features –Datapreprocessingincludesseveralsteps


such as variable scaling and different types of encoding. For example, one feature with
the range [0, 1] and the other with the range [−100, 1000] will not have the same weights
in the applied technique; theywill also influence the final data-mining results differently.
Therefore, it is recommended to scale them and bring both features to the same
weightforfurtheranalysis.Also,application-specificencodingmethodsusuallyachieve
dimensionality reduction by providing a smaller number of informative features for
subsequent data modeling.
These two classes of preprocessing tasks are only illustrative examples of a large
spectrum of preprocessing activities in a data-mining process.
Data-preprocessing steps should not be considered completely independent from other
data-mining phases. In every iteration of the data-mining process, all activities, together,
could define new and improved data sets for subsequent iterations. Generally, a good
preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling andencoding.

4. Estimatethemodel

The selection and implementation of the appropriate data-mining technique is the main
task in this phase. This process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the best one is an additional
task. The basic principles of learning and discovery from data are given in Chapter 4 of
this book. Later, Chapter 5 through 13 explain and analyze specific techniques that are
applied to perform a successful learning process from data and to develop an appropriate
model.

5. Interpretthemodelanddrawconclusions

In most cases, data-mining models should help in decision making. Hence, such models
need to be interpretable in order to be useful because humans are not likely to base their
decisions on complex "black-box" models. Note that the goals of accuracy of the model
andaccuracyofitsinterpretationaresomewhatcontradictory.Usually,simplemodelsare more
interpretable, but they are also less accurate. Modern data-mining methods are expected
to yield highly accurate results using highdimensional models. The problem of
interpretingthesemodels,alsoveryimportant,isconsideredaseparatetask,withspecific
techniques to validate the results. A user does not want hundreds of pages of numeric
results. He does not understand them; he cannot summarize, interpret, and use them for
successful decision making.

TheDataminingProcess

ClassificationofDataminingSystems:

Thedata miningsystemcanbeclassifiedaccordingtothefollowingcriteria:

DatabaseTechnology
Statistics
Machine Learning
InformationScience
Visualization
Other Disciplines
SomeOtherClassificationCriteria:

Classification according to kind of databases mined


Classificationaccordingtokindofknowledgemined
Classificationaccordingtokindsoftechniquesutilized
Classification according to applications adapted

Classificationaccordingtokindofdatabasesmined

We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the
data mining system can be classified accordingly. For example if we classify the database
according to data model then we may have a relational, transactional, object- relational, or data
warehouse mining system.

Classificationaccordingtokindofknowledgemined

Wecanclassifythedataminingsystemaccordingtokindofknowledgemined.Itismeansdata mining
system are classified on the basis of functionalities such as:

Characterization
Discrimination
AssociationandCorrelationAnalysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Classificationaccordingtokindsoftechniquesutilized

Wecanclassifythedataminingsystemaccordingtokindoftechniquesused.Wecandescribes these
techniques according to degree of user interaction involved or the methods of analysis employed.

Classificationaccordingtoapplicationsadapted

Wecanclassifythedataminingsystemaccordingtoapplicationadapted.Theseapplicationsare as
follows:

Finance
Telecommunications
DNA
StockMarkets
E-mail

MajorIssuesInDataMining:

Mining different kinds of knowledge in databases. - The need of different users is


not the same. And Different user may be in interested in different kind of knowledge. Thereforeit
is necessary for data mining to cover broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction.-Thedataminingprocess


needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.

Incorporation of background knowledge. - To guide discovery process and to express the


discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple level of
abstraction.
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehousequery
language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results. - Once the patterns are discovered it
needstobeexpressedin highlevellanguages,visualrepresentations.Thisrepresentationsshould be
easily understandable by the users.

Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.

Patternevaluation.-Itreferstointerestingnessoftheproblem.Thepatternsdiscoveredshould be
interesting because either they represent common knowledge or lack novelty.

Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.

Parallel, distributed, and incremental mining algorithms. -Thefactors suchashugesizeof


databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide thedata
into partitions which is further processed parallel. Then the results from the partitions is
merged. The incremental algorithms, updates databases without having mine the data again
from scratch.

KnowledgeDiscoveryinDatabases(KDD)
Some people treat data mining same as Knowledge discovery while some people view data
miningessentialstepinprocessofknowledgediscovery.Hereisthelistofstepsinvolvedin knowledge
discovery process:

Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the database.
DataTransformation-Inthisstepdataaretransformedorconsolidatedintoforms appropriate for
mining by performing summary or aggregation operations.
DataMining-Inthisstepintelligentmethodsareappliedinordertoextractdatapatterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step,knowledge is represented.
Thefollowingdiagram shows the process of knowledgediscoveryprocess:

Architectureof KDD

Data Warehouse:

Adatawarehouseisasubject-oriented,integrated,time-variantandnon-volatilecollectionof data in
support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.

Integrated: A data warehouseintegrates data from multiple data sources. Forexample, sourceA
and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.

DataWarehouseDesignProcess:

Adata warehouse canbebuiltusingatop-downapproach, abottom-up approach,or a


combinationof both.

The top-down approach starts with the overall design and planning. It is useful in cases where
the technologyis mature and well known, and where the business problems that must be
solved are clear and well understood.

The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.

In the combined approach, an organization can exploit the planned and strategic nature ofthe
top-down approach while retaining the rapid implementation and opportunistic application
of the bottom-up approach.
Thewarehousedesign process consists of thefollowingsteps:

Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger. If the business process is organizational
and involves multiple complex object collections, a data warehouse model should be
followed. However,iftheprocess is departmental and focuses on theanalysis ofonekind of
business process, a data mart model should be chosen.

Choose the grain of the business process. The grain is the fundamental, atomic level of data to
be represented in the fact table for this process, for example, individual transactions,
individual daily snapshots, and so on.

Choose the dimensions that will apply to each fact table record. Typical dimensions aretime,
item, customer, supplier, warehouse, transaction type, and status.

Choosethemeasuresthatwillpopulateeachfacttablerecord.Typicalmeasuresarenumeric additive
quantities like dollars sold and units sold.
AThreeTierDataWarehouseArchitecture:

Tier-1:

The bottom tier is a warehouse database server that is almost always a relationaldatabase
system. Back-end tools and utilities are used to feed data into the bottomtier from
operational databases or other external sources (such as customer profileinformation
provided by external consultants). These tools and utilities performdataextraction,
cleaning, and transformation (e.g., to merge similar data from differentsources into a
unified format), as well as load and refresh functions to update thedata warehouse . The
dataareextractedusingapplicationprograminterfacesknownasgateways.Agatewayis
supported by the underlying DBMS andallows client programs to generate SQL code to
be executed at a server.

Examplesof gateways include ODBC (Open Database Connection) and OLEDB (Open
Linkingand Embedding for Databases) by Microsoft and JDBC (Java Database
Connection).
This tier also contains a metadata repository, which stores information aboutthe data
warehouse and its contents.

Tier-2:

The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP.

OLAPmodelisanextendedrelationalDBMSthatmapsoperationsonmultidimensional data
to standard relational operations.

AmultidimensionalOLAP(MOLAP)model,thatis,aspecial-purposeserverthat directly
implements multidimensional data and operations.

Tier-3:

Thetoptierisafront-endclientlayer,whichcontainsqueryandreportingtools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
DataWarehouse Models:

Therearethreedata warehousemodels.

1. Enterprisewarehouse:
Anenterprise warehouse collects allof theinformationaboutsubjectsspanningthe entire
organization.

Itprovidescorporate-widedataintegration,usuallyfromoneormoreoperationalsystems or
external information providers, and is cross-functional in scope.

Ittypicallycontainsdetaileddataaswellassummarizeddata,andcanrangeinsizefroma few
gigabytes to hundreds of gigabytes, terabytes, or beyond.

An enterprise data warehouse may be implemented on traditional mainframes, computer


superservers, or parallel architecture platforms. It requires extensive business modeling
and may take years to design and build.

2. Datamart:

A data mart contains a subset of corporate-wide data that is of value to aspecific group of
users. The scope is confined to specific selected subjects. For example,a marketing data
mart may confine its subjects to customer, item, and sales. Thedata contained in data
marts tend to be summarized.

Data marts are usually implemented on low-cost departmental servers that


areUNIX/LINUX- or Windows-based. The implementation cycle of a data mart ismore
likely to be measured in weeks rather than months or years. However, itmay involve
complex integration in the long run if its design and planning werenot enterprise-wide.
Depending on the source of data, data marts can be categorized as independent
ordependent. Independent data marts are sourced fromdata captured fromone or
moreoperational systems or external information providers, or fromdata generated
locallywithin a particular department or geographic area. Dependent data marts are
sourceddirectly from enterprise data warehouses.

3. Virtualwarehouse:

Avirtualwarehouseisasetofviewsoveroperationaldatabases.Forefficientquery processing,
only some of the possible summary views may be materialized.

A virtual warehouse is easy to build but requires excess capacity on operational database
servers.

MetaDataRepository:

Metadata are data about data.When used in a data warehouse, metadata are the data thatdefine
warehouse objects. Metadata are created for the data names anddefinitions of the given
warehouse. Additional metadata are created and captured fortimestamping any extracted data,the
source of the extracted data, and missing fieldsthat have been added by data cleaning or
integration processes.

Ametadata repositoryshould contain the following:

A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.

Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged),
and monitoring information (warehouse usage statistics, error reports, and audit trails).
The algorithms used for summarization, which include measure and dimension
definitionalgorithms, data on granularity, partitions, subject areas, aggregation,
summarization,and predefined queries and reports.

The mapping from the operational environment to the data warehouse, which
includessource databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
andsecurity (user authorization and access control).

Data related to system performance, which include indices and profiles that improvedata
access and retrieval performance, in addition to rules for the timing and scheduling of
refresh, update, and replication cycles.

Business metadata, which include business terms and definitions, data


ownershipinformation, and charging policies.

OLAP(OnlineanalyticalProcessing):

OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly.

OLAP is part of the broadercategory of business intelligence,which also


encompassesrelational database, report writing and data mining.

OLAPtoolsenableuserstoanalyzemultidimensionaldatainteractivelyfrommultiple
perspectives.

OLAPconsistsofthreebasicanalyticaloperations:

 Consolidation(Roll-Up)
 Drill-Down
 SlicingAnd Dicing

Consolidation involves the aggregation of data that can be accumulated and computed in
one or more dimensions. For example, all sales offices are rolled up to the sales
department or sales division to anticipate sales trends.

Thedrill-downisatechniquethatallowsuserstonavigatethroughthedetails.For instance, users


can view the sales by individual products that make up a region’s sales.

Slicing and dicing is a feature whereby users cantake out (slicing) a specific set of dataof
the OLAP cube and view (dicing) the slices from different viewpoints.

TypesofOLAP:

1. RelationalOLAP(ROLAP):

ROLAP works directly with relational databases. The base data and the dimension
tables are stored as relational tables and new tables are created to hold the aggregated
information. It depends on a specialized schema design.

This methodology relies on manipulating the data stored in the relational database to
givetheappearanceoftraditionalOLAP'sslicinganddicingfunctionality. Inessence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement.

ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.

ROLAP tools feature the ability to ask any question because the methodology doesnot
limit to the contents of a cube. ROLAP also has the ability to drill down to the lowest
level of detail in the database.
2. MultidimensionalOLAP(MOLAP):

MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.

MOLAP stores this data in an optimized multi-dimensional array storage, rather than in
a relational database. Therefore it requires the pre-computation and storage of
information in the cube - the operation known as processing.

MOLAPtoolsgenerallyutilizeapre-calculateddatasetreferredtoasadatacube.
Thedatacubecontainsallthe possibleanswers toagiven rangeof questions.

MOLAPtoolshaveaveryfastresponsetimeandtheabilitytoquicklywriteback data into the


data set.

3. HybridOLAP(HOLAP):

There is no clear agreement across the industry as to what constitutes Hybrid OLAP,
except that a database will divide data between relational and specialized storage.

For example, for some vendors, a HOLAP database will use relational tables to hold
the larger quantities of detailed data, and use specialized storage for at least some
aspects of the smaller quantities of more-aggregate or less-detailed data.

HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the


capabilities of both approaches.
HOLAPtoolscanutilizeboth pre-calculatedcubesandrelationaldatasources.
Data Preprocessing:

DataIntegration:

It combines datafrom multiple sources into a coherent data store, as in data warehousing. These
sourcesmay include multiple databases, data cubes, or flat files.

Thedataintegrationsystemsareformallydefinedastriple<G,S,M>

Where G: The global schema

S:Heterogeneoussource of schemas

M:Mappingbetweenthe queriesofsource andglobalschema


IssuesinDataintegration:

1. Schemaintegrationandobjectmatching:

Howcanthedataanalystorthecomputerbesurethat customeridinonedatabaseand customer


number in another reference to the same attribute.

2. Redundancy:

An attribute (such as annual revenue, forinstance) may be redundant if it can be derived


fromanotherattributeorsetofattributes.Inconsistenciesinattributeordimensionnaming can
also cause redundanciesin the resulting data set.

3. detectionandresolutionofdatavalue conflicts:

Forthesamereal-worldentity,attributevaluesfromdifferentsourcesmaydiffer.

DataTransformation:

Indatatransformation,thedataaretransformedorconsolidatedintoformsappropriateformining.

Datatransformationcaninvolvethefollowing:

Smoothing,whichworkstoremovenoisefromthedata.Suchtechniquesincludebinning,
regression, and clustering.

Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and
annualtotalamounts.Thisstepistypicallyusedinconstructingadatacubeforanalysisof the data
at multiple granularities.
Generalization of the data, where low-level or ―primitive‖ (raw) data are replaced
byhigher-level concepts through the use of concept hierarchies. For example,
categoricalattributes, like street, can be generalized to higher-level concepts, like city or
country.
Normalization, where the attribute data are scaled so as to fall within a small
specifiedrange, such as 1:0 to 1:0, or 0:0 to 1:0.

Attribute construction (or feature construction),wherenewattributes are constructedand


added from the given set of attributes to help the mining process.

DataReduction:

Data reduction techniques can be applied to obtain a reduced representation of thedata set that
ismuch smaller in volume, yet closely maintains the integrity of the originaldata. That is, mining
on the reduced data set should be more efficient yet produce thesame (or almost the same)
analytical results.
Strategies fordatareductionincludethe following:

Data cube aggregation, where aggregation operations are applied to the data in
theconstruction of a data cube.

Attribute subset selection, where irrelevant, weakly relevant, or redundant attributesor


dimensions may be detected and removed.

Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.

Numerosityreduction,where the data are replaced or estimated by alternative,smallerdata


representations such as parametric models (which need store only the modelparameters
instead of the actual data) or nonparametric methods such as clustering,sampling, and the
use of histograms.
Discretizationandconcepthierarchygeneration,whererawdatavaluesfor attributesare
replaced by ranges or higher conceptual levels. Data discretization is a form ofnumerosity
reduction that is very useful for the automatic generation of concept
hierarchies.Discretization and concept hierarchy generation are powerful tools for
datamining, in that they allow the mining of data at multiple levels of abstraction.
Chapter-2

AssociationRuleMining:
Associationruleminingisapopularandwellresearchedmethodfordiscoveringinteresting relations
between variables in large databases.
Itisintendedtoidentifystrongrulesdiscoveredindatabasesusingdifferent measuresof
interestingness.
Basedontheconceptof strongrules,RakeshAgrawaletal.introducedassociation rules.
ProblemDefinition:
Theproblem ofassociation ruleminingis definedas:

Let beaset of binaryattributescalleditems.

Let beasetoftransactionscalledthedatabase.
Eachtransactionin hasauniquetransactionID andcontainsasubsetofthe itemsin .
Aruleisdefinedasanimplicationoftheform where
and .
Thesetsofitems(forshortitemsets) and arecalledantecedent(left-hand-sideorLHS)and
consequent(right-hand-sideorRHS)oftherule respectively.
Example:
To illustrate the concepts, we use a small example from the supermarket domain. The set ofitems
is and a small database containing the items (1 codes
presence and 0 absence of an item in a transaction) is shown in the table.

An example rule for the supermarket could be meaning that if


butter and bread are bought, customers also buy milk.
Exampledatabasewith 4items and5 transactions

TransactionID milk bread butter beer


1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0

ImportantconceptsofAssociationRuleMining:

The support of an itemset is defined as the proportion of transactions in the


data set which contain the itemset. In the example database, the itemset

hasasupportof sinceitoccursin20%ofall transactions


(1 out of 5 transactions).

The confidenceof a rule is defined

For example, the rule has a confidence of

in the database, which means that for 100% of the transactions


containing butter and bread the rule is correct (100% of the times a customer buys butter
andbread,milkisboughtaswell).Confidencecanbeinterpretedasanestimateofthe

probability , the probability of finding the RHS of the rule in transactionsunder


the condition that these transactions also contain the LHS.

The liftof a rule is defined as


ortheratiooftheobservedsupporttothatexpectedifXandYwereindependent.The rule

has a lift of .

The conviction of a rule is defined as

Therule has aconvictionof ,

and can be interpreted as the ratio of the expected frequency that X occurs without Y(that
is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.

Marketbasketanalysis:

This processanalyzes customer buying habits byfinding associations between the different items
thatcustomers place in their shopping baskets. The discovery of such associationscan help
retailers develop marketing strategies by gaining insight into which itemsare frequently
purchased together by customers. For instance, if customers are buyingmilk, how likely are they
to also buybread (and what kind ofbread)on the sametrip to thesupermarket. Such information
can lead to increased sales by helping retailers doselective marketing and plan their shelf space.
Example:

If customers who purchase computers also tend to buy antivirussoftware at the same time, then
placing the hardware display close to the software displaymay help increase the sales of both
items. In an alternative strategy, placing hardware andsoftware at opposite ends of the store may
entice customers who purchase such items topick up other items along the way. For instance,
after deciding on an expensive computer,a customer mayobserve security systems for sale while
heading toward the software displayto purchase antivirus software and may decide to purchase a
home security systemas well. Market basket analysis can also help retailers plan which items to
put on saleat reduced prices. If customers tend to purchase computers and printers together,
thenhaving a sale on printers may encourage the sale of printers as well as computers.

FrequentPattern Mining:
Frequentpatternminingcan beclassifiedin variousways,based onthe followingcriteria:
1. Basedonthecompletenessofpatternstobe mined:

We can mine the complete set of frequent itemsets, the closed frequent itemsets, and the
maximal frequent itemsets, given a minimum support threshold.

Wecanalsomineconstrainedfrequentitemsets,approximatefrequentitemsets,near- match
frequent itemsets, top-k frequent itemsets and so on.

2. Basedonthelevelsofabstractioninvolvedin theruleset:

Somemethods forassociationruleminingcan find rules atdifferinglevelsofabstraction.

Forexample,supposethatasetofassociationrulesminedincludesthefollowingrules where X is
a variablerepresenting a customer:

buys(X,―computer‖))=>buys(X,―HPprinter‖) (1)

buys(X,―laptopcomputer‖))=>buys(X,―HPprinter‖) (2)

Inrule(1)and(2),theitemsboughtarereferencedatdifferentlevelsofabstraction(e.g.,
―computer‖isahigher-levelabstractionof―laptopcomputer‖).
3. Basedonthenumberof datadimensionsinvolvedin the rule:

Iftheitemsorattributesinanassociationrulereferenceonlyonedimension,thenitisa single-
dimensional association rule.
buys(X, ―computer‖))=>buys(X, ―antivirussoftware‖)

Ifarulereferencestwoormoredimensions,suchasthedimensionsage,income,andbuys, then it is
amultidimensional association rule. The following rule is an exampleof a multidimensional
rule:
age(X,―30,31…39‖)^income(X,―42K,…48K‖))=>buys(X,―highresolutionTV‖)
4. Basedonthetypesof valueshandledintherule:

If a rule involves associations between the presence or absence of items, it is a


Booleanassociation rule.

Ifaruledescribesassociationsbetweenquantitativeitemsorattributes,thenitisaquantitative
association rule.

5. Basedonthekinds of rules tobemined:

Frequentpatternanalysiscangeneratevariouskindsofrulesandotherinteresting relationships.

Associationruleminingcangeneratealargenumberofrules,manyofwhichare redundant or do
not indicatea correlation relationship among itemsets.

Thediscoveredassociationscanbefurtheranalyzedtouncoverstatisticalcorrelations, leading to
correlation rules.

6. Basedonthekindsof patternstobe mined:


Many kinds of frequent patterns can be mined from different kinds of data sets.

Sequential pattern mining searches for frequent subsequences in a sequence data set, where
a sequence records an ordering of events.

For example, with sequential pattern mining, we can study the order in which items are
frequently purchased. For instance, customers may tend to first buy a PC, followed by a
digitalcamera,and then a memory card.

Structuredpatternminingsearchesforfrequentsubstructuresinastructureddataset.
Single items are the simplest form of structure.
Each element of an itemsetmay contain a subsequence, a subtree, and so on.

Therefore, structuredpattern mining can be considered as the most general formoffrequent


pattern mining.
EfficientFrequentItemsetMiningMethods:
FindingFrequent ItemsetsUsing Candidate Generation:The Apriori
Algorithm

Apriori is aseminal algorithm proposed byR. Agrawal and R. Srikant in 1994 formining
frequent itemsets for Boolean association rules.

Thenameofthealgorithm is based on thefact that thealgorithm uses prior knowledge of


frequent itemset properties.

Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.

First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.
Atwo-step process isfollowed in Aprioriconsistingof joinandprune action.
Example:

TID Listof itemIDs


T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3

Thereareninetransactionsinthisdatabase,thatis,|D| =9.
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simplyscans all of the transactions in order to countthe number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can thenbe determined. It consists of the candidate 1-itemsets
satisfying minimum support.In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
togenerate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the
prune step because each subset of thecandidates is also frequent.
4. Next, the transactions inDarescanned and the support count of each candidate itemsetInC2 is
accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, Fromthejoin step, we first getC3 =L2x
L2= ({I1,I2,I3},{I1,I2,I5},{I1,I3,I5},{I2,I3,I4},{I2,I3,I5},{I2,I4,I5}.Based on the
Apriori property that all subsets of a frequentitemsetmust also be frequent, we can determine
that the four latter candidates cannotpossibly be frequent.

7. ThetransactionsinDarescannedinordertodetermineL3,consistingofthosecandidate 3-
itemsets in C3 having minimum support.
8. ThealgorithmusesL3xL3togenerateacandidatesetof4-itemsets,C4.
GeneratingAssociationRulesfromFrequentItemsets:
OncethefrequentitemsetsfromtransactionsinadatabaseDhavebeenfound,itis straightforward to
generate strong association rules from them.
Example:

MiningMultilevelAssociationRules:

Formanyapplications,itisdifficulttofindstrong associationsamongdataitemsatlow or
primitive levels of abstraction due to the sparsity of data at those levels.

Strongassociationsdiscoveredathighlevelsofabstractionmayrepresentcommonsense
knowledge.

Therefore,dataminingsystemsshouldprovidecapabilitiesforminingassociationrules at
multiple levels of abstraction, with sufficient flexibility for easy traversal
amongdifferentabstraction spaces.
Association rules generated from mining data at multiple levels of abstraction arecalled
multiple-level or multilevel association rules.

Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.

In general, a top-down strategy is employed, where counts are accumulated for the
calculation of frequent itemsets at each concept level, starting at the concept level 1 and
working downward in the hierarchy toward the more specific concept levels,until nomore
frequent itemsets can be found.

A concepthierarchy defines a sequence of mappings froma set of low-level concepts to


higherlevel,more general concepts. Data can be generalized by replacing low-level
conceptswithin the data by their higher-level concepts, or ancestors, from a concept hierarchy.
The concept hierarchy has five levels, respectively referred to as levels 0to 4, starting with
level0 at the root node for all.

Here, Level 1 includes computer, software, printer&camera, and computer accessory.


Level2includeslaptopcomputer,desktopcomputer,officesoftware,antivirussoftware Level
3 includes IBM desktop computer, . . . , Microsoft office software, and so on.
Level4 isthe most specificabstraction levelof this hierarchy.

ApproachesForMiningMultilevelAssociationRules:
1. UniformMinimumSupport:
The same minimum support threshold is used when mining at each level of abstraction.
When a uniform minimum support threshold is used, the search procedure is simplified.
Themethodisalsosimpleinthatusersarerequiredtospecifyonlyoneminimumsupport threshold.
The uniform support approach, however, has some difficulties. It is unlikely thatitems at
lowerlevelsofabstractionwilloccurasfrequentlyasthoseathigherlevelsofabstraction.
Iftheminimumsupportthresholdissettoohigh,itcouldmisssomemeaningfulassociations
occurring at low abstraction levels. If the threshold is set too low, it may generate many
uninteresting associations occurring at high abstractionlevels.

2. ReducedMinimumSupport:
Eachlevelofabstractionhasitsownminimumsupport threshold.
Thedeeper thelevel of abstraction, the smallerthecorrespondingthreshold is.
Forexample,theminimumsupportthresholdsforlevels1and2are5%and3%,respectively.
Inthisway,―computer,‖―laptopcomputer,‖and―desktopcomputer‖areallconsidered frequent.

3. Group-BasedMinimumSupport:
Because users or experts often have insight as to which groups are more important than
others, it is sometimes more desirable to set up user-specific, item, or group based minimal
support thresholds when mining multilevel rules.
For example, a user could set up the minimum support thresholds based on product price, or
on items of interest, such as by setting particularly low support thresholds for laptop
computersand flash drives in order to pay particular attention to the association patterns
containing items in these categories.

MiningMultidimensionalAssociationRulesfromRelational
Databases and Data Warehouses:

Single dimensional or intradimensional association rulecontains a single distinct predicate


(e.g., buys)with multiple occurrencesi.e., the predicate occurs more than once within the
rule.

buys(X,―digitalcamera‖)=>buys(X,―HP printer‖)

Associationrulesthatinvolvetwoormoredimensionsorpredicatescanbe referredto as
multidimensional association rules.
age(X,“20…29”)^occupation(X,“student”)=>buys(X,“laptop”)

Above Rule contains three predicates (age, occupation,and buys), each of which occurs
only once in the rule. Hence, we say that it has norepeated predicates.

Multidimensionalassociationruleswithnorepeatedpredicatesarecalled interdimensional
association rules.

We can also mine multidimensional associationrules with repeated predicates, which


contain multiple occurrences of some predicates.These rules are called hybrid-
dimensional association rules. An example of sucha rule is the following, where the
predicate buys is repeated:
age(X,―20…29‖)^buys(X,―laptop‖)=>buys(X,―HPprinter‖)

MiningQuantitativeAssociationRules:
Quantitative association rules are multidimensional association rules in which the numeric
attributes are dynamically discretized during the mining process so as to satisfysome mining
criteria, such as maximizing the confidence or compactness of the rules mined.
In this section, we focus specifically on how to mine quantitative association rules havingtwo
quantitative attributes on the left-hand side of the rule and one categorical attribute onthe
right-hand side of the rule. That is
Aquan1^Aquan2=>Acat
whereAquan1andAquan2aretestsonquantitative attributeinterval
Acattestsacategoricalattributefromthetask-relevantdata.
Suchruleshavebeenreferredtoastwo-dimensionalquantitativeassociationrules,
because they contain two quantitative dimensions.
Forinstance,suppose youarecuriousabouttheassociationrelationshipbetweenpairsof
quantitative attributes, like customer age and income, and the type of television (such as
high-definition TV, i.e., HDTV) that customers like to buy.
Anexampleofsucha2-D quantitativeassociation ruleis
age(X,―30…39‖)^income(X,―42K…48K‖)=>buys(X,―HDTV‖)
FromAssociationMiningtoCorrelationAnalysis:
Acorrelationmeasurecanbeusedtoaugmentthesupport-confidenceframeworkfor
association rules. This leads to correlation rules of the form
A=>B[support,confidence,correlation]

That is, a correlation rule is measured not only by its support and confidence but alsoby the
correlation between itemsetsA and B. There are many different correlation measuresfrom
which to choose. In this section, we study various correlation measures todetermine
which would be good for mining large data sets.

Liftisasimplecorrelationmeasurethatisgivenasfollows.Theoccurrenceofitemset A is
independent of the occurrence of itemsetB if = P(A)P(B); otherwise,
itemsetsA and B are dependent and correlated as events. This definition can easily be
extended to more than two itemsets.

Thelift between theoccurrenceofAand Bcanbemeasured bycomputing

Ifthelift(A,B)islessthan1,thentheoccurrenceofAisnegativelycorrelatedwiththe occurrence of B.
Iftheresultingvalueisgreaterthan1,thenAandBarepositivelycorrelated,meaningthat the
occurrence of one implies the occurrence of the other.
Iftheresultingvalueisequalto1,thenAandBareindependentandthereisnocorrelation between
them.
Chapter-3

ClassificationandPrediction:

Classificationandpredictionaretwo formsof dataanalysisthat canbeusedtoextractmodels


describing important data classes or to predict future data trends.
Classificationpredicts categorical (discrete, unordered) labels, prediction models
continuousvaluedfunctions.
For example, we can build a classification model to categorize bankloan applications aseither
safe or risky, or a prediction model to predict the expendituresof potential customerson
computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered value, as
opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Most algorithms are memory resident, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
techniques capable of handling large disk-resident data.

IssuesRegardingClassificationandPrediction:
1.PreparingtheDataforClassificationandPrediction:
Thefollowingpreprocessingstepsmaybeappliedtothedatatohelpimprovetheaccuracy, efficiency,
and scalability of the classification or prediction process.
(i) Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying
smoothing techniques) and the treatment of missingvalues (e.g., by replacing a missing
value with the most commonlyoccurring value for that attribute, or with the most probable
value based on statistics).
Although most classification algorithms have some mechanisms for handling noisy or
missing data, this step can help reduce confusion during learning.
(ii) Relevanceanalysis:
Manyof theattributes inthe data mayberedundant.
Correlation analysis can be used to identify whether any two given attributes are
statisticallyrelated.
For example,astrongcorrelationbetween attributes A1 andA2 wouldsuggestthatone of the
two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection can be used
in these cases to find a reduced set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution obtained
using all attributes.
Hence, relevance analysis, in the form of correlation analysis and attribute subset
selection, can be used to detect attributes that do not contribute to the classification or
prediction task.
Suchanalysiscan helpimproveclassification efficiencyand scalability.
(iii) DataTransformationAnd Reduction
The data may be transformed by normalization, particularly when neural networks
ormethods involving distance measurements are used in the learning step.
Normalization involves scaling all values for a given attribute so that they fall within a
small specified range, such as -1 to +1 or 0 to 1.
The data can also be transformed by generalizing it to higher-level concepts. Concept
hierarchiesmaybeusedforthispurpose.Thisisparticularlyusefulforcontinuous
valuedattributes.
For example, numeric values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can
be generalized to higher-level concepts, like city.
Data can also be reduced by applying many other methods, ranging from wavelet
transformationandprinciplecomponentsanalysistodiscretizationtechniques,such as
binning, histogram analysis, and clustering.

ComparingClassificationandPredictionMethods:
 Accuracy:

The accuracy of a classifier refers to the ability of a given classifier to correctly predict
the class label of new or previously unseen data (i.e., tuples without class label
information).

The accuracy of a predictor refers to how well a given predictor can guess the value of
the predicted attribute for new or previously unseen data.
 Speed:
Thisreferstothecomputationalcostsinvolvedingeneratingandusingthe given
classifier or predictor.
 Robustness:
Thisistheabilityoftheclassifierorpredictortomakecorrectpredictions given
noisy data or data with missing values.
 Scalability:
Thisreferstotheabilitytoconstructtheclassifierorpredictorefficiently given
large amounts of data.
 Interpretability:

This refers to the level of understanding and insight that is providedby the classifier or
predictor.
Interpretabilityissubjectiveandthereforemoredifficultto assess.
ClassificationbyDecisionTreeInduction:
Decisiontreeinductionisthelearningofdecisiontreesfromclass-labeledtrainingtuples. A
decision tree is a flowchart-like tree structure,where
 Eachinternalnodedenotesatestonanattribute.
 Eachbranchrepresents anoutcomeofthe test.
 Eachleafnodeholdsa class label.
 Thetopmost nodeina treeistheroot node.

The construction of decision treeclassifiers does not require any domain knowledge or
parameter setting, and therefore I appropriate for exploratory knowledge discovery.

Decision trees can handle high dimensionaldata.

Their representation of acquired knowledge in tree formis intuitive and generallyeasy to


assimilate by humans.

Thelearningandclassificationstepsofdecision treeinductionaresimpleandfast. In
general, decision tree classifiers have good accuracy.
Decision tree induction algorithmshave been used for classification in many application
areas,suchasmedicine,manufacturingandproduction,financialanalysis, astronomy,and
molecular biology.
AlgorithmForDecisionTreeInduction:

Thealgorithmiscalledwiththree parameters:
 Data partition
 Attributelist
 Attributeselectionmethod

The parameter attribute list is a list ofattributes describing the tuples.

Attribute selection method specifies a heuristic procedurefor selecting the attribute that
―best‖discriminatesthe giventuplesaccordingtoclass.
Thetreestarts asasinglenode, N, representingthetrainingtuplesinD.
IfthetuplesinDareall ofthesameclass,thennodeNbecomesaleafandislabeledwith that class .

Allof the terminating conditions are explained at the end of the algorithm.

Otherwise,thealgorithmcallsAttributeselectionmethodtodeterminethesplitting criterion.

ThesplittingcriteriontellsuswhichattributetotestatnodeNbydeterminingthe―best‖ way to
separate or partition the tuples in D into individual classes.

Therearethreepossiblescenarios.LetAbethesplittingattribute.Ahasvdistinctvalues,
{a1,a2,…,av},basedonthetrainingdata.

1 Aisdiscrete-valued:

Inthiscase,theoutcomesofthetestatnodeNcorresponddirectlytotheknown values of A.
Abranchiscreatedforeachknownvalue,aj,ofAandlabeledwiththatvalue. Aneed not be
considered in any future partitioning of the tuples.

2 Ais continuous-valued:

Inthiscase,thetestatnodeNhastwopossibleoutcomes,correspondingtotheconditions A <=split
point and A >split point, respectively
wheresplitpointisthesplit-pointreturnedbyAttributeselectionmethodaspartofthe splitting
criterion.

3 Aisdiscrete-valuedandabinarytreemustbe produced:

Thetest at nodeNis of the form―A€SA?‖.


SA is the splitting subset for A, returned by Attribute selection methodas part of the splitting
criterion. It is a subset of the known values of A.
(a)IfAisDiscretevalued(b)IfAiscontinuousvalued(c) IfAisdiscrete-valuedandabinary tree must
be produced:

BayesianClassification:

Bayesian classifiers are statistical classifiers.

Theycanpredictclassmembershipprobabilities,suchastheprobabilitythatagiventuple belongs toa


particular class.

Bayesian classification is based on Bayes’ theorem.

Bayes’Theorem:
Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is described by
measurements made on a set of n attributes.
LetHbesomehypothesis,such asthatthedatatupleXbelongstoa specifiedclass C.
For classification problems, we want to determine P(H|X), the probabilitythat the hypothesis
H holds given the ―evidence‖ or observed data tuple X.
P(H|X)istheposteriorprobability,oraposterioriprobability,ofHconditionedonX.
Bayes’theoremisusefulinthatitprovidesawayofcalculatingtheposteriorprobability,
P(H|X),fromP(H),P(X|H),andP(X).

NaïveBayesianClassification:

ThenaïveBayesianclassifier,orsimpleBayesian classifier,worksasfollows:

1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.

2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will
predict that X belongs to the class havingthe highest posterior probability, conditioned on X.

Thatis,thenaïveBayesianclassifierpredictsthattupleXbelongstotheclassCiifandonly if

ThuswemaximizeP(CijX).TheclassCiforwhichP(CijX)ismaximizediscalledthe maximum
posteriori hypothesis. By Bayes’ theorem

3. AsP(X)is constantfor allclasses,onlyP(X|Ci)P(Ci)need bemaximized. If the class


prior probabilities are not known, then it is commonly assumed that the classes are
equallylikely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, wemaximizeP(X|Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally expensiveto
computeP(X|Ci). InordertoreducecomputationinevaluatingP(X|Ci),the naiveassumption of
class conditional independence is made. This presumes that the values of the attributes
areconditionallyindependentofoneanother,giventheclasslabelofthetuple.Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) fromthe


trainingtuples. For eachattribute, we look at whether the attribute is categorical orcontinuous-
valued. Forinstance, to compute P(X|Ci), we consider the following:
 IfAkiscategorical,thenP(xk|Ci)isthenumberoftuplesofclassCiinDhavingthevalue
xkforAk, divided by|Ci,D|the number oftuples of classCiin D.
 IfAkiscontinuous-valued,thenweneed todo abit morework, butthecalculationispretty
straightforward.
Acontinuous-valuedattributeistypicallyassumedtohaveaGaussiandistributionwitha mean μ and
standard deviation , defined by

5. InordertopredicttheclasslabelofX,P(XjCi)P(Ci)isevaluatedforeachclassCi. The
classifier predicts that the class label of tuple X is the class Ciif and only if

AMultilayerFeed-ForwardNeuralNetwork:
Thebackpropagationalgorithmperformslearningonamultilayerfeed- forward
neural network.
Ititerativelylearnsasetofweightsforpredictionoftheclasslabeloftuples.
Amultilayerfeed-forwardneuralnetworkconsistsofaninputlayer,oneor more
hiddenlayers, and an output layer.
Example:

Theinputstothenetworkcorrespondtotheattributesmeasuredforeachtrainingtuple.The inputs
are fed simultaneously into the units making up the input layer. These inputs pass through
the input layer and are then weighted and fed simultaneously to a second layer known as a
hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The
number of hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network’s prediction for given tuples

ClassificationbyBackpropagation:
Backpropagationisaneuralnetworklearningalgorithm.
Aneural network is a set of connected input/output units inwhich each connection has
aweightassociated with it.
During the learning phase, the network learns by adjusting the weights so as to be able
topredict the correct class label of the input tuples.
Neural network learning is also referred to as connectionist learning due to the connections
between units.
Neural networks involve long training times and are therefore more suitable for
applicationswhere this is feasible.

Backpropagation learns by iteratively processing a data set of training tuples, comparing


the network’s prediction for each tuple with the actual known target value.

The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for prediction).

For each training tuple, the weights are modified so as to minimize the mean squared
errorbetween the network’s prediction and the actual target value. These
modificationsaremadeinthe―backwards‖direction,thatis,fromtheoutputlayer,througheach
hidden layer down to the first hidden layer hence the name is backpropagation.

Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
Advantages:

Itinclude their high tolerance of noisy data as well as their abilityto classify patterns on
which they have not been trained.

They can be used when you may have little knowledge of the relationships between
attributesand classes.

They are well-suited for continuous-valued inputs and outputs, unlike most decision tree
algorithms.

They have been successful on a wide array of real-world data, including handwritten
character recognition, pathology and laboratory medicine, and training a computer to
pronounce English text.

Neural network algorithms are inherently parallel; parallelization techniques can be used to
speed up the computation process.
Process:
Initializetheweights:

Theweightsinthenetwork areinitializedtosmallrandom numbers


rangingfrom-1.0to1.0,or-0.5to0.5.Eachunithasabiasassociatedwithit.Thebiasesare similarly
initialized to small random numbers.
Eachtrainingtuple,X, is processed bythefollowingsteps.

Propagatetheinputsforward:

First, the training tuple is fed to the input layer of thenetwork. The inputs pass through the input
units, unchanged. That is, foran input unitj, its output, Oj, is equal to itsinput value, Ij. Next, the
net input and output of eachunit in the hidden and output layers are computed. The net input to a
unit in the hiddenor output layers is computed as a linear combination of its inputs.
Each suchunit has anumberofinputs toit thatare, in fact, theoutputsoftheunits connectedto it in
theprevious layer. Each connection has a weight. To compute the net input to the unit, each input
connected to the unit ismultiplied by its correspondingweight, and this is summed.

wherewi,jistheweightoftheconnectionfromunitiinthepreviouslayertounitj; Oiis the


output of unit ifrom the previous layer

Ɵj is thebias ofthe unit &itactsas athreshold in thatit serves to varytheactivityof the unit.

Eachunitinthehiddenandoutputlayerstakesitsnetinputandthenappliesanactivation function to
it.
Backpropagatethe error:

The error is propagated backward by updating the weights and biases to reflect the error
ofthe network’s prediction. For a unit j in the output layer, the error Err jis computed by

whereOjistheactualoutputofunitj,andTjistheknowntargetvalueofthegiventraining tuple.
Theerror ofahidden layerunitj is

wherewjkistheweightoftheconnectionfromunitjtoaunitkinthenexthigherlayer, andErr kis the


error of unit k.
Weightsareupdatedbythefollowingequations,where Dwi j is thechangein weightwij:

Biasesareupdated bythefollowingequationsbelow
Algorithm:

k-Nearest-NeighborClassifier:
Nearest-neighborclassifiersarebasedonlearningbyanalogy,thatis,bycomparinga given
test tuplewith training tuples that are similar to it.
The training tuples are describedby n attributes. Each tuple represents a point in an n-
dimensional space. In this way,all of the training tuples are stored in an n-dimensional
pattern space. When given anunknown tuple, a k-nearest-neighbor classifier searches the
pattern space for the k trainingtuples that are closest to the unknown tuple. These k training
tuples are the k nearest neighbors of the unknown tuple.
Closenessis definedin termsofadistancemetric, suchasEuclidean distance.
TheEuclidean distancebetween two points or tuples, say,X1=(x11, x12, … , x1n) and
X2=(x21, x22,… ,x2n), is

Inotherwords,foreachnumericattribute,wetakethedifferencebetweenthecorresponding values
of that attribute in tuple X1and in tuple X2, square this difference,and accumulate it.
Thesquarerootistakenofthetotalaccumulateddistancecount.
Min-Maxnormalizationcanbe used totransforma valuev of a numeric attributeAtov0 in
therange [0, 1] by computing

whereminAandmaxAaretheminimumand maximumvaluesofattributeA

Fork-nearest-neighborclassification,theunknowntupleisassignedthemostcommon class
among its k nearest neighbors.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to
it in pattern space.
Nearestneighborclassifierscanalsobeusedforprediction,thatis,toreturnareal-valued
prediction for a given unknown tuple.
Inthiscase,theclassifierreturnstheaveragevalueofthereal-valuedlabelsassociated with the k
nearest neighbors of the unknowntuple.

OtherClassificationMethods:

GeneticAlgorithms:

Genetic algorithms attempt to incorporate ideas of natural evolution. In general, geneticlearning


starts as follows.

An initial population is created consisting of randomly generated rules. Each rule can be
representedbyastringofbits.Asasimpleexample,supposethatsamplesinagiven
training set are described by two Boolean attributes, A1 and A2, and that there are two
classes,C1 andC2.

The rule ―IF A1 ANDNOT A2 THENC2‖ can be encoded as the bit string ―100,‖ where
the two leftmost bits represent attributes A1 and A2, respectively, and the rightmost bit
represents the class.

Similarly,therule―IFNOTA1ANDNOTA2THENC1‖canbeencodedas―001.‖

If an attribute has k values, where k > 2, then k bits maybe used to encode the attribute’s
values.
Classescanbe encoded inasimilar fashion.

Basedonthenotionofsurvivalofthefittest,anewpopulationisformedtoconsistof the fittest


rules in the current population, as well as offspring of these rules.

Typically, thefitnessof a ruleis assessed byits classification accuracyon aset oftraining


samples.

Offspringarecreatedbyapplyinggeneticoperatorssuchascrossoverandmutation. In
crossover, substrings from pairs of rules are swapped to form new pairs of rules.
Inmutation, randomly selected bits in a rule’s string are inverted.
The process of generating new populations based on prior populations of rules continues
untilapopulation,P,evolveswhereeachruleinPsatisfiesaprespecifiedfitnessthreshold.

Genetic algorithms are easily parallelizable and have been used for classification as
wellasotheroptimizationproblems.Indatamining,theymaybeusedtoevaluatethe fitness
of other algorithms.

FuzzySetApproaches:
Fuzzylogicusestruthvaluesbetween0.0and1.0torepresentthedegreeofmembership that a certain
value has in a given category. Each category then represents afuzzy set.

Fuzzy logic systemstypically provide graphical tools to assist users in converting attribute
values to fuzzy truthvalues.
Fuzzysettheoryisalsoknownaspossibilitytheory.
It was proposed by LotfiZadeh in1965 as an alternative to traditional two-value logic
andprobability theory.

Itletsusworkatahighlevelofabstractionandoffersameansfordealingwithimprecise
measurementof data.

Most important, fuzzyset theory allows us to dealwith vague or inexact facts.

Unlikethenotionoftraditional―crisp‖setswhereanelementeitherbelongstoaset Sorits
complement, in fuzzy set theory, elements canbelong to more than one fuzzy set.

Fuzzysettheoryisusefulfordataminingsystemsperformingrule-based classification. It
provides operations for combining fuzzy measurements.
Several procedures exist for translating the resulting fuzzy output into a defuzzifiedor crisp
value that is returned by the system.

Fuzzylogicsystemshavebeenusedinnumerousareas forclassification,including market


research, finance, health care, and environmental engineering.

Example:

RegressionAnalysis:
Regression analysis can beusedto model therelationship between one or moreindependent or
predictor variables and a dependent or response variable which is continuous-valued.

In the context of data mining, the predictor variables are theattributes of interest describing
the tuple (i.e., making up the attribute vector).
In general,thevaluesofthepredictorvariables areknown.
Theresponsevariableiswhatwewantto predict.

LinearRegression:
Straight-lineregressionanalysisinvolvesaresponsevariable, y,andasinglepredictor variable x.

It is the simplest form of regression, and models y as a linear function of x.


Thatis,y=b+wx
wherethe varianceofyis assumedto be constant
bandw areregressioncoefficientsspecifyingtheY-intercept andslope oftheline.
Theregressioncoefficients,wandb,canalsobethoughtofasweights,sothatwecan equivalently
write, y = w0+w1x
Thesecoefficientscanbesolvedforbythemethodofleastsquares,whichestimatesthe best-
fitting straight line as the one that minimizes the error between the actual data and the
estimate of the line.
LetDbeatrainingsetconsistingofvaluesofpredictorvariable, x,forsomepopulationand their
associated values for response variable, y. The training set contains |D| data points of the
form(x1, y1), (x2, y2), … , (x|D|, y|D|).
Theregressioncoefficientscanbeestimatedusingthismethodwiththefollowingequations:

where x is the mean value of x1, x2, … , x|D|, and y is the mean value of y1, y2,…, y|
D|.Thecoefficientsw0andw1oftenprovidegoodapproximationstootherwisecomplicated
regression equations.
MultipleLinearRegression:
Itisanextensionofstraight-lineregressionsoastoinvolvemorethanonepredictor variable.
Itallowsresponsevariableytobemodeledasalinearfunctionof,say,npredictorvariables or
attributes, A1, A2, …, An, describing a tuple, X.

Anexampleofamultiplelinearregressionmodelbasedontwopredictorattributesor variables, A 1
and A2, isy = w0+w1x1+w2x2
wherex1 and x2 arethevalues ofattributesA1 andA2, respectively,inX.

Multipleregressionproblemsareinsteadcommonlysolvedwiththeuseofstatistical software
packages, such as SAS,SPSS, and S-Plus.

NonlinearRegression:
It can be modeled by adding polynomial terms to the basic linear model.

By applying transformations to the variables, we can convert the nonlinear model into a
linear one that can then be solved by the method of least squares.

Polynomial Regression is a special case of multiple regression. That is, the addition ofhigh-
order terms like x2, x3, and so on, which are simple functions of the single variable, x, can
be considered equivalent to adding new independent variables.
Transformationofapolynomialregressionmodeltoalinearregressionmodel:
Consideracubicpolynomialrelationshipgivenby
y=w0+w1x+w2x2+w3x3
Toconvertthisequation tolinearform,wedefinenewvariables:
x1=x,x2 =x2 ,x3 =x3
It canthenbeconvertedtolinearformbyapplyingtheaboveassignments,resultinginthe equation y
= w0+w1x+w2x2+w3x3
whichis easilysolved bythemethod of least squares usingsoftware for regression analysis.

ClassifierAccuracy:
The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier.
In the pattern recognition literature, this is also referred to as the overall recognition rate of
theclassifier, that is, it reflects how well the classifier recognizes tuples of the various
classes.
Theerrorrateormisclassificationrateofaclassifier,M,whichissimply1-Acc(M), where
Acc(M) is the accuracy of M.

Theconfusionmatrix isausefultoolforanalyzinghowwellyourclassifiercanrecognize tuples of


different classes.

Truepositivesrefertothepositivetuplesthatwerecorrectlylabeledbytheclassifier. True
negatives are the negative tuples that were correctly labeled by the classifier.
Falsepositives arethe negativetuples that wereincorrectlylabeled.

How well the classifier can recognize, for thissensitivity and specificity measures can beused.
Accuracyis a functionofsensitivityandspecificity.

wheret_posisthenumberoftruepositives
posis the number of positive tuples
t_negisthenumberoftruenegatives
negis the number of negativetuples, f
_posis thenumberof false positives
Chapter-4

ClusterAnalysis:
Theprocessofgroupingasetofphysicalorabstractobjectsintoclassesof similarobjects is called
clustering.

Aclusterisacollectionofdataobjectsthataresimilartooneanotherwithinthesame cluster and are


dissimilar to the objects in other clusters.

A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.

Cluster analysis tools based on k-means, k-medoids, and severalmethods have also been built
into many statisticalanalysis software packages or systems, such as S-Plus, SPSS, and SAS.

Applications:
Clusteranalysis has been widelyused in numerous applications, includingmarket research,
pattern recognition, data analysis, and image processing.

In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.

In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.

Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.

Clusteringisalsocalleddatasegmentationinsomeapplicationsbecauseclustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection,Applications of outlier detection includethe
detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.

TypicalRequirementsOfClusteringInDataMining:
 Scalability:
Many clustering algorithms work well on small data sets containing fewer than several
hundreddataobjects; however,alargedatabase maycontainmillionsofobjects.Clustering on a
sample of a given large data set may lead to biased results.
Highlyscalableclusteringalgorithms areneeded.
 Abilitytodealwithdifferenttypesof attributes:
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.
 Discoveryofclusterswitharbitraryshape:
Many clustering algorithms determine clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to find spherical clusters with
similar size and density.
However, a cluster could be of any shape. It is important to develop algorithms thatcan
detect clusters of arbitrary shape.
 Minimalrequirementsfordomainknowledgetodetermineinputparameters:
Many clustering algorithms require users to input certain parameters in cluster analysis
(suchasthenumberofdesiredclusters).Theclusteringresultscanbequitesensitiveto
inputparameters.Parametersareoftendifficulttodetermine,especiallyfordatasetscontaininghig
h-dimensionalobjects.Thisnotonlyburdensusers,butitalsomakesthe quality of clustering
difficult to control.
 Abilitytodeal withnoisy data:
Mostreal-worlddatabasescontainoutliersormissing,unknown,orerroneousdata.
Someclusteringalgorithmsaresensitivetosuchdataandmayleadtoclustersofpoor quality.
 Incrementalclusteringandinsensitivitytotheorderof inputrecords:
Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates)
intoexistingclusteringstructuresand,instead,mustdetermineanewclusteringfrom scratch. Some
clustering algorithms are sensitive to the order of input data.
Thatis,givenasetofdataobjects,suchanalgorithmmayreturndramaticallydifferent clusterings
depending on the order of presentation of the input objects.
Itisimportanttodevelopincrementalclusteringalgorithmsandalgorithmsthatare insensitive to
the order of input.
 High dimensionality:
A database or a data warehouse can contain several dimensionsor attributes.Many
clustering algorithms are good at handling low-dimensional data,involving only two to
three dimensions. Human eyes are good at judging the qualityof clustering for up to three
dimensions. Finding clusters of data objects in highdimensionalspace is challenging,
especially considering that such data can be sparseand highly skewed.
 Constraint-basedclustering:
Real-world applications mayneed to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic
banking machines (ATMs) in a city. To decide upon this, you may cluster householdswhile
considering constraints such as the city’s rivers and highway networks, and the type and
numberofcustomers percluster.Achallengingtaskis to findgroupsofdatawith good clustering
behavior that satisfy specified constraints.
 Interpretabilityand usability:
Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering
features and methods.

MajorClusteringMethods:
 PartitioningMethods
 HierarchicalMethods
 Density-BasedMethods
 Grid-BasedMethods
 Model-BasedMethods

PartitioningMethods:
A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:

Eachgroupmustcontainatleastoneobject,and
Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation


technique that attempts to improve the partitioning by moving objects from one group to
another.

The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.

HierarchicalMethods:
Ahierarchicalmethodcreatesahierarchicaldecompositionofthegivensetofdataobjects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on
howthe hierarchical decomposition is formed.

 Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
 The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.

Therearetwoapproachestoimprovingthequalityofhierarchicalclustering:

 Perform careful analysis ofobject ―linkages‖ at each hierarchical partitioning, such as in


Chameleon, or
 Integratehierarchical agglomeration and other approaches by first using a
hierarchicalagglomerative algorithm to group objects into microclusters, and then
performingmacroclustering on the microclusters using another clustering method such as
iterative relocation.
Density-basedmethods:
 Most partitioning methods cluster objects based on the distance between objects. Such
methods can find onlyspherical-shaped clusters and encounter difficultyat discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a
givencluster,theneighborhoodofagivenradiushasto containatleastaminimumnumberof
points. Such a method can be used to filter out noise (outliers)and discover clusters of
arbitrary shape.
 DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is a
methodthat clusters objects based on the analysis of the value distributions of density
functions.
Grid-BasedMethods:
 Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
 Alloftheclusteringoperationsareperformedonthegridstructurei.e.,onthequantized space.
The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.
 STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.

Model-BasedMethods:
 Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
 A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
 It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.

TasksinDataMining:
 ClusteringHigh-DimensionalData
 Constraint-BasedClustering
ClusteringHigh-DimensionalData:
It is a particularly important task in cluster analysis because many applications
require the analysis of objects containing a large number of features or dimensions.

For example, text documents may contain thousands of terms or keywords as


features, and DNA micro array data may provide information on the expression
levels of thousands of genes under hundreds of conditions.

Clustering high-dimensional data is challenging due to the curse of dimensionality.


Manydimensionsmaynotberelevant.Asthenumberofdimensionsincreases,
thedata become increasingly sparse so that the distance measurement between pairs
ofpoints become meaningless and the average density of points anywhere in thedata
islikely to be low. Therefore, a different clustering methodology needs to be
developedfor high-dimensional data.
CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces ofthe data, rather than over the entire data space.

Frequent pattern–based clustering,another clustering methodology, extractsdistinct


frequent patterns among subsets ofdimensions that occur frequently. It uses such
patterns to group objects and generatemeaningful clusters.

Constraint-BasedClustering:
Itis a clustering approach that performs clustering by incorporation of user-specifiedor
application-oriented constraints.

A constraint expresses a user’s expectation or describes properties of the desired


clustering results, and provides an effective means for communicating with the
clustering process.

Various kinds of constraints can be specified, either by a user or as per application


requirements.

Spatial clustering employs with the existence of obstacles and clustering under user-
specified constraints. In addition, semi-supervised clusteringemploysforpairwise
constraints in order to improvethe quality of the resulting clustering.

ClassicalPartitioningMethods:
Themostwell-knownandcommonlyusedpartitioningmethods are
 Thek-Means Method
 k-MedoidsMethod
Centroid-BasedTechnique:TheK-MeansMethod:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok
clusters so that the resulting intracluster similarity is high but the intercluster similarity islow.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which
can be viewed as the cluster’s centroid or center of gravity.
Thek-meansalgorithmproceeds asfollows.
First,itrandomlyselectskoftheobjects,eachofwhichinitiallyrepresentsacluster mean or center.

Foreachoftheremainingobjects,anobjectisassignedtotheclustertowhichitisthe most similar,


based on the distance between the object and the cluster mean.

It then computes the new mean for each cluster.


Thisprocessiteratesuntilthecriterionfunctionconverges.

Typically,thesquare-errorcriterionisused,definedas

whereEisthesumofthesquareerrorforallobjectsinthedataset pis the


point in space representing a given object
miisthemeanofclusterCi

Thek-meanspartitioningalgorithm:
The k-means algorithm for partitioning, where each cluster’s center is represented by the
meanvalue of the objects in the cluster.
Clusteringof aset of objects based on thek-means method

Thek-MedoidsMethod:

The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.

Instead oftakingthemean valueoftheobjects inaclusteras areferencepoint, wecan pick actual


objects to represent the clusters, using one representative object per cluster. Each remaining
object is clustered with the representative object to which it is the most similar.

Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as

whereEisthesumoftheabsolute error forallobjectsinthedata set

pisthepointinspacerepresentinga givenobjectinclusterCj

ojistherepresentativeobjectof Cj
The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved.

This quality is estimated using a cost function that measures the average
dissimilaritybetween an object and the representative object of its cluster.

To determine whether a non representative object, oj random, is a good replacement for a


current representativeobject, oj, the following four cases are examined for each of the
nonrepresentative objects.

Case1:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.

Case2:

pcurrentlybelongs to representative object, oj. If ojis replaced byorandomasa representative object


and p is closest to orandom, then p is reassigned to orandom.

Case3:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a


representativeobject and p is still closest to oi, then the assignment does notchange.

Case4:

pcurrently belongstorepresentativeobject,oi,i≠j.Ifojisreplacedbyorandomasa representative object and


p is closest to orandom, then p is reassigned
toorandom.
Fourcasesofthe costfunctionfor k-medoidsclustering

Thek-MedoidsAlgorithm:

Thek-medoidsalgorithmforpartitioningbasedonmedoidorcentral objects.
The k-medoids method ismore robust than k-means in the presence of noise and outliers,
because a medoid is lessinfluenced by outliers or other extreme values than a mean. However,
its processing ismore costly than the k-means method.

HierarchicalClusteringMethods:

A hierarchical clustering method works by grouping data objects into a tree of clusters.

The quality of a pure hierarchical clusteringmethod suffers fromits inability to


performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
backtrack and correct it.

Hierarchical clustering methods can be further classified as either agglomerative or divisive,


depending on whether the hierarchical decomposition is formed in a bottom-up or top-down
fashion.

Agglomerativehierarchicalclustering:
This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are satisfied.

Most hierarchical clustering methods belong to this category. They differ only in their
definition of intercluster similarity.

Divisivehierarchicalclustering:
This top-down strategy does the reverse of agglomerativehierarchical clustering by starting
with all objects in one cluster.

It subdividesthe cluster into smaller and smaller pieces, until each object forms a cluster on
itsown or until it satisfies certain termination conditions, such as a desired number
ofclusters is obtained or the diameter of each cluster is within a certain threshold.
Constraint-BasedCluster Analysis:
Constraint-based clustering finds clusters that satisfy user-specified preferences orconstraints.
Depending on the nature of the constraints, constraint-based clusteringmay adopt rather different
approaches.
Thereareafewcategoriesofconstraints.
 Constraintsonindividualobjects:

We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxurymansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handledby
preprocessing after which the problem reduces to an instance ofunconstrainedclustering.

 Constraintsontheselectionof clustering parameters:

A user may like to set a desired range for each clustering parameter. Clustering parameters
areusuallyquitespecific to thegiven clusteringalgorithm. Examples ofparametersinclude k,
the desired numberof clusters in a k-means algorithm; or e the radius andthe minimum
numberofpoints in the DBSCANalgorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus,theirfinetuningandprocessingareusuallynotconsideredaformof constraint-based
clustering.
 Constraintsondistanceorsimilarityfunctions:

Wecan specifydifferent distanceorsimilarityfunctions forspecificattributesoftheobjects to be


clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocessperse. However,insomecases, suchchangesmaymake theevaluationof the
distance function nontrivial, especially when it is tightlyintertwined with the clustering
process.
 User-specifiedconstraintsonthepropertiesofindividualclusters:
A user may like tospecify desired characteristics of the resulting clusters, which may
strongly influencethe clustering process.
 Semi-supervisedclusteringbasedonpartialsupervision:
Thequalityofunsupervisedclusteringcanbesignificantlyimprovedusing someweak form of
supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled as
belongingto the same or different cluster). Such a constrained clustering process is
calledsemi-supervised clustering.

OutlierAnalysis:

There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining setof
data, are called outliers.
Manydatamining algorithms tryto minimizetheinfluenceofoutliers oreliminatethem all
together. This, however, could result in the loss of important hidden information because
one person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.

It can be used in fraud detection, for example, bydetecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing foridentifying
the spending behavior of customers with extremely low or extremely high
incomes,orinmedicalanalysis for findingunusualresponses tovariousmedical treatments.

Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar,
exceptional, or inconsistent with respect to the remaining data. The outliermining problem
can be viewed as two subproblems:

Definewhatdatacanbeconsideredasinconsistentinagivendataset,and Find an
efficient method to mine the outliers so defined.
Typesofoutlierdetection:
 StatisticalDistribution-BasedOutlierDetection
 Distance-BasedOutlierDetection
 Density-BasedLocalOutlierDetection
 Deviation-BasedOutlierDetection

StatisticalDistribution-BasedOutlierDetection:
The statistical distribution-based approach to outlier detection assumes a distributionor
probability model for the given data set (e.g., a normal or Poisson distribution) andthen
identifies outliers with respect to the model using a discordancy test. Application ofthe
test requires knowledge of the data set parametersknowledge of distribution parameters
such as the mean and variance and theexpected number of outliers.

Astatisticaldiscordancytestexaminestwohypotheses:
A working hypothesis
An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes froman
initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its


rejection. A discordancy test verifies whether an object, oi, is significantly large (orsmall)
in relation to the distribution F. Different test statistics have been proposed for use as a
discordancy test, depending on the available knowledge of the data. Assuming that some
statistic, T, has been chosen for discordancy testing, and the value of the statisticfor
object oi is vi, then the distribution of T is constructed. Significance probability,
SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficientlysmall, then oi isdiscordant and
the working hypothesis is rejected.
Analternativehypothesis,H,whichstatesthatoicomesfromanotherdistributionmodel, G, is
adopted. The result is very much dependent on which model F is chosen because
oimaybeanoutlierunderonemodelandaperfectlyvalidvalueunderanother.The
alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier.
Therearedifferentkindsofalternative distributions.
Inherentalternativedistribution:
In this case, the working hypothesis that all of the objects come from distribution F is
rejected in favor of the alternative hypothesis that all of the objects arise from another
distribution, G:
H:oi € G, wherei=1, 2,…, n
F and G may be different distributions or differ only in parameters of the same
distribution.
There are constraints on the form of the G distribution in that it must have potential to
produce outliers. For example, it may have a different mean or dispersion, or a longer
tail.
Mixturealternativedistribution:
Themixturealternativestatesthatdiscordantvaluesarenotoutliersinthe Fpopulation, but
contaminants from some other population,
G.Inthiscase,thealternativehypothesisis

Slippagealternativedistribution:
Thisalternativestatesthatalloftheobjects(apartfromsomeprescribedsmallnumber) arise
independently from the initial model, F, with its given parameters, whereas the
remaining objects are independent observations from a modified version of F in which
the parameters have been shifted.
Therearetwobasictypes ofproceduresfordetectingoutliers:
Block procedures:
In this case, either all of the suspect objects are treated as outliersor all of them are accepted
as consistent.
Consecutiveprocedures:
Anexampleofsuchaprocedureistheinsideoutprocedure.Itsmainideaisthattheobject
thatisleastlikelytobeanoutlieristestedfirst.Ifitisfoundtobeanoutlier,thenallofthe
more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is
tested, and so on. Thisprocedure tends to be more effective than block procedures.

Distance-BasedOutlierDetection:
The notion of distance-based outliers was introduced to counter the main limitationsimposed
by statistical methods. An object, o, in a data set, D, is a distance-based (DB)outlier with
parameters pct and dmin,that is, a DB(pct;dmin)-outlier, if at least a fraction,pct, of the
objects in D lie at a distance greater than dmin from o. In other words, rather thatrelying on
statistical tests, we can think of distance-based outliers as thoseobjects that do not have
enoughneighbors, where neighbors are defined based ondistance from the given object. In
comparison with statistical-based methods, distancebased outlier detection generalizes the
ideas behind discordancy testing for various standarddistributions. Distance-based outlier
detection avoids the excessive computationthat can be associated with fitting the observed
distribution into some standard distributionand in selecting discordancy tests.
For many discordancy tests, it can be shown that if an object, o, is an outlier accordingto the
given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct anddmin.
Forexample,if objectsthat liethreeormorestandarddeviations fromthemean
areconsideredtobeoutliers,assuminganormaldistribution,thenthisdefinitioncan be
generalized by a DB(0.9988, 0.13s) outlier.
Severalefficientalgorithmsforminingdistance-basedoutliershavebeen developed.
Index-based algorithm:
Given a data set, the index-based algorithm uses multidimensionalindexing structures, such
as R-trees or k-d trees, to search for neighbors of eachobject o within radius dminaround that
object. Let Mbe the maximum number ofobjects within the dmin-neighborhood of an outlier.
Therefore, onceM+1 neighborsof object o are found, it is clear that o is not an outlier. This
algorithm has aworst-casecomplexityof O(n2k), wheren is thenumberof objects in thedata set
and k is thedimensionality. The index-based algorithm scales well as k increases. However,
thiscomplexityevaluation takes onlythe search time into account, even though the taskof
building an index in itself can be computationally intensive.
Nested-loopalgorithm:
The nested-loop algorithm has the same computational complexityas the index-based
algorithm but avoids index structure construction and triesto minimize the number of I/Os. It
divides the memory buffer space into two halvesand the data set into several logical blocks.
By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can
be achieved.
Cell-basedalgorithm:
ToavoidO(n2)computationalcomplexity,acell-basedalgorithmwasdevelopedformemory-
resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number
of cells and k is the dimensionality.

In this method, the data space is partitioned into cells with a side length equal to
Eachcellhastwolayerssurroundingit.Thefirstlayerisonecellthick,whilethesecondis

cells thick, rounded up to the closest integer. The algorithm countsoutliers on a


cell-by-cell rather than an object-by-object basis. For a given cell, itaccumulates three counts
—the number of objects in the cell, in the cell and the firstlayer together, and in
thecellandbothlayerstogether. Let’srefertothesecountsascellcount,cell+1layercount,and cell +
2 layers count, respectively.

Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an
outlier.

An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
thanorequal to M. If this condition does not hold, thenall oftheobjectsin thecell can be
removed from further investigation as they cannot be outliers.

If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
consideredoutliers.Otherwise,ifthisnumberismorethanM,thenitispossiblethatsome of the
objects in the cell may be outliers. To detect theseoutliers, object-by-object processing is
used where, for each object, o, in the cell,objects in the second layer of oare examined.
For objects in the cell, only thoseobjects having no more than M points in their dmin-
neighborhoods are outliers.The dmin-neighborhood of an object consists ofthe object’s
cell, all of its firstlayer, and some of its second layer.
A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.

Density-BasedLocalOutlierDetection:
Statistical and distance-based outlier detection both depend on the overall or
globaldistribution of the given set of data points, D. However, data are usually not
uniformlydistributed.Thesemethodsencounterdifficultieswhen analyzingdatawithrather
different
densitydistributions.
To define the local outlier factor of an object, we need to introduce the concepts ofk-
distance, k-distance neighborhood, reachability distance,13 and local reachability density.
Thesearedefinedasfollows:
Thek-distanceofanobjectpisthemaximaldistancethatpgetsfromitsk- nearestneighbors. This
distance is denoted as k-distance(p). It is defined as the distance, d(p, o), between p and an
object o 2 D, such that for at least k objects, o0 2 D, it holds that d(p, o’)_d(p, o). That is,
thereare at least k objects inDthat areas closeasor closerto p than o, andfor at most k-1
objects, o00 2 D, it holds that d(p;o’’) <d(p, o).

That is, there are at most k-1 objects that are closer to p than o. You maybewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
Here,MinPts(as k)is usedto definethelocal neighborhoodof anobject, p.
The k-distance neighborhood of an object p is denoted N kdistance(p)(p), or Nk(p)for short. By
settingk to MinPts, weget NMinPts(p). It contains theMinPts-nearestneighbors ofp. That is, it
contains every object whose distance is not greater than theMinPts-distance of p.
Thereachabilitydistanceofanobjectpwithrespecttoobjecto(whereoiswithin
theMinPts-nearest neighbors of p), is defined as reach
distMinPts(p,o)=max{MinPtsdistance(o),d(p,o)}.
Intuitively,ifanobjectp isfaraway,thenthereachabilitydistancebetweenthetwois simply their
actual distance. However, if they are sufficientlyclose (i.e., where p is within the MinPts-
distance neighborhood of o), thenthe actual distance is replaced by the MinPts- distance of o.
This helps to significantlyreduce the statistical fluctuations of d(p, o) for all of the p close to
o.
The higher thevalue of MinPts is, the more similar is the reachability distance for objects
withinthe same neighborhood.
Intuitively,thelocalreachabilitydensityofpistheinverseoftheaveragereachability density
based on the MinPts-nearest neighbors of p. It is defined as

Thelocaloutlierfactor(LOF)ofpcapturesthedegreetowhichwecallpanoutlier. It is
defined as

It is the average of the ratio of the local reachability density of p and those of p’s
MinPts-nearestneighbors.Itiseasytoseethatthelowerp’slocalreachabilitydensity is, and
the higher the local reachability density of p’s MinPts-nearest neighbors are, the
higher LOF(p) is.
Deviation-BasedOutlierDetection:
Deviation-based outlier detection does not use statistical tests or distance-basedmeasures to
identify exceptional objects. Instead, it identifies outliers by examining themain
characteristicsofobjectsinagroup.Objectsthat―deviate‖fromthisdescriptionareconsidered
outliers. Hence, in this approach the term deviations is typically used to referto outliers. In
this section, we study two techniques for deviation-based outlier detection.The first
sequentially compares objects in a set, while the second employs an OLAPdata cube
approach.

SequentialExceptionTechnique:
The sequential exception technique simulates the way in which humans can
distinguishunusual objects from among a series of supposedly like objects. It uses implicit
redundancyof the data. Given a data set, D, of n objects, it builds a sequence of subsets,{D1,
D2, …,Dm}, of these objects with 2<=m <= n such that

Dissimilarities are assessed between subsets in the sequence. The technique introducesthe
following key terms.
Exceptionset:
This is the set of deviations or outliers. It is defined as the smallestsubset of objects whose
removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarityfunction:
This function does not require a metric distance between theobjects. It is any function that, if
given a set of objects, returns a lowvalue if the objectsare similar to one another. The greater
the dissimilarity among the objects, the higherthe value returned by the function. The
dissimilarity of a subset is incrementally computedbased on the subset prior to it in the
sequence. Given a subset of n numbers, {x1, …,xn}, a possible dissimilarity function is the
variance of the numbers in theset, that is,

where x is the mean of the n numbers in the set. For character strings, the dissimilarityfunction
may be in the form of a pattern string (e.g., containing wildcard charactersthat is used to cover
all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the
strings in Dj-1does not cover any string in Dj that isnot in Dj-1.
Cardinalityfunction:
Thisis typicallythecount of thenumberof objectsin agiven set.
Smoothingfactor:
This function is computed for each subset in the sequence. Itassesses how much the
dissimilarity can be reduced by removing the subset from theoriginal set of objects.

You might also like