Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

INDEX

S.N. TABLE OF CONTENT PAGE NO.


1 Case Study of Data warehousing
2 Case Study of Warehousing Schemas
3 Case Study of OLAP server
4 Case Study of MOLAP server
5 Case Study of Data mining Technique

6 Case Study of Data mining related Challenges

7 Case Study of Naive Bayesian Classification


8 Implementation of Decision Tree Algorithm.

9 Implementation of K-Means Algorithm.

10 Case study of Clustering & its Techniques.

1|Page
CaseStudy-1

ExplainDataWarehouseDesign?
A data warehouse is a single data repository where a record frommultiple
datasourcesisintegratedforonlinebusinessanalyticalprocessing(OLAP).This
impliesadatawarehouseneedstomeettherequirementsfromallthebusiness stages within
the entire organization. Thus, data warehouse design is a hugelycomplex, lengthy, and
hence error-prone process. Furthermore, business analytical functions change over
time, which results in changes in the requirements for the
systems.Therefore,datawarehouseandOLAPsystemsaredynamic,andthe design process
is continuous.

Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such as
answering management related queries. The target of the design becomes how the
recordfrommultipledatasourcesshouldbeextracted,transformed,andloaded (ETL) to be
organized in a database as the data warehouse.

Therearethreeapproaches-
1. "Top-down"approach

2. "Bottom-up"approach

3. Acombinationofbothapproaches.

There are three approaches to build a data warehouse – A top-down approach, a


bottom-up approach, or a combination of both. The various steps involved in
thedesignandconstructionofadatawarehouseareplanningrequirementstudy, problem
analysis, warehouse design, data integration testing, and finally deploymentof the data
warehouse.
Two methodologies are commonly used for developing large software systems,namely
- The Waterfall Method or The Spiral Method. The Waterfall Methodperforms a
structured and step to the next. The Spiral Method involves the rapid generation of
increasingly. Spiral model is abetter choice forthedata warehouse development,
especially for data mart, since the turnaround time is less, modification
canbeperformedeasily,andnewdesignsandtechnologiescanbeadaptedina timely manner.
1. Although, a data, mart model should be selected if the process is departmental
and emphasizes on theanalysis of one type of business process.
2. Selectthe grain of business process.For this process,the grain is the
fundamental, atomic level of datato be represented in the fact table.
Forexample, transaction, individual daily snapshots, etc.
3. Select the dimensions that will use for each fact table record. Customer, supplier,
status, time, item,warehouse and transaction types are the typical dimensions .

2|Page
TOP-DOWNAPPROACH:
In the "Top-Down" design approach, a data warehouse is described as a subject-oriented,
time-variant,non-volatileandintegrateddatarepositoryfortheentireenterprisedata from
different sources are validated, reformatted and saved in a normalized (up to 3NF)
databaseasthedatawarehouse.Thedatawarehousestores"atomic"information,the data at the
lowest level of granularity, from where dimensional data marts can be built by selecting the
data required for specific business subjects or particular departments. An approach is a
data-driven approach as the information is gathered and integrated first and then business
requirements by subjects for building data marts are formulated. The advantage of this
method is which it supports a single integrated data source. Thus, data marts built from it
will have consistency whenthey overlap.
Advantages:

 DataMartsareloadedfromthedatawarehouses.

 Developingnewdatamartfromthedatawarehouseisveryeasy.

Disadvantages:

 Thistechniqueisinflexibletochangingdepartmentalneeds.

 Thecostofimplementingtheprojectishigh.

3|Page
BOTTOM-UPDESIGNAPPROACH:

Inthe"Bottom-Up"approach,adatawarehouseisdescribedas"acopyof transaction data


specifically architecture for query and analysis," term the Starschema. In this
approach, a data mart is created first to necessary reporting and
analyticalcapabilitiesforparticularbusinessprocesses(orsubjects).Thusitis
neededtobeabusiness-drivenapproachincontrasttoInman’sdata-driven
approach.Insteadofa normalized database for the data warehouse, a denormalized
dimensional databaseis adapted to meetthe data delivery requirements of data
warehouses. Using this method, to use the set of data martsas the enterprise data
warehouse, data marts should be built with conformed dimensions in mind, defining
that ordinary objects are represented the same in different data marts. The conformed
dimensions connected the data marts to form a data warehouse, which is generally
called a virtual data warehouse.

Advantagesofbottom-updesign
 Documentscanbegeneratedquickly.
 Thedatawarehousecanbeextendedtoaccommodatenewbusinessunits.
 Itisjustdevelopingnewdatamartsandthenintegratingwithotherdatamarts.

Disadvantagesofbottom-updesign
 The locations ofthe datawarehouse and the data marts arereversed in
thebottom-up approach design.

4|Page
CaseStudy–2
Explain Data Warehousing Schemas?

A Data Warehousing (DW) is process for collecting and managing data from varied
sourcestoprovidemeaningfulbusinessinsights.ADatawarehouseistypically used
toconnect and analyzebusiness data from heterogeneous sources. The data
warehouseisthecoreoftheBIsystemwhichisbuiltfordataanalysisand reporting.
Itisablendoftechnologiesandcomponentswhichaidsthestrategicuseofdata. It
iselectronicstorageofa large amount of information by a business which is designed for
query and analysis instead of transaction processing. It is a process of
transformingdataintoinformationandmakingitavailabletousersinatimely manner to
make a difference.
Wecan see that the only information shown via theconceptual data model is the entities
that describe the data and the relationships between those entities. There may be more
than one concept hierarchy for a given attribute or dimension, based on
differentusers’viewpoints.Nootherinformationisshownthroughthe conceptual data
model.

 Starschema
 Snowflakeschema
 Factconstellation

StarSchema:
Star Schema in Data Warehouse is a schema in which the center of the star can have
onefacttableandanumberofassociateddimensiontables.Itisknownasstar schema as its
structure resembles a star. The Star Schema data model is the simplest type of Data
Warehouseschema. Itis also known asStar Join Schema and is optimized for querying
large data sets.

 ConsistsofsetsofrelationsknownasDimensionTable(DT)andFactTable(FT)
 Asinglelargecentralfacttableandonetableforeachdimension.
 Afact table primary key iscompositionof setof foreignkeysreferencingdimension tables.
 Everydimensiontableisrelatedtooneormorefacttables.
 Every fact pointsto one tuple in each of the dimensions and has
additionalattributes.

5|Page
Characteristics:
 ItcreatesaDE-normalizeddatabasethatcanquicklyprovidequeryresponses.

 Itprovidesaflexibledesignthatcanbechangedeasilyoraddedtothroughout
thedevelopmentcycle,and as the database grows.
 Itprovidesaparallelindesigntohowend-userstypicallythinkofandusethedata.

 Itreducesthecomplexityofmetadataforbothdevelopersandend-users.

Advantages:
 Queryperformance
 Loadperformanceandadministration
 Builtinreferentialintegrity
 Easytounderstand

SnowflakeSchema:
Snowflake Schema in data warehouse is a logical arrangement of tables in a
multidimensionaldatabasesuch that the ER diagram resembles a snowflake shape. A
Snowflake Schema is an extension of a Star Schema, and it adds additionaldimensions.
The dimension tables are normalized which splits data into additional tables.

6|Page
Characteristics:
 Themainbenefitofthesnowflakeschemaitusessmallerdiskspace.
 EasiertoimplementadimensionisaddedtotheSchema
 Duetomultipletablesqueryperformanceisreduced

 TheprimarychallengethatyouwillfacewhileusingthesnowflakeSchemais that
youneed toperformmore maintenance efforts because of the more lookup tables.

Advantages:
 Snowflake schema’s primary advantage is its ability to reduce disk
storagerequirementsandjoinsmaller lookup tables, improving query
performance.
 Provides greater scalability in the interrelationship between components
anddimension levels.
 Thereisnoredundancy,soitiseasiertomaintain.

Disadvantages:
 A significantdisadvantage of the snowflake schema is the increased
maintenancerequired.
 Complexqueriesarechallengingtounderstand.

7|Page
FactConstellationSchemaorGalaxySchema:
Here are some ofthebasicpointsof fact constellation schema which are as follows: Afact
constellationcanconsistofmultiplefacttables.Thesearemorethantwotablesthatshare thesame
dimension tables. This schema is also known as galaxy schema. It is viewed as a
collectionofstars and hencethe name galaxy. The shared dimensions in this schema
areknownasconformeddimensions.Thedimensionsinthisschemaareseparatedinto segregated
dimensions which are having different levels of hierarchy. As an example, we can consider the
four levels of hierarchy taking geography intoconsiderationasregion, country, state, and city.
This galaxy schema has four dimensions. Another way of creating a galaxy
schemaisbysplittingone-starschemaintomorestarschemas.Thedimensionscreatedas largeandbuilt
onthe basisofhierarchy.Thisschemaisusefulwhenaggregationoffact tables is necessary. Fact
constellations are considered to be more complex than star and snowflake schemas. These
areconsidered to bemore flexible but hard to implement and maintain. This type of schema is
usually used for sophisticated applications. The multiple number of tables present in this schema
makes it difficult and complex. Implementing this schemais hence difficult. The architecture is
thus more complex when compared to star and snowflake schema.

8|Page
CaseStudy–3 Explain
types of OLAP Server?
Thereare3maintypesofOLAPserversareasfollowing:

 RelationalOLAP(ROLAP)–StarSchemabased–
The ROLAP is based on the premise that data need not to be storedmultidimensionally
in order to viewed multidimensionally, and that it is possible to exploitthewell-
provenrelationaldatabasetechnologytohandle multidimensionality of data. In ROLAP
data is stored in a relational database. In essence,eachaction
ofslicinganddicingisequivalenttoaddinga“WHERE” clauseinSQL statement.ROLAP
can handle large amounts of data. ROLAP can leverage functionalities inherent in the
relational database.

 MultidimensionalOLAP(MOLAP)–Cubebased–

MOLAPstoresdataondisksinaspecializedmultidimensionalarraystructure.
OLAPisperformedonitrelyingontherandom-accesscapabilityofthearrays.
Arrayselementisdeterminedbydimensioninstances,andthefactdataor
measuredvalueassociatedwitheachcellisusuallystoredinthecorresponding array element.
In MOLAP, the multidimensional array is usually stored in a linear allocation
according to nested traversal of the axes in some predeterminedorder.
But unlike ROLAP, where only records with non-zero facts are stored, all array
elementsaredefinedinMOLAPandasaresult,thearraysgenerallytendto sparse, with empty
elements occupying a greater partof it. Since both storage and retrieval costs are
important while assessing online performance efficiency, MOLAP systems typically
include provision such as advanced indexing and hashing to locate data while
performing Queries for handling sparse arrays. MOLAP
cubesarefastdataretrieval,optimalfor slicing and dicing and they can perform complex
calculation. All calculation is pre-generated when the cube is created.
HybridOLAP(HOLAP)–

HOLAPisacombinationofROLAPandMOLAP.HOLAPserversallowsstoringthelargedat
avolumesofdetaildata.Ontheonehand,HOLAPleverages the greater scalability of
ROLAP. On the other hand, HOLAP leverages the cube technology for faster
performance and for summary-type information. Cubesare
smallerthanMOLAPsincedetaildataiskeptintherelationaldatabase.The database is used
to stores data in the most functional way possible.

9|Page
SomeothertypesofOLAP:
 WebOLA(WOLAP)–
It is a Web browser-based technology. In traditional OLAP application
isaccessiblebytheclient/serverbutinthisOLAPapplicationisaccessiblebytheweb browser.
It is a three-tierarchitecture which consist of client, middleware and database server.
The most appealing features of this style of OLAP was (past tense
intended,sincefewproductscategorizethemselvesthisway)theconsiderably
lowerinvestmentinvolvedon the client side (“all that’s needed is a browser”) and
enhanced accessibilitytoconnecttothe data.AWeb based application requires no
deploymenton the clientmachine. Allthat is requiredis a Web browser anda network
connection to the intranet or Internet.

DesktopOLAP(DOLAP)–

DOLAPstandsfordesktopanalyticalprocessing.Inthatusercandownloadthe data from the


source and work with the dataset, or on their desktop. Functionality is limited compare
tootherOLAP application.It has cheaper cost.
MobileOLAP(MOLAP)–

MOLAPiswirelessfunctionalityormobiledevices.Userisworkandaccessthe data through


the mobile devices.

SpatialOLAP(SOLAP)–
Merge capabilities of both Geographic Information Systems (GIS) and OLAP into
singleuserinterface,SOLAPegress.SOLAPiscreatedbecausethedatacomeon the form of
alphanumeric, image and vector. This provides the easy and quick exploration of data
that resides on a spatial database.

10|Page
CaseStudy–4

ExplainArchitectureofMOLAP?

ArchitectureofMOLAP:
MOLAParchitectureconsistsofthefollowingcomponentswhicharelistedbelow:

 Databaseserver
 MOLAPserver
 Frontendtool

TheArchitectureinDetail:
Usingthefront-endtoolwhichisacomponentinMOLAParchitecture,theusercan
sendarequestforreportsthroughtheinterfacewhichispresent.Oncetheinformation request is sent
then comesthe application layer. This layer consists of the MOLAP server.The MOLAP server
takes the responsibility of metadata request processing. This means the application logic layer
of the metadata database retrieves the stored data which is in the database. Once this is
retrieved the application layer then forwards the result obtained to the client or user.
MOLAParchitecturemainlyfocusesonprecompileddata.Thisarchitecture provides a limited
number of categories. You can dynamically create aggregations that can help you to
haveresults.Thesecanbedoneforallcustomers.Themetadatadatabasewillretrievedata
ofbothaccountswithitsuserdetails.Oncetheresultsareretrieveditwilldisplaytheresult to the user.
ImplementationofMOLAP:-
When implementing MOLAP, it is very important to consider thatbothmaintenanceand storage
implications have their strategies built. Theymust have proper building cubes setupso that
everything happens in a systematic way. The languages which are used in the
implementationofMOLAPare proprietarylanguages that are usedtobuildQueries.
MOLAPmusthaveextensiveclicksupportanddragsupportwhichisprovidedbyfor

10|Page
examplesayMDXbyMicrosoft.Whenthecubesarecreateditisdifficultto scale the number and size
of cubes as these should be scalable as and when the dimensions change or increase.

Thisalsoneedsthatthepersondoingthishasadifferentskillset.Thetoolstobe usedshouldalsobe
different and they should maintain the database. MOLAP directly supports multidimensional
data and operations. The data is by default stored in a multidimensional array. This provides
the user different perspectives of data that canaggregate the sales by time, geography or the
product. The storage being used here is not a relational database. If thedatawere stored in the
relational database,then itwould bedifficult to access and process the table for every
dimension. MOLAP processes the datawhichcanbeeasilystored
inamultidimensionalarray.Thishelpsinprocessingall different combinations which
canbereflected easily.
MOLAP thus needs less storage as compared to ROLAP where compression techniquescreate
an issue.The data cubes being discussed here are created by extracting datafromthe
databaseswhichare operationalorarebeingusedinadatabase.Datacubescannotbe
createdbyusingtheadhocQueriesandonthego.Henceitissaidthattheyworkbest with pre-defined
Queries. Data cubes are thus critical and have a necessity of in detail frontend and design
work.

11|P age
CaseStudy–5

ExplainvariousDataMiningTechniques?
These tools can incorporate statistical models, machine learning techniques, and mathematical
algorithms,suchasneuralnetworksordecisiontrees.Thus,dataminingincorporates analysis and
prediction.data mining techniques have been developed and used, including
association,classification,clustering, prediction, sequential patterns, and regression.

 Classification:

ThisTechniqueisusedtoobtainimportantandrelevantinformationaboutdataand
metadata.Thisdatamining technique helps to classify data in different class Data
miningtechniques.

 Clustering:

Clusteringisadivisionofinformationintogroupsofconnectedobjects.Describingthedata by a few
clusters mainly loses certain confine details, but accomplishes improvement. Data modeling puts
clustering from mathematics, and numerical analysis. From a machine learning point
ofview,clustersrelatetohiddenpatterns,thesearch forclustersisunsupervised learning, and the
subsequent framework r a data concept.

12|P age
 Regression:

Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because, Regression, primarily a form of planning and modeling. For
example, we might use it to.

 AssociationRules:

This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set. Association rule mining has several applications and is
commonly usedtohelpsales correlations in data or medical data sets.

 Outerdetection:

This type of data mining technique relates to the observation ofdata items in the data set,
which do not match an expected pattern or expected behavior. The outlier is a data point
that diverges too much from the rest of the dataset.
 SequentialPatterns:

Thesequentialpatternisadataminingtechniquespecializedforevaluatingsequential data to
discover sequential patterns. It comprises of finding interesting subsequences in aset of
sequences.

 Prediction:

Predictionusedacombinationofotherdataminingtechniquessuchastrends,clustering, classification,
etc.

13|P age
CaseStudy –6
Explain Data Mining techniques, various issuesand challenges of Data
Mining?
Hereinthistutorial,wewill discussthemajorissuesregarding−
1) MiningMethodologyandUserInteraction
2) PerformanceIssues
3) DiverseData TypesIssues
Thefollowing diagramdescribesthemajor issues.

1. MiningMethodologyandUserInteractionIssues:
Itreferstothefollowingkindsofissues−

 Mining different kinds of knowledge in databases − Different users may be


interestedindifferentkindsofknowledge.Therefore,itisnecessaryfordata mining to
cover a broad range of knowledge discovery task.
 Handling noisy orincompletedata −The data cleaning methods are required to
handlethenoise and incomplete objects while mining the data regularities. If the
datacleaningmethodsarenottherethentheaccuracyofthediscoveredpatterns will be
poor.

14|P age
 Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.

2. PerformanceIssues:
Therecanbeperformance-relatedissuessuchasfollows–

 Efficiency and scalability of data mining algorithms − In order to effectively


extractthe information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as
huge size ofdatabases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data again from
scratch.

3. DiverseDataTypesIssues:
 Handling of relational and complex types of data − The database may contain
complexdataobjects,multimediadataobjects,spatialdata,temporaldataetc.It is not
possible for one system to mine all these kinds of data.
 Mining information from heterogeneous databases and global information
systems−ThedataisavailableatdifferentdatasourcesonLANorWAN. These data
sourcemay be structured, semi structured or unstructured. Therefore, mining the
knowledge from them adds challenges to data mining.

ChallengesofDataMining:
SomeoftheDataminingchallengesaregivenasunder:
1) SecurityandSocialChallenges
2) DistributedData
3) ComplexData
4) ScalabilityandEfficiencyoftheAlgorithms
5) DataVisualization
6) DataPrivacyandSecurity
7) UserInterface

15|P age
1. SecurityandSocialChallenges:
Dynamic techniques are done through data assortment sharing, which requires
impressive security. Private information about people and touchy information is
gathered for the client’s profiles, client standard of conduct understanding—illicit
admittancetoinformationandthesecretideaofinformationturningintoa significant issue.

2. DistributedData:

True data is normally put away at various stages in distributed processing conditions.It
may be onthe internet, individual systems, or even databases. It is essentially hardto
carry all the data to a unified data archive principally because of technical and
organizational reasons.

3. ComplexData:
Truedataisheterogeneous,anditmaybemediadata,includingnaturallanguage text, time
series,spatial data,temporal data,complex data, audio or video, images, etc.It is truly
hard to deal with these various types of data and concentrate on the necessary
information.More often than not,new apparatusesandsystemswould need to be created
to separate important information.

4. ScalabilityandEfficiencyoftheAlgorithms:
The Data Mining algorithm should be scalable and efficient to extricate information
from tremendous measures of data in the data set. Difficulty ofdatamining
approaches,theenormous size of thedatabase, and the entire data flow, inspire the
distribution and creation of parallel data mining algorithms.

5. DataVisualization:
Data visualization is a vital cycle in datamining since it is the foremostinteractionthat
shows the output ina respectable way to the client. The information extricated
oughttopassonthesignificanceofwhatitplanstopasson.However,ordinarily, it istruly hard
to address theinformation precisely and straightforwardly to the end user. The output
information and input data being very effective, successful, and complex data
perception methods shouldbe applied to make it fruitful.

16|P age
6. DataPrivacyandSecurity:
Datamining typically prompts significant governance, privacy, and datasecurity
issues.For instance, when a retailer investigates the purchase details, it uncovers
information about purchasing propensities and choices of customers without their
authorization.

7. UserInterface
The knowledge is determined utilizing datamining devices is valuable just in theevent
that it is fascinating or more all reasonable by the client. From great representation
translation of data, mining results can be facilitated, and betters comprehend their
prerequisites.

17|P age
CaseStudy–7
ExplainNaïveBayesianClassification?

Bayesian classification is based on Baes' Theorem. Bayesian classifiers arethe


statisticalclassifiers. Bayesian classifiers can predict class membership probabilities
such as the probability that a given tuple belongs to a particular class.

Baye'sTheorem
Bayes'TheoremisnamedafterThomasBayes.Therearetwotypesofprobabilities

Posterior
Probability
[P(H/X)]
Prior
Probability
[P(H)]
(Where X is data tuple and H is some hypothesis.)
According to Bayes' Theorem -
P(H/X)=P(X/H)P(H)/P(X)

BayesianBeliefNetwork

Bayesian Belief Networksspecify joint conditionalprobabilitydistributions.Theyare


alsoknownasBeliefNetworks, Bayesian Networks,or Probabilistic Networks.

 A Belief Network allows class conditional independencies to be defined


betweensubsets of variables.

 Itprovides a graphical model of causal relationshipon which learning can


beperformed.

 WecanuseatrainedBayesianNetworkforclassification.

TherearetwocomponentsthatdefineaBayesianBeliefNetwork−
 Directedacyclicgraph
 Asetofconditionalprobabilitytables

18|P age
DirectedAcyclicGraph
 Eachnodeinadirectedacyclicgraphrepresentsarandomvariable.
 Thesevariablesmaybediscreteorcontinuousvalued.
 Thesevariablesmaycorrespondtotheactualattributegiveninthedata.
DirectedAcyclicGraphRepresentation
ThefollowingdiagramshowsadirectedacyclicgraphforsixBooleanvariables.

Thearcinthediagramallowsrepresentationofcausalknowledge.Forexample,lung
cancerisinfluencedbyaperson'sfamilyhistoryoflungcancer,aswellaswhetherornotthepersonisasm
oker.Itisworth notingthatthevariablePositiveXrayis
independentofwhetherthepatienthasafamilyhistoryoflungcancerorthatthepatientis a smoker,
given that we know the patient has lung cancer.

ConditionalProbabilityTable

TheconditionalprobabilitytableforthevaluesofthevariableLungCancer(LC)
showingeachpossiblecombinationofthevaluesofitsparentnodes,FamilyHistory (FH), and
Smoker (S) is as follows −

19|P age
CaseStudy–8
Discussclassificationmethodusingdecisiontreeandwriteitsalgorithm.?

Decision Tree is a supervised learning method used in data miningforclassificationand


regression methods. It is a tree that helps us in decision-making purposes. The decision tree
createsclassificationorregressionmodelsasatreestructure.Itseparatesadatasetinto
smallersubsets,andatthesametime,thedecisiontreeissteadilydeveloped.Thefinaltreeis
atreewiththedecisionnodesandleafnodes.Adecisionnodehasatleasttwobranches.
Theleafnodesshowaclassificationordecision.Wecan'taccomplishmoresplitonleaf nodes-The
uppermost decision node in a tree that relates to the best predictor called the root node. Decision
trees candealwith bothcategorical and numerical data.
KeyFactors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.

InformationGain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return the
highest data gain.

20|Page
In short, a decision tree is just like a flow chartdiagramwith theterminalnodesshowing
decisions. Startingwiththe dataset,wecan measurethe entropyto findaway to segment the
setuntilthe data belongs to the same class.
Whyaredecisiontreesuseful?
 Itenablesustoanalyzethepossibleconsequencesofadecisionthoroughly.
 Itprovidesusaframeworktomeasurethevaluesofoutcomesandthe
probabilityofaccomplishingthem.
 Ithelpsustomakethebestdecisionsbasedonexistingdataandbestspeculations.

In other words, we can say that a decision tree is a hierarchical tree structure that can be
used to split an extensive collection of records into smaller sets of the class
byimplementinga sequenceof simpledecisionrules.Adecision tree modelcomprisesa set of
rules for portioning a huge heterogeneous population into smaller, more homogeneous,or
mutually exclusive classes. The attributes ofthe classes canbeany variables from nominal,
ordinal, binary, and quantitative values, in contrast, the classes must be a qualitative type,
such as categorical or ordinal or binary. In brief, the given data ofattributes together withits
class, a decision tree creates a set of rules that can be used to identify the class. One rule is
implemented after another, resulting in a hierarchy of segmentswithinasegment.The
hierarchy is known as the tree, and each segment iscalled a node. With each progressive
division, the members from the subsequent sets
becomemoreandmoresimilartoeachother.Hence,thealgorithmusedtobuilda decision tree is
referredto as recursive partitioning. The algorithm is known as CART (Classification and
Regression Trees).
Considerthegiven example of afactory were

21|P age
Expandingfactorcosts$3million,theprobabilityofagoodeconomyis0.6(60%),
whichleadsto$8Notexpandingfactorwith0$cost,theprobabilityofagoodeconomyis
0.6(60%),whichleadsto $4 million profit, and the probability of a bad economy is 0.4,
which leads to $2millionprofit.The management teams need to take a data-driven decision
to expand or not based on the given data.

DecisionTreeInductionAlgorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor
of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking;
the trees are constructedinatop-down recursive divide-and-conquer manner.

GeneratingadecisiontreeformtrainingtuplesofdatapartitionD

Input:

 Data partition, D, which is a set of training tuplesand their associated


classlabels.
 attributelist,thesetofcandidateattributes.
 Attribute selection method, a procedure to determine thesplitting criterion
thatbest partitions that the data
 tuples into individual classes. This criterion includes a splitting attribute and
either a splitting point or splitting subset.

Output:
ADecisionTree

22|P age
Algorithm:
createanodeN;
iftuplesinDareallofthesameclass,CthenreturnNasleafnodelabeledwithclass C
ifattribute_listisemptythen
returnNasleafnodewithlabeled
withmajorityclassinD;||majorityvoting

apply attribute_selection_method(D, attribute_list)to find the best splitting_criterion;


label node N with splitting_criterion;

ifsplitting_attributeisdiscrete-valuedand
multiway splits allowed then// no restricted to binary

treesattribute_list=splittingattribute;//removesplittingattribute

foreachoutcomejofsplittingcriterion
//partitionthetuplesandgrowsubtreesforeachpartition
letDjbethesetofdatatuplesinDsatisfyingoutcomej;//apartition

ifDjisemptythen
attach a leaf labeled with the majorityclass in D to node
N;
else
attach the node returned by Generate decision
tree(Dj, attribute list) to node N;

end for

returnN;

23|P age
CaseStudy –9
ExplainK-meansAlgorithm?
ThiscasestudydemonstrateshowtousetheK-Meansclusteringalgorithmto perform
segmentation on companies from the Standard & Poor (S&P) 500 index,based on the
text of Wikipedia articles about each company.

Data
ThearticlesfromWikipediawerepre-processed outside the azure ML studio to extract and partially
clean text content related to each company. The processing included:

 Removingwikiformatting
 Removingnon-alphanumericcharacters
 Convertingalltexttolowercase
 Addingcompanycategories,whereknown
Forsomecompanies, articles couldnotbe found; therefore, the number ofrecords is less than
500.
Model
Step1)First,thecontentsofeach wiki article were passed to the feature hashing module,
which tokenizes the text string and then transforms the data into a series of numbers
based on the hash value of each token.

Feature hashingworks by converting unique tokens into integers. Itoperates on the exact
stringsthatyouprovide as an input and does not perform any linguistic analysis or pre-
processing.

Forexample,takeasetofsimplesentenceslikethese,followedbyasentimentscore.
Assumethatyouwant to use this text to build a model.

Internally,theFeatureHashingmodule createsadictionaryof n-grams. For example, the list of


bigrams forthis dataset would be something like this:

24|P age
Youcancontrolthesizeofthen-grams by using the N- grams property. If you choose bigrams,
unigrams are also computed. Thus, the dictionary would also include single terms like these

After the dictionary has been built the feature hashing module convert the dictionary terms
into hash values and computes whether a feature was used in each case. For each row of text
data, the module outputs a set of columns for each hashed feature.

Step2)evenwiththistransformation,thedimensionalityofthedataistoohighand sparsetobeusedby
theK-Meansclusteringalgorithmdirectly.Therefore,Principal ComponentAnalysis (PCA)
wasapplied usingacustomR scriptinthe Execute RScript module to reduce the dimensionality
to 10 variables.

Principalcomponentanalysis(PCA) isatechniqueusedtoemphasizevariationand bringoutstrong


patterns in a dataset.

25|P age
Step3)Bytrialanderror,welearnedthatthefirstvariable(inthePCAtransformeddata)hadthe highest
variance and appears to have had a detrimental effect on clustering.Therefore,we removed it
from the feature set using Project Columns.

Step4)Oncethedatawasprepared,wecreatedseveraldifferentinstancesofthe K-
MeansClusteringmoduleandtrainedmodels onthetextdata.Bytrialanderror,wefoundthat thebest
results were obtainedwith 3 clusters, but models using 4 and5 clusterswere alsotried.

K-meansisoneofthesimplestunsupervisedlearningalgorithmsthatsolvethewell-known clustering
problem

Theprocedurefollowsasimpleandeasywaytoclassifyagivendatasetthroughacertain numberof
clusters (assume k clusters).

Thealgorithmiscomposedof the following steps:


1. Place K points into the space represented by the objects that are being clustered. These
points representinitial group centroids.
2. Assigneachobjecttothegroupthathastheclosestcentroid.Whenallobjectshavebeen assigned,
recalculated the position of the K centroid.
3. Repeat Steps 2 and 3 until the centroids no longer move. This produces aseparationof the
objects into groups from which the metric to beminimized can be calculate.

Step 5) Finally, we usedMetadata Editortochange the clusterlabels into categorical values, and
saved theresults in CSV format for downloading, using Convert to CSV module.

26|P age
Results

Toviewtheresultsfromthesampleexperiment:

1. Right-clicktheoutputfromMetadataEditorandselectVisualize.
2. PlottheCategory column(a known featurefrom the Wikipediadata) againstthe
Assignments columns.

The three clusters that we obtained correspond roughly to threeplausible categories.


Notethat the clustersare not clearly delineated.

27|P age
Case Study–10

Explainclusteringanditsmajorclassification?
Clustering is a type of unsupervised learning method of machine learning. In the
unsupervised learning method, the inferences are drawn from the data sets which do not
contain labelled output variable.It is an exploratorydataanalysistechniquethatallows us to
analyze the multivariate data sets.
Clustering is a task of dividing the data sets into a certain number of clusters in such a
manner that the data points belonging to a cluster have similar
characteristics.Clustersarenothing but thegroupingof data points suchthatthe distance
betweenthe data points within the clusters is minimal. Clustering is done to segregate the
groups with similar traits.
Inotherwords,theclustersareregionswherethedensityofsimilardatapointsis
high.Itisgenerallyusedfor the analysis of the data set, to find insightful data among
hugedatasetsanddrawinferencesfromit.Generally,theclustersareseenina spherical shape, but
itis not necessary asthe clusters can be of any shape.
It depends on the typeof algorithmwe usewhich decides how the clusters will be
created.Theinferencesthat need to be drawn fromthe data sets also depend upon the user as
thereisnocriterion for good clustering.

Classificationofclustering
Clustering itself can be categorized into two types viz. Hard Clustering and Soft
Clustering. Inhard clustering, one data point can belong to one cluster only. But in soft
clustering, the output provided is a probability likelihood of adatapoint belonging to each
of the pre-defined numbers of clusters.

Density-BasedClustering
In this method, the clusters are created based upon the density ofthedatapoints which
arerepresentedinthedataspace.Theregionsthatbecomedenseduetothehuge number of data
points residing in that region are considered as clusters.

The data points in the sparse region (the region where the data points are very
less) are consideredas noise or outliers. The clusterscreated inthese methods
canbe of arbitrary shape. Following are theexamplesofDensity-based clustering
algorithms:

28|P age
DBSCAN(Density-Based Spatial ClusteringofApplicationswith Noise)
DBSCANgroupsdatapointstogetherbasedonthedistancemetric.Itfollowsthe
criterionforaminimumnumberofdatapoints.Itcandiscoverclustersofdifferent
shapesandsizesfromalargeamountof data,which iscontainingnoiseandoutliers.It
takestwoparameters–epsandminimumpoints.Epsindicateshowclosethedata
pointsshouldbetobeconsideredasneighbors.Thecriterionforminimumpoints shouldbe
completed to consider that region as a dense region.

OPTICS(OrderingPointstoIdentifyClusteringStructure)
OPTICS follows a similar process as DBSCAN but overcomes one of its drawbacks, i.e.
inabilitytoformclustersfromdataofarbitrarydensity.Itconsiderstwomore parameters which
are core distance and reachability distance. Core distance indicates
whetherthedatapointbeingconsiderediscoreor notby setting aminimumvalue for it.
Reachability distance is the maximum of core distance and the value of distance metric
that is used for calculating the distance among two data points. One thing to consider
aboutreachabilitydistance isthatitsvalueremainsnotdefinedifoneofthedatapoints is a core
point.

HDBSCAN(HierarchicalDensity-BasedSpatialClusteringofApplicationswith
Noise) HDBSCANisadensity-basedclusteringmethodthatextendstheDBSCAN
methodologybyconvertingitto a hierarchical clustering algorithm.

HierarchicalClustering
HierarchicalClusteringgroups(AgglomerativeoralsocalledasBottom-UpApproach) or
divides (Divisiveor also called as Top-Down Approach) the clusters based on the distance
metrics.
In agglomerative clustering, initially, each data point acts as a cluster, and then it groupsthe
clusters one by one. This comes under in one of the most sought-after clustering methods.
DivisiveistheoppositeofAgglomerative,itstartsoffwithallthepointsintoone clusterand divides
them tocreatemoreclusters.Thesealgorithmscreateadistance
matrixofalltheexistingclustersandperform thelinkagebetweentheclusters
dependingonthecriteriaofthelinkage.Theclusteringofthedatapointsisrepresented by using a
dendrogram. There are different types of linkages: –
 SingleLinkage:–Insinglelinkagethedistancebetweenthetwoclustersisthe

29|P age
shortestdistancebetweenpointsinthosetwoclusters.
 Complete Linkage: – Incomplete linkage, thedistance betweenthe twoclustersisthe
farthestdistance between points in those two clusters.
 Average Linkage: – In average linkage the distance between the two clusters is the
averagedistanceof every point in the cluster with every point in another cluster.

FuzzyClustering
In fuzzy clustering, the assignment of the data points in any of the clusters is not decisive.
Here,one data point can belong to more than one cluster. It provides the outcome as the
probability of thedatapoint belonging to eachof the clusters. One of the algorithms usedin
fuzzyclusteringisFuzzyc-means clustering.
This algorithm is similar in approach to the K-Means clustering. It differs in the parameters
involved in the computation, like fuzzifier and membership values. In this type of clustering
method, each data point can belong to more than onecluster.Thisclustering technique
allocatesmembership values toeachimage pointcorrelatedtoeachclustercenter basedon
thedistance betweentheclustercenter andtheimagepoint.

PartitioningClustering
This method is one of the most popular choices for analysts to create clusters.
Inpartitioningclustering, theclusters arepartitionedbased uponthecharacteristicsof the data
points. We need to specify the number of clusters to be created for this clustering
method.Theseclusteringalgorithmsfollowaniterativeprocesstoreassignthedata points
between clusters based upon the distance.Thealgorithmsthatfallintothis category are as
follows: –

 K-Means Clustering: – K-Means clustering is one of the most widely used


algorithms. It partitions the datapoints into k clusters based upon thedistance metric
used for theclustering.The valueof ‘k’ is to be defined by theuser. The distance is
calculated between thedata points and thecentroids of the clusters.

 K-meansclusteringisatypeofunsupervisedlearningusedwhenyouhave
unlabeled data (i.e., data without defined categories or groups).This algorithm aims
to find groupsinthedata, with thenumber of groups represented by the
variableK.Inthisclusteringmethod,thenumberofclustersfoundfromthe data is denoted
by the letter ‘K.’
 The data point which is closest to the centroid of the cluster gets assigned
tothatcluster.Afteraniteration,itcomputesthecentroidsofthoseclusters

30|Page
again and the process continues until a pre-defined number of iterations are
completedorwhenthecentroidsoftheclustersdonotchangeafteran iteration.
 It is a very computationally expensive algorithm as it computes the distance
ofeverydatapointwiththecentroidsofalltheclustersateachiteration.This
makesitdifficultforimplementingthesameforhugedata sets.

PAM(PartitioningAroundMedoids)
Thisalgorithmisalsocalledask-medoidalgorithm.Itisalsosimilarinprocessto theK-
meansclusteringalgorithmwiththedifferencebeingintheassignmentof
thecenterofthecluster.InPAM,themedoidofthe cluster has to be an input data point
whilethis is not truefor K-means clustering as theaverageofallthe datapoints in a cluster
may not belong to an input data point.
 CLARA(ClusteringLargeApplications):–CLARAisanextensiontothe
PAMalgorithmwherethe computation timehas been reduced tomake it perform betterfor
largedatasets.Itarbitrarilyselectsa portion of data fromthewhole data set, as a
representativeof theactual data. It applies the PAM algorithm to multiple samples of the
data and chooses the best clusters fromanumberof iterations. It uses only
randomsamplesoftheinputdata(insteadof theentiredataset)and computes the best medoids
in those samples. It works better than K-Medoids for
crowdeddatasets.Itisintendedtoreducethecomputationtimein the case ofa largedata set.

Grid-BasedClustering
Ingrid-basedclustering,thedatasetisrepresentedintoagridstructurewhich
comprisesofgrids(alsocalled cells).The overall approachinthealgorithmsofthis method
differs fromthe rest of the algorithms.
Theyaremoreconcernedwiththevaluespacesurroundingthedatapointsratherthanthe
datapointsthemselves.
Oneofthegreatestadvantagesofthesealgorithmsisitsreductionin
computationalcomplexity. This makes it appropriate for dealing with humongous
data sets.
Afterpartitioningthedatasetsintocells,itcomputesthedensityofthecellswhich
helpsinidentifyingtheclusters. A few algorithms based on grid-based clustering are as
follows: –

 STING(StatisticalInformationGridApproach):–InSTING,thedatasetis

31|Page
divided recursively ina hierarchical manner. Each cell is further sub-divided into a
different number of cells. It captures the statisticalmeasuresofthecellswhichhelps
inansweringthequeriesinasmallamountoftime.Eachcellis divided intoa different number
of cells. Thereafter, the statistical measures of the cellare collected, which helps answer
the query as quickly as possible.

 Wave Cluster: – In this algorithm, the data space is represented in form of wavelets.
The data space composes an n-dimensional signal which helps in identifying the
clusters.Thepartsofthesignalwitha lower frequency and high amplitude indicate that the
data points are concentrated. These regions are identified as clusters by the
algorithm.The parts of the signal where the frequency high represents the boundariesof
the clusters. It could use a wavelet transformation to change the original featurespace to
find dense domains in the transformed space.

 CLIQUE (Clustering in Quest): – CLIQUE is a combination of density-basedand


grid-based clustering algorithm. It partitions the data space and identifies the sub-
spacesusingtheAprioriprinciple.Itidentifiestheclustersbycalculatingthe densitiesof
thecells.Itcan findclustersofanyshapeand isabletofindany number of clusters in any
number of dimensions, where the number is not predetermined by a
parameter.ItoutperformsK-means,DBSCAN,andFarthestFirstinboth execution, time, and
accuracy.

32|Page

You might also like