DM Unit 1

You might also like

Download as pdf
Download as pdf
You are on page 1of 29
(Data ji) Attribute (iv) Measurement scale Data Datatefers tothe information wh duc help of research forthe purpose. opiate decision ‘Mospsnstern’ quot in Jan, 19 tse silent, i is the use to which tis pu terms of inferring. 1d decision raking that i important". Usually, only data is ial onty if it provides good and reliable information. A researcher must use rie sfatcgy or method to provide a m he collected data ‘i i) Data Set A data set is 2 + objects, records, events, patterns, entities or Vectors. Data objects are defined using several fundamental characteristics has the time of the event Example Dept | De rept | Designation HO 2 ‘Dilo-z002 sus | s: Associate | 2510-2010 _ 13-05-1999 i) Attribute The propery ofan objet hat changes fom ime 1 tim from ane objeto another ical an atric, An ait . isnot ahout symbol of number, they are assigned to an attribute 10 analyze its Features i Example a ait color of humans which may vary from a person to another. nine (>) Measurement Seale ‘Ame oteroied elton between an object's tribute Py esscriiing ether a »7mbo or mumeccal valu to called ‘measurement sale k Example ‘Measurement of height and weight SPECTRUM ALLA-OnE JOURNAL. FOR ENGINEERING STUDENTS “Anawer* ‘the different types of attributes Dirtereat types of atrbutes include, 1 Oadinal 3, Interval 4. Ratio, Nominal : ind # symbols ancl Nominal a object from another object, uses the operations that ean be performed or thi inchide 1% test, comelation, entropy contingency and mode. Example: StodentlDs, gender and major Odin ‘Ordinal atribute values offer detail information about an ‘objet through which onc object can be compared against ‘her objects It uses < and > symbol and the operations thatean be performed on this attribute includes sign tes, percentiles, run test and rank correlation, Example: Geades of students ‘The imerval atribute provides meaningful differences aiong various objects in other words a unit of Measurement It uses + and ~ signs and the operations that can be performed on this atibute include mean, ¢ and Fests, tandard deviation and Pearson's correlation, Example: Difference betw Rat lendar dates Ratio ateibute provides meaningful difference and ratios between two object values. [uses symbols * and /sind the operations that can be performed on this attribute ‘include percent variat harmonic mean and geomet: Example; Difference betwee All four aida However, iB, Mass and length ‘hate similar type of charscteristic, ‘2n€ operation applicable to one attribute type ‘may or may not be applicable to other atrbute typen Funbermore, i Forinss though: tee, employee ID operations cttaat be perf average of ID, itis numerical several 16d on it, for example lexpreased a Currentvalue = P* Previs Here Pand Qare Example Thescale of temperature differs in ze and Fahrenheit 4 Ratio These values des not change even w are changed. It is expressed as cutre prev arnple Measurement of length produces same t ‘measured fn fect or meters. What is data mining? List and describe tt ‘motivating challenges of data mining. Answer: Data Mining ar Mode! Paes Data mining is the’ process of extracting fom massive volumes of data, It refers 9 # 2 o! ‘ignificant and useful information from an org” database, The knowlege which is extracted canins ‘yes, association rales and different trends. Date ‘Ponfined 10 a particular organization instead it ba to explore the knowledge hidden in any data. T techniques used for digging out data are arti! Satistical and mathematical techniques and pe! IG Lacing Oe Sk ng 4 techniques, SRINNAL ae: Anyone ford puny ie LIABLE ota LEGAL proves" |, Complex Heterogcn sous Data gowth in'various fields such as selene medical Jind Gnance pr rts ca, helerogencou ind non-traditional data, So ae Eel Fealstructured text, unstructured texi, hyperl mitltimedia, geore ierene am data, time: seri Gastativice-fimensional DNA dite. This bypcofd: ‘cannot be handled by classical data analysis teclnigue Wichare capable of handling only homogeneous data sei. Therefore, new technique caqable of handling graph connec orelatios and relation among NMI, documnen' ty, spatial a Distributed Data ie Pe Dats beeded for analysis in sean elreumst ot Belong sing Jocation. This distributed dav fechaiqucs which is faced by seve ize ve (@ Techniques to mini striated computing (8). Integration ofdatamsininy resus sources (Gi) Mandting data security 3 Seatabiity 4.02) findit Dild mining algorithms must be copable ws handle incorporate buge volumes of dia as advanced d Socation and collection techniqu excess of giga bytes. These also ‘ore sealable by. duce dat rr thms ean be @) ‘Sampling data (9). tpleen 8 Developing distributed and parale!algorit! ing dats structure dala sed evaluation of Inewe i ih Therefore, hypertseis weneration a iced wo be autornated. Mich Dimensionabity ‘Datasets of present crabs several mundeds ata involve very high Volume af ier lies microarray technology fs iand of dimension he number of dimensions. grow with tech The number of dimensions: gro 1. These data sets cannot be b “ techniques, G18. Explain the significance of data mining in data- base technology: advance Anawer + The following are the reasons for using data mini 1. Knowledge discovery 2, Data visualization 3, Data correction 1. Knowledge Discovery tive of knowledge discovery Pr fy the invisible coteelatio 4 visualization iste ta 40 as to find a-sen display 3. Data Correction stent data ged due The advancement is sen utes of database system t t daabasc technology (i) Data gathering and data creation Gi) Da ui) Ad a evaluation, Data ¢ and database creation developed in development of anagement which includes data si action ptocessing techniques which development ofadvanced dati SPECTRUM ALL-IN-ONE JOURNAL FOR ENGINEERING STUDENTS Dita gathering and database creation was bavod on Conventional file processing system, Later in 70's datahase have evolved trom fle processing sysicm to highly devel m ne complex database system starting ftom hierarchical and | th Aetwotk database system to relational database system, This atabase alo include different data modeling tools and data ‘secesting approaches through the we of QU langstges and wer faces. Taleo inchides an OLLTP tool whichis respurnible for soting and managing massive amounts of data cfficicnty In 1980's DBMS evolved for developmen are effective database systerh, These database s)sern promoted the development of different data models whish include object-elational model, objestriented model ete, Different databage syatems which ane dependent on application such & spatial database, multimedia database, knowledge database have been developed, The consiant development in hantware | (y technology strengthened the database industry and resulted in boge development of data colletion and storage de thisteehnology mubiple database andl epsitoris ae de ston for transaction management and for data snalyxs Data warehouse consructed as 4 repository in 1980's can be wed to store data which are collected from differcat data sources. The functionalities of data warehouse include data ‘leaning, dat integration and most important online-analytical processing. OLAP refers to ar analysis techni that pe opertions like summarization, transfor And visualization Though, OLAP provides t ta alysis supporting in-Septh analysis Which include outlier analysis, cluster analysis and evolution Analysis. IL is possible to archive large amount of data that ‘roses the storage limit of data warehouse In 1990'sdfferent |!) Web-based database such as XML aod various data integration sd retrieval techniques swere developed. WWW an ierncte ase information system whieh plays. crucial ole im database When huge dita is integrated with effective analysis tools it fizfered toa data rich but information poor scenario. | (We) nd poworful took ate ft presenti is very dificult data nage database Because of sc Hecomme da fabs Which 1 important seategi decision sieht of Ucition maker bit notin the ed in data repositories, The data tombs cn nuggets of knowledge by using data dato reveal out essential dan pattems 49. Discuss in detai f knowled discovery, ‘pray 1m}, a) | (°) oR Explain data mining as a step in the process of knowledge discovery. Answer of Koowledge Discavery (KDD) Process There are seven different stages in KDD process. This ths raw data as input and peovides wscfulfoformation ested by uses asthe cuit. The objetive of KDD process te atin «od understanding about he dytanc onpanlcation's WARNING: secon ; raion ya Ebook 8 CROMIAL Sc Anyone fund puis ALE to face LEGAL prerere* r cay vi) In vii) . Data Cleaning and Fre al Real world Pe he databases: Whos Hf fling ot mining < Data Integration Stage Data which is required for data mining procex multiple and be such as databases, selection fs pro for Dats Transformation and Reduction Stegs “ Inthe transformation stage, data ex data sources for da dng process by perforin It may also be required 8 be converted into thre usable format, Dats re ber, of prs vai dered without a is used ta decrease the data, which are being: co imtegriy of data Pata Mining Discovery Stage Data mining ix an Important process where £8 techniques are applied so as to-extract the hi Patter for evaluation. tt does this by apiyias algorithm tthe data generate afr the trans tage, The output of this stage is » set of corel ‘Falters. The relationship may be between dil ‘itis that are withthe same ine diemension x the tributes of same clits. Data ning 3 P°= discovering potentially useful knowledge 0%" ‘Volumes of dita stored indifferent reposicres pe iy, Kaede Viowaleaton Stagg ols are used! fo represent the knawhe fe inthe form of charts, graph Presentation also de ja the knowledge base fo itera nalization techniques include geometric, icon-based, | pisebtased cs. Using visualization, wire can sume frarte, extract and grasp mor inderstandable manner required multiple expert Aden dats seicetd tion oral \ difficult results in easy 9 sPecTRUM @LLin-one JOU! o —_——__ Discus data mining ax @ atep in knowledge discovery process and various challang Avmwer vente Ing as 8 Step ln Knemiedge Discovery ner eefer Unit, QU Chatbenges of Data Mining | Por answer refer Unit, Q1%, Topic: Motivating ges of eee 21. Write ahert notes on different types Of databases, | Answer moet Pape O20 Types of Databases consists of structured colle —_ allows a user 10 ane ‘queries oto extract desired information, Database marae co) for defining databs A data consist of software sytem iva software that includes methods Structure, data storage and methods for ensuring cos) and security of information Relational Database ot. Dati Retotional database is considered as one of th tional database that consist of abundant informa stored in the form of tables. Indiidual table Contains tion about a particular object which is represented i volumes, ‘The column specifies the different atiriutes of 2 ject and row (also called tuple) specifics astual instante the object. The relation én relational database: refers to dif nt bles inthe database. Tt is mandatory that, every row fm & fable should consist of unique value fe, every tuple contain Relational datshases are based on Entity Rela (ER) diagrams that represent the databss their relationship, Let us consider a rel! students that contain the following re Student, College Student table consists of following fields Student ID, Student Name, Address, Age, DOJ College table consists of f ld College 1D, College Name, Address These fields describe the prope Tables ean also be used to show the relation more tables. For example, the relat table and college table is “join Inorder to access the data, databases which are written ational query language such as SQL. can be used. When, ry is entered by the ust onal operations like un it is converted into group of i, intersection, selection eX 8. query only: small portion of specified ‘Aggregated functions like sum, av in the relational query language is fetched. nax, min can be included IRNAL FOR ENGINEERING STUDENTS, 10 When data ining tech ered. The method of accessing da understand. The Cates College 1D College Nan dr ies So Coblege Table jes how ‘ple The ia Tate £ (ay Non em (database) whore lange volumes of dat 2s very py and ible. Data warehouse sytem which sores dy-4o-ay information and answers Ue quasar he other hand provides solutions to sophisticated queries, wi (6 lof granular, . Sera el diets ows ati pe ed ile ila warhouse asa nabject (but not on of data which is mainly supported by dee data warehouse ct he defined ms ‘pptication) oriented, consolidated, tme-Jepe a fod management. From tis definition, the Follow 0 (4) Suibject-onented (Not spplication-oriented) a tod dats (), Time-dependent @ bn-crasable data _ 1) Subjectortented oe hat retrevin hat it do sconded int! comput dent an 00 ‘character feu © outa aes nt centrale ° } ody daytona transact P ° |i amin doa warchouse is archived ts nd pre baronie ee ee ycicncss time, Twa jurzare: | ____ Toot databace at dencbonel wi OOo jopesenttion stages of data warchousc | els ah wns elon aa Reema wen nom | a ar te Rensblegnalyzing of historic da : : sect oF os Ait the dna and pe associied 4 epeovides eter fata pre | oj sn erappetin sin! ipl ii 6 | ie a aero varies, resiagen an rose Me ccecuiiers hays ts tre dalawor: | arable acts are wed 10 provide " possi gemangsatic unl particular ev lee eld ERs. Mess 8 : Das wurehoue cotains data hich: | Scmmuncaton pole among he jets ses feed neat ron ecm | eo ee at ch of pet tenered om opertionl sytem sts ware Orie sitio nal ia aed oi Pople tueralsatumedependingon the specication | clas th group all bie chlor Thee Ne ares scaiy wreoncrtons eccuicd | claics art reprinted fo} iia nee $y dia wachouse for acess dat ae hi eso 6) Daaloading methest all the properties ofp {@), Datceess metod [pe cesses os BR aaa tance don won cn be | rods gates Baines ; considered as 1 consistent storage datatype sas leech Seinen atingand Temporal, Sequence, Time-series Databaes ‘Transactional Databases: These databases exe opetational-oriented a ce : 8 deytonday transactions, It stores information about miss of atrbutcs related ee cece cere | eesheh rice wranys niu IID is assigned to individual transection. which Lmestamps have distinguishable meaning. In shor, SiS iat Product IDs. Apa orn informal temporal database usually consists of relational data, hich taction, these databases also instude addi able | (4) Sequence Databases evi tomovabaah lech as ons i SESE slopecon cic eamactin pasion Ths databases Consists of vent sogsenes which my Gecemimcmicscteritcrecnctoe | Sse) tne wal ecm Fcc biological sequences, customer shopping sequences etc | srrernan quan-one JOURNAL FOR ENGINEERING STUDENTS p 4 12 Artis (0 Time-series Databases This database consist o are scquired by itera ly measur 3 fike quai, half youly suchas invent cenchange Data mi enethads can be ap wo entract the behaviour of objector tse ana decision spat syste fo planning lange ‘analy require different level of time abizacti 6 Spatial Databases These databases consist of information whit spatial relate is gesetally used to sore and query data associa ; vith the objects ia given area or space, Spatial database sa todle massive vohinies of data whlch ic sharable between | eographical formation aysiem applications Like jeopruphis database, CAD database ete Conventional databases ca undert i object lool charac ile: pets However, exirs function 7 reaps proces spatial types. of data. Raster format ‘of nlenenaional 8 and pine map ca be wed to ea | spatial dats. Vector format can he used | : entl a brges, rads, bs swing dite ne conatrunts Spatial De Thereare many application where geographic databases | a refer ie used, sich ai, c fore (0) Ecology planning se (6) Providing service infxention associated with the cables chime Cros Ficlephone system, eee cables, sewage line cc i bet ‘echniquescai be appli inorder toreveal | 9% ete nnd ° at desetibe, (0) Heterogenvous Databay dgimcst ‘of houses sinuied nea « particular location. aie Heterogene . ise © "The climatic coma 4 situated at diferent oon axsapeash Soe Dee toes ter the dice vies, Th The trend of modification of mewopolitan pavety rates epenaing the distance of cites from highways, so ponible to eeste spatial datacthes to arrange ‘ls in multidimensional format which ix backed-up by urea ‘elationat databases. These datacbes perfor different OLAP ‘peritaes like slicing, dling. selection ete. The vocation een. Muttimed Acheen patil objectscan be monitored io reveal which part | (0) Legacy Databases Datanaser of oj ave stay and automatalyasvciaed with oo . % Meteropes Sonthe Spal hrter analysiacan be dove to detec thc clus ayia ae Reccank Ailes, resent nthe dats that integrate vane te 1. Tt Databases Wernchical, network ae” S Stream 0 TAS oie of des “ to legacy daa ethene of hese databascs i 49 describe an ‘ oft mar cs ca sh of communication Sia ‘at tan competent Test databases can te Since, different database oe (3) Migny swe new typ representation format, ti dilicul to eX hice (0) Semisrsctarss ttm them. Data mining volves « Features of issue by pe ‘nto higher detailed eve ES RR CARMA Sak Aone ou ity UABLE tte ECA Objectrelational Datab Temporal, Sequence, Time-serics Databas 4, Spatial Databases £_ Spatleteenporal Databases ee tery come }Spatiotemporstaat ik at : | canis spatial object | fkesthey change with respcet : dels dy toe hased 6n the behavior of F itveveral moving stjertcan be distinguished from the identical behavior. Si ; Mietagisied fon anormal outcast tot practically 2 fe dca tate sead , mverievel of ob ne iadimendional ks Sea Databases wrline analysis and mining must Be “ad Fortes cet Uni Q21, Topic: Text otha Fefomed onthe sar dt, Some of he cx jes { Matimedia Databases ‘ata, sock exchange Fare ec Unit, 021, Topic: Mulimed ou aban 4. Word Wide Web (WWW) aos Heterogeneous Databases and Laxey Datsbees | F reropenec a WWW oe simply web i a reparitory of informal ane ler Uni, 21, Tis: Hs | thats spread aero the word by some distributed to Lepay Databases information services like Google, Yahoo, Alta Vis avinter ™ ‘Stream Data | ind America On-line, WWW i basically a collection Bem darters 0 dynamic ow of as ts apelin web pages in which the data objects provide austere immd outer tes rbscrvation pation or window It interactive acgess to the user. By traversing from one sage i Ses ype of deta that is analyzed ant generated src page £6 thy other via the hyperlinks, the weer can ration coe seve dierent applications. The dtrentcharoctens accent the esi tion. Thus, WWW contributes ryconverine ef arearn dat. ar vs follows | aap sete for data mining “SrecTRan qULtH-ONE JOURNAL FOR ENGINEERING STUDENTS cena 14 IF the paiterus oF accessing the we ‘ima distributed information eer hen noch typeof mining is known 20st ‘or Weblog mining. Consider an oo steeds, patterns are unc Aesign i reat ean art efficiently mak Wed pages include injormation vo the uss" swetetingd stn lewsenis} and structuring 2 DATA MINING FUNCTIONALITIES Explain data mining 38 2 stop process of Inowledge discovery. Mention the Functionalities, of Data Mining, Paper Qe) | Nor Rec-2HR2}, oR Discuss various kinds of pattoms to be mined trom webfserver logs in web usage mining, OR Explain various data mining tasks. eR Cy Tj! Pi ies of Data Answer Dste Slaing a0» Step Process of Knowledge Discovery Femctionalites of Dats Miaiog tized int Book ask a CruNMNAL SI) Aneoc AG devi pli whic! rule Pred mini va Panchase (ABC (Support Intesdove sor Air cenit impli force ae purchased along 1 ipl th, both computer and USB devices ae pur ber Ax this rule consists of single ps : ‘4 iat. ireferred to an single-dimensional soca : Th ahove rule can be simplified by deleting the predicate | ering it, Computer => USB devices {suppor (On the other hand, if rule consists of more than one Pte, tis refered to as multidimensional association le 6 Extra aj tition rales are eliminated if hey do net smum support, certainty threshold propertics most be performed to reveal desiced statistical between the associated pai of atribute and valve ise uta prediction s Regremion analy>s, Wh i Data prediction also inclides rornition of dvsibute trends bed (al) Cluster Rvateation s into classes of ientical abject fn the principle of maximizing end mi fase sienlantics respectively. A cluster is defined as bother objects issimilar to objects belonging p of data objects that are highly simil cluster and ate within. 5 “Srecrauts ALLAH-ONE JOURHHAL FOR ENGINEERING STUDENTS Rie Feibet Coster Individual cluster can j-using which clustering rules can be derived Cl 4 ngilshes data objects by an & bat clustering evaluates tho depen Rdescription, The advantage of ati ean ea s Isecustined to modification Cluster analysis is used ina variety of app aon ike data analysis, pavtern recognition, image processing tle, Clistering i also refered to ax data segments ‘lumering divides huge sets of dat into groups w | scalably, high dimenyjoealty, interpretability and usabi (iy Ouitler valuation Example ‘Outliers refer o hove data objects that does not satisfy the poneric ehaviot of structure ofthe data These objects des Le ‘not match with the othe available objects and are trated asthe | beim reive or exceptions by mot of the data mining methods. Some Jf applications like ftaud detsstion, however, require minsng of GUMS Seccuse they ne rately cecatied aad thos ive tbe ae! igre 1.3 INTERESTINGS Explain the v ' Anewer Moc a sone de 1. Support tig) " 2 Contence (certsinty) : 3. Newey X 4 cn people Buy Pet ‘larly, when 50% peneil ali e a threshold valve ciation rules below sting ocean pcx SIFICATION OF cas: id Systems ot 300 wntie we fre ita Miia Us Considered a iteection of 1 Dirciptine ase on data min soe ais like neural nctwor Seal ichigues, computer er Fane of different yarictics of mining ‘moti a casafytheen Tredlaieaton of data rnin ms tsdoor hae L Yarety of databases extracted Variety of knowledge extracted 3. Waiey oftetniques wed 4 Variety of application adapted Variety of Databases Extracted Claniication of mining systems donc based on the tat ae system is d : [iit tthe tat are extracted or mined. Thisclassificatio ‘ec tepcating on various standards like Wdite model is used, then ‘Rlaional datahase mode! ational model. erational model or a” Data mining system can also be catego that exacts data regula and also the one that extracts data irregular 3. Variety of Techniques Used sn be dons based ed. The descri The classificatio ot kinds of methods emp these methods is done according ts " Mal da ype re cd ten ser can hve tes | () Une nero evened ht ce intertivg cha pial teers dat le taplraary yim, ery -triven mem Me tse standards need separate methods of dat | (i) Daa seals edna used such ax data visualization, rade pas revagalen, neural estar hae log es oe ENGINEERING STUDENTS Re SPECTRUM ALLIH-OME JOURNAL FOR ENGIN ataminig Walon enjoys matiple ) Uw SR PIRRIRLGISN OF dats mining 4) 00 Peni ot applications they ui. so Ipteelehets dan mining sypcry are ved inclu | 1 Petpet iene Ares ret ptt perfecting wks ro Geil all domains becate some apiciions rset 1 af applisition-specific technique eat 9 bets ‘ LS DATAMININGTASK PRIMITIVES |... Es IDR. st and daseribe te prnlives for speciving | & 2 data mining task, h Anewer eat : 1 Primitives for Specifying Data Mining Task | y using 3 a ; Kaas inl cach tak is specified data mining | EH b ‘ Gey. ich te spot ion das ming sytem, Ads | NCTE: tng ery eden crnoftak primitives Gatalincie | nt f Be ommunieae Wi the date mining systemandeumine | Knees: ci ‘the result from different perspectives ii The pre-requisite fix DMQLA The pri od for defi a ue ya cena Sea a salary is i) Taskreevana = the BE Te = eu spa a2 @ syntax forthe following dota mi ws ( cg ik-relovant data a P ncopt hierar & wh « Task-retevant Data Prienitive es : er ) Co ca Knowledg uy ; i Dy ; j Wierarehy Mt hey ied: the pase iss sche reer by 15,000 and sal < 20,000 2 Sten * orkerby emp_ASC 7 poup by cmp_id Concept Hierareble er IGéscept hierarchic provides uscful background as for displaying results. As the name specifies, it shies at concept level specication of concept hierarchy is as follows Tevel, concept hierarchies can be defined ge altribute relationship. Set Grouping Hierarchy = define thierarchy for | at a lower level than another >) i < Syntax it, country) < (disriet, country) {()} ANAL FOR ENGINEERING STUDENTS yuie address can be shown in Speciation at eepenaiave! Figure (2E Concept Hierarchy for Attribute Address, Some hierrchial information is specified by concept grouping whlch exliily shows that one group of soncepss = °= define hierarchy < (a) 0 (G28. Write the syntax for the following data mining DATA MININ Figure (3) ‘Modification of Concept Hierarchies ‘User can insert subordinate concept in hicraxchy or delete one subordinate concey Syntax ‘ :: = insert under )] for delete under from hierarchy {(shiee_namc>)] for {a} The kind of knowledge to be mined (b)_Moasures of pattorn interestingno: Anawer t Syntax for Kind of Knowledge fo be Mined ‘Kind of knowledge 10 be mined is based on different dats mining fictions. Some ( Charncterization Gi) Discrimination (iil) Association (iy) Classification (¥)._ Prediction, Syntax for Characterization Characterization iy specified. with M_K.S;:= mine characteristic he statement given below as pt_name]analyze messurs(s) The characterlzatlon syntax specifies the mining Afferent aggregate measures like sum, © Example mine characteritics a8 vendor tristhe descriptions, Here, “analyz” clause analyze count’ ‘This example provides the cha that satisfy the above ehara leiste description of xing habe OF & Vendor. he percenay of task-relev Syntas for Discrimination Syn for this data mining function i i en by the following statement. M_K_S::= mine comparison nc} for ig_ class where tpt sondition, sison_j) analyze measore(s) glass —* target cls class > conta aay ‘Where taget and contrast classes are data collecto descriptions, "Merete The diserimination syntax spcxiti larget class wit diffe The analyze clause i Example ye ath For iupultive customers where a re where ame. WARMING: xareargtosapying ot nis bic CRURINAL act Anyone fon coy NRG analyze: ww Sy 2 PMOL specific wine aoc pushing age. "20 ‘ome (M Thana indicatcs tsbey ipo one J whose income is hetween 25 1045 thousands are mos ely Torpesfy classifica . neste IHS: == mine classifiation (as pt nam ‘clan —> elassifying c di > dimension! Teeeasiiation syotax specifics, the mining o Fessfaation on the basis of values Example sin amfcation as cla Here, debit_info is an attr ‘Sytner Prediction Fratiction is specifics by, BES sie prediction fos trae] analyze pat. (eisai j~saie 1) : pate» pretition fb ‘Hie, tean be seen that the syntax of ‘ail the data mining functions have one common stateme Meese K. 5). This statemcot speiies the kindof knowledge be ined description of syntax for each data mining func sm below, relevant twp! mln ya species the mining of thos ising ovunkvon con ox drnension inthe "analyze" clause ze! elause ix uved to perform acsifying_attribule of dimension sastomerdebitrating an ¢ which determines the debit rating of the eustomet ci to speci jie Mine Knowledge ous data values that are specified for any ‘3 prediction product cost analyze £0st “= "Handycam" and brand = "SONY" Bese dlc cot ofthe hantycary The Set lace specifies ha. the sreicive Pann regarding the cost ihe bse of tas-elevant dain associated with SONY Banca AL for Measures of Pattern Interesting ess evr increstingnes measure of ptems THY 8 e comparison of parison fanetio” PR ec coastence ars commonky wsce essence of patiern interes encss blir beyincst is at support and confedence measures havea threshold IRNAL FOR ENGINEERING STUDENTS «Syntax for Support ‘Support is specified by the following: stitemnent ‘with threshol 1, with support shreshold = 0.03 2. with suppon thet = 0.04 3. with support threshold 0.05 Syntax for Confidence ‘Goalence i a defined by the same statement give above ew exanpes ar, 1, with confidence teesbold = 0:5 | 2 with confidence threshold = 01 3, _with confidence thre 5} @29. Discuss briefly about data mining query lan- ‘guages and list out the advantagos and draw- backs of using query language in data mining. Answer! Data Mlalng Query Languages language that sie to qucey lange inks Frey wseful information and represent in an understanstable potter sats mining query Language Ad with tam D ining qucry language enables exible ing stems by providing a user-friendly mining systerss generally imterocty with several Adatan ing ery languages fac Som of the most widely used data mining query language ae, © DMQL (Date Mining Query Language) @ Microson Patahase) OLE DB (Object Linking and Embedding (i) PMML for Data mining (P rogramning Data Mode} Markup Language) (6) CRISP-DM (CRoss Indust Mining) ndard Process for Data PMOL is pe easbest query ‘mtx, DMOL was designed at Sin sy ngage, 2s it uses SQL like non Frater University Use database For ) ‘sociations [as ) up by astudent- ID. Advantages of Data Mining Query Langs The ads follows. lized sets of rules can be using SQL. queries Drawhacks of Data Mining Query Lamguare The drawbacks of data I ining query! does not provide much support 0 ‘associated with pre-processing and 7° techniques 2 eis indi WARNING: NIN: seonrhccron ha Bick lk a and ad boc in nature ‘Ate lxnd gif LIABLE to face LEGAL prover twapply any 5. DM systems processing ‘those tem ‘fae or da ‘ysfem, in order to tanaged by these ‘a tutte or main erent data rin Seed either in 1 data warchous ‘ein a Ss coupling Ange by data {iro ay ta “Suoting approach INTE MININ WAREHOUSE fp) Loose cour , i 3 0 1 fal to ache . vod scab ec vss | 46) Smith Coup aps Perl ing | Mead wih adnate or at warehouse mem, Ar wiv) mat } tm setgraion, this coupling approach provides efficient “gt tially data froin a | implementations of eeroit data ring prentives, which a mining algo applied 50 ed dat, ft mining: primitives include sorting, indexin og Mig ees ir another foeos escomputation of some sttitica measures (sm, ny aX 2 Diadeantages unt, standard devia The dssdvantagcs of No coupling are as follow ‘Kachin Iniselficult to vearch for tash-rolevant and high qustity E eee dike ita since data in DM system is not orpanized abe Oe ee cee Huge arsount of time is consumed white collctin 1 intermedinte mining revulls are efficiently computed ot seaniag and transforming data since itis not possi guage tospply any flexible or scalable algorithm ne (4) Tight Coupling DDM systems cannot be integrated into an information et ary woe | a eg | cee rene supers Nese: ULE Stlalatiteiey queries “mnesstem. In this approach, some ofthc funetionalitics daca daa warehouse system are ue Pai ede a fech data foe dal me The retrieved data Sal emai memory Tis datas anal yest by spy Oe ee toe a Ss tcrina hl or ina specific loca Mowshause, The techniques like query PrOeessing. by dats mining | is done by w (Which is ig different indexing sche ed with DB or DW a be accessed © then stored | system: Advantages the results generated 4 fadatabase ew very, unguas® with The advantages of tight coupling are as fo edhe Galiin penile by oper ean be wsed | 1», Duta atning functions can be implemented effchently ‘hoa Provided by Miepericnete fh any aod mi a | The enamels ofaem nr Teg hy teibases or datawarchouse systems. Tr YX tha loose coupling approach is Better 3. Am integrated information processing envionment is Serra, created 1 proceesinot SPECTRUM ALL-IN-ONE JOURNAL FOR ENGINEERING STUDENTS that make the data incomplete Data integr ty of aftributes when required, ween da dain, exclusion of relevant data, | et ma in ‘dae 10 inconsistencies, modifications ah fore 10 avoid these (ype of errors, data Z ng is employes. in ‘of data is not done in a simely quality of data. For instance, = of enarketing executives whose data 1 cs are decided ot the end of the month, IF i not subsite till the end of the month, i is 3 das incorrect. and Iaterpretabitity . ee bility isthe port of data which users consider as i est data and inerpetabiity isa factor which depends the understanding of user with respect to the data ‘example, iFan erroneous database is corrected and with new codes, user could still consider it as Gi) Red ‘and interpretability: (iii) Identify (Object Matching and Sche iy Anere ‘Object matching is a process whe Using ‘tha Various data pre-processing | jgenify and match up the objects that ha ie data integration is be ves, Binnin How data reduction helps in data |. possess different names in different databases a also known as object identification issue, In schema @ the errors can be avoided by usin : ie oR By sseuongel yaleck Te cet i) Wirt a note on subset selection in attributes: | to perform data transformation ® __ fordata reduction, (i) Redundancy and Inconsistency (i) Gene fet Ont Tope: Sedecting Subset of Asributes) Redundancy refers to unnecessury repetitce Inthi Rune’ fan attribute is extracted from another = PERT YO) 1 seibute ray become redindanl, Redundancy #860 |Riy)| ord Peepreceasing by inconsistencies in the name and size of wnribucs Te Daisbase consists of massive volume of data which is | fason for redundancy is when the table is n0\ == ‘collected roe sources: Due'to thishetersgene- | Inconsistency occurs duc to duplication, inaccuracy ity eal world data tends 10 be inconsistent and noisy. If data | updation. It is necessary to detect and delete be a) i attribute yalue.at tuple level in order to obtain coe as | ey Correlation analysis can be used to detects = tb) i : tedata preprocessing | by measuring the level of depemlencies (comets Sequlivordiawduiceartecyetetemhances | testes. ‘ Ths ‘ofmining process. The diferent techni iiculty | (ut) Identifying and Resolving the Conflicts b™ ing data preprocessing are, eee Data Values : i i 3. ee ray also occur An atrute many be at more the same attribute may be at Aggregation of Data Cube arcchniqu orm 1cgregate Functions pac a a -hedneien tive 1s Ko compress the mavsive volume of data into Kinited ‘. A without sa aa integrity se fot performing aggrepation a (0) Reducing data dimensionalit : (@) Reducing data numerosity ssary repetition af da a values that are present at lower-cdnceptual level ae substituted by data value athigher ye attribute set then conceptual level, This is ueneraly used while extrscting data present a different granolarty level dundancy is also cae! yy Normalization of Data © of attributes, The ode able is not normal 3€ attribute Values so that they’ ae within a specified rainpe of smaller sine Normalization is & process of decompotin Normalization is used in hm that enhances the speed of leaminy rf ® etwork classification algorithm such asin back pra nd delete the Sobiain 4 neighbour classification that prohbt the layer range attribute valucs (8) Distance-based method such as k-eare Jo detect the redusdan (correlation) bers= (ut weighing smatler range attribute values. The different data normalization techniques are 1s between DF Conflicts bet Min-max 2. Zeto-score (x-score) normalization objects coltecte! & pute va Normalization using decima! scaling ferent ath ; internal represents" cost attribute ma) and in dollar form? 1 occur because ¢ at more detailed may be at more BM ruction of Attribute/ tributes ix derived from the existing atbutes. These new ‘Toparding the correlation that exist between the a nia atnbutes, This correlation i important while performing knowledge SPECTRUM ALL-IN-ONE JOURNAL FOR ENGINEERING STUDENTS AL proceedings aa cas i pce 01607 suet daa st witow saervcin 8? ., i 7 | + |i Poe mats nove in dt edt a Gi) aise aareesion i) wae wat we 10) Rotunda incre Waring DATA MININ a att ; 7 fh f 034, ExP now Answer = (a) Parametric Methods thou! Generation eras Non y atiributes bh ‘acs a} ‘ Jab -type, ¢ metric methods senerating 0 ag aieretiant vm and Generation of € 4 stion of Conees pattie com al descripi it makes the a | Gu Explain concept hie sults inp rechy generation for the pominal data. serten of Concept Hierarchy for No Nominal data is basic discrete form ie hve finite nurnber of independent v be ny ot have any eer. For example a ik ay.same ele. The different techniques used for scokeps hicrarchy for Nominal dats ae Qi) Specifying attribute’s partial schema level (9) Specifying portion of concept hic performing data grouping Specifying arsbute set instead of atributes partial coding Bi Specifying only paris! ordering on grown of smb ‘civ ing Attribute’s Partial Ordering Directly at a Level Gespt hicrarchies consists of set of atribuics. By | HRilhing ether total or partial ordering of aticibutes | Fevel, a user oF domain expert can det tei eras seithout any dh Heute asociated withthe time dumerst sek PES Poh qaanes, half yearand year. The Attributes can be represented PAPAL the following way Week < month < quarter buckets of equivatent frequency divided 2,3,5,6,8.9,9, 10, 12,15, 17, Bucket A : 2,3,5.6 Bucket B : 8,9,9,10 Bucket € ¢/ 12, 15.17.20. The averages of buckets 4,8, Care 4,9, 16 respectively ockets ate substinute a Ths exiting ales i Bucket A= 4,444 Backer # = %9,%9 Blucket C+ 16, 16,1616 (i) Smoothing by Calculating the Medan of Bucket This: method is similar to Cha ceeept that median i ealculated for individual tuskets re median ofthe above lst for bucket 4, Care 4 4. 16 eapestively which are substituted inthe followin Bucket: 4.444, Bucket B+ 99,99 Bucket C ; 16,16,16, 16 {uy Smuoting by Comldering the Boundaries of Bucket In this method, the highest and lowest values in faadividual buckets are considered. The rected as boundaries ofa bucket. Eve racket i substituted by the boundary valve 6 tee rarest 10 The highest an lowest valucs rem ‘unchanged. ‘data value ar! Pslore initiating the excuton of 2, suet The hist valuey of buckes 4,2, Ca Ihe owed a3 9 3p the biskets are substi inthe following "=> Backed = 2.2.6.6 BuckerB + &,8,8.10 Bucket + 12,12, 20,20. | 2, Smoothing Noisy Data wsing Regression Metros famction that is used to ‘This function is responsible for determining the | ‘hat ts mot suitable to model the correlation betwece {Wo atirbute valves. This i done in suet a way that fone atibate can be used to estimate the valve of thr arcributes (i) Matipte Lineae Regression Function | This function is an enhancement of simple lincar | tegession function. In this,at Feat two atta values | tre considered andthe data associated with this function ip Ett fo a ultvimsnsional sorta 3. Smoothing CChstering i a method of combining abstr immo classes of ential objects. A cluster is defined asa ‘of ata abject hatbaye high sir fosamechaterand are disiilar wo objecs belonging a ether clusters. Chstcring can be used! to delete the ouiers which jay Data using Clustering Method ity with objects helongin fe the dala values that aly exhibit dssirilar nature in a | ppanticlar cluster. (Q38, Whatare the steps for performing data cleaning asa process? Answer © }2 The steps for performing data cleaning asia process ate, | 1. Aeatifying dserepancies 2. Transformation of data 1, Ideatifying Discrepancies ‘The seasons for the occurrences of discrepancies are mors made manually while entering data. Eero made ntetionally duet the disinterest of users Errors dc to data degradation, w Exton because of inconsistene : inconsistency while representing ‘or While using inconsistent coder = (Vy Een de wo uilization of da Spar fom is intended (9) Exrors due to incon oe ency while performing data | 13 a wenn avy formation Sats tyres.domainnange ind ine eens ‘troduced alter performing the « vata Migration Toots These tools allow to specif codke* with “zip code TL Tool (Extraetion/Transform! These tools. enabl transformation via GUL White dat Powsibility of x The major disedvantage data cl ‘Much inter activity is not provided. Ne Sleaning proves. 8 potier's wheel, deci ge usage 89 Nana el, declarative language sare FP MSs of each | being devetéped that focuses on increasing the * . wa pierete transform: This transformed data ¥¢ Data reduct fractional value 6 threshold walue spex to NUL parse data can be invoked in wayele approximation ‘own. This is dor transformation, DWTissin the former is mc af actual data. \ansformation a @ When coe proaches a ‘whens com GDETisnot Coefficien On the ot i) here are DFT. weg Dire hierarchical Algorithm shat Ives, This di saing) en 5) Principal ce picrete Wavelet Transformation (DW 1) pWTisatechnique of tursforming nerves (y Joeoo Je) 10 Vert ily bat the length of both vectors i rained fom vector Y'consiss of DWT to perform data reduc wh of yansfoeoed data vector Ys eve. This is done by truncatin sare data weCt0rs Data reduction can be performed by storing 2 srial siesal yalue of wavelet coefficien' sel valve specified hy uss, A the Fo tense to NULL. Since most of miss of zo", i is referred to as sparse data, Operation en da canbe performed with ereater sekedin wavelet space, It is also possible vypuimation of actual d laovn This is done by performing raslonation, DWTissimilare DFT (Discrete cfoene is much capable of attaining good approximation (istual data, The other differences between the #wo ‘alomution approaches are, (When coefficient length of both transforination ap: Poacher similar, smaller space is occupied by DWT ‘vhen compared tothe space utilized by DFT which crosses the aining coemickents rales in revulting data set reed if enecited of p compute good when wavelet coeflicient are ee Transforn) but © DrTisncapubieof performing lcaizationon wevelt ‘Sveffcients because of which it hold local detail of data aie ater hand DWT can hold focal dias ‘There are numerous DWT"s where as there is only on5 Orr, beg DIRHE wavelet transform can be used By expla‘ etal pnd algorithm, The principal theme of Ith tdvdes the data at each consecutive se? i? wen “This division leads o faster processing spect aS SPECTRUM ALLIN-ONE JOURNAL FOR ‘yaa ph eg ta values obtained ftom the above ie Marked as wavelet coefficients. rad ss Wy robaning RETR eer ere matrix whic icles on discrete wavelet iran Fc mate eulipication oa the input data The alc, the columns of mats unit bx must be mutually erthogosal. The inverse ‘obtained by just transposing the resultant matrix, The matrix ied ino soese specified sparse matric ino bin fast symmetic DWT algoriten with somptesity O(n). Advantages of Wavelet Transformation 1. Wis capable of handing sparse data as well ax ordered tof ate bes (i) Principal Component Analysis (PCA) PCA is an alerative inherent method for performing dimensionality reduction I is amethod of searching orthogonal data vector, which is used for data representation, The search is mage in sch 3 way that the sclocted vectors are les that ‘or equal 1 the cate atribuie Farge f vestors. This results in reducing the dimensionality of data because the whole ateibtes range of data vectors are represented on a space which is of Fimited size Le, less than the space oeeupied by entire data vectors Let us consider a yectos data set having k-imension, ‘We scarch "number of k-fimension orthogonal that Sk This appeosch is ferent fromm ati beeause, PCA mengers the characteristics of attributes by Jeveloping acdifferent set of variables of salir size, Atibute Sabset selection approach is hol capable of disclosing the ereclatign that existed between the atrbutes because of which joderstand the eesulline ute ris not possible 1 easily und TPCA ppproach ll, these difficulics ace solved asi is ws wich allows interpreting subse selection By ming PCA proach all x pcofreeing corel cenall ‘without any difficulty, ENGINEERING STUDENTS analysis 4 performed » WhertwMied oH input at tal atisbute con aftribute devwain Igo Vestoes hat “Baa are evaluated tv pio a emote vectors as p Tecause cach . ih ‘ofthe other vector K Once al (1) Measures to Eval Advantages of Principal Component Analy fats 3 Meveap y beter than wa eet tr iy Strategy t0 Con { bo handle ‘ x 240. Describe the fosture subset selection. a oR Discuss attribute subset selection for Gimensionality reduction | ‘Anmwer + reo 14, O80) Dimensions can la be reduced th selectbon ‘Selecting Sane ot Astibates | Kor acme ser nil, re Wail, (3, Tope: 5. ‘eture rae sect han "det er Weappet IARI: rng he appr ¢ Whether I crea: set sisonn yee short notes on data discret Explain the following with examples, (a) Agaregation (&) Dimensionatity reduction i dere (c) Feature subset selection, ante continous values with Answer Masunes9n39, ip. The advantage of . (4) Aggregation eepreseataton of mining res Tectarediferenimctbods which arcwsedorpertarm. |) etetslonaleyRedetion 4 dandscrtiztion, These methods rely on the informs For answer refer Units, 9. (©) Feature Subset Selectio mosted withthe way of perform hn infemation indicates eit tection in wh For anawer refer Unit Sexpoves othe elas infor thods used for data | @43. What stops you would fallow to identity a fraud for a credit eard company. sxsrtzton ae, Answer! stapes) 080) (a) Supervised discreticat () Unsupervised discretica Credit Card Fraud i) Sopervised! Diseretiza Wat is discretized si refered to as supervised or organized disretizatis {performing purchases and gemoving findsThe van ee ce Dicretastion | selon wei rs eredit ear company are ax If dus valves are reduced by sbbstitating them by li (> epi a Baa Steet itera desertion but without using sans infor thn inroferred o as wnsupcrse lice Tete is another method which can be Pfeng a carton, fn this rho tli a cared out rapeately om inti serial at repeatelly on a indivi ttoial atribute jn oder to construct a hitatchy of atthe refered to ascancept hierarchy, Ths pscipal Of coocept hicrarchy is to extract dats present at Brodlarity levels. Ik is also used to simplify the data Presa and io mubsttue detailed level or ypecialiest with more general level concepts which ae <=) ‘Usce should feu heck the hunk statements. Incase ofany wnavtherized transaction, user should be alert Setup Alert from Bank. User shoud set an aleat fran Ban when a tansaction is Cheek the Crealt Reports sul chock te credit reports SHequently. So that applied Us file will get to now if any tnkrwrwn pers for eredit cards on behalf of you [Atest When the Card ls Destined ‘User bould be earful when the cards decline. That ue arate eres, means, the cand will be decl ined only when the pin ix ‘ oie Eola

You might also like