DA Unit1

SYLLABUS Data Analytics [CS853PE/EM8310E] Data Management: Design Deta Architecture ane manege the data for analysis. understand vorloussources (of Data like SensorvSignalsiGPS cte. Data Management, Data Qualrinoise. outliers. missing ealues. luplcate data) and Data Processing and Processing. (Chapter 1) unit -11 Data Analytics; Introduction 1 Anapics, Introduction o Toolsand Ensironment, Application of Modeling ‘in Busines. Databases and Types of Data and warlables, Data Modeling Techniques. Missing Imputations ec Need for Business Modeling. (Chepter=#) UNIT-m ble Rationalization, and Regression ~ Concepts. Blue property assumptions, Least Square Estimation, Va Mode! Butiding ete Logistic Regression « Sodel Theory. Mode! ft Statistics, Model Construction. Analtcs applications to ‘various Business Domainsetc. (Chapters) UNIT-1V, Object Segmentation Regressions Segmentation -Supercised and Unsupervised Learning, Tree Bullding = Rogression, Clasication, Overiting, Pruning ane Compsts,Mulhiple Decision Trees et Time Series Methods : Arima, Measures of Forecast Accuracy. STL approsch, Exact features from generated model as Holght. Acerage Energy ct: and Ansty for preéition. (Chapter 4) unit -v Data Visualization : PxebOrented. Vis ination Techniques. Geometric Projection Visualization Techniques. Ieon-Based Visualization Techniques, lerarche fon Techniques, Vsualising Complex Data and Relations. (Chapter) ‘islshapter-1 Data Management Chapter-4 Object Segmentati (1-4) to (1-44) (4-1) to (4-17) Design Data Architecture... edel 4.1 Supervised and Unsupervised Leaming ......4-1 2 Various Source ofData oesssessssssseses 1-4 | 42 Tree Bling Ping and Complesiy, Mute D Trees etc........ oe 4-4 1.3 Data Quality 1-6 scision Trees ete. : . 4.3. Time Series Methods. 4-122 Fill in the Blanks with Answers for Mid Term Exam . - 1-11 Fill in the Blanks with Answers for Mid Term . oo Exam .. 4-16 Multiple Choice Questions with Answers for Mid Term Exam... . 1-1 Multiple Choice Questions with Answers . - 4-16 Unit - II shapter - 2 Introduction Data Analytics (2-1) to (2-10) Chapter-5 Data Visualization (5-1) to (5-10) Introduction to Analytics... - 3 2-1 5.1 Introduction to Data Visualization 5-1 Introduction to Tools and Environment - * - 2-6 5.2 Icon-Based Visualization Techniques, Hierarchical . Visualization Techniques... .. - 5-5 3. Application of Modelling in Business . . 2-6 5.3 Visualizing Complex Data and Relations .... 5-6 Databases and Type of Data and Variables 2-7 Fill in the Blanks with Answers for Mid Term " . Exam... tees wee S29 Data Modelling Techniques. .... 2-8 ear i Multiple Choice Questions with Answers Fill in the Blanks with Answers for Mid Term for Mid Term Exam. Exam .... sercesesceseees 2210 Solved Model Question Paper (M-1) to (M-2) 5-10 Multiple Choice Questions with Answers for Mid Term Exam. 22-10 ai IlUNIT -1 DATA MANAGEMENT 1.1 : Design Data Architecture Q.1 What fs data architecture ? ‘Ans. ata architecture is a set of rules, policies, standards and models thot govem and define the type of data collected and how it is used, stored, managed and integrated within an organization and its database systems. It provides a formal spprosch to creating and managing the flow of data and how itis ‘processed across an organization's IT systems and plications. 0.2 What ts data ? Ans.: Data is collection of data objects and their attributes G3 Explain difference between Information and Informstion Data Wis raw data proces da, Data isnot specifi. Information depends on | Osta docs not depend on data information, Information i ypc Information fs output of compatee ‘Data is input fo the computer Data refer to fects esurements, Characteristics or tats of an oie of interest Information simply wefers Ao the knowledge of value tained through the ‘llection, interpretation, Sra analyte of data G4 What fs attribute ? ‘Ans.: An attribute is a property oF characteristic of fan object, Examples: eye colour of a person, temperature, ete. Attribute is also known as variable, field, ch ic, dimension, or feature, 2.5 What is object ? ‘Ans. A collection of attributes describes an object Object is also known as record, point, ease, sample, ctity oF instance 6 List the different types of attributes. ‘Ans.+ Different types of attributes are as follows : 1. Nominal + Examples - 1D numbers, eye color, 2p codes 2 Ordinal : Examples - rankings, grades, height 3 Interval: Examples temperatures in Celsius or Farenheit calendar dates, 4. Ratio : Examples - temporaturo in Kelvin, length, time, counts Q7 Define an attribute, Explain types of attributes. ‘Ang. An attribute is property or characteristics of fan object that_ma cither from one abject to another or from one time to anothes + Example : Eye color varies from person to person and temperature of on object varles over the ten. ‘+ Attbates can be categorical or quantitative. Categorical attributes have a finite number of possible values, with no ordering among the values (gs occupation, brand, cole). + Categorical att aitrbutes, because their values are “names of things” Quantitative attributes are numeric and have an implicit ordering among values (0g, 9g, income, price) wes are also called nominal + Techniques for mining multidimensional association rules can be categorized into two basic approaches regarding the treatment of quantitative attributes.Data Analy Data Mtenagement ‘Types of Attributes = 1, Nominal : Nominal means “relating to names ‘The values of a nominal altbute are symbols or names of things. Each value represents. some ind of category, code, or state and so nominal atbutes are also referred to as categorical. The values do not have any meaningful order about them. Examples: ID numbers, eye color, zip codes. ny Attributes nominal attribute with only two categories or states : 0 oF I. Binary atributes are referred to as Boolean if the two states correspond to true and false, Example: Gender, outcome of medical test (postive, negative 3. Ordinal Attibutes + An ordinal atrbute is an. attribute whose possible values have a rmeaningfal order or ranking among. them, bat the magnitude between successive volves is not known, Examples: rankings (eg, taste of potato chips on a scale from 1-10), grades, height in (tll, medium, short) 4. Numeric Attributes : quantitative, that is, itis 2 measurable quantity, represented in integer or real values. Numeric atsbutes can be intervalscaled oF ratioscaled. A binary tsbute ia A. numeric attribute is sIntervalscaled attributes are measured on a scale of equabsized units. The values of {nterval-sealed attributes have order and can be positive, 0, oF negative, Temperature is an interval scaled attribute, + Roto-scaed attribute is a numeric attribute with an inherent 2er0-point Q.8 List the types of data « ‘Ans. : Types of data sets are as fllows 1. Record : Data Matriy, Document Data and ‘Transaction Data 2 Graph : World Wide Web and Molecular Structures 3. Ordered : Spatial Data + Temporal Data = Sequential Data « Genetic Sequence Data Q8 Define various data types. ‘Ans.: Types of data are record data, data matt, document data, transaction data, graph data and ‘ordered data, 2.10 What is continuows date ? ‘Ans. + Continuous data take values in an interval of numbers. These are also known as scale data, interval data, or measurement data. Examples include: height, weight, length, time, etc. Continuous data are often characterized by fractions oF decimals : 382, 7.0001 Q.11 What are the constraints and influences 1 cllects on data architecture design ? ‘Ans. Various constraints and influences will ha effect on data architecture design, These include enterprise requirements, technology drivers, economics, business polices and data processing needs. 2.12 Define discrete data. ‘Ans. Discrete data take values ina finite or countable finite sot of numbers, that i, all possible values could be written down in an ordered list Examples include + counts, number of arrivals, or number of successes. They are often represented by integers, say, 0, 1, 2, ete 2.13 Compare discrete versus attributes. Ans. continuous Discrete Attributes | Continuous Attributes Discrte attributes has a finite or conte infinite sctof values Continuous atten fe fone who wales are rd bers. Trample: sip codes, | Examples: torperatir, ‘oun o the staf words | height oF weight ina coletion of documents, binary data ‘Of eprotented a Integer vais “Typically eprsenied Aoing-point variables Discote dat is countable | Continuous data is esse ‘Quaiatve autos are alivays dscete ‘Quanistivestbates can Deellher discrete or continous Istogram is used 80 preset data PP recrmace PusucATOns™ snip eat or woneayeData Analyte 0.14 What is dat Ans. columns may contain different data types, though all values within a column must be of the same dats type and all columns must have the same length. +A data frame is an object with rows and columns, ‘The rows contain different observations from your study, or measurements from your experiment. +The columns contain the values of different variables. All the values of the same variable must {go in the same columa frame ? Data frome is two-dimensional and different 4A data frame is used for storing data tables. It is a list of vectors of egal length, +A data frame is a list of variables of the same number of rows with unique row names, piven class "dataframe", If no variables are included, the row names determine the number of rows. +The column names should be nonempty, and attempts to use empty names have unsupported results. + Duplicate column names are allowed, but you need to ‘use checknames = FALSE for dataframe to sgenorate such a data frame + However, not all operations on data frames wil preserve duplicated column names : for example tmatrbike subsettng. will force column names in the result to be unique. + Example ae 0238) ee ote white “red”, NA) fe e(TRUE;TRUE,TRUE FALSE) mydata < dataframe(de.) rnames(mydata) < (1D""Color'Passed") # variable Q.45 What aze structured array ? Give examples. ‘Ans. Structure array is @ particular instance of a structure, where each of the fiekds of the structure is represented by a cell array. Each of these cell arrays has the same dimensions. Conceptually, a structure array can also be scen as an array of structures with identical fickds FB reoranoal PUBLCATONG?. Aw tant noice Data Management Q.16 Explain vectors, factors, matrices, data frame and lst. ‘Ans.: Veetors : A collection of values that all have the same data type. The elements of a vector are all numbers, giving a numeric vector, oF all character values, giving a character vector. A vector can be used to represent a single variable in a data sct Factors + A collection of values that all come from a fixed set of possible values. A factor is similar to a vector, except that the values within a factor are limited to a Fixed set of possible values. A factor can be used to represent a categorical variable in a data set, ‘Matrices : A two-dimensional collection of values that all have the same type. The values are arranged in tows ond cokumns. There is also an array data structure that extends this idoa to more than two. dimensions Data frames : A collection of vectors that all have the same length. This is like a matrix, except that each column can contain a different data type. A data frame can be used to represent an entire data sel HEHData Anaiytcs Data Management Lists: A collection of data structures. The ‘components of 3 list ean be simply vectors-similar to 2 data frame, but with each column allowed to have 2 different length. However list can also be a much more complicated structure, This is a very fl data structure, Lists can be used t9 store any ‘combination of data values together rT 2 Eee 4.2 : Various Source of Data Q.17 What is data management ? ‘Ans. Data management is the development and exceution of architectures, policies, practices and procedures. in order to manage the information lifecycle needs of an enterprise in an effective 0.18 What ie sensor data ? Ans. : Sensor data is the output of a device that detects and responds to some type of input from the physical environment. The output may be used to provide information or input to another system or to Buide a process 0.19 Why sampling is used ? ‘Ans.:Sampling is used in data mining because processing the entre set of data of interest is too ‘expensive or time consuming 0.20 What is purpose of dimensionality reduction ? Ans. : Purpose 1. Avoid curse of dimensionality 2. Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized 4. May help to el reduce noise inate iclevant. features. or 2.21 What fs dimensionality reduction ? ‘Ans. In dimensionality reduction, data encoding oF transformations are applied so as to obtain a reduced for “compressed” representation ofthe original data, If the original data con be reconstructed. from the compressed data without any loss of information, the data reduction is called lossless. 0.22 Define discretization. ‘Ans. Data discretization techniques can be used to reduce the number of values for a given continuous attbute by dividing the range of the attribute into Intervals, Diserotzation techniques can be categorized based on how the discretization is performed, such as whether it uses class information or which ditection it proceeds (ie, top-down vs. bottom-up) 0.23 Discuss about subset selection ‘Ans. + Finding the best subset of the set of features is ‘main aim of subset selection, The best subset contains the least number of dimensions that most comtibute fo accuracy. Subset selection of two types: forward and backward selection, 0.24 Write a note on subset selection tn attributes for data reduction ‘Ans. Finding the best subset ofthe set of features fis main sim of subset selection, The best subset contains the least number of dimensions that most contribute 0 accuracy. + Using a suitable error function, this can be used in both regression and classification problems. There are 2" possible subsets of d_vatiables, but we cannot test forall of thorn unless is small and we employ heuristics to got a reasonable (but not optimal) solution in reasonable (polynomial) time, ‘+ Subset selection are of two types + Forward ond backward selection 1. Forward selection + It start without variables and add them one by one, at each step adding the one that decreases the error the most, unt any further addition dace not decrease the error 2. Backward selection : It start with all variables land remove them one by one, at each stop removing the one that decreases the error the most, until any farther removal increases the certor significantly PP recraca PusucATONS™. snip nator moneyData Anaytics 1s Data Management + Soquentat forward selection (SES) : SFS is the simplest greedy search algorithm. It start fom the empty set sequentially add the feature x". SIS performs best when the optimal subsets smal # The main disadvantage of SFS is that it fs unable to remove features that become obsolete after the Addition of other features ‘*Scquential backward solection (SBS) : It works in the opposite direction of SPS. Stating from the full set, sequentially remove the feature x” shat Teast reduces the value ofthe objective Function * Attribute subsot selections reduces the data sot size by removing irrelevant of redundant attributes. ‘The procedure starts with an empty set of attributes as the reduced set, The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes added to the sot + Forward selection ‘Backward elimination : The procedure starts with the full set of attributes. At each step, it removes ‘the worst atribute remaining in the set. 1.25 Describe the feature subset relection. Ans. + Feature selection is 2 process that chooses a subset of features from the original features so that the feature space is optimally reduced according to a ‘Feature solection is a critcal step in the feature construction process. In text categorization problems, some swords simply do not appear very often, + Pethaps the word “groovy” appears in exnelly one training document, which i positive. Is it reall ‘worth keeping this word around as a feature? dangerous endeavor because it's hard to tell with just one training exomple if it is really correlated ‘ith the positive elas, or i it just noise You could hope that your leoming algorithm is smart enough to figure it out, Or you could just # There ate three general classes of feature selection algorithms : flter methods, wrapper methods and embedded methods. ‘The role of feature selection in machine learning Is, 1. To reduce the dimensionality of feature space 2. To speed up a learning algorithm 3. To improve the predictive accuracy of a classification algorithm 4. To improve the comprehensiili learning results of the «+ Features Selection Algorithms are as follows 1, Instance based approaches : There is no explicit procedure for fexture subset generation. Many Small data samples are sampled from the data Features are weighted according to their roles in differentiating Instances of diferent clases for 3 ddata sample. Features with higher weights can be selected 2. Nondeterministic approaches Genetic algorithms and simulated annealing are also tused in feature selection. 3. Exhaustive complete approaches + Branch and Bound evaluates estimated accurocy and ABB checks an inconsisteney measure that is ‘monotonic. Both start with a full feature st until the preset bound cannot be maintained. 0.26 What ie the need of dimensionality reduction ? Explain any two techniques for dimensionality reduction. ‘Ans.:* Most machine learning ond data mining techniques may not be effective for high-dimensional data. Query accuracy and efficiency degrade rapidly as the dimension increases ‘The “dimensionality” simply refers to the number of features (be. input variables) in your dataset ‘When the number of features is very large ecative fo the number of observations in your dataset, certain algorithms struggle to train effective models This is called the “Curse of Dimensionality.” and is espocally relevant for clustering algorithms that rely on distance caleultions. + Dimensionality reduction is the process of reducing the number of random — variables under consideration, by obtaining a set of principal variables, It ean be divided into feature selection and feature extraction. PP reormace PusuCATONS®. snip enter woneayeDate Analytics ‘Tt reduces. the time and storage space required Removal of mult-colinearity improves the interpretation of the parameters of the machine learning, mee! There sre many methods to perform dimension reduction, 1, Missing. Valucs + While exploring data, if we encounter missing, values, what we do ? Our first step should be to identify the reason then Jmpute missing values/ drop variables using appropriate methods, But, what 1 we have 00 many missing values ? Should we impute missing values or drop the variables ? 2. Low Variance : Let's think of a scenario where we have a constant variable in our dataset 3, Decision Trees + Tt can be used as a ultimate solution 9 tackle mukiple challenges ike missing coulis and identifying significant variables. values, 4. Random Forest Random Forest 15, High Corzelation : Dimensions exhibiting higher correlation can lower down the performance of model. Mercover, i is not good to have multiple variables of similar information ot variation also known a “Multicolinesrity” Similar 40 decision tree is 6, Backward Feature Elimination : In this method, wwe start with all n dimensions, Compate the sum of square of error (SSR) after eliminating cach variable (times). Then, identifying variables whose removal has produced the smallest increase in the SSR and removing i finaly, leaving us with nel input features G27 List the advantages and disadvantages of Qs 415x108 2 x€ Q.415xIOR 1p recrace PusucATOns™ snip nator woneayeData Avaites Data Management *# Does our data have any oul ‘© Outliers can drastically change the results of the data analysis and statistical modoting, There are numerous unfavourable impacts of outliers in the data set 1. It Increases the error variance and reduces the power of statistical tests 2. 1F the outliers ate non-random aistibuted they can decrease normality 3. They can bias or influence estimates that may be of substantive interest 4 They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions 32 What are outliers? Explain with an ‘example ? Ang. A database may contain data objects that do not comply with the general behavior oF model of the data, These data objects are outliers. + Thre exst data objects that donot comply with the general behavior or model of the data. Such data febjects, which are grossly Inconsistent with the remaining set of data, are called outliers different from or + Outliers may be detected using statistical tests that assume a distribution or probability model for the data, oF using distance measures where objects that, fare a substantial distance from any other cluster are considered outliers. ‘+ Rather thon using statistical or distance measures, deviationbased methods identity outliers by examining differences in the main characteristics of objects in a group, +A potential oullier is any observation that falls beyond 15 times the width of the box on cither side, that 4, any observation less than or greater than, +A suspected outlier is any observation that fall beyond 3 times the width of the box on either side In R, both potential and suspected outliers (if present) are denoted by open circles; there is no dlistnction between the tw. * Outlier detection and analysis are very useful for fraud detection, customized marketing, medical + Computer-based outlier analysis methods typically follow either a statistical distribution-based approach, a distance-based approach, a Aersity-based local outlier detection approach, oF 3 Aeviation-based approach, 0.33 For the given attributes marks values: 35, 45, 50, 55, 60, 65, 75 compute mean, median, mode. Also compute Five number summary of above data. Ans. Mean 35445+50455+6046575 _ gs Median ~ 55 (The median is the middle number. First you arrange the numbers in order from lowest to highest, then you find the middle mumber by crossing off the ‘numbers until you reach the middle.) Mode = Mode is the number that is repeated most foften, but all the numbers io this list appear only ‘once, $0 there is no mode Five number summary : Minimum = 35, Maximum = 75, Median = 55 ‘Place parentheses around the numbers above and below the median : (35, 45, $0) 55 (60, 65, 75) 4+ Find Q, and Qs. Q) can be thought of as a median, fn the lower half of the data and Qy can be thought of a4 2 median for the upper half of data ‘minimum = 35, Q, = 45, medias ‘maximum = 75 5, Oy = 65 and 034 Suppose that the minimum and maximum values for the aftribute income are $12,000 and $98,000, respectively. Normalize income value {$73,600 to the range [0:0;1:0] using min-max normalization method. The min-max normalization to transform value 73,600 onto the range (0.0, 1.0), ‘Ans. Given data: ming * 12000, macy = 98000, ew_ming = 00, new max, = 10, v = 73600, [PP recrace PuSucATONS™. snip easton woneapeAnalytics as Date Managem vemin xfnew_ max ~new_miny ) «73600-12000 Y"~ gg000=12000 (0-00) + 00 vi = a6 235 Define dato preprocessing ‘Ans. : Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Aim to reduce the data size, find the relation between data and normalized them. 1k is a proven method of zesolving such issues, Data preprocessing prepares data for farther processing, .36 What is data cleaning ? ‘Ans. : Data clesning means removing the inconsistent data oF noise and collecting necessary information of collection of interelated data 237 What do you mean by data preprocessing 7 Why it is needed ? Data preprocessing is mining technique that involves transforming raw data into an understandable format. Real-world data is often, incomplete, inconsistent, andor lacking. in certain Dehavioes or trends, and is Ukely to contain many Ans: a data Aim to reduce the data size, find the relation between data and nosmalized them, «Data preprocessing is a proven method of resolving sch issues. Data preprocessing prepares raw data for further processing, ‘Data preprocessing opplicstions such management and rule-based applications Aatabaee-driven telationship is used + Steps during preprocessing © 4. Data cleaning : Data is cleansed through processes such as. filling in missing. vahies, Smoothing the noisy dala, or resolving. the inconsistencies in the data, 2 Data integration: Data with dlferent Representations are put together and conflicts ‘within the data are resolved, 3. Data transformation : Dats aggregated and generalized A. Data reduction + This step aims to present a reduced representation of the data sma. data ‘warehouse 5. Dats disctetization : Involves the reduction of a ‘number of values of 2 continuous attribute by dividing the range of attribute interval. is normalized, 2.98 Briefly explain : 4) Dota cleaning Ans.:i) Data deaning : incomplete records, Inconsistent data. Data leaning Is a fist step in data preprocessing techniques which ie used to find the missing value, smooth noise data, recognize outliors and corzect inconsistent, 1) Data integration Row data may have value, noise coutiers and These dirty data will effects on miming procedure and led to unreliable and poor output. Therefore it Is important for some data cleaning routines How to handle noisy data In data mining 2 + Following methods are used for handling notsy dota 1. Binning method : Fitst sat data and pasttion into (equ-depth} bins then one ean smooth by bin means, smooth by bin median, smooth by bin boundaries, ee. 2. Clustering : Detect and remove outliers 3. Combined computer and human inspection Detect suspicious values and check by human 4A. Regression Smooth by fitting the dats into regression fanetions Fig. 0384 FP recneuca pusccanon® An eptmat e onmineata Analytics Data Management 4H) Data integration + ‘+ Data Integration is the process of integrating data from multiple sources and probably has a single over al these sources. *# These sources may include multiple databases, data cubes, or Hat Ales. ‘+ These sourees may include multiple databases, data cubes, oF flat files. One of the most well-known Implementation of data integration is building an cnterprise’s data warehouse ‘The benefit of a data warehouse enables a business to perform analyses based on the data in the data warehouse. There are_mainly 2 major approaches for data Integration 1 Tight Coupling: In tight coupling data is combined from different sources into’ single physical lation through the process of ETL - Extraction, Transformation and Loading 2. Loose Coupling : In loose coupling data only remains in the actual source databases. In this approach, en interface is provided that takes query from user and transforms it in a way the source database con understand and then sends the query directly to the source databases. to ‘obtain the result, 0.29 Describe the Issue fo be considered during data integration. Ans.:# The issue to be considered during data integration are Schema integration and object matching + Examples of metadata for eneh atteibute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values, + Such metadata can be used to help avoid errors in schema integration. The metadata may also be used to help transform the data, ‘+ Redundancy is another important issue. An attbute may be redundant if it can be “derived” from another attribute or set of attributes, Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set + Some redundancies can be detected by correlation analysis. Given two attributes, sich analysis can measure how strongly one attribute implies the other, based on the available data 0.40 Explain various methods for handling mi data values. ‘Ans.:# Data cleaning routines attempt to fll in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data ‘The various methods for handling the problem of missing values in data tuples are a follows ¢ ing Ignoring the tuple : This is usually done when the class label is missing, This method is not ‘ery effective unless the tuple contains. several attributes with missing values. It is especialy poor when the percentage of missing values per attribute varies considerably 2% Manually filling in the missing value : This ‘approach fs time-consuming and may not be a reasonable task for large data sete with many missing values, especially when the value to be filled in is not easily determined Using a global constant to fill in the missing value : Replace all missing attribute values by the same constant, such as a label lke “Unknown” or - If missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they form an Interesting concept, since they all have a value in common that of “Unknown.” Hence, although this method is simple, i is not recommended. 4. Using a measure of contral tendency for the attribute, such asthe mean (for symmetric rrumeric data), the median (for asymmetric ‘numeric data) oF the mode (for nominal data) 5. Using the attribute mean for numeric values oF attrbte mode nominal values, for all samples belonging to the same class as the given tuple, 6 Using the most probable value to fill in the missing value : This may be determined with regression, inference-based tools using. Baye formalism, or decision tree induction. [PP recraca PuSuCATONS®. snip nator woneapeData Anaiytics ao 2.41 What is dup ‘Ans.:Dala set may include data objects that are duplicates, or almost duplicates of one another. Major issue when merging data from helerogeneous sources, Examples is same person with multiple email ‘addresses. Data cleaning is the process of dealing with duplicate data issues te data ? (2.42 Explain sampling methods for data reduction. ‘Ans. :* Sampling can be used as a data reduction technique because it allows a lange data set to be represented by a much smaller random sample (oF subset) of the dats «Different methods of sampling, are Simple Random Sampling (SRS), Stratified Sampling, Cluster Sampling, Systematic Sampling and Mulstage Sampling 1. Simple random sample without replacement of Sizes : This is created by drawing s of the N tuples fromD (s < N), where the probability of drawing any tuple in D is 1=N, that i, all tuples are equally Hkely to be sampled. 2 Simple random sample with replacement GRSWR) of size s : This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back la D so that i may be drawn again 3. Cluster sample : Ifthe tuples in D are grouped to M mutually disjoint “clusters” then an SRS of s clusters can be obtained, where s < M. For example, tuples ina database are ususlly retrieved a page at a tine, so that each page can be considered a cluster. For example, ina spatial database, we may choose to define custo: geographically based on how closely. diferent areas ate located. 44. Stratlied sample 1D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by obtaining an SRS at each stratum. This helps ensure a representative sample, especially when the dats are skewed. For example, a stratifiod sample may be obtained from customer dats, where a stratum is crated for cach customer age group. In thie way, the age group having the smallest number fof customers will be sure to be represented. 2.43 Explain term reduction, data integration, d ‘Ans.z* Numerosity reduction : The data ore replaced or estimated by altemative, smaller data representations such 3s parametric models (which reed store only the model parameters instead of the actual dats) or nonparametrie methods such 3s clustering, sampling, and the use of histograms + Data integration combines data from multiple sources to form a coherent data store. Metadata, cortelation analysis, data confit detection, and the resolution of semantic heterogeneity contibute toward smooth data integration Data traneformation ‘+n data transformation, the data ate transformed oF consolidated into forms. appropriate for mining, Data transformation can involve the following = 1. Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering 2 Aggregation, where summary” or aggregation ‘operations are applied to the data. For example, the daily sales data may be aggregated so as to computer monthly and annual total amounts This step is typically used in constructing a data ccube for analysis of the data at multiple [pronularites. 3. Generalization of the data, where low-level oF “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, Tike stroct, can be generalized. to higher-level concepts, Tike city oF country. Similarly, valves for numerical attributes, like age, may be mapped 10 higherdevel concepts, like youth, middle-aged, and senior 4. Normalization, where the atibute dats are sealed s0_as to fall within a small specified range, such as 1.0 to 10, of 00 to 10. 2.44 What Is Importance of data to organization ? ‘Ans. © Data is the raw material for business + Leading organization have realized that business data and content is not 40 be managed separately from the ret ofthe information management, + CEOs and managers recognize that data Aeiven-decision making. is essential, and that a PP recrace PusucATOnS™. snip nator worseata Analytics fan Data Management “dats-orientod” mindset_can be a competitive advantage * That omnipresence of data pushes managers to ‘equip their business with self-service BI tools ‘Such tools allow employees to extract relevant, Important information from their company’s immense collection, ‘+ They enable employees to. perform sophisticated analyses and glean insights even without a strong technical background. ‘Bl is cortsinly not just hype, as it addresses the real reed from companies looking le relieve managers from time-consuming, complex data extraction and formatting, allowing, them insteod to focus on more value-added tasks Fill in the Blanks for Mid Term Exam 1 Data cleaning routines work to “clean” the data by filling in _, smoathing noisy data, identifying or removing outliers, and resolving inconsistencies, 2. The process of converting the integrated data into correct format is called 3° An attribute isa property or characteristic of 4 Subset selection are of two types and selection QS Smoothing, which works to remove from the data 6 Data preprocessing is a data mining technique that involves. transforming into an 2A data frame is used for storing data. [i] vatve [B) numbers {c) tables Ed at of these 3 Data frames is 2 collection of vectors that al have the length [a] same [B] variable short at of these A. Following which method is NOT used for tandling missing values. [a] Etiminate Data Objects [By Estimate Missing Values Ignore the Misi Valve During Analysis [1] Replace with All Error Values 5 Dats ___ means removing the inconsistent dsta “or alte ane ellecting, necessary information ofa eoletion of interrelated data. Ee preprocessing cleaning [c) taneforming [a mone 8 The process of converting the integrated data into correct forma called data cleaning [By data preprocessing [] dats transformation [data handing ‘Answer Key for Flin the Blanks 1 | missing valve | Q2_| data transformation understandable format. 7 Data ____ means removing the inconsistent data or noise and collecting necessary information ofa collection of interrelated data, ‘Multiple Choice Questions for Mid Term Exam 03 | otic 24 | forward, bockword a5 | rie 2s | waa 7 | ceaning Answer Key for Multiple Cholee Questions G21 Data is collection of data objects and their By ati [a none [3 tnormation (Gl characteristics an ® co © Qs 2 ae ‘ Qs ® as € [PP recrmaca PusucATONS®. snp nator woneape

DA Unit1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DA Unit1

Uploaded by

Copyright:

Available Formats

You might also like