Professional Documents
Culture Documents
Dmbi Assignment 3
Dmbi Assignment 3
DMBI ASSIGNMENT 3
The term "data mining" is a misnomer, because the goal is the extraction of patterns and
knowledge from large amounts of data, not the extraction (mining) of data itself. It also is
a buzzword and is frequently applied to any form of large-scale data or information
processing (collection, extraction, warehousing, analysis, and statistics) as well as any
application of computer decision support system, including artificial intelligence (e.g.,
machine learning) and business intelligence. The book Data mining: Practical machine
learning tools and techniques with Java (which covers mostly machine learning material)
was originally to be named just Practical machine learning, and the term data mining was
only added for marketing reasons.
Ans-2
Volume of information is increasing every day that we can handle from business transactions,
scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of
extracting essence of information available and that can automatically generate report,
views or summary of data for better decision-making.
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse).
Data integration using Data Migration tools.
Data integration using Data Synchronization tools.
Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection.
Data selection using Neural network.
Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming
data into appropriate form required by mining procedure.
Data Transformation is a two step process:
Ans-3
o It is a quick process that makes it easy for new users to analyse enormous amounts
of data in a short time.
Ans-4
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new
datapoints which further means that the new data point will be assigned a value based on
how closely it matches the points in the training set. We can understand its working with the
help of following steps −
Step 1 − For implemen ng any algorithm, we need dataset. So during the first step of KNN,
we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
3.1 − Calculate the distance between test data and each row of training data with the
help of any of the method namely: Euclidean, Manhattan or Hamming distance. The
most commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class of these
rows.
Step 4 − End