Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Kanishk Gupta 417 CSE-7B

DMBI ASSIGNMENT 3

Q-1 What is Data Mining?


Ans-1
Data mining is a process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database systems.[1] Data mining is
an interdisciplinary subfield of computer science and statistics with an overall goal to
extract information (with intelligent methods) from a data set and transform the
information into a comprehensible structure for further use. Data mining is the analysis
step of the "knowledge discovery in databases" process, or KDD. Aside from the raw
analysis step, it also involves database and data management aspects, data pre-
processing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered
structures, visualization, and online updating.

The term "data mining" is a misnomer, because the goal is the extraction of patterns and
knowledge from large amounts of data, not the extraction (mining) of data itself. It also is
a buzzword and is frequently applied to any form of large-scale data or information
processing (collection, extraction, warehousing, analysis, and statistics) as well as any
application of computer decision support system, including artificial intelligence (e.g.,
machine learning) and business intelligence. The book Data mining: Practical machine
learning tools and techniques with Java (which covers mostly machine learning material)
was originally to be named just Practical machine learning, and the term data mining was
only added for marketing reasons.

Q-2 Explain knowledge discovery process (KDD Process).

Ans-2

Volume of information is increasing every day that we can handle from business transactions,
scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of
extracting essence of information available and that can automatically generate report,
views or summary of data for better decision-making.

Data mining is used in business to make better managerial decisions by:


 Automatic summarization of data
 Extracting essence of information stored.
 Discovering patterns in raw data.
Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data stored
in databases.

1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming
data into appropriate form required by mining procedure.
Data Transformation is a two step process:

 Data Mapping: Assigning elements from source base to destination to capture


transformations.
 Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing
patterns representing knowledge based on given measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data understandable by user.
7. Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules, etc.

Q-3 Explain Benefits of data mining.

Ans-3

Advantages of Data Mining

o The Data Mining technique enables organizations to obtain knowledge-based data.

o Data mining enables organizations to make lucrative modifications in operation and


production.

o Compared with other statistical data applications, data mining is a cost-efficient.

o Data Mining helps the decision-making process of an organization.

o It Facilitates the automated discovery of hidden patterns as well as the prediction of


trends and behaviours.

o It can be induced in the new system as well as the existing platforms.

o It is a quick process that makes it easy for new users to analyse enormous amounts
of data in a short time.

Q-4 Explain KNN Algorithm.

Ans-4

K-nearest neighbours (KNN) algorithm is a type of supervised ML algorithm which can be


used for both classification as well as regression predictive problems. However, it is mainly
used for classification predictive problems in industry. The following two properties would
define KNN well −

 Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.

 Non-parametric learning algorithm − KNN is also a non-parametric learning


algorithm because it doesn’t assume anything about the underlying data.
Working of KNN Algorithm

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new
datapoints which further means that the new data point will be assigned a value based on
how closely it matches the points in the training set. We can understand its working with the
help of following steps −

Step 1 − For implemen ng any algorithm, we need dataset. So during the first step of KNN,
we must load the training as well as test data.

Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.

Step 3 − For each point in the test data do the following −

 3.1 − Calculate the distance between test data and each row of training data with the
help of any of the method namely: Euclidean, Manhattan or Hamming distance. The
most commonly used method to calculate distance is Euclidean.

 3.2 − Now, based on the distance value, sort them in ascending order.

 3.3 − Next, it will choose the top K rows from the sorted array.

 3.4 − Now, it will assign a class to the test point based on most frequent class of these
rows.

Step 4 − End

You might also like