Professional Documents
Culture Documents
2018 & 2019 Data Mining Answers
2018 & 2019 Data Mining Answers
2018 & 2019 Data Mining Answers
1MARKS
1) What is OLTP?
It is a data base which used to store day to day transactions.
(or)
It is used to store the current transaction and it receives input from the source applications.
2) Define Data Mining?
It is the process of discovering or extracting or mining knowledge from large amount of data.
3) Expand KDD
Knowledge Discovery in Database.
4) Mention any two applications of data mining.
- Education
- Banking
- Government
- Manufacturing
5) State Bay’s Theorem
It is one of the classification models which is used to predict the target data based on
training data.
Formula:
P(B/A)*P(A)
P(A/B) = -----------------
P(B)
P(A) = proposition.
P(B) = evidence.
Support(S): Percentage (%) of transaction (T) that contains bath „A‟ and „B‟
Confidence(C): in a transaction set „T‟, if „C „is the % of times „B „is present in all the
Example: problem
For the following given transaction data set, generate rules using aprior algorithm. Consider the
values as support = 50% and confidence = 75%j
Answer:
Next step we need to find frequency of combination of 2 items out of 4 items they are Bread,
Cheese, juice and milk.
16) Explain the popular classification software
There is many classification software in data mining.
Some of them are:
➢ MSBN:
o Microsoft Belief Network tools for creation assessment and evaluation of
networks.
o It is free for non-commercial research users.
➢ YADT:
o Yet Another Decision Tree builder, it is a new implementation of the C4.5
decision tree algorithm.
➢ C5.0:
o It constructs classifiers in the form of decision trees and rules etc.
17) Explain different smoothing technique.
Binning technique is used to handle noisy data by “data smoothing”.
Step1: sort the data if it is not sorted.
Step2: divide the data into equal depth partitions.
Step3: apply data smoothing in 3 ways
1) Bin mean
2) Bin meadian
3) Bin boundaries
Example: Data=8,9,15,24,21,25,29,34,4,21,26,28
B1- 4,8,9,15
B2-21,21,24,25
B3-26,28,29,34
1. Bin mean:
2. Bin median:
3. Bin boundaries:
B3- 26,26,26,34
15. Mention the different classifier used to solve the classification problem. Explain any one
16. Explain core, border and outlier points in Density based clustering with neat diagram
Mid-points: It indicates the minimum points in circle with neighbor of that point.
Core point: A point, which should satisfy the condition of minimum points or a point is a core point if
it has more than a specified number of minimum points.
Border point: A point which is neighbor of core point is called border point.
Noise / outlier point: A point which is neighbor core point nor a border point is called noise
point.
Repeated 32
It is the process of extracting useful information from the content of web documents is called web
content mining.
- Content may consists of text, images, audio, video on the list and tables.
- By using all these we can extract useful information from web document.
- Page rank method and web document clustering are the two methods to mine the
content.
19. Give an example for structured data, semi structured data and Unstructured data
* Examples of unstructured data are text documents, pdfs, images, videos etc.
* Example of semi structured data are Spread sheet files, XML, java script documents, NOSQL
database etc.
5MARKS
20) Explain the OLAP operation.
• OLAP stands for Online Analytical Processing.
• Applications of OLAP are,
o Finance and accounting.
o Sales and marketing.
o Production.
• Drill down
• Slice
• Dice
• pivot
➢ Roll up: When roll-up operation is performed on data cube one or more dimensions
from that cube are removed.
➢ Drill-down:
• It is reverse operation of roll-up.
• When we perform drill-down on data cube dimensions are added to the cube.
• It means lower level summary to higher level summery.
➢ Slice:
• It performs selection on one dimension from given cube and provides a new subcube.
➢ Pivot: -
• It is also called as rotation.
• It is a technique of changing from one-dimension orientation to another with value
also.
3-tier architecture: • Data warehouse which is used to connect the data from different OLTPs and
external source.
1. Bottom tier (Data warehouse tier): It receives the data from different sources after applying ETL.
2. Middle tier (OLAP server): • It acts as an interface between the end users and the data
warehouse.
• In this tier user presence is an important, Which is used to display the output as per user
requirements.
22)Explain the rule based classifier used to solve the classification problem
It is a type of classifier which is used to classify records by using collection of “if-then” rules.
1) Direct method.
2) Indirect method.
Direct method:
Sequential covering algorithm: It is direct method to extract rules from the data set. Steps:
Indirect method: cannot extract rules directly from data set have to apply some techniques like
decision tree, neural network algorithms etc.,
Steps,
Rule growing.
Instance elimination.
24) given data set k={2,3,4,10,11,12,20,25,30} using k-means algorithm devide the data set into 2
cluster.
Step 4: now compute centroids of m1 and m2.
- Divisive clustering
1) Agglomerative hierarchical clustering :-
It keeps on merging the objects if they are close to each other. Agglomerative Clustering can
be classified in to 2 methods, they are
Example:
26) List and explain the factors affecting the search engine ranking
1) Volume: Big data can handle high volumes of data may be >TB being generated from
various sources.
The volume of data is so large then we cannot store and analyse data using traditional
database technology.
2) Variety: It refers heterogeneous data type like structured, unstructured and semi
structured from various sources in the form of Emails, images, videos, PDFs, audio etc.
3) Velocity: The term velocity refers to the speed at which the data is generated from
various sources like social media, mobile devices, sensors etc. The flow of data is
continuous.
4) Veracity: It refers to the inconsistency, noisy and abnormality in data. In other words,
5) Value: Value refers to the worth of data being extracted. The analysis on bigdata must
20. Write the differentiate between the data warehouse and data mining
Repeated 14
• Roll-up
• Drill down
• Slice
• Dice
• pivot
Repeated 21
Repeated 22
24. Given data set k={2,3,4,10,11,12,20,25,30} using k-means algorithm devide the data set into 2
cluster.
Repeated 24
- Then the cluster is subdivided into smaller and smaller pieces until each object forms a cluster on
its own.
Algorithm:
Step1: compute minimum spanning tree for the given matrix using prims/kruskal‟s algorithm.
Step2: repeat.
Repeated 18
Repeated
7MARKS
Data cleaning: it handles noisy data and missing values by data smoothing, regression and
clustering techniques.
Data integration: when data integrated into a single source some issues are arised, we can
handle it.
Data selection: in this phase selection of a relevant data from huge amount of data.
Data warehouse server: server is a supplying machine where we connect or we are going to
write the queries like DML, DCL and process it.
Data mining engine: it is the heart of Data mining architecture which does all the tools like
extraction clustering etc.
Pattern evaluation: in this measures accordingly where should be calculated and these
patterns in the knowledge data base.
User interface: it communicates data between users and data mining engine/ system
through queries like DCL/DML.
(a) Phases of KDD: There are six steps behind the KDD process.
1. Data integration.
2. Data selection.
3. Data transformation.
4. Data mining.
5. Pattern evaluation.
6. Knowledge representation.
1. Data integration: In this phase collection of data from different sources and
integrated into a single source (DWH).
2. Data selection: In this phase retrieve purpose first select relevant data from
DWH.
3. Data transformation: After selecting the data, that will be transformed in
other forms as per requirements.
Data cleaning: It involves removal of noisy and irrelavent data from the database.
4. Data mining: Apply various techniques like association rule, classification,
clustering, regression etc., to extract the data patterns.
5. Pattern evaluation: The different data patterns generated by data mining are
evaluated using metrics.
6. Knowledge representation: The final step of KDD, which represents the
knowledge extracted in the user required forms.
(b):
Definition: Machine learning is a subset of artificial intelligence, which provides machines the ability
to learn automatically from experience without any external program.
➢ Supervised learning
➢ Unsupervised learning
➢ Reinforcement learning
➢ Semi supervised learning
30)
1. P(x/mango) where x= yellow, sweet, long.
= 350 / 650
= 0.538
= 450/ 650
= 0.692
P(long / mango) = P ( m / l ) . P ( l ) / P ( m )
=0
=0
=1
= 0.75
= 0.875
= 0.2185
= 0.33
= 0.66
= 0.33
= 0.0089
Mango =0
Banana= 0.2185
Others= 0.0089
31) What is agglomerative clustering ? cluster the following data using agglomerative approach and
represent through dendogram.
32) Mention any three cluster analysis method. Explain any one in detail
Clustering methods:
There are 5 methods, they are
i. Partitioning method
1) Partitioning method:
- Each partition must include at least one object; it means none of the cluster will be
empty.
Problem: Apply K-means clustering for the following data sets for 2 clusters K={2,3,4,10,11,12,20,25,30}
given 2 initial random centroids m1=4 and m2=12.
- Crawling
- Indexing
- Retrieving
Once the user required information/ pages are searched by crawler, the next step is to organise, sort
and store this information in search engine data base. This process is called as indexing.
3) Retrieval and ranking: The last step for search engine is to decide which pages to be retrieved
from database and give back the result to user.
Definition: Map reduce is designed to process a large amount of data in parallel by dividing the work
into smaller and independent tasks.
Repeated 28
Repeated 31
32.
Repeated 31
33. Explain the briefly the architecture of search engine and its working
Repeated 33
Repeated 27
Repeated 12