2018 & 2019 Data Mining Answers

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

2018 & 2019 DATA MINING ANSWERS

1MARKS

1) What is OLTP?
It is a data base which used to store day to day transactions.
(or)
It is used to store the current transaction and it receives input from the source applications.
2) Define Data Mining?
It is the process of discovering or extracting or mining knowledge from large amount of data.
3) Expand KDD
Knowledge Discovery in Database.
4) Mention any two applications of data mining.
- Education
- Banking
- Government
- Manufacturing
5) State Bay’s Theorem
It is one of the classification models which is used to predict the target data based on
training data.
Formula:
P(B/A)*P(A)
P(A/B) = -----------------
P(B)

Where P(A/B) = probability of A when B is true.

P(B/A) = probability of B when A is true.

P(A) = proposition.

P(B) = evidence.

6) Mention any two classification Methods


• Classifier
• Decision tree
7) What is clustering?
- Clustering means forming group of similar objects.
- Two objects are said to be similar if the distance between objects is very less.
8) What is Dendrogram?
dendrogram is to work out of the best way to allocate objects to cluster.
9) What is Web Mining?
- Web mining is an application of data mining techniques which is used to discover
patterns or knowledge from WWW.
- It is the largest publicly accessible data source in the world.
10) What is Search engine?
Search engine is software which searches users request on www.
11) Define Big Data
big data is a term used to refer to the study of dataset which are so big and complex.
12) Mention any two tools used in big Data.
1. What OLTP
repeated
2. Define data cube
A cube is a multidimensional data set, which is used store the data in multidimensional
format for reporting purpose.
3. Define meta data
It is a data about the data which gives us detailed information of the data in data
warehouse.
4. What is data mining?
Repeated
5. Mention any two application of data mining
repeated
6. What is an attribute?
It is defined as a field for storing the data that represents the characteristics of an object.
7. What is supervised learning?
It is a method in which we teach the machine using labelled data.
8. What is clustering?
repeated
9. Write the formula for Euclidean distance

10. Mention any two cluster analysis software


CLUTO, Cluster Graphics 3, CViz (cluster visualization), Auto class C
11. What is web usage mining?
is the application of data mining techniques, which is discover interesting usage patterns
from web page.
12. What is big data?
big data is a term used to refer to the study of dataset which are so big and complex.
3MARKS
13) Define Missing value, noisy data and duplicate data.
Missing value: The data may be missing due to various reason like deletion/
internationally/incomplete values, we can handle it by cleaning process
Noisy data: Noisy is nothing but distributed/ error data.
Duplicate data: dataset may include data objects that are duplicate, or almost duplicates,of one
another
14) List any three differences between data mining data warehouse

15) Explain the associative classification with an example


- In associative rule mining the rules generated specify the association on the relationship
between the attributes.
- ARM is also called as “Market Basket Analysis” [MBA]
If „A „then „B „{A=> B} where A is called antecedent and B is called as Consequent.

2 Main terms used in ARM:

Support(S): Percentage (%) of transaction (T) that contains bath „A‟ and „B‟

It can be represented by P(AnB), if measures the combination of A and B frequency.

Confidence(C): in a transaction set „T‟, if „C „is the % of times „B „is present in all the

transactions containing „A‟. c= P(B/A) = P(AnB)/ P(A) it shows strength of combination.

Example: problem

For the following given transaction data set, generate rules using aprior algorithm. Consider the
values as support = 50% and confidence = 75%j

Answer:

Next step we need to find frequency of combination of 2 items out of 4 items they are Bread,
Cheese, juice and milk.
16) Explain the popular classification software
There is many classification software in data mining.
Some of them are:
➢ MSBN:
o Microsoft Belief Network tools for creation assessment and evaluation of
networks.
o It is free for non-commercial research users.
➢ YADT:
o Yet Another Decision Tree builder, it is a new implementation of the C4.5
decision tree algorithm.
➢ C5.0:
o It constructs classifiers in the form of decision trees and rules etc.
17) Explain different smoothing technique.
Binning technique is used to handle noisy data by “data smoothing”.
Step1: sort the data if it is not sorted.
Step2: divide the data into equal depth partitions.
Step3: apply data smoothing in 3 ways
1) Bin mean
2) Bin meadian
3) Bin boundaries

Example: Data=8,9,15,24,21,25,29,34,4,21,26,28

Step1: Sorted data 4,8,9,15,21,21,24,25,26,28,29,34

Step 2: we are going to divide data into 3 bins.

B1- 4,8,9,15

B2-21,21,24,25

B3-26,28,29,34

Step3: apply data smoothing techniques

1. Bin mean:

B1- (4+8+9+15)/4 = 9,9,9,9

B2- (21+21+24+25)/4 = 23,23,23,23

B3- (26+28+29+34)/4 = 29,29,29,29

2. Bin median:

B1- (8+9)/2 = 8.5 = 9,9,9,9

B2- (21+24)/2 = 22.9 = 23,23,23,23

B3- (28+29)/2 = 28.5 = 29,29,29,29

3. Bin boundaries:

In boundaries remains the same 1st and last elements

B1- 4,4,4,15 [here 8 is nearer to 4, and 9 is nearer to 4]


B2- 21,21,25,25

B3- 26,26,26,34

This is the way we can handle noisy data

18) Explain the web terminologies used in web mining.

19) Write any three application of big data


1) Weather forecasting:
Data every minute of every day from land, sea and space-based sensors used to analyse and
forecast the weather.
2) Crime prediction and prevention:
Police department, real time analytics to provide intelligence that can be used to
understand criminal behaviour, identify crime etc.
3) Healthcare sector:
- Improved healthcare by providing medicine.
- Helps to see what treatments are more effective for condition of patients.
4) Manufacturing sector:
- Product quality and defects tracking.
- Supply planning.
5) Media and entertainment:
- Predicting what the audience wants.
- Scheduling optimization.
- Ad targeting [advertise].

13. Explain the functions of ETL.

14. Write a short note on,

(a) Missing values : repeated

(b) Noisy data: repeated

(c) Duplicate data:repeated

15. Mention the different classifier used to solve the classification problem. Explain any one
16. Explain core, border and outlier points in Density based clustering with neat diagram

Mid-points: It indicates the minimum points in circle with neighbor of that point.

Core point: A point, which should satisfy the condition of minimum points or a point is a core point if
it has more than a specified number of minimum points.

Border point: A point which is neighbor of core point is called border point.

Noise / outlier point: A point which is neighbor core point nor a border point is called noise

point.

17. Explain partition method used in clustering

Repeated 32

18. What is web content mining? Explain

It is the process of extracting useful information from the content of web documents is called web
content mining.

- Content may consists of text, images, audio, video on the list and tables.
- By using all these we can extract useful information from web document.
- Page rank method and web document clustering are the two methods to mine the
content.

19. Give an example for structured data, semi structured data and Unstructured data

* Example of structured data is “data base”.

* Examples of unstructured data are text documents, pdfs, images, videos etc.

* Example of semi structured data are Spread sheet files, XML, java script documents, NOSQL
database etc.

5MARKS
20) Explain the OLAP operation.
• OLAP stands for Online Analytical Processing.
• Applications of OLAP are,
o Finance and accounting.
o Sales and marketing.
o Production.

There are 5 operations performed on data cube.


• Roll-up

• Drill down

• Slice

• Dice

• pivot

➢ Roll up: When roll-up operation is performed on data cube one or more dimensions
from that cube are removed.

➢ Drill-down:
• It is reverse operation of roll-up.
• When we perform drill-down on data cube dimensions are added to the cube.
• It means lower level summary to higher level summery.

➢ Slice:
• It performs selection on one dimension from given cube and provides a new subcube.

• We select only one particular dimension.


➢ Dice: -This operation selects two or more dimensions from a given cube and provides a
new sub cube.

➢ Pivot: -
• It is also called as rotation.
• It is a technique of changing from one-dimension orientation to another with value
also.

21) Explain the architecture of Data warehouse with a neat diagram


Dis Advantages:
• It doesn‟t support a greater number of end users compared to 3-tier.
• If the connection fails then the communication between these two is also fails.

3-tier architecture: • Data warehouse which is used to connect the data from different OLTPs and
external source.

There are three layers:

1. Bottom tier (Data warehouse tier): It receives the data from different sources after applying ETL.

2. Middle tier (OLAP server): • It acts as an interface between the end users and the data
warehouse.

• Data is stored or represented in different formats by using OLAP server.

3. Top tier (Front end client tier): • It is a top most tier.

• In this tier user presence is an important, Which is used to display the output as per user
requirements.
22)Explain the rule based classifier used to solve the classification problem

It is a type of classifier which is used to classify records by using collection of “if-then” rules.

There are two types they are:

1) Direct method.

2) Indirect method.

Direct method:

Sequential covering algorithm: It is direct method to extract rules from the data set. Steps:

Indirect method: cannot extract rules directly from data set have to apply some techniques like
decision tree, neural network algorithms etc.,

Example classifier problem


Direct method:

Steps,

Rule growing.

Instance elimination.

24) given data set k={2,3,4,10,11,12,20,25,30} using k-means algorithm devide the data set into 2
cluster.
Step 4: now compute centroids of m1 and m2.

25) Explain the hierarchical clustering technique with an example

i. These types of methods create hierarchical or level by level decomposition of given

set of data objects in the form of tree/ dendogram.

ii. These types of methods are either bottom up or top down

iii. There are 2 classification

- Agglomerative hierarchical clustering

- Divisive clustering
1) Agglomerative hierarchical clustering :-

It keeps on merging the objects if they are close to each other. Agglomerative Clustering can
be classified in to 2 methods, they are

* Min-distance/single linkage: This method is computed minimum distance between the


clusters to update the matrix.

*Max-distance/complete linkage: This method is computed maximum distance between


the clusters to update matrix.

2) Divisive hierarchical clustering:-

> In this, data objects are grouped in a top down manner.

> Initially all objects are in one cluster.

Example:
26) List and explain the factors affecting the search engine ranking

27) Explain any five characteristics of big data

Characteristics of big data or 5 V’s of big data

1) Volume: Big data can handle high volumes of data may be >TB being generated from

various sources.

The volume of data is so large then we cannot store and analyse data using traditional

database technology.

2) Variety: It refers heterogeneous data type like structured, unstructured and semi

structured from various sources in the form of Emails, images, videos, PDFs, audio etc.

3) Velocity: The term velocity refers to the speed at which the data is generated from

various sources like social media, mobile devices, sensors etc. The flow of data is
continuous.

4) Veracity: It refers to the inconsistency, noisy and abnormality in data. In other words,

veracity is the quality of the data.

5) Value: Value refers to the worth of data being extracted. The analysis on bigdata must

result is valuable information.

20. Write the differentiate between the data warehouse and data mining

Repeated 14

21. What are the characteristics of OLAP?

There are 5 operations performed on data cube.

• Roll-up

• Drill down

• Slice

• Dice

• pivot

22. Explain architecture of data warehouse with neat diagram.

Repeated 21

23. Explain rule based classifier with an example

Repeated 22

24. Given data set k={2,3,4,10,11,12,20,25,30} using k-means algorithm devide the data set into 2
cluster.

Repeated 24

25. with an algorithm explain divisive hierarchical clustering

- In this, data objects are grouped in a top down manner.

- Initially all objects are in one cluster.

- Then the cluster is subdivided into smaller and smaller pieces until each object forms a cluster on
its own.

Algorithm:

Step1: compute minimum spanning tree for the given matrix using prims/kruskal‟s algorithm.

Step2: repeat.

Step3: create a new cluster by breaking the link corresponding distance.


Step4: until only singleton cluster only remain.

Example refer Q25

26. Explain any five web terminologies in web mining

Repeated 18

27. What are the application of Big data? Explain

Repeated

7MARKS

28) Explain the architecture of data mining with a neat diagram.

Data cleaning: it handles noisy data and missing values by data smoothing, regression and
clustering techniques.

Data integration: when data integrated into a single source some issues are arised, we can
handle it.

Data selection: in this phase selection of a relevant data from huge amount of data.

Data warehouse server: server is a supplying machine where we connect or we are going to
write the queries like DML, DCL and process it.

Data mining engine: it is the heart of Data mining architecture which does all the tools like
extraction clustering etc.

Pattern evaluation: in this measures accordingly where should be calculated and these
patterns in the knowledge data base.
User interface: it communicates data between users and data mining engine/ system
through queries like DCL/DML.

29) (a) Explain the steps involved in the KDD process

(b) Define machine learning. List its categories.

(a) Phases of KDD: There are six steps behind the KDD process.
1. Data integration.
2. Data selection.
3. Data transformation.
4. Data mining.
5. Pattern evaluation.
6. Knowledge representation.

1. Data integration: In this phase collection of data from different sources and
integrated into a single source (DWH).
2. Data selection: In this phase retrieve purpose first select relevant data from
DWH.
3. Data transformation: After selecting the data, that will be transformed in
other forms as per requirements.
Data cleaning: It involves removal of noisy and irrelavent data from the database.
4. Data mining: Apply various techniques like association rule, classification,
clustering, regression etc., to extract the data patterns.
5. Pattern evaluation: The different data patterns generated by data mining are
evaluated using metrics.
6. Knowledge representation: The final step of KDD, which represents the
knowledge extracted in the user required forms.

(b):

Definition: Machine learning is a subset of artificial intelligence, which provides machines the ability
to learn automatically from experience without any external program.

There are 4 categories,

➢ Supervised learning
➢ Unsupervised learning
➢ Reinforcement learning
➢ Semi supervised learning

30)
1. P(x/mango) where x= yellow, sweet, long.

P(yellow/ mango)= P(m/y)*P (y) / P ( m)

= (350/800) . (800 / 1200) / (650 / 1200)

= 350 / 650

= 0.538

P( sweet / mango) = P(m / s) . p ( S) /P(m)

=( 450/ 850) . (850/1200) / (650 / 1200)

= 450/ 650

= 0.692

P(long / mango) = P ( m / l ) . P ( l ) / P ( m )

= (0/ 400) . ( 400 / 1200) / (650 / 1200)

=0

P(x/ mango) = 650/ 1200* { 0.538 * 0. 692 * 0 }

=0

2. P(x /banana) where x= yellow , Sweet, long

P( yellow / banana) = p(b/y) * p(y) / P( b)

= (400 / 800 ) . ( 800 / 1200) / (400/ 1200)

=1

P ( Sweet/ banana) = p( b/s) . P( s) / P(b)

=(300/850) (850 / 1200) / ( 400 / 1200)

= 0.75

P ( long/ banana) = p( b/l) . P( l) / P(b)


=(350/400) (400 / 1200) / ( 400 / 1200)

= 0.875

P(x/ banana) = 400/ 1200 * { 1* 0.75 * 0.875 }

= 0.2185

3. 2. P(x /others) where x= yellow , Sweet, long

P( yellow /others) = p(o/y) * p(y) / P( o)

= (50 / 800 ) * ( 800 / 1200) / (150/ 1200)

= 0.33

P ( Sweet/others) = p( o/s) . P( s) / P(o)

=(100/850) (850 / 1200) / ( 150 / 1200)

= 0.66

P ( long/others) = p( o/l) . P( l) / P(o)

=(50/400) (400 / 1200) / ( 150 / 1200)

= 0.33

P(x / others) = 150/ 1200* { 0.33* 0.66 * 0.33 }

= 0.0089

Mango =0

Banana= 0.2185

Others= 0.0089

Banana > mango and others

.’. the type of fruit is “banana”.

31) What is agglomerative clustering ? cluster the following data using agglomerative approach and
represent through dendogram.

32) Mention any three cluster analysis method. Explain any one in detail

Clustering methods:
There are 5 methods, they are

i. Partitioning method

ii. Hierarchical method

iii. Density based method.

iv. Grid based method

v. Model based method

1) Partitioning method:

- It must satisfy the following 2 conditions,

- Each partition must include at least one object; it means none of the cluster will be

empty.

- Every object will belong to anyone partition/ group/cluster.

Example: K-means algorithm problem:  K-means algorithm/clustering is partition based clustering


algorithm.

Problem: Apply K-means clustering for the following data sets for 2 clusters K={2,3,4,10,11,12,20,25,30}
given 2 initial random centroids m1=4 and m2=12.

Given k=2 clusters, centroids m1=4; m2=12.


Step 4: now compute centroids of m1 and m2.

33) Explain In detail, architecture of search engine and its working.

Def.: Search engine is software which searches users request on www.

- Google is one of the most common search engines.


- Bing, yahoo, ask, AOL etc. are other search engines.
- Search engine always receives request in the form of keyboards and sends response to
the user.

Search engine architecture

Every search engine has 3 main functions, they are:

- Crawling
- Indexing
- Retrieving

1) Crawling: (discover content):


- It involves collection of information based on user request.
- When a web crawler visits a page, it collects every link on the page and adds them to its
list of next pages to visit.
2) Indexing: (to track and store content)

Once the user required information/ pages are searched by crawler, the next step is to organise, sort
and store this information in search engine data base. This process is called as indexing.

3) Retrieval and ranking: The last step for search engine is to decide which pages to be retrieved
from database and give back the result to user.

34) Explain map reduce technique. Explain with an example

The map reduce is one of the main components of Hadoop ecosystem.

Definition: Map reduce is designed to process a large amount of data in parallel by dividing the work
into smaller and independent tasks.

Working of Map reduce:

The data goes through the following phases.

• Input: It is a bigdata given by user.


• Input splits: An input to map reduce job is divided into fixed size pieces called splits.
• Mapping: This is second phase in the execution of map reduce program.
• Shuffling: This phase consumes/receives the o/p of mapping phase.
• Reducing: It is a final phase, in this output values from the shuffling phase are
integrated.
28. Explain the architecture of data mining with a neat diagram

Repeated 28

29. How decision tree classifier works ? Explain with an example

30. Explain Naïve Bayes classifier with an example

Repeated 2019 Q5 and Q19

31. What is Map Reduce technique? Explain with an example

Repeated 31

32.

Repeated 31

33. Explain the briefly the architecture of search engine and its working

Repeated 33

34. (a) Explain five characteristics of Big data

Repeated 27

(b) Mention any two tools used in Big data

Repeated 12

You might also like