2033 Rao Faisal Maqbool Data Maining 2

Q1.
Briefly outline how to compute the dissimilarity between objects described by the
following:
(A) Nominal attributes

A categorical variable may be a generalization of the binary variable therein it can combat quite
two states.
The dissimilarity between two objects i and j are often computed supported the ratio of
mismatches.
d (i, j) = p - m/p
Where m is that the number of matches (i.e., the amount of variables that i and j are within the
same state), and p is that the total number of variables.
Alternatively, we will use an outsized number of binary variables by creating a replacement
binary variable for every of the M nominal states. For an object with a given state value, the
binary variable representing that state is about to 1, while the remaining binary variables are set
to 0.
(B) Asymmetric binary attributes

An asymmetric binary attribute is one during which outcomes aren't of equal importance.
Positive and negative outcomes for an HIV test (say).The most important outcome (usually rare)
is 1 whereas subsequent one is 0.
In computing the dissimilarity between asymmetric binary variables, the amount of negative
matches, t, is taken into account unimportant and thus is ignored within the computation, that is,
d (i, j) = a + b /c + a + b
(C) Numeric attributes

A numerical or continuous variable (attribute) is one which will combat any value within a finite
or infinite interval (e.g., height, weight, temperature, blood sugar). There are two sorts of
numerical variables, interval and ratio. An interval variable has values whose differences are
interpretable, but it doesn't have a real zero. An honest example is temperature in Centigrade
degrees. Data on an interval scale are often added and subtracted but can't be meaningfully
multiplied or divided. For instance, we cannot say that at some point is twice as hot as another
day. In contrast, a ratio variable has values with a real zero and may be added, subtracted,
multiplied or divided (e.g., weight).
(D) Term-frequency vectors
Simply calculating the frequency of terms as in document-term matrix suffers from a critical
problem, all terms are considered equally important when it involves assessing relevancy on a
question .
As an example, a set of documents on the auto industry is probably going to possess the term
auto in almost every document. to finish this, a mechanism are often introduced for attenuating
the effect of terms that occur too often within the collection to be meaningful for relevance
determination. this will be solved by cutting down the weights of terms with high collection
frequency.
Term Frequency(TF) is that the ratio of number of times a word occurred during a document to
the entire number of words within the document.
TF refers to term frequency of a term during a document . More the frequency of the term , more
likelihood is that that this particular document has relevancy to the present query term .
TF-IDF may be a n abbreviation for Term Frequency-Inverse Document Frequency and is a quite
common algorithm to rework text into a meaningful representation of numbers. The technique is
widely wont to extract features across various NLP applications. this text would assist you
understand the importance of TF-IDF, and the way to compute and apply the algorithm in your
applications.
Q2. Exercise 2.2 gave the following data (in increasing order) for the attribute age: 13, 15,
16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
A. (a) Use smoothing by bin means to smooth these data, using a bin depth of 3. Illustrate
your steps. Comment on the effect of this technique for the given data.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of three.
Bin # Values Smoothed Values

1 13, 15, 16 14.67, 14.67, 14.67
2 16, 19, 20 18.33, 18.33, 18.33
3 20, 21, 22 21, 21, 21
4 22, 25, 25 24, 24, 24
5 25, 25, 30 26.67, 26.67, 26.67
6 33, 33, 35 33.67, 33.67, 33.67
7 35, 35, 35 35, 35, 35
8 36, 40, 45 40.33, 40.33, 40.33
9 46, 52, 70 56, 56, 56
(B) How might you determine outliers in the data?
Outliers in data are often identified in several ways.
By dividing the info into equi-width histograms and identifying the outlying histograms. By
clustering the info into groups. Any data that don't fall during a group are often taken as outliers.
Generally, fit a model to the info. Any data points that deviate significantly (based on some
threshold) from the model are often considered outliers.
Outliers are the info that are abnormal to other data points. Outliers are often found by Box and
whisker chart (box plot). Inter Quartile range also can be wont to identify outliers.
(C) What other methods are there for data smoothing?

Some other methods of smoothing data are given below:
A) Binning by boundaries
B) Exponential smoothing
C) Random walk
Q3. Consider the below given information, design and train a perceptron
while considering the below given parameters.
A B T W1 W2
1 1 1 0.6 -0.4
0 1 1
1 0 1
0 0 0
Learning Rate = 0.5 Threshold = 2

2033 Rao Faisal Maqbool Data Maining 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2033 Rao Faisal Maqbool Data Maining 2

Uploaded by

Copyright:

Available Formats

Q1.

(A) Nominal attributes

(B) Asymmetric binary attributes

(C) Numeric attributes

Bin # Values Smoothed Values

(C) What other methods are there for data smoothing?

Learning Rate = 0.5 Threshold = 2

You might also like