Rule Ch2

Data Pre-Processing (Cleaning)
1- Mean = ‫ عددهم‬/ ‫مجموع األرقام‬

n is sample size and N is population size. Weighted arithmetic mean:
2- Median
Where:
L₁ is the minimum critical interval, N is the number of values in the data set.
(Σfreq) l is the sum of the frequencies of all the
intervals that are lower than the median Interval.
Freq median is the frequency of the median interval.
Width is the width of the median interval.
- Sort first
If number of values is even = (V1 + V2)/2
V1 = n/2
If number of values is odd = (n+1) / 2
V2 = (n/2)+1
V1 is value of index n/2
3- Mode
- Value that occurs most frequently
- Unimodal = 1, bimodal = 2, trimodal = 3
4- Mid-range
- Sort first
- It is the average of the largest and smallest values in the set => (min + max)/2
- EX: order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
- Midrange = (30 + 110) / 2
5- Quartiles, outliers, and boxplots

- Sort first
- Quartiles: Q1 (25th percentile), Q3 (75th percentile)
- Inter-quartile range: IQR = Q3 – Q1
- Five number summary: min, Q1, median, Q3, max
- Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot
outliers individually
- Outlier: usually, a value higher/lower than 1.5 x IQR
- EX: order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Sol. Min Mid Max
Q1 = 47, Q3 = 63 ➔ IQR = 63 – 47 = 16
[Q1 – IQR * 1.5 , Q3 + IQR * 1.5] = [23, 87]
Outlier = 23 ‫ و اقل من‬87 ‫ = اكبر من‬110
6- Variance: (algebraic, scalable computation)
7- Standard deviation
- s (or σ) is the square root of variance s 2 (or σ 2)
Boxplot
 Data is represented in a box.
 The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR.
 The median is marked by a line within the box.
 Whiskers: two lines outside the box extended to Minimum and Maximum
 Outliers: points beyond a specified outlier threshold, plotted individually
Similarity and Dissimilarity
Similarity Dissimilarity
• Numerical measure of how alike two data • Numerical measure of how different two
objects is. data objects are.
• Value is higher when objects are more • Lower when objects are more alike.
alike. • Minimum dissimilarity is often 0.
• Often falls in the range [0,1] • Upper limit varies
Proximity
 refers to a similarity or dissimilarity.

Proximity Measure for Nominal Attributes
- Method 1: Simple matching
m: number of matches, p: total number of variables.
- Method 2: Use a large number of binary attributes.
creating a new binary attribute for each of the M nominal states.
 Videos: Proximity Measure for Nominal Attribute
Proximity Measure for Binary Attributes

- Distance measure for symmetric binary variables:
- Distance measure for asymmetric binary variables:
- Jaccard coefficient (similarity measure for asymmetric binary variables):
- Note: Jaccard coefficient is the same as “coherence”:
 Videos:
Part 1-Proximity Measure for Binary Attributes
Part 2-Proximity Measure for Binary Attributes
Proximity Measure for Ordinal Attributes
- An ordinal variable can be discrete or continuous.
- Order is important, e.g., rank.
 Videos: Proximity Measure for Ordinal Attribute
Euclidean & Manhattan & Minkowski Distance

 Videos:
Euclidean distance Proximity Matrix
 Minkowski distance
o A popular distance measures.
o Properties:
1. d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
2. d(i, j) = d(j, i) (Symmetry)
3. d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
o A distance that satisfies these properties is metric.
 h = 1: Manhattan distance
o the Hamming distance: the number of bits that are different between two binary
vectors.
 h = 2: Euclidean distance
 h = ∞: Supremum distance
o Supremum distance: is the maximum difference between any component
(attribute) of the vectors.
Proximity Measure for Mixed Attributes
- A database may contain all attribute types.
Nominal, symmetric binary, asymmetric binary, numeric, ordinal
 Videos: Proximity Measure for Mixed Attributes
Cosine Similarity
- A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
- Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
- Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
where • indicates vector dot product, ||d||: the length of vector d.
 Videos: Cosine Similarity

Questions:
Sol.
Find:
1) Euclidean distance
2) Manhattan distance
3) Minkowski distance
4) Supremum distance
5) Cosine Similarity
Sol.
1)
2)
3)
4)
5)
Find:
1) Euclidean distance
2) Manhattan distance
3) Minkowski distance
4) Supremum distance
Sol.
Find: Dissimilarity between Binary Variables
Sol.
A(+, +) = q B(+, -) = r
C(-, +) = s
Find: Dissimilarity between normal attributes
Sol.
p = # of attributes = one (test normal)

m = # of match
Find:
1) Mean
2) Median
3) Mode
4) Mid-Range
5) Quartiles
6) Inter-quartile range
7) Five number summaries
8) Outlier
Sol.
1) Mean = sum values / number of value
2) Num of values is even ==>

n/2 = 6 => value1 = 52
(n/2) + 1 => value2 = 56
Median = (52 + 56)/2 = 54
3) Mode = bimodal ==> 52 & 70 ‫مكررين‬
4) Mid-Range = (Max + Min) / 2 = (30+ 110) / 2 = 70
5) Quartiles => Q1 = 47, Q3 = 63
6) Inter-quartile range => IQR = 63 – 47 = 16
7) Five number summaries = min, Q1, mid, Q3, max
= 30 , 47, 52, 63, 110
8) Outlier = [Q1 – IQR * 1.5, Q3 + IQR * 1.5] = [23, 87]
Made by: Abrar

Rule Ch2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rule Ch2

Uploaded by

Copyright:

Available Formats

Data Pre-Processing (Cleaning)

1- Mean = ‫ عددهم‬/ ‫مجموع األرقام‬

5- Quartiles, outliers, and boxplots

[Q1 – IQR * 1.5 , Q3 + IQR * 1.5] = [23, 87]

Outlier = 23 ‫ و اقل من‬87 ‫ = اكبر من‬110

6- Variance: (algebraic, scalable computation)

 refers to a similarity or dissimilarity.

Proximity Measure for Binary Attributes

- Distance measure for asymmetric binary variables:

- Jaccard coefficient (similarity measure for asymmetric binary variables):

- Note: Jaccard coefficient is the same as “coherence”:

 Videos: Proximity Measure for Ordinal Attribute

Euclidean & Manhattan & Minkowski Distance

 Videos: Proximity Measure for Mixed Attributes

where • indicates vector dot product, ||d||: the length of vector d.

 Videos: Cosine Similarity

p = # of attributes = one (test normal)

2) Num of values is even ==>

Made by: Abrar

You might also like