Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Data Pre-Processing (Cleaning)

1- Mean = ‫ عددهم‬/ ‫مجموع األرقام‬


n is sample size and N is population size. Weighted arithmetic mean:

2- Median
Where:
L₁ is the minimum critical interval, N is the number of values in the data set.
(Σfreq) l is the sum of the frequencies of all the
intervals that are lower than the median Interval.
Freq median is the frequency of the median interval.
Width is the width of the median interval.

- Sort first
If number of values is even = (V1 + V2)/2
V1 = n/2
If number of values is odd = (n+1) / 2
V2 = (n/2)+1
V1 is value of index n/2

3- Mode
- Value that occurs most frequently
- Unimodal = 1, bimodal = 2, trimodal = 3
4- Mid-range
- Sort first
- It is the average of the largest and smallest values in the set => (min + max)/2
- EX: order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
- Midrange = (30 + 110) / 2

5- Quartiles, outliers, and boxplots


- Sort first
- Quartiles: Q1 (25th percentile), Q3 (75th percentile)
- Inter-quartile range: IQR = Q3 – Q1
- Five number summary: min, Q1, median, Q3, max
- Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot
outliers individually
- Outlier: usually, a value higher/lower than 1.5 x IQR
- EX: order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Sol. Min Mid Max

Q1 = 47, Q3 = 63 ➔ IQR = 63 – 47 = 16

[Q1 – IQR * 1.5 , Q3 + IQR * 1.5] = [23, 87]

Outlier = 23 ‫ و اقل من‬87 ‫ = اكبر من‬110

6- Variance: (algebraic, scalable computation)

7- Standard deviation
- s (or σ) is the square root of variance s 2 (or σ 2)

Boxplot
 Data is represented in a box.
 The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR.
 The median is marked by a line within the box.
 Whiskers: two lines outside the box extended to Minimum and Maximum
 Outliers: points beyond a specified outlier threshold, plotted individually
Similarity and Dissimilarity

Similarity Dissimilarity
• Numerical measure of how alike two data • Numerical measure of how different two
objects is. data objects are.
• Value is higher when objects are more • Lower when objects are more alike.
alike. • Minimum dissimilarity is often 0.
• Often falls in the range [0,1] • Upper limit varies

Proximity

 refers to a similarity or dissimilarity.


Proximity Measure for Nominal Attributes
- Method 1: Simple matching
m: number of matches, p: total number of variables.
- Method 2: Use a large number of binary attributes.
creating a new binary attribute for each of the M nominal states.
 Videos: Proximity Measure for Nominal Attribute

Proximity Measure for Binary Attributes


- Distance measure for symmetric binary variables:

- Distance measure for asymmetric binary variables:

- Jaccard coefficient (similarity measure for asymmetric binary variables):

- Note: Jaccard coefficient is the same as “coherence”:

 Videos:
Part 1-Proximity Measure for Binary Attributes
Part 2-Proximity Measure for Binary Attributes
Proximity Measure for Ordinal Attributes
- An ordinal variable can be discrete or continuous.
- Order is important, e.g., rank.

 Videos: Proximity Measure for Ordinal Attribute

Euclidean & Manhattan & Minkowski Distance


 Videos:
Euclidean distance Proximity Matrix

 Minkowski distance
o A popular distance measures.

o Properties:
1. d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
2. d(i, j) = d(j, i) (Symmetry)
3. d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
o A distance that satisfies these properties is metric.

 h = 1: Manhattan distance
o the Hamming distance: the number of bits that are different between two binary
vectors.

 h = 2: Euclidean distance

 h = ∞: Supremum distance
o Supremum distance: is the maximum difference between any component
(attribute) of the vectors.
Proximity Measure for Mixed Attributes
- A database may contain all attribute types.
Nominal, symmetric binary, asymmetric binary, numeric, ordinal

 Videos: Proximity Measure for Mixed Attributes

Cosine Similarity
- A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
- Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
- Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then

where • indicates vector dot product, ||d||: the length of vector d.

 Videos: Cosine Similarity


Questions:

Sol.

Find:
1) Euclidean distance
2) Manhattan distance
3) Minkowski distance
4) Supremum distance
5) Cosine Similarity

Sol.
1)
2)

3)

4)
5)

Find:
1) Euclidean distance
2) Manhattan distance
3) Minkowski distance
4) Supremum distance

Sol.
Find: Dissimilarity between Binary Variables

Sol.
A(+, +) = q B(+, -) = r

C(-, +) = s
Find: Dissimilarity between normal attributes

Sol.

p = # of attributes = one (test normal)


m = # of match

Find:
1) Mean
2) Median
3) Mode
4) Mid-Range
5) Quartiles
6) Inter-quartile range
7) Five number summaries
8) Outlier
Sol.
1) Mean = sum values / number of value

2) Num of values is even ==>


n/2 = 6 => value1 = 52
(n/2) + 1 => value2 = 56
Median = (52 + 56)/2 = 54
3) Mode = bimodal ==> 52 & 70 ‫مكررين‬
4) Mid-Range = (Max + Min) / 2 = (30+ 110) / 2 = 70
5) Quartiles => Q1 = 47, Q3 = 63
6) Inter-quartile range => IQR = 63 – 47 = 16
7) Five number summaries = min, Q1, mid, Q3, max
= 30 , 47, 52, 63, 110
8) Outlier = [Q1 – IQR * 1.5, Q3 + IQR * 1.5] = [23, 87]

Made by: Abrar

You might also like