Professional Documents
Culture Documents
Rule Ch2
Rule Ch2
2- Median
Where:
L₁ is the minimum critical interval, N is the number of values in the data set.
(Σfreq) l is the sum of the frequencies of all the
intervals that are lower than the median Interval.
Freq median is the frequency of the median interval.
Width is the width of the median interval.
- Sort first
If number of values is even = (V1 + V2)/2
V1 = n/2
If number of values is odd = (n+1) / 2
V2 = (n/2)+1
V1 is value of index n/2
3- Mode
- Value that occurs most frequently
- Unimodal = 1, bimodal = 2, trimodal = 3
4- Mid-range
- Sort first
- It is the average of the largest and smallest values in the set => (min + max)/2
- EX: order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
- Midrange = (30 + 110) / 2
Q1 = 47, Q3 = 63 ➔ IQR = 63 – 47 = 16
7- Standard deviation
- s (or σ) is the square root of variance s 2 (or σ 2)
Boxplot
Data is represented in a box.
The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR.
The median is marked by a line within the box.
Whiskers: two lines outside the box extended to Minimum and Maximum
Outliers: points beyond a specified outlier threshold, plotted individually
Similarity and Dissimilarity
Similarity Dissimilarity
• Numerical measure of how alike two data • Numerical measure of how different two
objects is. data objects are.
• Value is higher when objects are more • Lower when objects are more alike.
alike. • Minimum dissimilarity is often 0.
• Often falls in the range [0,1] • Upper limit varies
Proximity
Videos:
Part 1-Proximity Measure for Binary Attributes
Part 2-Proximity Measure for Binary Attributes
Proximity Measure for Ordinal Attributes
- An ordinal variable can be discrete or continuous.
- Order is important, e.g., rank.
Minkowski distance
o A popular distance measures.
o Properties:
1. d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
2. d(i, j) = d(j, i) (Symmetry)
3. d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
o A distance that satisfies these properties is metric.
h = 1: Manhattan distance
o the Hamming distance: the number of bits that are different between two binary
vectors.
h = 2: Euclidean distance
h = ∞: Supremum distance
o Supremum distance: is the maximum difference between any component
(attribute) of the vectors.
Proximity Measure for Mixed Attributes
- A database may contain all attribute types.
Nominal, symmetric binary, asymmetric binary, numeric, ordinal
Cosine Similarity
- A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
- Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
- Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
Sol.
Find:
1) Euclidean distance
2) Manhattan distance
3) Minkowski distance
4) Supremum distance
5) Cosine Similarity
Sol.
1)
2)
3)
4)
5)
Find:
1) Euclidean distance
2) Manhattan distance
3) Minkowski distance
4) Supremum distance
Sol.
Find: Dissimilarity between Binary Variables
Sol.
A(+, +) = q B(+, -) = r
C(-, +) = s
Find: Dissimilarity between normal attributes
Sol.
Find:
1) Mean
2) Median
3) Mode
4) Mid-Range
5) Quartiles
6) Inter-quartile range
7) Five number summaries
8) Outlier
Sol.
1) Mean = sum values / number of value