Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 38

Data Mining:

Mengenal dan
memahami data

1
Mengenal dan memahami data

Objek data dan macam-macam atribut

Statistik diskriptif data

Visualisasi data

Mengukur kesamaan dan ketidaksamaan data

2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Document data: text documents: term-
frequency vector
Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
Graph and network
World Wide Web Document 2 0 7 0 2 1 0 0 3 0 0

Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0


Ordered
Video data: sequence of images
Temporal data: time-series TID Items
Spatial, image and multimedia: 1 Bread, Coke, Milk
Spatial data: maps 2 Beer, Bread
Image data:
3 Beer, Coke, Diaper, Milk
Video data:
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

3
Data Objects

Data object menyatakan suatu entitas


Contoh:
Database penjualan: customers, barang-barang yang dijual,
penjualan
Database medis: pasien, perawatan
Database universitas: mahasiswa, professor, perkuliahan
Data objects dijelaskan dengan attribut-atribut.
Baris-baris Database -> data objects;
Kolom-kolom ->attribut-atribut.

4
Atribut
Attribut ( dimensi, fitur, variabel): menyatakan
karakteristik atau fitur dari data objek
Misal., ID_pelanggan, nama, alama

Tipe-tipe:
Nominal

Ordina

Biner

Numerik:

Interval-scaled

Ratio-scaled

5
Attribute Types
Nominal: kategori, keadaan, atau nama suatu hal
Warna rambut
Status , kode pos, dll, NRP dll
Binary :Atribut Nominal dengan hanya 2 keadaan (0 dan 1)
Symmetric binary: keduanya sama penting
Misal: jenis kelamin,
Asymmetric binary: keduanya tidak sama penting.
Misal : medical test (positive atau negative)
Dinyatakan dengan 1 untuk menyatakan hal yang lebih penting (
positif HIV)
Ordinal
Memiliki arti secara berurutan, (ranking) tetapi tidak dinyatakan
dengan besaran angka atau nilai.
Size = {small, medium, large}, kelas, pangkat

6
Atribut Numerik
Kuantitas (integer atau nilai real)
Interval
Diukur pada skala dengan unit satuan yang sama
Nilai memiliki urutan
tanggal kalender
No true zero-point
Ratio
Inherent zero-point
Contoh:Panjang, berat badan, dll
Bisa mengatakan perkalian dari nilai objek data yang
lain
Misal : panjang jalan A adalah 2 kali dari panjang jalan B

7
Atribut Discrete dan kontinu
Atribut Diskrit
Terhingga, dapat dihitung walaupun itu tak terhingga

Kode pos, kata dalam sekumpulan dokumen

Kadang dinyatakan dengan variabel integer

Catatan Atribut Binary: kasus khusus atribut diskrit

Atribut Kontinu
Memilki nilai real

E.g., temperature, tinggi, berat

Atribut kontinu dinyatakn dengan floating-point variables

8
Basic Statistical Descriptions of Data
Tujuan
Untuk memahami data: central tendency, variasi dan
sebaran
Karakteristik Sebaran data
median, max, min, quantiles, outliers, variance, dll.

9
Mengukur Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi
x
Note: n jumlah sample dan N nilai populasi. n i 1 N
n

w x
Mean/rata-rata:
i i
Trimmed mean: x i 1
n
Median: w
i 1
i

Estimated by interpolation (for grouped data):


n / 2 ( freq)l
median L1 ( ) width
Mode freqmedian
Median
Value that occurs most frequently in the data interval

Unimodal, bimodal, trimodal


Empirical formula:

mean mode 3 (mean median)


10
Symmetric vs. Skewed Data
Median, mean and mode of symmetric
symmetric, positively and negatively
skewed data

positively skewed negatively skewed

November 19, 2017 Data Mining: Concepts and Techniques 11


Mengukur sebaran Data
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, median, Q3, max
Outlier: biasanya lebih tinggi atau lebih rendah dari 1.5 x IQR
Variansi dan standar deviasi (sample: s, population: )
Variance: (algebraic, scalable computation)
n n
1 1
x 2
n n n 2 2 2
1 1 1
( xi ) 2 ]
( x )
s
2 2 2 i i
( x x ) [ x N i 1 N i 1
n 1 i 1 n 1 i 1
i i
n i 1
Standard deviation s (atau ) akar kuadrat daro variance s2 ( atau 2 )

12
Boxplot Analysis

Lima nilai dari sebaran data


Minimum, Q1, Median, Q3, Maximum
Boxplot
Data dinyatakan dengan box
Ujung dari box kuartil pertama dan ketiga,
tinggi kotak adalah IQR
Median ditandai garis dalam box
Outliers: diplot sendiri diluar

13
Sifat-sifat kurva Distribusi Normal

Kurva norma
dari to +: berisi 68% pengukukuran (: mean, :

standar deviasi)
Dari 2 to +2: berisi 95% pengukuran

Dari 3 to +3: berisi 99.7% pengukuran

14
Histogram Analysis
Histogram: grafik menampilkan
40
tabulasi dari frekwensi data
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000

15
Histograms Often Tell More than Boxplots

Dua histogram
menunjukkan boxplot
yang sama
Nilai yang sama: min,
Q1, median, Q3, max
Tetapi distribusi datanya
berbeda

16
Scatter plot
Melihat data bivariate data untuk melihat cluster dan outlier
data, etc
Setiap data menunjukkan pasangan koordinat dari suatu data

17
Positively and Negatively Correlated Data

Kiri atas korelasi positif


Kanan atas korelasi negatif

18
Uncorrelated Data

19
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto graphical

primitives
Provide qualitative overview of large data sets

Search for patterns, trends, structure, irregularities, relationships among

data
Help find interesting regions and suitable parameters for further

quantitative analysis

20
Geometric Projection Visualization Techniques

Visualization of geometric transformations and projections of


the data
Methods
Scatterplot and scatterplot matrices

21
Scatterplot Matrices

Used by ermission of M. Ward, Worcester Polytechnic Institute

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

22
Similarity and Dissimilarity
Similarity
Mengukur secara Numerik bagaimana kesamaan dua objek
data
Tinggi nilainya bila benda yang lebih mirip
Range [0,1]
Dissimilarity (e.g., distance/jarak)
Ukuran numerik dari perbedaan dua objek
Sangat rendah bila benda yang lebih mirip
Minimum dissimilarity i0

23
Data Matrix and Dissimilarity Matrix
Data matrix
n titik data dengan p x11 ... x1f ... x1p

dimensi ... ... ... ... ...
Two modes
x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
Dissimilarity matrix
0
n titik data yang didata
d(2,1) 0
adalah distance/jarak
d(3,1) d ( 3,2) 0
Matrik segitiga
: : :
Single mode
d ( n,1) d ( n,2) ... ... 0

24
Proximity Measure for Nominal Attributes

Misal terdapat 2 atau lebih nilai, misal., red, yellow,


blue, green (generalisasi dari atribut binary)
Metode Simple matching
m: # yang sesuai, p: total # variabel
d (i, j) p p
m

25
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i

Distance measure for symmetric


binary variables:
Distance measure for asymmetric
binary variables:
Jaccard coefficient (similarity
measure for asymmetric binary
variables):
Note: Jaccard coefficient is the same as coherence:

26
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

Gender is a symmetric attribute


The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
27
Standardizing Numeric Data

Z-score:

z x
X: raw score to be standardized, : mean of the population, : standard
deviation
the distance between the raw score and the population mean in units
of the standard deviation
negative when the raw score is below the mean, + when above
An alternative way: Calculate the mean absolute deviation
sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
mf 1
n (x1 f x2 f ... xnf ) x m
.

standardized measure (z-score): zif if


sf
f

Using mean absolute deviation is more robust than using standard


deviation

28
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix
(with Euclidean Distance)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

29
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure

where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
Properties
d(i, j) > 0 if i j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)

30
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are different

between two binary vectors


d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp

h = 2: (L2 norm) Euclidean distance


d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp

h . supremum (Lmax norm, L norm) distance.


This is the maximum difference between any component (attribute)

of the vectors

31
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
32
Ordinal Variables

An ordinal variable can be discrete or continuous


Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank rif {1,...,M f }
map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif
M f 1
compute the dissimilarity using methods for interval-scaled
variables

33
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks rif and r 1
zif
if

Treat zif as interval-scaled M 1


f

34
Cosine Similarity
A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

Other vector objects: gene features in micro-arrays,


Applications: information retrieval, biologic taxonomy, gene feature mapping,
...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors),
then
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d||: the length of vector d

35
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d

Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

36
KL Divergence: Comparing Two
Probability Distributions
The Kullback-Leibler (KL) divergence: Measure the difference between two
probability distributions over the same variable x
From information theory, closely related to relative entropy, information
divergence, and information for discrimination
DKL(p(x) || q(x)): divergence of q(x) from p(x), measuring the information lost
when q(x) is used to approximate p(x)
Discrete form:

The KL divergence measures the expected number of extra bits required to


code samples from p(x) (true distribution) when using a code based on q(x),
which represents a theory, model, description, or approximation of p(x)
Its continuous form:

The KL divergence: not a distance measure, not a metric: asymmetric, not


satisfy triangular inequality
37
How to Compute the
KL Divergence?
Base on the formula, DKL(P,Q) 0 and DKL(P || Q) = 0 if and only if P = Q.
How about when p = 0 or q = 0?
limp0 p log p = 0

when p != 0 but q = 0, DKL(p || q) is defined as , i.e., if one event e is

possible (i.e., p(e) > 0), and the other predicts it is absolutely impossible
(i.e., q(e) = 0), then the two distributions are absolutely different
However, in practice, P and Q are derived from frequency distributions, not
counting the possibility of unseen events. Thus smoothing is needed
Example: P : (a : 3/5, b : 1/5, c : 1/5). Q : (a : 5/9, b : 3/9, d : 1/9)
need to introduce a small constant , e.g., = 10
3

The sample set observed in P, SP = {a, b, c}, SQ = {a, b, d}, SU = {a, b, c, d}

Smoothing, add missing symbols to each distribution, with probability

P : (a : 3/5 /3, b : 1/5 /3, c : 1/5 /3, d : )

Q : (a : 5/9 /3, b : 3/9 /3, c : , d : 1/9 /3).

DKL(P || Q) can be computed easily

38

You might also like