Data Mining Memahami Data

Data Mining:
Mengenal dan
memahami data
1
Mengenal dan memahami data
Objek data dan macam-macam atribut
Statistik diskriptif data
Visualisasi data
Mengukur kesamaan dan ketidaksamaan data
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document data: text documents: term-
frequency vector
Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
Graph and network
World Wide Web Document 2 0 7 0 2 1 0 0 3 0 0
Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0

Ordered
Video data: sequence of images
Temporal data: time-series TID Items
Spatial, image and multimedia: 1 Bread, Coke, Milk
Spatial data: maps 2 Beer, Bread
Image data:
3 Beer, Coke, Diaper, Milk
Video data:
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
3
Data Objects
Data object menyatakan suatu entitas

Contoh:
Database penjualan: customers, barang-barang yang dijual,
penjualan
Database medis: pasien, perawatan
Database universitas: mahasiswa, professor, perkuliahan
Data objects dijelaskan dengan attribut-atribut.
Baris-baris Database -> data objects;
Kolom-kolom ->attribut-atribut.
4
Atribut
Attribut ( dimensi, fitur, variabel): menyatakan
karakteristik atau fitur dari data objek
Misal., ID_pelanggan, nama, alama
Tipe-tipe:
Nominal
Ordina
Biner
Numerik:
Interval-scaled
Ratio-scaled
5
Attribute Types
Nominal: kategori, keadaan, atau nama suatu hal
Warna rambut
Status , kode pos, dll, NRP dll
Binary :Atribut Nominal dengan hanya 2 keadaan (0 dan 1)
Symmetric binary: keduanya sama penting
Misal: jenis kelamin,
Asymmetric binary: keduanya tidak sama penting.
Misal : medical test (positive atau negative)
Dinyatakan dengan 1 untuk menyatakan hal yang lebih penting (
positif HIV)
Ordinal
Memiliki arti secara berurutan, (ranking) tetapi tidak dinyatakan
dengan besaran angka atau nilai.
Size = {small, medium, large}, kelas, pangkat
6
Atribut Numerik
Kuantitas (integer atau nilai real)
Interval
Diukur pada skala dengan unit satuan yang sama
Nilai memiliki urutan
tanggal kalender
No true zero-point
Ratio
Inherent zero-point
Contoh:Panjang, berat badan, dll
Bisa mengatakan perkalian dari nilai objek data yang
lain
Misal : panjang jalan A adalah 2 kali dari panjang jalan B
7
Atribut Discrete dan kontinu
Atribut Diskrit
Terhingga, dapat dihitung walaupun itu tak terhingga
Kode pos, kata dalam sekumpulan dokumen
Kadang dinyatakan dengan variabel integer
Catatan Atribut Binary: kasus khusus atribut diskrit
Atribut Kontinu
Memilki nilai real
E.g., temperature, tinggi, berat
Atribut kontinu dinyatakn dengan floating-point variables
8
Basic Statistical Descriptions of Data
Tujuan
Untuk memahami data: central tendency, variasi dan
sebaran
Karakteristik Sebaran data
median, max, min, quantiles, outliers, variance, dll.
9
Mengukur Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi
x
Note: n jumlah sample dan N nilai populasi. n i 1 N
n
w x
Mean/rata-rata:
i i
Trimmed mean: x i 1
n
Median: w
i 1
i
Estimated by interpolation (for grouped data):

n / 2 ( freq)l
median L1 ( ) width
Mode freqmedian
Median
Value that occurs most frequently in the data interval
Unimodal, bimodal, trimodal

Empirical formula:
mean mode 3 (mean median)

10
Symmetric vs. Skewed Data
Median, mean and mode of symmetric
symmetric, positively and negatively
skewed data
positively skewed negatively skewed
November 19, 2017 Data Mining: Concepts and Techniques 11

Mengukur sebaran Data
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, median, Q3, max
Outlier: biasanya lebih tinggi atau lebih rendah dari 1.5 x IQR
Variansi dan standar deviasi (sample: s, population: )
Variance: (algebraic, scalable computation)
n n
1 1
x 2
n n n 2 2 2
1 1 1
( xi ) 2 ]
( x )
s
2 2 2 i i
( x x ) [ x N i 1 N i 1
n 1 i 1 n 1 i 1
i i
n i 1
Standard deviation s (atau ) akar kuadrat daro variance s2 ( atau 2 )
12
Boxplot Analysis
Lima nilai dari sebaran data

Minimum, Q1, Median, Q3, Maximum
Boxplot
Data dinyatakan dengan box
Ujung dari box kuartil pertama dan ketiga,
tinggi kotak adalah IQR
Median ditandai garis dalam box
Outliers: diplot sendiri diluar
13
Sifat-sifat kurva Distribusi Normal
Kurva norma
dari to +: berisi 68% pengukukuran (: mean, :
standar deviasi)
Dari 2 to +2: berisi 95% pengukuran
Dari 3 to +3: berisi 99.7% pengukuran
14
Histogram Analysis
Histogram: grafik menampilkan
40
tabulasi dari frekwensi data
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
15
Histograms Often Tell More than Boxplots
Dua histogram
menunjukkan boxplot
yang sama
Nilai yang sama: min,
Q1, median, Q3, max
Tetapi distribusi datanya
berbeda
16
Scatter plot
Melihat data bivariate data untuk melihat cluster dan outlier
data, etc
Setiap data menunjukkan pasangan koordinat dari suatu data
17
Positively and Negatively Correlated Data
Kiri atas korelasi positif

Kanan atas korelasi negatif
18
Uncorrelated Data
19
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto graphical
primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships among
data
Help find interesting regions and suitable parameters for further
quantitative analysis
20
Geometric Projection Visualization Techniques
Visualization of geometric transformations and projections of

the data
Methods
Scatterplot and scatterplot matrices
21
Scatterplot Matrices
Used by ermission of M. Ward, Worcester Polytechnic Institute
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]
22
Similarity and Dissimilarity
Similarity
Mengukur secara Numerik bagaimana kesamaan dua objek
data
Tinggi nilainya bila benda yang lebih mirip
Range [0,1]
Dissimilarity (e.g., distance/jarak)
Ukuran numerik dari perbedaan dua objek
Sangat rendah bila benda yang lebih mirip
Minimum dissimilarity i0
23
Data Matrix and Dissimilarity Matrix
Data matrix
n titik data dengan p x11 ... x1f ... x1p

dimensi ... ... ... ... ...
Two modes
x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
Dissimilarity matrix
0
n titik data yang didata
d(2,1) 0
adalah distance/jarak
d(3,1) d ( 3,2) 0
Matrik segitiga
: : :
Single mode
d ( n,1) d ( n,2) ... ... 0
24
Proximity Measure for Nominal Attributes
Misal terdapat 2 atau lebih nilai, misal., red, yellow,

blue, green (generalisasi dari atribut binary)
Metode Simple matching
m: # yang sesuai, p: total # variabel
d (i, j) p p
m
25
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
Distance measure for symmetric

binary variables:
Distance measure for asymmetric
binary variables:
Jaccard coefficient (similarity
measure for asymmetric binary
variables):
Note: Jaccard coefficient is the same as coherence:
26
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Gender is a symmetric attribute

The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
27
Standardizing Numeric Data
Z-score:

z x
X: raw score to be standardized, : mean of the population, : standard
deviation
the distance between the raw score and the population mean in units
of the standard deviation
negative when the raw score is below the mean, + when above
An alternative way: Calculate the mean absolute deviation
sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
mf 1
n (x1 f x2 f ... xnf ) x m
.
standardized measure (z-score): zif if

sf
f
Using mean absolute deviation is more robust than using standard

deviation
28
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
29
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
Properties
d(i, j) > 0 if i j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
30
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are different
between two binary vectors

d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
h = 2: (L2 norm) Euclidean distance

d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
h . supremum (Lmax norm, L norm) distance.

This is the maximum difference between any component (attribute)
of the vectors
31
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
32
Ordinal Variables
An ordinal variable can be discrete or continuous

Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank rif {1,...,M f }
map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif
M f 1
compute the dissimilarity using methods for interval-scaled
variables
33
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks rif and r 1
zif
if
Treat zif as interval-scaled M 1

f
34
Cosine Similarity
A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
Other vector objects: gene features in micro-arrays,

Applications: information retrieval, biologic taxonomy, gene feature mapping,
...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors),
then
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d||: the length of vector d
35
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
36
KL Divergence: Comparing Two
Probability Distributions
The Kullback-Leibler (KL) divergence: Measure the difference between two
probability distributions over the same variable x
From information theory, closely related to relative entropy, information
divergence, and information for discrimination
DKL(p(x) || q(x)): divergence of q(x) from p(x), measuring the information lost
when q(x) is used to approximate p(x)
Discrete form:
The KL divergence measures the expected number of extra bits required to

code samples from p(x) (true distribution) when using a code based on q(x),
which represents a theory, model, description, or approximation of p(x)
Its continuous form:
The KL divergence: not a distance measure, not a metric: asymmetric, not

satisfy triangular inequality
37
How to Compute the
KL Divergence?
Base on the formula, DKL(P,Q) 0 and DKL(P || Q) = 0 if and only if P = Q.
How about when p = 0 or q = 0?
limp0 p log p = 0
when p != 0 but q = 0, DKL(p || q) is defined as , i.e., if one event e is
possible (i.e., p(e) > 0), and the other predicts it is absolutely impossible
(i.e., q(e) = 0), then the two distributions are absolutely different
However, in practice, P and Q are derived from frequency distributions, not
counting the possibility of unseen events. Thus smoothing is needed
Example: P : (a : 3/5, b : 1/5, c : 1/5). Q : (a : 5/9, b : 3/9, d : 1/9)
need to introduce a small constant , e.g., = 10
3
The sample set observed in P, SP = {a, b, c}, SQ = {a, b, d}, SU = {a, b, c, d}
Smoothing, add missing symbols to each distribution, with probability
P : (a : 3/5 /3, b : 1/5 /3, c : 1/5 /3, d : )
Q : (a : 5/9 /3, b : 3/9 /3, c : , d : 1/9 /3).
DKL(P || Q) can be computed easily
38

Data Mining Memahami Data

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Memahami Data

Uploaded by

Copyright:

Available Formats

Data Mining:

Objek data dan macam-macam atribut

Statistik diskriptif data

Mengukur kesamaan dan ketidaksamaan data

Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0

Data object menyatakan suatu entitas

Kode pos, kata dalam sekumpulan dokumen

Kadang dinyatakan dengan variabel integer

Catatan Atribut Binary: kasus khusus atribut diskrit

E.g., temperature, tinggi, berat

Atribut kontinu dinyatakn dengan floating-point variables

Estimated by interpolation (for grouped data):

Unimodal, bimodal, trimodal

mean mode 3 (mean median)

positively skewed negatively skewed

November 19, 2017 Data Mining: Concepts and Techniques 11

Lima nilai dari sebaran data

Dari 3 to +3: berisi 99.7% pengukuran

Kiri atas korelasi positif

Search for patterns, trends, structure, irregularities, relationships among

Visualization of geometric transformations and projections of

Used by ermission of M. Ward, Worcester Polytechnic Institute

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

Misal terdapat 2 atau lebih nilai, misal., red, yellow,

Distance measure for symmetric

Gender is a symmetric attribute

standardized measure (z-score): zif if

Using mean absolute deviation is more robust than using standard

between two binary vectors

h = 2: (L2 norm) Euclidean distance

h . supremum (Lmax norm, L norm) distance.

An ordinal variable can be discrete or continuous

Treat zif as interval-scaled M 1

Other vector objects: gene features in micro-arrays,

Ex: Find the similarity between documents 1 and 2.

The KL divergence measures the expected number of extra bits required to

The KL divergence: not a distance measure, not a metric: asymmetric, not

when p != 0 but q = 0, DKL(p || q) is defined as , i.e., if one event e is

The sample set observed in P, SP = {a, b, c}, SQ = {a, b, d}, SU = {a, b, c, d}

Smoothing, add missing symbols to each distribution, with probability

P : (a : 3/5 /3, b : 1/5 /3, c : 1/5 /3, d : )

Q : (a : 5/9 /3, b : 3/9 /3, c : , d : 1/9 /3).

DKL(P || Q) can be computed easily

You might also like