ADII10 Analisa Outlier

Minggu ke-10:
Analisa Outlier
Dr.Eng. Fitra Abdurrachman Bachtiar, S.T., M.Eng

Dr. Diva Kurnianingtyas, S.Kom
Outline
1
Outlier and Outlier Analysis
2
Outlier Detection Method
3
-
4
-
Note:
This slides are based on the additional material provided with the textbook that we use: J. Han, M. Kamber and J. Pei, “Data Mining:
Concepts and Techniques” and P. Tan, M. Steinbach, and V. Kumar "Introduction to Data Mining“.
Outlier and Outlier Analysis
What Are Outliers?
● Outlier: A data object that deviates significantly from the normal

objects as if it were generated by a different mechanism
○ Ex.: Unusual credit card purchase, sports: Michael Jordon,
Wayne Gretzky, ...

● Outliers are different from the noise data
○ Noise is random error or variance in a measured variable
○ Noise should be removed before outlier detection
● Outliers are interesting:

○ It violates the mechanism that generates the normal data
5
What Are Outliers?
● Outlier detection vs. novelty detection: early stage, outlier; but later
merged into the model
○ In "novelty detection", you have a data set that contains only
good data, and you're trying to determine whether new

observations fit within the existing data set.
○ In "outlier detection", the data may contain outliers, which you
want to identify
● Various applications of outliers detection
6
Application of Outlier Detection
Credit Card Fraud Detection Telecom Fraud Detection

Application of Outlier Detection
Medical Analysis Customer Segmentation

Distribution of
outlier data
Distribution of
normal data
Types of Outliers (I)
Three kinds types of outlier: global, contextual and collective outliers

Global outlier Contextual outlier Collective Outliers
(Point anomaly) (Conditional outlier)
12
Global Outlier
● Definition:
○ Global outliers are data points that deviate significantly from
the overall distribution of a dataset.

○ Object is Og if it significantly deviates from the rest of the data
set
● Example:
○ Intrusion detection in computer networks
● Causes:
○ Errors in data collection, measurement errors, or truly unusual
events can result in global outliers.

Global Outlier
● Impact:
○ Global outliers can distort data analysis results and affect
machine learning model performance.

● Issue:
○ Find an appropriate measurement of deviation
● Detection:
○ Techniques include statistical methods (e.g., z-score,
Mahalanobis distance), machine learning algorithms (e.g.,

isolation forest, one-class SVM), and data visualization
techniques.
Global Outlier
● Handling:
○ Options may include removing or correcting outliers,
transforming data, or using robust methods.
● Considerations:
○ Carefully considering the impact of global outliers is crucial
for accurate data analysis and machine learning model

outcomes.
Global Outlier
Contextual Outlier
● Definition:
○ Contextual outliers are data points that deviate significantly from the
expected behavior within a specific context or subgroup.
● Characteristics:
○ Contextual outliers may not be outliers when considered in the entire
dataset, but they exhibit unusual behavior within a specific context or
subgroup.
● Detection:
○ Techniques for detecting contextual outliers include contextual
clustering, contextual anomaly detection, and context-aware machine
learning approaches.
Contextual Outlier
● Detection:
○ Techniques for detecting contextual outliers include
contextual clustering, contextual anomaly detection, and

context-aware machine learning approaches.
● Contextual Information:
○ Contextual information such as time, location, or other
relevant factors are crucial in identifying contextual outliers.

● Impact:
○ Contextual outliers can represent unusual or anomalous
behavior within a specific context, which may require further

investigation or attention.
Contextual Outlier
● Handling:
○ Handling contextual outliers may involve considering the
contextual information, contextual normalization or

transformation of data, or using context-specific models or
algorithms.
● Considerations:
○ Proper understanding of the context and domain-specific
knowledge is crucial for accurate detection and interpretation

of contextual outliers, as they may vary based on the specific
context or subgroup being considered.
Contextual Outlier
Collective Outlier
● Definition:
○ A subset of data objects collectively deviate significantly from
the whole data set, even if the individual data objects may not
be outliers
● Applications:
○ e.g., intrusion detection: When a number of computers keep
sending denial-of-service packages to each other

Collective Outlier
● Detection of collective outliers:

○ Consider not only behavior of individual objects, but also that
of groups of objects
○ Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure

on objects.
● A data set may have multiple types of outlier

● One object may belong to more than one type of outlier
Collective Outlier
● Definition:
○ Collective outliers are groups of data points that collectively
deviate significantly from the overall distribution of a dataset.

● Characteristics:
○ Collective outliers may not be outliers when considered
individually, but as a group, they exhibit unusual behavior.

● Detection:
○ Techniques for detecting collective outliers include clustering
algorithms, density-based methods, and subspace-based

approaches.
Collective Outlier
● Impact:
○ Collective outliers can represent interesting patterns or
anomalies in data that may require special attention or further

investigation.
● Handling:
○ Handling collective outliers depends on the specific use case
and may involve further analysis of the group behavior,

identification of contributing factors, or considering
contextual information.
Collective Outlier
● Considerations:
○ Detecting and interpreting collective outliers can be more
complex than individual outliers, as the focus is on group

behavior rather than individual data points.
○ Proper understanding of the data context and domain
knowledge is crucial for effective handling of collective

outliers.
Collective Outlier
Challenges of Outlier Detection
● Modeling normal objects and outliers properly

○ Hard to enumerate all possible normal behaviors in an
application
○ The border between normal and outlier objects is often a gray
area
● Application-specific outlier detection
○ Choice of distance measure among objects and the model of
relationship among objects are often application-dependent

○ E.g., clinic data: a small deviation could be an outlier; while in
marketing analysis, larger fluctuations

Challenges of Outlier Detection
● Handling noise in outlier detection

○ Noise may distort the normal objects and blur the distinction
between normal objects and outliers.

○ It may help hide outliers and reduce the effectiveness of
outlier detection
● Understandability
○ Understand why these are outliers: Justification of the
detection
○ Specify the degree of an outlier: the unlikelihood of the object
being generated by a normal mechanism

Outlier Detection Method
Outlier Detection I: Supervised Methods
● Two ways to categorize outlier detection methods:

○ Based on whether user-labeled examples of outliers can be
obtained:
■ Supervised, semi-supervised vs. unsupervised methods
○ Based on assumptions about normal data and outliers:
■ Statistical, proximity-based, and clustering-based
methods
● Outlier Detection I: Supervised Methods

○ Modeling outlier detection as a classification problem
■ Samples examined by domain experts used for training &
testing
○ Methods for Learning a classifier for outlier detection
effectively:
■ Model normal objects & report those not matching the
model as outliers, or
■ Model outliers and treat those not matching the model as
normal
● Challenges
○ Imbalanced classes, i.e., outliers are rare: Boost the outlier
class and make up some artificial outliers
○ Catch as many outliers as possible, i.e., recall is more
important than accuracy (i.e., not mislabeling normal objects
as outliers)
Outlier Detection II: Unsupervised Methods
● Assume the normal objects are somewhat ``clustered'‘ into

multiple groups, each having some distinct features
● An outlier is expected to be far away from any groups of normal
objects
● Weakness: Cannot detect collective outlier effectively
○ Normal objects may not share any strong patterns, but the
collective outliers may share high similarity in a small area
● Ex. In some intrusion or virus detection, normal activities are
diverse
○ Unsupervised methods may have a high false positive rate but
still miss many real outliers.
○ Supervised methods can be more effective, e.g., identify
attacking some key resources
Outlier Detection II: Unsupervised Methods
● Many clustering methods can be adapted for unsupervised

methods
○ Find clusters, then outliers: not belonging to any cluster
○ Problem 1: Hard to distinguish noise from outliers
○ Problem 2: Costly since first clustering: but far less outliers
than normal objects
■ Newer methods: tackle outliers directly
Outlier Detection III: Semi-Supervised Methods
● Situation: In many applications, the number of labeled data is

often small: Labels could be on outliers only, normal objects only,
or both
● Semi-supervised outlier detection: Regarded as applications of
semi-supervised learning
● If some labeled normal objects are available
○ Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
○ Those not fitting the model of normal objects are detected as
outliers
Outlier Detection III: Semi-Supervised Methods
● If only some labeled outliers are available, a small number of

labeled outliers many not cover the possible outliers well
○ To improve the quality of outlier detection, one can get help
from models for normal objects learned from unsupervised
methods
Terimakasih

ADII10 Analisa Outlier

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ADII10 Analisa Outlier

Uploaded by

Copyright:

Available Formats

Minggu ke-10:

Dr.Eng. Fitra Abdurrachman Bachtiar, S.T., M.Eng

● Outlier: A data object that deviates significantly from the normal

Wayne Gretzky, ...

○ Noise should be removed before outlier detection

● Outliers are interesting:

good data, and you're trying to determine whether new

● Various applications of outliers detection

Credit Card Fraud Detection Telecom Fraud Detection

Medical Analysis Customer Segmentation

Three kinds types of outlier: global, contextual and collective outliers

the overall distribution of a dataset.

events can result in global outliers.

machine learning model performance.

Mahalanobis distance), machine learning algorithms (e.g.,

transforming data, or using robust methods.

for accurate data analysis and machine learning model

contextual clustering, contextual anomaly detection, and

relevant factors are crucial in identifying contextual outliers.

behavior within a specific context, which may require further

contextual information, contextual normalization or

knowledge is crucial for accurate detection and interpretation

sending denial-of-service packages to each other

● Detection of collective outliers:

among data objects, such as a distance or similarity measure

● A data set may have multiple types of outlier

deviate significantly from the overall distribution of a dataset.

individually, but as a group, they exhibit unusual behavior.

algorithms, density-based methods, and subspace-based

anomalies in data that may require special attention or further

and may involve further analysis of the group behavior,

complex than individual outliers, as the focus is on group

knowledge is crucial for effective handling of collective

● Modeling normal objects and outliers properly

relationship among objects are often application-dependent

marketing analysis, larger fluctuations

● Handling noise in outlier detection

between normal objects and outliers.

being generated by a normal mechanism

● Two ways to categorize outlier detection methods:

○ Based on assumptions about normal data and outliers:

■ Statistical, proximity-based, and clustering-based

● Outlier Detection I: Supervised Methods

● Assume the normal objects are somewhat ``clustered'‘ into

● Many clustering methods can be adapted for unsupervised

● Situation: In many applications, the number of labeled data is

● If only some labeled outliers are available, a small number of

You might also like