Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Data Mining Techniques (DMT)

By Kushal Anjaria
Session-1 (Lecture note)
Nowadays, we are witnessing the enormous growth of data pre-processing stage, we may also fill the data points using
from terabytes to petabytes. For data, we have multiple data statistical functions.
collection tools and sources. The device includes various types • In the transformation phase, we combine our data into
of disk, servers, hardware, and processing units. The sources meaningful repositories. We create a data warehouse where
can be classified into three categories relational databases may be given some formal meanings
1. Business: transactions stocks Web and e-commerce and interpretations in the transformation phase.
2. Science: Remote Sensing bioinformatics and • In the data mining phase, we apply mathematical models
simulations and data mining algorithms to transform data. This stage
3. Society: News digital camera social networking sites helps us in identifying underlying patterns in the data. In
and so on the data mining phase, we use various statistical analysis
“Real drowning in data but starving for knowledge.”- Prof. and learning algorithms.
Pabitra Mitra. In this situation, data mining comes into the • The final stage of the KDD process is interpretation and
picture evaluation. In this phase, we convert the pattern obtained
in the data mining phase into a human-understandable
Definition of data mining: Extraction of interesting, form. Proper data interpretation and visualization only
nontrivial, implicit, previously unknown, and the potentially leads to knowledge generation.
useful pattern of knowledge obtained from a vast amount of • The entire KDD process is an iterative process that means
data is known as data mining once knowledge is generated, one can again go back to the
data selection and pre-processing stage.
The alternative name of data mining is Knowledge Discovery
from Data (KDD). While doing data mining, one should be We will start our discussion on data mining by understanding
careful as one has to know what data mining is. For example, the meaning of data.
a simple search in the search engine or query in the database is
not a data mining procedure. In this course, we consider data in tabular form. Suppose a bank
has provided us with the historical past data. From the patterns
Normal Data Analytics procedure will not be able to handle the available in the data, we intend to evaluate new loan
following: applications. We aim to identify whether the new applicant is
• Data Stream (from the sensor), time-series data, a fraud or legit.
temporal data, sequential data,
• graphs, graphical data, multi-linked data, social
network data
• Heterogeneous database and legacy databases,
• Multimedia, large text, and web data
• Simulation and forecasting data

Procedure for knowledge discovery from data

Fig-2: Data Objects, Attributes, Vectors and dimensions

The data will have specific attributes and objects. In the above
example, the table columns are the attributes, and rows of the
table are records. In data mining literature, attributes or
columns are known as features, variables, or inputs.
One important thing to be noted is that in this particular
Fig-1: The KDD Process
representation if we examine a row, we can think of each row
as a vector whose components are these individual attribute
• In this diagram, the entire KDD process is described.
values. These vectors are also sometimes known as the object
• From the vast amount of data, it is crucial to search for vector or the feature vector. Mathematically, we know that
attributes that fulfill our requirements. This process is each vector will have a dimension associated with it. The
known as the selection process. number of attributes determines the dimension of the vectors
• Once the data is selected, we check the data to verify in it. In the present example, we have five attributes, so it is a
whether any data point is missing or not. The task of five-dimensional vector. In data mining, each vector is
scanning the data is known as data pre-processing. In the considered as a point in the coordinate system. For the present
example, ten objects can be represented as data points in the • Often represented as integer values
five-dimensional coordinate system. • Please note that binary values of any attribute can be
Furthermore, a bank may have a collection of one lakh loan the particular case of the discrete attributes
applications over the past year. These loans can be thought of Continuous attributes:
as one lakh points in a five-dimensional coordinate system. • Has real numbers as attribute values
And once you do this plotting exercise, it helps you visualize • For example, temperature, height, weight
the nature of the data. • Practically real numbers are represented using a
finite number of digits
There are four types of attributes used in the data mining • Continuous variables are represented as floating-
process point variables
1. Nominal attributes: e.g., ID numbers, eye color, zip
codes, etc.
The nominal attributes are just symbols. For example, the ID
number of a bank account is only a number; it has no other
meaning. Similarly, eye color: black or white or blue, zip code
the pin code of a place. So, these are just numbers or values or
symbols. These attributes are nominal attributes that act as
identifiers. Why are these only symbols? Because
suppose a person has a bank account number say 1001, and
another person has a bank account number say 1002, then you
cannot say that the person with 1002 is greater than the person
with 1001. You cannot compare these values. They are just
symbolic values. Consider another example: Kushal Anjaria
and another person’s name is to say something else, say Ram
Kumar. So, it does not tell us about anything more besides our
identity. Fig-3 Attribute types and operation: explanation with
2. Ordinal attributes: e.g., ranking, grades, weights, example:
measurements
The ordinal values can be compared and measured. For Before going for data transformation, it is crucial to check the
example, you are rating a movie or potato chips, for instance, quality of data.
on a scale of 1 to 10, how good it is or how bad it is.
3. Interval attributes: e.g., date, temperature range The data can be considered of bad quality if
The value of interval attributes represents some interval space. • Some values of the attributes are missing
E.g., a date. The calendar date tells you that whether another • If the data domain is not satisfied
date, say the date of a loan application, falls in some time
• Incorrect data is inserted
interval or not. From the interval attribute values, you can say
• Duplication or redundancy of data exists
that one belongs to this interval one does not belong to this
interval. • If data has some noise or distortion
4. Ratio attributes: e.g., time, the temperature where • Data with outliers (e.g., % data with more than 100%
you can change the unit, and ratio can be obtained. values. In decimal, missing the. etc.)

Properties of the attributes: Data pre-processing increases the value of the data. Moreover,
The four types of attributes described above depend upon it also decreases the computational load.
which of the following properties it possesses: Data pre-processing tasks can be completed in the following
way:
1. Distinctness: 1. Aggregation: Aggregation means sometimes you
2. Order consider a bunch of data together. After that, the
3. Addition or subtraction cumulative information of all these data is used.
4. Multiplication or division 2. Sampling: In the sampling technique, only a few
Nominal attribute: distinctness representative data are kept, and the rest is thrown away.
Ordinal attribute: distinctness and order The idea is that only the sample is enough for the
Interval attribute: distinctness, order, and addition processing.
Ratio attribute: All the four properties above 3. Dimensionality reduction: We pick only the required
characteristics of the data. For example, if you go to a
Each data mining algorithm is redefined based on which doctor with lots of symptoms and lots of measurements,
attributes we use and which property the attribute possesses. the doctor would not look at all of them. The doctor will
The attribute types and operations are summarized in the select a few of them and complete the diagnosis.
figure-3 4. Discretization and binarization: Sometimes, we have to
In data mining, whatever operations you do has to be convert data in the discrete form or binary form from the
compatible with the data type. continuous data.
In data mining, the attribute can be represented as the discrete
or continuous attribute:
Discrete attribute:
• Has finite or countably infinite set of values
• Examples: zip code, set of words in the documents,
counting of any entity, number of accounts in the
bank, number of products in the warehouse

You might also like