Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

.

. Group name id
1. abel belete 0093/13
2. Iyassu Theodros 1405/13
3. Samson Workineh 2139/13
4. Tekalegn Tadewos 2365/13
Data mining

project 1
.
introduction
What is data mining ?
 is defined as procedure of extracting
information from huge sets of data.
 mining of knowledge from data.
 Data Mining is also called Knowledge
Discovery of Data (KDD).
what types of data can be mined ?
 Database data (RDBMS):
 Data warehouse data:
 Transactional database
Database data (RDBMS)
Set of tables.
Has rows and columns.
While mining databases we can search
for trends or data patterns.
Data is stored in the form of table.
Columns represents the attributes .
Data ware house:
Collecting of data integrated from
different sources with querying and
decision making on data.
Data is stored in multi dimensional
structure (data cube) where each
dimension is each attribute.
.
Data cube
Transactional database
Each record is called as transaction
 example of a transaction:
 customers sales
 flight ticket booking
 users clicks on web page
A transaction has:
Transaction id
Name of transaction
Time of beginning
Time of end
 transaction Location
 transaction date
Other types of data that can be mined:
Sequence data : stocks and stock market
Data streams : continuously being
transmitted
Spatial data: maps
Engineering & design data: integrated
circuits
Multimedia: audio, video
Web data: webpage related data
Hypertext: text related
Data(query) pre-processing
The process of transforming raw data
into an understandable format.

 major tasks:
1) data cleaning
2) data integration
3) Data reduction
4) data transformation
Data cleaning:
 Process of removal of incorrect, incomplete,
inaccurate data, also replaces missing data.
Data integration:
 Multiple heterogeneous sources of
data are combined into single dataset.
2 types of data integration:
tight coupling
 loose coupling
.
Tight coupling vs loose coupling
Tight coupling means classes and objects
are dependent on one another.
tight coupling is usually not good because it reduces
the flexibility and re-usability of the code.

Loose coupling
 Loose coupling means reducing the dependencies of a
class that uses the different class directly.
Tight coupling vs loose coupling
More interdependency Less interdependency
More coordination Less coordination
More information flow Less information flow
Less testability More testability
Data reduction:
Volume of data is reduced to make analysis
easier.
Methods for data reduction:
1. Dimensionality reduction:
 Reduces no of i/p variables in the dataset
 Large i/p variables -> poor performance
2. Data cube Aggregation
 Data is combined to construct a data cube.
 redundant, noisy data is removed.
3.Attribute subset selection:
Highly relevant attributes should be used.
Others -> removed.
4. Numerosity reduction:
 storing only the model or the sample of
data instead of entire data.
Data transformation
Data is transformed into appropriate form
suitable for mining process.
Methods :
1. Normalization
2. Attribute selection
3. Dicretization
4. Concept hierarchy generation
Data transformation
DATA MINING FUNCTIONALTIES:
1.Class description :
Data is always associated with class concepts:
 Descriptions can be done in 2 ways:
 data characterization:
refers to the summary of the class/concept.
Output -> general overview
 data discrimination:
compares the common features of the classes.
Output -> bar charts, curves, graphs
2.Mining frequent patterns,
associations and correlations
Frequent patterns:
Things which are found most commonly
in data.
 frequent item sets(data items)
 frequent subsequence
Frequent substructure
Association analysis
It is the way of identifying the relation
among various item.
Correlation analysis
Mathematical method.
Shows how strongly pair of attributes are
related together.
3.Classfication and regression for
predictive analysis
Classification:
 Process of finding a model that
distinguishes data items.
 Decision tree is used for classification
Regression: statistical methodology that is
used for numeric prediction of missing data.
4. Cluster analysis
The data items are clustered based on the
principle of maximising the intraclass similarity
and minimising the interclass similarity
5.Outlier analysis (anomaly mining)
Among the data items in a database, there
may be some items which do not follow the
general behaviour of data.
sometimes Called :
Noise
Exception
outlier
outlier detection
Data Mining Applications
Market-basket analysis
Market basket analysis is a data
mining technique used by retailers to
increase sales by better understanding
customer purchasing patterns.
It involves analyzing large data sets, such
as purchase history, to reveal product
groupings, as well as products that are
likely to be purchased together.
interestingness of patterns
In data mining system, everyday millions of
data patterns Are generated.
• Among all these patterns generated,
How many are really interesting ?
A small fraction of patterns generated
would be interesting to any given users.
What makes the pattern interesting ?
A pattern is interesting if it is:
 easily understood by human.
 valid on new/test data.
 potentially useful.
Can data mining system generate an
interesting patterns ?
refers to completeness of data mining
system.
in reality it is not possible for a data mining
system to generate all interesting patterns.
Classification of data mining
Bayesian classification
Bayesian classification are statistical
classifications.
 They can predict probabilities of class items.
 It gives the probability that a given item
belongs to that class/not
Classification based on
application adapted:
Finance
Telecommunications
DNA
Stock markets
Email
general classifications
Conclusion :
 Data mining is a big area of data sciences,
which aims to discover patterns and features
in data, often large data sets. It includes
regression, classification, clustering, detection
of anomaly, and others. It also includes
preprocessing, validation, summarization, and
ultimately the making sense of the data sets.
 It primarily turns raw data into useful
information
References
Wikipedia.org
YouTube videos
Different articles
.

You might also like