Docter Presentation1

Collage of Computing And Informatics
Department Of Information Science

Data Mining And Data Warehousing group
Assignment presentation
GROUP LIST ID
1. TEMESGEN FIKAD………………………………RU/1231/12
2. MUSA ENDRIS …………………………………...RU//12
3. MUHAMED MOLA…………………………………..RU//12
4. HLINA GETACHEW…………………………….. RU//12
Discus about
Write brief explanations
1. OLAP, OLTP
2.Data Preprocessing (Data Cleaning, Integration,
Transformation, Reduction and Discretization)
3. Data Mining algorithms: Classification and
Prediction, Clustering
4.Text Mining, Web Mining
1, OLAP and OLTP
OLAP
Online analytical processing (OLAP) is a system for
performing multi-dimensional analysis at high speeds
on large volumes of data. Typically, this data is from a
data warehouse, data mart or some other centralized
data store. OLAP is ideal for data mining, business
intelligence and complex analytical calculations, as
well as business reporting functions like financial
analysis, budgeting and sales forecasting.
 OLTP
Online transactional processing (OLTP) enables the
real-time execution of large numbers of database
transactions by large numbers of people, typically over
the Internet. OLTP systems are behind many of our
everyday transactions, from ATMs to in-store
purchases to hotel reservations. OLTP can also drive
non-financial transactions, including password
changes and text messages.
The main difference between OLAP
and OLTP
Online Analytical Processing (OLAP) is a category of software
tools that analyze data stored in a database, whereas Online
transaction processing (OLTP) supports transaction-oriented
applications in a 3-tier architecture.
OLAP creates a single platform for all types of business analysis
needs which includes planning, budgeting, forecasting, and
analysis, while OLTP is useful for administering day-to-day
transactions of an organization.
OLAP is characterized by a large volume of data, while OLTP is
characterized by large numbers of short online transactions.
 OLAP, a data warehouse is created uniquely so that it can
integrate different data sources for building a consolidated
database, whereas OLTP uses traditional DBMS.
 Examples – Any type of Data warehouse system is an
OLAP system. The uses of OLAP are as follows:
Spotify analyzed songs by users to come up with a
personalized homepage of their songs and playlist.
Netflix movie recommendation system. Whereas
 Examples: Uses of OLTP are as follows:
ATM center is an OLTP application.
OLTP handles the ACID properties during data
transactions via the application.
It’s also used for Online banking, Online airline ticket
booking, sending a text message, add a book to the
shopping cart.
Parameters OLTP OLAP
Process It is an online transactional system. It manages OLAP is an online analysis and data retrieving process.
database modification.
Characterist It is characterized by large numbers of short It is characterized by a large volume of data.

ic online transactions.
Functionalit OLTP is an online database modifying system. OLAP is an online database query management system.
y
Method OLTP uses traditional DBMS. OLAP uses the data warehouse.
Query Insert, Update, and Delete information from Mostly select operations
the database.
Table Tables in OLTP database are normalized. Tables in OLAP database are not normalized.
Source OLTP and its transactions are the sources of Different OLTP databases become the source of data for OLAP.
data.
Data OLTP database must maintain data integrity OLAP database does not get frequently modified. Hence, data
Integrity constraint. integrity is not an issue.
Response It’s response time is in millisecond. Response time in seconds to minutes.

time
Data quality The data in the OLTP database is always The data in OLAP process might not be organized.
detailed and organized.
2, Data Preprocessing (Data Cleaning,
Integration, Transformation, Reduction and
Discretization)
1. Data cleaning
Data cleaning or cleansing is the process of cleaning
datasets by accounting for missing values, removing outliers,
correcting inconsistent data points, and smoothing noisy
data. In essence, the motive behind data cleaning is to offer
complete and accurate samples for machine learning models.
The techniques used in data cleaning are specific to the data
scientist’s preferences and the problem they’re trying to solve.
Here’s a quick look at the issues that are solved during
data cleaning and the techniques involved.
 Missing values
 Noisy data
Missing values
The problem of missing data values is quite common. It may happen
during data collection or due to some specific data validation rule.
Here are some ways to account for missing data:
 Manually fill in the missing values. This can be a tedious and time-
consuming approach and is not recommended for large datasets.
 Make use of a standard value to replace the missing data
value. You can use a global constant like “unknown” or “N/A” to
replace the missing value. Although a straightforward approach, it isn’t
foolproof.
 Fill the missing value with the most probable value. To predict the
probable value, you can use algorithms like logistic regression or
decision trees.
 Use a central tendency to replace the missing value. Central
tendency is the tendency of a value to cluster around its mean, mode,
or median.
Noisy data

A large amount of meaningless data is called noise. More precisely,
it’s the random variance in a measured variable or data having
incorrect attribute values. Noise includes duplicate or semi-
duplicates of data points, data segments of no value for a specific
research process, or unwanted information fields.

The following are some methods used to solve the problem of noise:

Regression: Regression analysis can help determine the variables
that have an impact. This will enable you to work with only the
essential features instead of analyzing large volumes of data. Both
linear regression and multiple linear regression can be used for
smoothing the data.

Binning: Binning methods can be used for a collection of sorted
data. They smoothen a sorted value by looking at the values around
it. The sorted values are then divided into “bins,” which means
sorting data into smaller segments of the same size.
There are different techniques for binning,
including
smoothing by bin means and
 smoothing by bin medians.
Clustering: Clustering algorithms such as k-
means clustering can be used to group data and
detect outliers in the process
2. Data integration
 data integration is a crucial part of data preparation.
Integration may lead to several inconsistent and redundant data
points, ultimately leading to models with inferior accuracy.
Here are some approaches to integrate data:
 Data consolidation: Data is physically brought together and
stored in a single place. Having all data in one place increases
efficiency and productivity. This step typically involves using
data warehouse software.
 Data virtualization: In this approach, an interface provides a
unified and real-time view of data from multiple sources. In
other words, data can be viewed from a single point of view.
 Data propagation: Involves copying data from one location to
another with the help of specific applications. This process can
be synchronous or asynchronous and is usually event-driven.
3. Data reduction
data reduction is used to reduce the amount of data and thereby reduce
the costs associated with data mining or data analysis.
It offers a condensed representation of the dataset. Although this step
reduces the volume, it maintains the integrity of the original data. This data
preprocessing step is especially crucial when working with big data as the
amount of data involved would be gigantic.
The following are some techniques used for data reduction.
Dimensionality reduction
Dimensionality reduction, also known as dimension reduction, reduces
the number of features or input variables in a dataset.
The number of features or input variables of a dataset is called its
dimensionality. The higher the number of features, the more troublesome it
is to visualize the training dataset and create a predictive model.
The following are some ways to perform dimensionality reduction:
Principal component analysis (PCA): A statistical technique
used to extract a new set of variables from a large set of variables.
The newly extracted variables are called principal components. This
method works only for features with numerical values.
High correlation filter: A technique used to find highly correlated
features and remove them; otherwise, a pair of highly correlated
variables can increase the multicollinearity in the dataset.
Missing values ratio: This method removes attributes having
missing values more than a specified threshold.
Low variance filter: Involves removing normalized attributes
having variance less than a threshold value as minor changes in data
translate to less information.
Random forest: This technique is used to assess the importance of
each feature in a dataset, allowing us to keep just the top most
important features.
Feature subset selection
Feature subset selection is the process of selecting a subset of
features or attributes that contribute the most or are the most
important.
This data reduction approach can help create faster and more
cost-efficient machine learning models. Attribute subset
selection can also be performed in the data transformation step.
Numerosity reduction
Numerosity reduction is the process of replacing the original data
with a smaller form of data representation. There are two ways to
perform this: parametric and non-parametric methods.
Parametric methods
Parametric methods use models for data representation. Log-linear
and regression methods are used to create such models. In
contrast, non-parametric methods store reduced data
representations using clustering, histograms, data cube aggregation,
and data sampling.
4. Data transformation
Data transformation is the process of converting data
from one format to another. In essence, it involves
methods for transforming data into appropriate formats
that the computer can learn efficiently from.
The following are some strategies for data transformation.
 Smoothing
This statistical approach is used to remove noise from the
data with the help of algorithms. It helps highlight the
most valuable features in a dataset and predict patterns.
Aggregation
Aggregation refers to pooling data from multiple sources and
presenting it in a unified format for data mining or analysis.
Aggregating data from various sources to increase the number of
data points is essential as only then the ML model will have enough
examples to learn from.
Discretization
Discretization involves converting continuous data into sets of
smaller intervals.
Generalization
Generalization involves converting low-level data features into high-
level data features
Normalization
Normalization refers to the process of converting all data variables
into a specific range. In other words, it’s used to scale the values of
an attribute so that it falls within a smaller range, for example, 0 to 1.
5. Discretization
Discretization involves reducing the number of values
for a continuous attribute by partitioning the attribute
range into intervals to replace actual data values.
Discretization can be done by binning, histogram
analysis, clustering, decision tree analysis, and
correlation analysis.
3,Data Mining algorithms: Classification and Prediction, Clustering
1, Classification
 Classification is a technique that allows data to be parsed in
accordance with predetermined outputs. Because the
outputs are known in advance, the classification learns the
data set as supervised
Classification is to identify the category or the class label of a
new observation. First, a set of data is used as training data.
The set of input data and the corresponding outputs are
given to the algorithm.
Classification is the process of classifying a record.
Classification is a supervised machine learning algorithm
which predicts the class of categorical data. It classifies data
based on the training dataset and the class labels in a
classifying attribute and uses it to classify new data .
 Classification algorithms works in two steps:
model construction and
model usage.
 Model Construction
In training dataset, each record belongs to a predefined
class label, which is determined by the class label attribute in the
dataset. Then a classifier model is constructed by using different
techniques such as classification rules, decision trees, or some
mathematical formulae.
 Model Usage
Classification makes use of two types of datasets namely training
dataset and testing dataset. Training dataset issued in classifier
model construction, whereas testing dataset is used for
predicting the class label of previously unseen records. The
accuracy of the classifier model is calculated by comparing the
known label of test samples with classified results of classifier
model.
2,Prediction
It is used to find a numerical output. Same as in
classification, the training dataset contains the inputs and
corresponding numerical output values. The algorithm
derives the model or a predictor according to the training
dataset. The model should find a numerical output when
the new data is given. The model predicts a continuous-
valued function or ordered value.
Predication is the process of identifying the missing or
unavailable numerical data for a new observation.
3,Clustering
Data clustering is developing strongly. In proportion to the
increasing amount of data collected in databases, cluster
analysis has recently become an active topic in data mining
research.
Clustering is one of the most popular dat mining
approaches in practice, because it automatically detects
“natural” groups or communities in big data. These clusters
could be the end-result. Or they could be used to improve
other data mining steps by customizing those steps
depending on the cluster membership of an object of
interest.
cluster is basically a collection of data items clubbed
together into the same group which are similar, dissimilar
data items are scattered into different groups.
Clustering is the process of grouping a set of data items
into classes of similar data items. Clustering is an example
of unsupervised learning, which do not require predefined
class labels to identify the cluster of a data item.
4, Text Mining, Web Mining
 Web Mining
Web mining is a process which includes various data mining
techniques to extract knowledge from web content, structure, and
usage. It can be used for discovering useful information previously
unknown.
Web mining can be classified based on the following categories:
 Web Content
 Web Structure
 Web usage
 Text Mining
The process of text mining is the transformation and interpretation
(often mathematical) of unstructured texts into structured data for
purposes such as identification patterns. The idea behind text mining
is to find patterns and associations in documents, which can be used
for a variety of purposes.
Web Mining Technologies
 Web Content Mining Web content mining is the process
of converting raw data into useful information using the
content on web pages from a specified website.
 Web Structure Mining The web graph is a structure that
consists of nodes and hyperlinks. The presence of these
connections between pages makes up for an edge.
Document-level analysis looks at the links between pages
within a single document while hyperlink analysis assesses
relationships among different documents on an Internet
web page or web site.
 Web Usage Mining The Web is a collection of interrelated
files housed on one or more servers. Leveraging the client-
server transactions, patterns of meaningful data are
discovered.
Text Mining Technologies
 Summarization Summarizing a large amount of data while
maintaining the main idea.
 Information Extraction Using pattern matching formats to
extract information.
 Categorization Supervised learning technique that
categorizes the document according to content.
 Visualization Using computer graphics to represent
information and to visualize relationships.
 Clustering Grouping according to textual similarity based on
the unsupervised technique.
 Question Answering Using a list of patterns to answer a
natural language question.
 Sentiment Analysis Also known as opinion mining, it gathers
peoples’ moods about a service or product.
Differences
The basic difference between web mining and text

mining arises from the difference between two natures
of the data.
Text mining imposes a structure to the specified data
to be mined for valuable information.
Web mining deals with the unstructured forms of
data, which includes Word documents, PDF files, and
XML files.
Data Mining Text Mining
Data mining is the statistical technique of processing raw Text mining is the part of data mining which involves
data in a structured form. processing of text from documents.
Pre-existing databases and spreadsheets are used to The text is used to gather high quality information.
gather information.
Processing of data is done directly. Processing of data is done linguistically.
Statistical techniques are used to evaluate data. Computational linguistic principles are used to evaluate text.
In data mining data is stored in structured format. In text mining data is stored in unstructured format.
Data is homogeneous and is easy to retrieve. Data is heterogeneous and is not so easy to retrieve.
It supports mining of mixed data. In text mining, mining of text is only done.
N T
DE
U
T HE E
S T
N D
A
ND
R S
E
C H
EA
T
U
Y O
N K
A
TH

Docter Presentation1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Docter Presentation1

Uploaded by

Copyright:

Available Formats

Collage of Computing And Informatics

Department Of Information Science

Characterist It is characterized by large numbers of short It is characterized by a large volume of data.

Response It’s response time is in millisecond. Response time in seconds to minutes.

The basic difference between web mining and text

Processing of data is done directly. Processing of data is done linguistically.

You might also like