DM and DW Notes-Module2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

Data Mining and Data Warehousing-15CS651 Module-2

Module-2
Data Warehouse Implementation and Data Mining

Efficient Data cube computation: An overview


Multidimensional data analysis is the computation of aggregation function in all the dimensions.
In SQL, aggregation functions are referred to as group_by function. The group_by function
represented by a cuboid. The set of group_by function forms a lattice of cuboids defining a data
cube.
Example:
A data cube is a lattice of cuboids. We want to create a data cube for “Allelectronics sales” that
contains the following:
city
item
year
Sales_in_dollars
To analyze the data in “AllElectroics sales” the following operation should be done.
1. Compute the sum of sales, grouping by city and item.
2. Compute the sum of sales, grouping by city.
3. Compute the sum of sales, grouping by item.
Here We consider,
city,item and year as dimensions
sales_in_dollars as measure.
What is total number of cuboid in our data cube? [AllElectronics sales]
We have 3 attributes such as city, item and year. So total number of cuboid is 23=8. This
cuboid’s(group by’s) is shown below:
1. { (city,item,year),
2. (city,item),
3. (city, year),
4. (item, year),
5. (city),
6. (item),
7. (year)
8. ().
Where, () means that group_by is empty. [ie. Dimensions are not grouped]
These group by’s (cuboid’s) forms a lattice of cuboids for the data cube.

Sri Sairam College of Engineering, Bangalore. Page 1


Data Mining and Data Warehousing-15CS651 Module-2

Figure2.1: Lattice of cuboids, making up a 3D data cube. Each cuboid represents a different group_by.

The base cuboid can return total sales for any combination of the three dimensions. The apex
cuboid contains total sum of all sales. The group by’s is empty in apex cuboid. In figure, if we
start at the apex cuboid and explore downward in the lattice is called drilling down within the data
cube.
If we start at the base cuboid and explore upward in the lattice is called rolling up within
the data cube. The cube operator on ‘n’ dimensions is equivalent to a collection of group_by
statements.
The SQL syntax for star schema is
Define cube sales_cube[city, item, year]: sum(sales_in_dollars)
The cube with ‘n’ dimensions contains 2n cuboids. This statement can be represented as,
Compute cube sales_cube
Online analytical processing may need to access different cuboids for different queries. So
the good idea is compute all or atleast some of the cuboids in a data cube in advance.
Precomputation leads to fast response time and avoid some redundant computation. The major
challenge of precomputation is “it needs large amount of storage space”. This problem referred to
as curse of dimensionality.
For ‘n’ dimensional data cube, the total number of cuboids that can be generated is,

Where, Li - is the number of levels associated with dimension i.


1 – is added to Li to include the virtual top level.
Partial materialization: Selected computation of cuboids

There are 3 choices for data cube materialization:


1. No Materialization:
Do not precompute any of the “nonbase” cuboids. This leads to computing expensive
multidimensional aggregates on-the-fly, which can be extereamly slow.
2. Full Materialization:

Sri Sairam College of Engineering, Bangalore. Page 2


Data Mining and Data Warehousing-15CS651 Module-2

Precompute all of the cuboids. The resulting lattice of computed cuboids is referred to as
full cube.
3. Partial Materialization:
Selectively compute subset cuboids of whole cuboids.

Indexing OLAP Data: Bitmap Index and Join Index


To facilitate efficient data accessing, most data warehouse systems support index structure and
materialized views.
There are 2 indexing:
1. Bitmap Indexing
2. Join Indexing
1. Bitmap Indexing:
Bitmap Indexing is a very popular method. It provides quick searching in data cubes.
Bitmap Index is an alternative representation of Record_ID(RID) list. In Bitmap Indexing, there is
a distinct bit vector (BV) for each value (V). In Bitmap Indexing, the attribute has value “V” then
set value “1” to bit of bitmap index. And all other bit has value “0”.

Figure2.2: Indexing OLAP data using bitmap indices


2. Join Indexing:
Join Indexing identify joinable tuples of relation without using any join operations. Join
Indexing is useful for maintaining the relationship between foreign key and its matching primary
key.

Figure2.3: Linkages between a sales fact table and location and item dimension table

Sri Sairam College of Engineering, Bangalore. Page 3


Data Mining and Data Warehousing-15CS651 Module-2

In the above Figure2.3, the “main street” value in the location dimension table joins with tuples
T57, T238 and T884 of the sales fact tables.

Efficient Processing of OLAP Queries


The purpose of materializing cuboids and constructing OLAP index structures is used to speed up
query processing in data cubes. Given materialized views, query processing should proceed as
follows:
1. Determine which operations should be performed on the available cuboids:
This involves transforming any selection, projection, roll-up (group-by) and drill-down
operations specified in the query into corresponding SQL and/or OLAP operations. For example,
slicing and dicing a data cube may correspond to selection and projection operations on a
materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should be applied:
This involves identifying all of the materialized cuboids that may potentially be used to
answer the query, prepare subset based on dominance relationship among the cuboids, estimating
the cost of using the remaining materialized cuboids and selecting the cuboid with the least cost.
Example:
Consider “AllElectronics” of the form “sales_cube[time,item,location]:sum(sales_in_dollars)”.
The dimension hierarchies used are
“day<month<quarter<year” for time
“item_name<<brand<type for item
“street<city<state<country for location.
Suppose that the query to be processes is on (brand, state) with the selection constant
“year=2010”
Also, suppose that there are four materialized cuboids available as follows:
 cuboid1: {year, item_name, city}
 cuboid2: {year, brand, country}
 cuboid3: {year, brand, state}
 cuboid4: {item_name, state} where year=2010.

Which of the above cuboid can be selected to process the query?


Cuboid2 cannot be used because country is more general concept than state.
Cuboid 1,3 and 4 can be used to process the query because
1. They have the same set or superset of the dimension in the query.
2. The selection clause in the query can imply the selection in cuboid.

OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP


OLAP servers present business users with multidimensional data from data warehouses or data
marts, without concerns regarding how or where the data are stored. The implementations of a
warehouse server for OLAP processing include the following:
1. Relational OLAP (ROLAP)

----------------------------------------------------------------------------------------------------------------

Sri Sairam College of Engineering, Bangalore. Page 4


Data Mining and Data Warehousing-15CS651 Module-2

What is Data Mining?


 Data mining is the process of automatically discovering useful information in large data
repositories.
 Data mining techniques are deployed to large databases in order to find novel and useful
patterns that might remain unknown.
 They also provide capabilities to predict the outcome of a future observation, such as
predicting whether a newly arrived customer will spend more than Rs1000 at a department
store.
Data Mining and Knowledge Discovery
Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall
process of converting raw data into useful information, as shown in Figure 2.4

Figure 2.4: The process of knowledge discovery in databases (KDD)

The input data can be stored in a variety of formats (flat files, spreadsheets, or relational tables)
and may reside in a centralized data repository or be distributed across multiple sites.
The purpose of Data preprocessing is to transform the raw input data into an appropriate format
for subsequent Analysis. Data preprocessing is the most laborious and time-consuming step in the
overall knowledge discovery process. The steps involved in data preprocessing are,
1. Fusing data from multiple resources.
2. Cleaning data to remove noise and duplicate observations
3. Selecting records and features that are relevant to the data mining task
A post processing ensures that only valid and useful results are incorporated into the decision
support system. An example of post processing is visualization, which allows analysts to explore
the data and the data mining results from a variety of viewpoints.

Challenges in Data Mining


There are 5 major challenges in data mining:
1. Scalability
2. High Dimensionality
3. Heterogeneous and Complex data
4. Data Ownership and Distribution
5. Non-traditional Analysis.

Sri Sairam College of Engineering, Bangalore. Page 5


Data Mining and Data Warehousing-15CS651 Module-2

1. Scalability
The capacity of data may be Gigabytes, Terabytes or even Petabytes. To handle this amount of
data data mining algorithm must be scalable. Data mining algorithm uses special search strategies
to handle exponential search problems. Scalability can also be improved by using sampling or
developing parallel and distributed algorithms.
2. High Dimensionality
Now a days a data set contains hundreds or thousands of attributes. In bioinformatics, progress in
microarray technology has produced gene expression data involving thousands of features.
3. Heterogeneous and Complex data
Traditional data analysis methods deal data sets which contains the same type of attributes. But
data mining technique can handle data sets which contains heterogeneous attributes.
Examples:
Web pages containing semi-structured text and hyperlinks.
DNA data with sequential and three-dimensional structure.
4. Data Ownership and Distribution
Sometimes, the data needed for an analysis is not stored in one location or owned by one
organization. Instead, the data is geographically distributed in multiple resources. This requires
the development of distributed data mining techniques. The key challenges faced by distributed
data mining algorithms include
1. How to reduce the amount of communication needed to perform the distributed
computation
2. How to effectively consolidate the data mining results obtained from multiple sources
3. How to address data security issue
5. Non-traditional Analysis
In traditional analysis, a hypothesis is proposed, an experiment is designed to gather the data, and
then the data is analyzed with respect to the hypothesis. Unfortunately, this process is extremely
labor intensive. But datasets in data mining uses non-traditional analysis. Because datasets
analyzed in data mining are not result of carefully designed experiments and represent samples of
data rather than random samples.

Data Mining Tasks


The Data mining tasks are divided into 2 major categories:
1. Predictive tasks
2. Descriptive tasks

1. Predictive tasks
 The objective of these tasks is to predict the value of a particular attribute based on the
values of other attributes.
 The attribute to be predicted is commonly known as the target or dependent variable,
while the attributes used for making the prediction are known as the explanatory or
independent variables.
2. Descriptive tasks
 The objective of descriptive tasks is to derive patterns (correlations, trends, clusters,
trajectories, and anomalies) that summarize the underlying relationships in data.
 Descriptive data mining tasks are exploratory in nature and frequently require
postprocessing techniques to validate and explain the results.

Sri Sairam College of Engineering, Bangalore. Page 6


Data Mining and Data Warehousing-15CS651 Module-2

And data mining contains 4 core tasks:


1. Predictive Modeling
2. Association analysis
3. Cluster analysis
4. Anomaly Detection

Figure2.5: Four of the core data mining tasks

1. Predictive Modeling
 Predictive modeling refers to the task of building a model for the target variable as a
function. There are two types of predictive modeling tasks:
a. Classification, which is used for discrete target variables.
b. Regression, which is used for continuous target variables.
 Example1: predicting whether a Web user will make a purchase at an online bookstore is a
classification task because the target variable is binary-valued. On the other hand,
forecasting the future price of a stock is a regression task because price is a continuous-
valued attribute.

 Example2: Predicting the type of Flower:


Petal width is broken into the categories low, medium, and high, which correspond to the
intervals (0, 0.75), (0.75, 1.75), (1.75, ∞), respectively. Also, petal length is broken into
categories low, medium and high, which correspond to the intervals (0, 2.5), (2.5, 5), (5,
∞) respectively. Figure 2.6 shows a plot of petal width versus petal length for the 150
flowers in the Iris data set.
 Based on these categories of petal width and length, the following rules can be derived:
Petal width low and petal length low implies Setosa.
Petal width medium and petal length medium implies Versicolour.
Petal width high and petal length high implies Virginica.

Sri Sairam College of Engineering, Bangalore. Page 7


Data Mining and Data Warehousing-15CS651 Module-2

Figure2.6: Petal width versus petal length for 150 iris flowers

2. Association analysis
 Association analysis is used to discover patterns that describe associated features in the
data.
 The discovered patterns are typically represented in the form of implication rules or
feature subsets. Because of the exponential size of its search space, the goal of association
analysis is to extract the most interesting patterns in an efficient manner.
 Example1:
Finding groups of genes that have related functionality
Identifying the Web pages that are accessed together
 Example2: Market Basket Analysis
The transactions shown in Table 2.1 illustrate point-of-sale data collected at the checkout
counters of a grocery store. Association analysis can be applied to find items that are
frequently bought together by customers. For example, we may discover the rule {Diapers}
{milk}, which suggests that customers who buy diapers also tend to buy milk. This type of
rule can be used to identify potential cross-selling opportunities among related items.

Sri Sairam College of Engineering, Bangalore. Page 8


Data Mining and Data Warehousing-15CS651 Module-2

Table2.1 Market basket data


4. Cluster Analysis
 Cluster analysis used to find groups of closely related observations so that observations
that belong to the same cluster are more similar to each other than observations that
belong to other clusters.
 Clustering has been used to group sets of related customers.
5. Anomaly detection
 Anomaly detection is the task of identifying observations whose characteristics are
significantly different from the rest of the data. Such observations are known as anomalies
or outliers.
 A good anomaly detector must have a high detection rate and a low false alarm rate.
 Example: Credit Card Fraud Detection, network intrusions, unusual patterns of disease.

Types of Data
 A data set can be viewed as a collection of data objects. The data object can be called as
record, point, vector, pattern, event, case, sample, observation, or entity.
 Data objects are described by a number of attributes that capture the basic characteristics
of an object, such as the mass of a physical object or the time at which an event occurred.
The attribute can be called as Variable, characteristic, field, feature, or dimension.
 Example: (Student Information)
A data set is a file, in which the objects are records (or rows) in the file and attributes are
field (or column) in the file.

Table2.1: A Sample data set containing student’s information

1. Attributes and Measurement

Sri Sairam College of Engineering, Bangalore. Page 9


Data Mining and Data Warehousing-15CS651 Module-2

2. Types of Data Sets

1. Attributes and Measurement


In this topic, we will see the issue of describing data by considering what types of attributes are
used to describe data objects.
a. Attribute
b. Type of an attribute
c. The different types of attributes
d. Describing attributes by number of values.
a. Attribute
 An attribute is a property or characteristic of an object that may vary either from one
object to another or from one time to another.
 Example: Eye color varies from person to person, while the temperature of an object
varies over time.
 A measurement scale is a rule (function) that associates a numerical or symbolic value
with an attribute of an object.
 Example: we count the number of chairs in a room to see if there will be enough to seat all
the people coming to a meeting. In this case the "physical value" of an attribute of an
object is mapped to a numerical or symbolic value.
b. Type of an attribute
 The properties of an attribute need not be the same as the properties of the values used to
measure it.
c. The different types of attributes
 A useful (and simple) way to specify the type of an attribute is to identify the properties of
numbers of the attributes.
 For example, an attribute such as length has many of the properties of numbers.
 The following properties (operations) of numbers are typically used to describe attributes.
1. Distinctness = and ≠
2. Order <, ≤, > and ≥
3. Addition + and -
4. Multiplication * and /
 Based these properties, we can define four types of attributes: nominal, ordinal, interval,
and ratio.

Sri Sairam College of Engineering, Bangalore. Page 10


Data Mining and Data Warehousing-15CS651 Module-2

Table2.2: Different Attribute type


d. Describing attributes by number of values
i) Discrete Attribute:
 A discrete attribute has a finite or countably infinite set of values.
 Example: zip codes or ID numbers or numeric, such as counts
 Discrete attributes are often represented using integer variables.
 Binary attributes are a special case of discrete attributes and assume only two values.
 Example: true/false, yes/no, male/female, or 0f 1.
 Binary attributes are often represented as Boolean variable.
ii) Continuous Attribute:
 A continuous attribute is one whose values are real numbers.
 Example: Attributes such as temperature, height, or weight.
 Continuous attributes are typically represented as floating-point variables.
iii) Asymmetric Attribute:
 Asymmetric attributes is a non-zero attribute value is regarded as important (only
presence).
 Example: For a specific student, an attribute has a value of 1 if the student took the course
associated with that attribute and a value of 0 otherwise.
 Binary attributes where only non-zero values are important are called asymmetric binary
attributes.
2. Types of Data Sets
i) Dimensionality:

Sri Sairam College of Engineering, Bangalore. Page 11


Data Mining and Data Warehousing-15CS651 Module-2

 The dimensionality of a data set is the number of attributes that the objects in the data set
possess.
 Data with a small number of dimensions tends to be qualitatively different than moderate
or high- dimensional data.
 The difficulties associated with analyzing high-dimensional data are sometimes referred to
as the curse of dimensionality.
ii) Sparsity
 If most attributes of an object have values of 0 and less than 1% of the entries are non-
zero.
 In practical terms, sparsity is an advantage because usually only the non-zero values need
to be stored and manipulated.
iii) Resolution
 It is possible to obtain data at different levels of resolution, and often the properties of the
data are different at different resolutions.
 Example: The surface of the Earth seems very uneven at a resolution of a few meters, but
is relatively smooth at a resolution of tens of kilometers.
 The patterns in the data also depend on the level of resolution.
iv) Record Data
 The data set is a collection of records (data objects), each of which consists of a fixed set
of data fields (attributes).
 In basic form of record data, there is no explicit relationship among records or data fields,
and every record (object) has the same set of attributes.
 Record data is usually stored either in flat files or in relational databases.
v) Transaction or Market Basket Data
 Transaction data is a special type of record data, where each record (transaction) involves
a set of items.
 Example: Consider a grocery store. The set of products purchased by a customer during
one shopping trip constitutes a transaction, while the individual products that were
purchased are the items. This type of data is called market basket data.
vi) The Data Matrix:
 If the data objects in a collection of data have the same fixed set of numeric attributes,
then the data objects can be thought of as points (vectors) in a multidimensional space,
where each dimension represents a distinct attribute describing the object.
 A set of such data objects can be interpreted as ‘m’ by ‘n’ matrix, where there are m rows,
one for each object, and n columns, one for each attribute.
vii) The sparse Data Matrix:
 A sparse data matrix is a special case of a data matrix in which the attributes are of the
same type and are asymmetric; i.e., only non-zero values are important.

Sri Sairam College of Engineering, Bangalore. Page 12


Data Mining and Data Warehousing-15CS651 Module-2

viii) The sparse data matrix


 A sparse data matrix is a special case of a data matrix in which the attributes are of the
same type and are asymmetric; i.e., only non-zero values are important.
 Example is document data. In particular, if the order of the terms (words) in a document is
ignored then a document can be represented as a term vector, where each term is a
component (attribute) of the vector and the value of each component is the number of
times the corresponding term occurs in the document. This representation of a collection
of documents is often called a document-term matrix.
ix) Graph-based data
 A graph can be a convenient and powerful representation for data.
 There are two specific cases:
a. Data with Relationships among Objects
b. Data with Objects That Are Graphs
a. Data with Relationships among Objects
---------------------------------------------------------------------------------------------------------------------

Sri Sairam College of Engineering, Bangalore. Page 13


Data Mining and Data Warehousing-15CS651 Module-2

Data Quality
i. Measurement and Data Collection Issues
ii. Issues Related to Applications

i. Measurement and Data Collection Issues


a. Measurement and Data Collection Errors
b. Noise and Artifacts
c. Precision, Bias, and Accuracy
d. Outliers
e. Missing Values
1. Eliminate Data Objects or Attributes
2. Estimate Missing Values
3. Ignore the Missing Value during Analysis
4. Inconsistent Values
5. Duplicate Data
ii. Issues Related to Applications
a. Timeliness
b. Relevance
c. Knowledge about the Data

i. Measurement and Data Collection Issues


a. Measurement and Data Collection Errors:
 Measurement error refers to any problem resulting from the measurement process.
 The numerical difference of the measured and true value is called the error.
 The term data collection error refers to errors such as omitting data objects or attribute
values, or inappropriately including a data object.
 Both measurement errors and data collection errors can be either systematic or random.
b. Noise and Artifacts:
 Noise is the random component of a measurement error. It may involve the distortion of a
value or the addition of spurious objects.
 Figure 2.7 shows a time series before and after it has been disrupted by random noise.

Figure2.7: Noise in a Time series Context


 Figure 2.6 shows a set of data points before and after some noise points.

Sri Sairam College of Engineering, Bangalore. Page 14


Data Mining and Data Warehousing-15CS651 Module-2

Figure2.7: Noise in a Spatial Context

 Techniques from signal or image processing can be used to reduce noise and thus, help
to discover patterns (signals) that might be "lost in the noise."
 Data mining focuses on robust algorithms that produce acceptable results even when
noise is present.
 Deterministic distortions of the data are referred to as artifacts.

c. Precision, Bias, and Accuracy:


 In statistics and experimental science, the quality of the measurement process and the
resulting data are measured by precision and bias.
 Precision: The closeness of repeated measurements (of the same quantity) to one another.
Precision is often measured by the standard deviation of a set of values.
 Bias: A systematic variation of measurements from the quantity being measured. Bias is
measured by taking the difference between the mean of the set of values and the known
value of the quantity being measured.
 Example: Suppose that we have a standard laboratory weight with a mass of 1g and want
to assess the precision and bias of our new Iaboratory scale. We weigh the mass five
times, and obtain the following five values: {1.015, 0.990, 1.013, 1.001, 0.986}.The mean
of these values is 1.001, and hence, the bias is 0.001. The precision, as measured by the
standard deviation is 0.013.
 Accuracy: The closeness of measurements to the true value of the quantity being
measured. Accuracy depends on precision and bias. There is no specific formula for
accuracy and it is a general concept.
d. Outliers:
 Outliers can be defined by, data objects that have characteristics that are different from
most of the other data objects in the data set (or) values of an attribute that are unusual
with respect to the typical values for that attribute.

e. Missing Values:
 It is not unusual for an object to be missing one or more attribute values.

Sri Sairam College of Engineering, Bangalore. Page 15


Data Mining and Data Warehousing-15CS651 Module-2

 In some cases, the information was not collected. For example some people decline to give
their age or weight.
 There are several strategies for dealing with missing data
1. Eliminate Data Objects or Attributes
2. Estimate Missing Values
3. Ignore the Missing Value during Analysis
4. Inconsistent Values
5. Duplicate Data

1. Eliminate Data Objects or Attributes


 A simple and effective strategy is to eliminate objects with missing values.
 If a data set has only a few objects that have missing values, then it will be omitted.
 Similarly eliminate attributes that have missing values.
 Example: Weight
2. Estimate Missing Values
 Sometimes missing data can be reliably estimated. For example, In a time series data, the
missing values can be estimated (interpolated) by using the remaining values.
3. Ignore the Missing Value during Analysis
 Many data mining approaches can be modified to ignore missing values.
 Missing values for some attributes can be calculated by using only the attributes that do
not have missing values.
4. Inconsistent Values:
 Data can contain inconsistent values. For example, Consider an address field, where both
a zip code and city are listed, but the specified zip code area is not contained in that city.
 Some types of inconsistencies are easy to detect. For example, a person's height should not
be negative
5. Duplicate Data:
 A data set may include data objects that are duplicates, or almost duplicates of one
another.
 To detect and eliminate such duplicates, there are two main issues must be resolved.
o First, if there are two objects that actually represent a single object, then the values
of corresponding attributes may differ, and these inconsistent values must be
resolved.
o Second, care needs to be taken to avoid accidentally combining data objects that
are similar.
ii. Issues Related to Applications
a. Timeliness:
 Some data starts to age as soon as it has been collected.
 For example, consider some snapshot of the on-going process such as the purchasing
behavior of customers or Web browsing patterns, then this snapshot represents reality for
only a limited time.
b. Relevance:
 The available data must contain the information necessary for the application.
 For example, consider the task of building a model that predicts the accident rate for
drivers.

Sri Sairam College of Engineering, Bangalore. Page 16


Data Mining and Data Warehousing-15CS651 Module-2

 If information about the age and gender of the driver is omitted, then this model will have
limited accuracy unless this information is indirectly available through other attributes.
c. Knowledge about the Data
 Normally data sets are accompanied by documentation that describes different aspects of
the data.
 The quality of this documentation can help the subsequent analysis.
 If the documentation is poor, we cannot do proper analysis.
 For example, that the missing values of a particular field are indicated as “999”, then our
analysis of the data may be faulty.

Data Preprocessing
 In this topic, we discuss the issue of which preprocessing steps should be applied to make
the data more suitable for data mining. The data preprocessing techniques are shown
below:
1. Aggregation
2. Sampling
3. Dimensionality reduction
4. Feature subset selection
5. Feature creation
6. Discretization and binarization
7. Variable transformation

1. Aggregation:
 Sometimes "less is more" and this concept is applied with aggregation.
 Aggregation is nothing but combining of two or more objects into a single object.
 Example: Consider the following data set,

Table2.4: Data set containing information about customer purchases


 This table shows all the transaction about all the stores.
 And we can display the data based on store-wise (i.e: Everyday’s sales of every stores).So
here we are doing combing the item attribute.
 Reducing the number of values for a particular attribute; e.g., reducing the possible values
for date from 365 days to 12 months. Here we are combining the date attribute.
 This type of aggregation is commonly used in Online Analytical Processing (OLAP).

Sri Sairam College of Engineering, Bangalore. Page 17


Data Mining and Data Warehousing-15CS651 Module-2

2. Sampling:
 Sampling is a commonly used approach for selecting a subset of the data objects to be
analyzed.
 Sampling is used to reduce cost and time. Because we cannot process all the data.
 A sample is representative if it has approximately the same property (of interest) as the
original set of data.
 Simple Random Sampling- In simple random sampling, there is an equal probability of
selecting any particular item. There are Two types of random sampling:
o Sampling without replacement:

Sri Sairam College of Engineering, Bangalore. Page 18

You might also like