CH 2

Chapter Two
DATA AND DATA SET

Outlines
 Data and its type
 Data set
 Perspectives on Data
 Standard stage in data science project

Data
As its name suggests, data science is fundamentally dependent
on data.
In its most basic form, a datum or a piece of information is an

abstraction of a real-world entity (person, object, or event).
What is data?
“Do you remember our first chapter” ?
Data is the collection/organized of data objects (facts) and their

attributes. Look at the following data shown in tabular form. ..
Data …
Objects
Data is an organized facts which is the level of conceptualization.

Data set
Each entity (object) is typically described by a number of attributes.
E.g. Book is entity or object:
This object might have attributes such as: author, title, topic, genre,
publisher, date published, number of pages, edition, ISBN, and so on.
A data set consists of the data relating to a collection of entities, with

each entity described in terms of a set of attributes
Data set
In its most basic form, a data set is organized in an n * m data
matrix called the analytics record, where n is the number of
entities (rows) and m is the number of attributes (columns).
 Data sets are made up of data objects
 Data objects are described by attributes
The terms instance, example, entity, object, case, individual, and

record are used in data science literature to refer to a row.
Data set
Example of dataset (book dataset)
Excluding the ID attribute which is simply a label for each row and
hence is not useful for analysis each book is described using six
attributes: title, author, year, cover, edition, and price.
Data set
We could have included many more attributes for each book, but,
as is typical of data science projects, we needed to make a choice
when we were designing the data set.
Data attributes
Data Attributes are also called dimensions, features, or variable.
Attribute is nothing but a data field, representing a characteristic or

feature of a data object.
Example: customer_ID, name, address
Types of attributes
There are many different types of attributes, and for each attribute type
different sorts of analysis are appropriate.
Types of attributes
So understanding and recognizing different attribute types is a
fundamental skill for a data scientist. The standard types are numeric,
nominal, and ordinal.
1. Nominal Attribute
Nominal (also known as categorical) attributes take values from a

finite set. These values are names (hence nominal) for categories,
classes, or states of things.
Types of attributes
E.g. Marital status= {single, married, and divorced
E.g. Hair color = {auburn, black, blond, brown, grey, red, white}
Types of Nominal attributes
Binary attribute: is a special case of a nominal attribute where

the set of possible values is restricted to just two values.
It is Nominal attribute with only 2 states (e.g. 0 and 1)
Example, we might have the binary attribute “spam,” which describes

whether an email is spam (true) or not spam (false).
Types of attributes
Binary attribute may classified into two:
Symmetric binary: both outcomes equally important
e.g. gender (F or M)
Asymmetric binary: outcomes not equally important
e.g., medical test (positive vs. negative). Why?
Convention: assign 1 to most important outcome (e.g., HIV positive)
Generally, Nominal attributes cannot have ordering or arithmetic

operations applied to them.
Nominal attribute may be sorted alphabetically, but alphabetizing is a

distinct operation from ordering.
Types of attributes
2. Ordinal attributes
 Ordinal attributes are similar to nominal attributes, with
the difference that it is possible to apply a rank order over
the categories of ordinal attributes
 E.g. an attribute describing the response to a survey
question might take values from the domain “strongly
dislike, dislike, neutral, like, and strongly like”.
 Here, there is a natural ordering over these values from
“strongly dislike” to “strongly like” (or vice versa
depending on the convention being used).
Types of attributes
Note: in Ordinal attributes, Values have a meaningful order
(ranking) but magnitude between successive values is not known.
Examples 1: Size = {small, medium, large}
However, an important feature of ordinal data is that there is no notion

of equal distance between these values
For example, the cognitive distance between “dislike” and “neutral”

may be different from the distance between “like” and “strongly like.”
As a result, it is not appropriate to apply arithmetic operations (such as

averaging) on ordinal attributes.
Types of attributes
3. Numeric attributes
is those attributes that describe measurable quantities that are

represented using integer or real values.
Can be measured on either an interval scale or a ratio scale. Interval

attributes are measured on a scale with a fixed but arbitrary interval and
arbitrary origin
Interval-scaled Numeric attributes: Measured on a scale of equal-

sized units, Values have order.
Types of attributes
Example 1: temperature in C˚or F˚, calendar dates etc.
Note: No true zero-point in interval scaled attribute.
Generally, the data type of an attribute (numeric, ordinal, nominal)

affects the methods we can use to analyze and understand the data,
including both the basic statistics we can use to describe the distribution
of values that an attribute takes and the more complex algorithms we
use to identify the patterns of relationships between attributes
Perspectives on Data
Apart from the above defined types of data (numerical, nominal and ordinal),
a number of other useful distinctions can be made regarding data. One such
distinction is between structured and unstructured data
Structured data
Structured data are data that can be stored in a table, and every instance in the
table has the same structure (i.e., set of attributes).
It can be easily stored, organized, searched, reordered, and merged with other
structured data.
It is relatively easy to apply data science to structured data because, by

definition, it is already in a format that is suitable for integration into an
analytics record.
Perspectives on Data:
Unstructured data
 are data where each instance in the data set may have its own internal
structure, and this structure is not necessarily the same in every
instance.
 It is much more common than structured data.
 For example, collections of human text (emails, tweets, text messages,

posts, novels, etc.) can be considered unstructured data, as can
collections of sound, image, music, video, and multimedia files.
Perspectives on Data
To use mendeley desktop, we have to download and install it first
The variation in the structure between the different elements means
that it is difficult to analyze unstructured data in its raw form.
We can often extract structured data from unstructured data using
techniques from artificial intelligence (such as natural-language
processing and ML), digital signal processing, and computer vision.
Stage in Data Science project
To use mendeley desktop, we have to download and install it first
Data Science pyramid
Stage in Data Science project
Many people and companies

regularly put forward
suggestions on the best process
to follow to climb the data
science pyramid.
The most commonly used
process is the Cross Industry
Standard Process for Data
Mining (CRISPDM).
The CRISP-DM life cycle
Data Science project life cycle
The primary advantage of CRISP-DM is that it is designed to
be independent of any software, vendor, or data-analysis technique.
The CRISP-DM life cycle consists of six stages:

1) Business understanding,
2) Data understanding,
3) Data preparation
4) Modeling
5) Evaluation, and
6) Deployment
Data Science project life cycle: Data
 Data are at the center of all data science activities, and that is why
the CRISP-DM diagram has data at its center.
Data are at the center of all data science activities, and that is why
the CRISP-DM diagram has data at its center.
The process is semi structured, which means that a data scientist

doesn’t always move through these six stages in a linear fashion.
Depending on the outcome of a particular stage, a data scientist

may go back to one of the previous stages, redo the current stage,
or move on to the next stage.
In the first two stages, business understanding and data
understanding, the data scientist is trying to define the goals of
the project by understanding the business needs and the data that
the business has available to it.
In the early stages of a project, a data scientist will often
iterate between focusing on the business and exploring
what data are available.
This iteration typically involves identifying a business problem

and then exploring if the appropriate data are available to develop
a data-driven solution to the problem.
If the data are available, the project can proceed; if not, the data
scientist will have to identify an alternative problem to tackle.
Here, data scientist identify what data are available
Once the data scientist has clearly defined a business problem

and is happy that the appropriate data are available, she moves on
to the next phase of the CRISPDM: Data preparation
2. Data preparation
This stage helps us gain a better understanding of the data and
prepares it for further evaluation.
The focus of the data-preparation stage is the creation of a data
set that can be used for the data analysis.
Creating this data set involves integrating data sources from a
number of databases.
Generally, data preparation involves the following tasks:
• Data cleansing
• Data integration
• Data reduction
• Data transformation
2.Data preparation…
• Data cleansing: to get rid of bad data
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
* Integration of data from multiple sources, such as databases,
data warehouses, or files
• Data reduction
* obtains a reduced representation of the data set that is much
smaller in volume, yet produces almost the same results.
o Dimensionality reduction
o Numerosity/size reduction
o Data compression
• Data transformation
* Normalization
* Discretization and/or Concept hierarchy generation
2. Data preparation: Data cleansing
Raw data is hardly usable and useless in its pure form
Ideally, we collect unstructured, irrelevant and unfiltered data.
After collecting the appropriate dataset, we need to adequately

clean and process the data before proceeding to the next step.
Data cleansing includes solving problems like outliers,

inconsistency, missing values, incorrect, skewed, and trends.
E.g. Duplicate or redundant data is data problems which require

data cleaning.
Incomplete Data
It is another problem that needs to be solved here.
• The dataset may lack certain attributes of interest.
• E.g. Is that enough if you have patient demographic profile and
address of region to predict the vulnerability (or exposure) of a
given region to Malaria outbreak?
•
The dataset may contain only
aggregate data. E.g.: traffic
police car accident report:
–This much accident occurs
this day in this sub-city
Missing value
• Data is not always available, lacking attribute values. E.g.,
Occupation=“ ”
many tuples have no recorded value for several attributes,

such as customer income in sales data.
What’s wrong here? A missing required field

So, How to handle Missing data?
 Ignore the missing value: not effective when the percentage of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill automatically: calculate, say, using Expected
Maximization (EM) Algorithm the most probable value.
Noisy Data
• Noisy: containing noise, errors, or outliers e.g., Salary=“−10” (an
error)
Typographical errors are errors that corrupt data
Let say ‘green’ is written as ‘rgeen’, and “cloud” as ‘could’ ?
•Incorrect attribute values may be due to:
Faulty data collection instruments (e.g.: OCR)
Data entry problems
Data transmission problems
Technology limitation
Inconsistency in naming convention
Noisy Data
How to overcome Noisy Data?
Manually check all data : tedious + infeasible?
Sort data by frequency
 ‘green’ is more frequent than ‘rgeen’
 Works well for categorical data
 Use, say Numerical constraints to Catch Corrupt Data
 Weight can’t be negative
 People can’t have more than 2 parents
 Salary can’t be less than Birr 300
Use statistical techniques to Catch Corrupt Data
 Check for outliers (the case of the 8 meters man)
 Check for correlated outliers using n-gram (“pregnant male”)
People can be male
People can be pregnant
People can’t be male AND pregnant
2. Data preparation: Data Integration
Data integration combines data from multiple sources (database,
data warehouse, files & sometimes from non-electronic sources)
into a coherent store
Because of the use of different sources, data that that is fine on its
own may become problematic when we want to integrate it.
 Some of the issues are:

Different formats and structures
Conflicting and redundant data
Data at different levels
Data Integration: Formats

Not everyone uses the same format. Do you agree?
 Schema integration: e.g., A.cust-id ≡ B.cust-#
 Integrate metadata from different sources
Dates are especially problematic:
 12/19/97
 19/12/97
 19/12/1997
 19-12-97
 Dec 19, 1997
 19 December 1997
 19th Dec. 1997
Are you frequently writing money as: Birr 200, Br. 200, 200 Birr, …
Data Integration: Inconsistent

Inconsistent data: containing discrepancies in codes or names,
which is also the problem of lack of standardization / naming
conventions. e.g., Age=“26” vs. Birthday=“03/07/1986”
Some use “1,2,3” for rating; others “A, B, C”
Discrepancy between duplicate records

Data Integration: Data that Moves

Be careful of taking snapshots of a moving target
Example: Let’s say you want to store the price of a shoe in
France, and the price of a shoe in Italy. Can we use same currency
(say, US$) or country’s currency?
 You can’t store it all in the same currency (say, US$) because the
exchange rate changes frequently
 Price in foreign currency stays the same
 Must keep the data in foreign currency and use the current exchange
rate to convert
The same needs to be done for ‘Age’

 It is better to store ‘Date of Birth’ than ‘Age’
Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object may have
different names in different databases
 Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
2. Data preparation: Data Reduction
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set

that is much smaller in volume but yet produces the same (or
almost the same) analytical results
Why data reduction? A database/data warehouse may store

terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.
2. Data preparation: Data Reduction
Data Reduction Strategies …

Data reduction strategies
 Dimensionality reduction: Select best attributes or remove
unimportant attributes
 Numerosity reduction: Reduce data volume by choosing
alternative, smaller forms of data representation
 Data compression: Is a technology that reduce the size of
large files such that smaller files take less memory space and
fast to transfer over a network or the Internet
3. Modeling
The next stage of CRISP-DM is the modeling stage
This is the stage where automatic algorithms are used to extract

useful patterns from the data and to create models that encode
these patterns
Machine learning is the field of computer science that focuses

on the design of these algorithms.
In the modeling stage, we use a number of different ML

algorithms to train a number of different models on the data set.
3. Modeling…
A model is trained on a data set by running an ML algorithm on the
data set so as to identify useful patterns in the data and to return a

model that encodes these patterns. Figure below shows typical
supervised ML model.
3. Modeling: Data sets preparation for learning
©A standard machine learning technique is to divide the dataset

into a training set and a test set.
© Training dataset is used for learning the parameters of the
model in order to produce hypotheses.
©A training set is a set of problem instances (described as a set of
properties and their values), together with a classification of the
instance.
© Test dataset, which is never seen during the hypothesis forming
stage, is used to get a final, unbiased estimate of how well the
model works.
© Test set evaluates the accuracy of the model/hypothesis in
predicting the categorization of unseen examples.
© A set of instances and their classifications used to test the
accuracy of a learned hypothesis.
4. Evaluation & Deployment
The last two stages of the CRISP-DM process, evaluation and

deployment, are focused on how the models fit the business and its
processes.
The tests run during the modeling stage are focused purely on the
accuracy of the models for the data set.
The evaluation phase involves assessing the models in the

broader context defined by the business needs.
Does a model meet the business objectives of the process?

4. Evaluation & Deployment
The main decision made during the evaluation phase is whether

any of the models should be deployed in the business or another
iteration of the CRISP-DM process is required to create adequate
models.
Assuming the evaluation process approves a model or models,

the project moves into the final stage of the process: deployment.
Deployment involves examining how to deploy the selected

models into the business environment.
Thank
s

CH 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 2

Uploaded by

Copyright:

Available Formats

Chapter Two

DATA AND DATA SET

 Data and its type

 Standard stage in data science project

In its most basic form, a datum or a piece of information is an

“Do you remember our first chapter” ?

Data is the collection/organized of data objects (facts) and their

Data is an organized facts which is the level of conceptualization.

A data set consists of the data relating to a collection of entities, with

 Data sets are made up of data objects

 Data objects are described by attributes

The terms instance, example, entity, object, case, individual, and

Attribute is nothing but a data field, representing a characteristic or

Example: customer_ID, name, address

Nominal (also known as categorical) attributes take values from a

Types of Nominal attributes

Binary attribute: is a special case of a nominal attribute where

It is Nominal attribute with only 2 states (e.g. 0 and 1)

Example, we might have the binary attribute “spam,” which describes

Generally, Nominal attributes cannot have ordering or arithmetic

Nominal attribute may be sorted alphabetically, but alphabetizing is a

Examples 1: Size = {small, medium, large}

However, an important feature of ordinal data is that there is no notion

For example, the cognitive distance between “dislike” and “neutral”

As a result, it is not appropriate to apply arithmetic operations (such as

is those attributes that describe measurable quantities that are

Can be measured on either an interval scale or a ratio scale. Interval

Interval-scaled Numeric attributes: Measured on a scale of equal-

Note: No true zero-point in interval scaled attribute.

Generally, the data type of an attribute (numeric, ordinal, nominal)

It is relatively easy to apply data science to structured data because, by

 It is much more common than structured data.

 For example, collections of human text (emails, tweets, text messages,

Many people and companies

The CRISP-DM life cycle consists of six stages:

The process is semi structured, which means that a data scientist

Depending on the outcome of a particular stage, a data scientist

This iteration typically involves identifying a business problem

Here, data scientist identify what data are available

Once the data scientist has clearly defined a business problem

After collecting the appropriate dataset, we need to adequately

Data cleansing includes solving problems like outliers,

E.g. Duplicate or redundant data is data problems which require

many tuples have no recorded value for several attributes,

What’s wrong here? A missing required field

 Some of the issues are:

Data Integration: Formats

Data Integration: Inconsistent

Discrepancy between duplicate records

Data Integration: Data that Moves

The same needs to be done for ‘Age’

Handling Redundancy in Data Integration

Data Reduction Strategies

Data reduction: Obtain a reduced representation of the data set

Why data reduction? A database/data warehouse may store

Data Reduction Strategies …

The next stage of CRISP-DM is the modeling stage

This is the stage where automatic algorithms are used to extract

Machine learning is the field of computer science that focuses

In the modeling stage, we use a number of different ML

A model is trained on a data set by running an ML algorithm on the

data set so as to identify useful patterns in the data and to return a

©A standard machine learning technique is to divide the dataset