Professional Documents
Culture Documents
CH 2
CH 2
Data set
Perspectives on Data
What is data?
Objects
This object might have attributes such as: author, title, topic, genre,
publisher, date published, number of pages, edition, ISBN, and so on.
Excluding the ID attribute which is simply a label for each row and
hence is not useful for analysis each book is described using six
attributes: title, author, year, cover, edition, and price.
Data set
We could have included many more attributes for each book, but,
as is typical of data science projects, we needed to make a choice
when we were designing the data set.
Data attributes
Data Attributes are also called dimensions, features, or variable.
Types of attributes
There are many different types of attributes, and for each attribute type
different sorts of analysis are appropriate.
Types of attributes
So understanding and recognizing different attribute types is a
fundamental skill for a data scientist. The standard types are numeric,
nominal, and ordinal.
1. Nominal Attribute
E.g. Hair color = {auburn, black, blond, brown, grey, red, white}
Apart from the above defined types of data (numerical, nominal and ordinal),
a number of other useful distinctions can be made regarding data. One such
distinction is between structured and unstructured data
Structured data
Structured data are data that can be stored in a table, and every instance in the
table has the same structure (i.e., set of attributes).
It can be easily stored, organized, searched, reordered, and merged with other
structured data.
We can often extract structured data from unstructured data using
techniques from artificial intelligence (such as natural-language
processing and ML), digital signal processing, and computer vision.
Stage in Data Science project
To use mendeley desktop, we have to download and install it first
Data Science pyramid
Stage in Data Science project
2) Data understanding,
3) Data preparation
4) Modeling
5) Evaluation, and
6) Deployment
Data Science project life cycle: Data
Data are at the center of all data science activities, and that is why
the CRISP-DM diagram has data at its center.
Data are at the center of all data science activities, and that is why
the CRISP-DM diagram has data at its center.
Because of the use of different sources, data that that is fine on its
own may become problematic when we want to integrate it.
The tests run during the modeling stage are focused purely on the
accuracy of the models for the data set.