Professional Documents
Culture Documents
Chapter 2 Preparing To Model
Chapter 2 Preparing To Model
TO MODEL
STEPS IN DEVELOPING MACHINE
LEARNING APPLICATION
Collect Data
Periodic Revist
COLLECT DATA
To work with machine learning projects, we need a huge amount of
data, because, without the data, one cannot train ML/AI models.
Collecting and preparing the dataset is one of the most crucial parts
while creating an ML/AI project.
During the development of the ML project, the developers completely
rely on the datasets. In building ML applications, datasets are divided
into two parts:
•Training dataset
•Test Dataset
POPULAR SOURCES FOR MACHINE
LEARNING DATASETS
1. Kaggle Datasets (https://www.kaggle.com/datasets)
Kaggle is one of the best sources for providing datasets for Data Scientists
and Machine Learners. It allows users to find, download, and publish
datasets in an easy way. It also provides the opportunity to work with other
machine learning engineers and solve difficult Data Science related tasks.
5. Microsoft Datasets(https://msropendata.com/)
The Microsoft has launched the "Microsoft Research Open data" repository
with the collection of free datasets in various areas such as natural language
processing, computer vision, and domain-specific sciences.
6. Awesome Public Dataset Collection(
https://github.com/awesomedata/awesome-public-datasets)
Awesome public dataset collection provides high-quality datasets that are arranged in a
well-organized manner within a list according to topics such as Agriculture, Biology,
Climate, Complex networks, etc. Most of the datasets are available free, but some may
not, so it is better to check the license before downloading the dataset.
7. Government Datasets:
•Indian Government dataset
•US Government Dataset
•Northern Ireland Public Sector Datasets
•European Union Open Data Portal
Data
Formats
Structured Data
Semi-structured
Data
Unstructured Data
STRUCTURED DATA
Structured data is a type of data that is expected to have some predefined sort of
structure prior to being stored to storage means. This structure is often referred to
as a schema-on-write.
This type of data can either be human or machine generated. The simplest
example of structured data is a spreadsheet created by an analyst (human
generated). Machine-generated structured data examples include weblogs, or
data entries created upon events (e.g. when a product is purchased by a customer
the a new sale entry is being created containing the price, quantity and
potentially many more fields related to that specific purchase).
Other examples of systems generating structured data include reservation or
inventory control systems.
Structured data usually resides in Relational Database Management
Systems (RDBMS). A relational database is usually consisted of many tables
where each table has a pre-defined schema that every record must satisfy. For
example, each field is associated with an expected data type (e.g. a name field is
expected to be of a string type of a certain length). Then further queries can be
executed over the stored data in order to retrieve records matching the specified
conditions.
SEMI-STRUCTURED DATA
•Now semi-structured data, refers to a specific type of data that are in fact
unstructured but at the same time it also contains some form of metadata that
enable users to determine some partial structure or hierarchy.
•Semi-structured data is information that doesn’t consist of Structured data
(relational database) but still has some structure to it.
•Semi-structured data consist of documents held in JavaScript Object
Notation (JSON) format. It also includes key-value stores
and graph databases.
UNSTRUCTURED DATA
Unstructured data refers to data that cannot be stored in relational databases
since they are missing a pre-defined data model. Such data aren’t processed
till the moment they are actually being used. This concept is also known
as schema-on-read.
Types of data that are considered to be unstructured, include video or audio
files, text, websites, presentation, data collected form various sensors or even
satellite imagery.
Unstructured information is a set of text-heavy but may contain data such as
numbers, dates, and facts as well.
FOR THE DETAIL INFO OF
DATA FORMATS
Visit:
https://k21academy.com/microsoft-azure/dp-900/struct
ured-data-vs-unstructured-data-vs-semi-structured-da
ta/
https://levelup.gitconnected.com/structured-unstructu
red-semistructured-data-259b869725ca
DIKW PYRAMID
DATA
This is the lowest bottom of the DIKW pyramid. This is the raw data
collected from various events, records, and transactions around you. The data
could be generated by machines or by humans.
The data itself does not have very high value until it is enriched with more
attributes that can be used for further analysis.
For example, you could just have a data set that has list of millions of people
with their demographic details and how they died. This data may not give you
anything actionable.
INFORMATION
At the next level, the collected data is enriched with the context to give
information. You start to build perception about the data to give you
hindsight.
The hindsight about the information reflects or acknowledges what is
contained in the data.
For example, you could begin to see that the list of people is actually the list
of cancer patients and their life patterns.
KNOWLEDGE
When you add meaning to the information, you start to gain knowledge. This is
where you precisely start analysing the information at hand and make it more useful
and meaningful. You could gain deep insights about the information and be able to
answer high-level questions.
For example, you can analyse the information on cancer patients and build patterns
around life expectancy after cancer detection with or without chemotherapy.
You could further analyse the effect of various chemotherapy medicines to
understand their dosage and effectiveness level. A company could then invest in
building more effective chemotherapy medicines to improve life expectancy of
cancer patients after diagnosis. So, understand here that the objectives of data
analysis must be clear to derive meaningful knowledge from the information at hand.
WISDOM
The final level of wisdom is achieved when you add understanding to the
derived knowledge. Note here that wisdom is not achieved using a technical
algorithm or formula but is based on the human understanding of the data
analysis that was carried out.
For example, after analysis of data on cancer patients, you could understand
what lifestyle to follow in terms of diet, sleep, and exercise to avoid or delay
occurrence of cancer. This understanding could be shared with the world as
foresight.
CATEGORIES OF DATA
ANALYTICS
TYPES OF DATA IN MACHINE
LEARNING
Types of Data
Continuous
Nominal Data Ordinal Data Discrete Data Data
QUALITATIVE OR
CATEGORICAL DATA
Qualitative data, also known as the categorical data, describes the data that fits
into the categories. Qualitative data are not numerical. The categorical information
involves categorical variables that describe the features such as a person’s gender,
home town etc. Categorical measures are defined in terms of natural language
specifications, but not in terms of numbers.
For example, if we consider the quality of performance of students in terms of
‘Good’, ‘Average’, and ‘Poor’, it falls under the category of qualitative data. Also,
name or roll number of students are information that cannot be measured using
some scale of measurement.
They are divided into two types: Normal data and Ordinal data.
1. Nominal Data:
Nominal data is one of the types of qualitative information which helps to
label the variables without providing the numerical value. Nominal data is
also called the nominal scale. It cannot be ordered and measured.
Examples of nominal data are letters, symbols, words, gender etc.
The nominal data are examined using the grouping method. In this method,
the data are grouped into categories, and then the frequency or the
percentage of the data can be calculated. These data are visually
represented using the pie charts.
Examples of nominal data are
1. Blood group: A, B, O, AB, etc.
2. Nationality: Indian, American, British, etc.
3. Gender: Male, Female, Other
2. Ordinal Data:
Ordinal data/variable is a type of data that follows a natural order. The
order of qualitative information matters more then the difference
between each category. This variable is mostly found in surveys, finance,
economics, questionnaires, and so on.
The ordinal data is commonly represented using a bar chart. These data
are investigated and interpreted through many visualization tools.
Ordinal data also assigns named values to attributes but unlike nominal
data, they can be arranged in a sequence of increasing or decreasing
value so that we can say whether a value is better than or greater than
another value. Examples of ordinal data are
1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.
2. Grades: A, B, C, etc.
3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc
QUANTITATIVE OR
NUMERICAL DATA
Quantitative data is also known as numerical data which represents the
numerical value (i.e., how much, how often, how many). Numerical data
gives information about the quantities of a specific thing. Some examples of
numerical data are height, length, size, weight, and so on. The quantitative
data can be classified into two different types based on the data sets. The
two different classifications of numerical data are discrete data and
continuous data.
They are divided into two types: Discrete Data and Continuous data.
1. Discrete Data :
Discrete data can take only discrete values. Discrete information contains
only a finite number of possible values. Those values cannot be
subdivided meaningfully. Here, things can be counted in whole numbers.
Example: Number of students in the class
2. Continuous data:
Continuous data is data that can be calculated. It has an infinite number
of probable values that can be selected within a given specific range.
Example: My brother is 114 inches tall, the elephant weighs 3 tons, the
temperature of sun is 10,000 Fahrenheit.
COMMON DATA QUALITY
ISSUES
Success of machine learning depends largely on the quality of data. A data
which has the right quality helps to achieve better prediction accuracy, in
case of supervised learning. However, it is not realistic to expect that the
data will be flawless.
We have already come across at least two types of problems:
1. Certain data elements without a value or data with a missing value.
2. Data elements having value surprisingly different from the other elements,
which we term as outliers.
There are multiple factors which lead to these data quality issues. Following
are some of them:
1. Data that is fit for use: If u r working on cancer project u need data that
has cancer patients and their health details.
2. Data that meets your analytics requirements : If u r trying to relate
consumption of meat with probability of having cancer , u would need eating
habits of the patients in the dataset.
3. Relevance and timeliness: If u r making a model related to 21st century
then u can’t use the dataset related to 18th century.
4. Completeness, correctness and formatting of data : check whether the
dataset have important field fully populated and there r only few rows
having missing values as well as check that after feature extraction whether
the dataset have enough data to build the model.
5. Data Integrity : Is the data biased or is not collected through right
sampling method , if the biased data is used u may develop incorrect model.
6. Other types of errors:
(a) Spelling mistakes : There may be spelling mistakes in names of countries,
people, things, etc. There could also be abbreviations such as US, United
States of America, America – they all refer to same country.
(b) Date formatting: Asian format is dd-mm-yyyy and America format is mm-
dd-yyyy.
(c) Incorrect labels: Sometimes age could be labelled as year born due to
which there may be chances that the machine may assume the age of 56 as
a person born in the year 1956.
(d) Scaling and Units: Sometimes units can be wrong . Weight of a person
may be written in Kgs or pounds .
REMEDIATING(FIXING) DATA
QUALITY ISSUES
The issues in data quality, as mentioned above, need to be remediated, if the right
amount of efficiency has to be achieved in the learning activity.
Some of the common measures to clean the data are:
1. Delete rows with missing values.
2. Fix any formatting issues.
3. Fix labelling issues.
4. Fix Spelling Mistakes and abbreviations.
5. Insert new columns based on other column.
6. Delete rows with skewed values.
DATA PRE-
PROCESSING(DIMENSIONALIT
Y REDUCTION TECHNIQUES)
Definition: Dimensionality reduction techniques help you to reduce the number of
dimensions to only keep important dimensions of data and discard all other
dimensions.
Example: You find that your dataset could have 100s of features (or dimensions).
Practically, you know that not all dimensions are equally important for analysis or
classification of data. Also, it becomes computationally intensive and visually
difficult to understand which dimensions have the most influence on the dataset if
you have 100s of dimensions.
In most learning algorithms, the complexity depends on the number of input
dimensions, d, as well as on the size of the data sample, N. Ás you increase
the number of dimensions, you would also require collecting increasing
number of samples to support those many dimensions (in order to ensure
that every combination of features is well represented in the dataset). As the
number of dimensions increase, working with it becomes increasingly
harder. This problem is often cited as "the curse of dimensionality".
Let's take a simple example: