Chapter 2 Preparing To Model

CHAPTER 2 PREPARING
TO MODEL
STEPS IN DEVELOPING MACHINE
LEARNING APPLICATION
Collect Data
Prepare the input data
Analyse the input data
Train the Algorithm
Test the Algorithm
Use the Algorithm
Periodic Revist
COLLECT DATA
To work with machine learning projects, we need a huge amount of
data, because, without the data, one cannot train ML/AI models.
Collecting and preparing the dataset is one of the most crucial parts
while creating an ML/AI project.
During the development of the ML project, the developers completely
rely on the datasets. In building ML applications, datasets are divided
into two parts:
•Training dataset
•Test Dataset
POPULAR SOURCES FOR MACHINE
LEARNING DATASETS
1. Kaggle Datasets (https://www.kaggle.com/datasets)
Kaggle is one of the best sources for providing datasets for Data Scientists
and Machine Learners. It allows users to find, download, and publish
datasets in an easy way. It also provides the opportunity to work with other
machine learning engineers and solve difficult Data Science related tasks.
2. UCI Machine Learning Repository(https://archive.ics.uci.edu/ml/index.php)

UCI Machine learning repository is one of the great sources of machine
learning datasets. This repository contains databases, domain theories, and
data generators that are widely used by the machine learning community for
the analysis of ML algorithms.
3. Datasets via AWS(https://registry.opendata.aws/)
We can search, download, access, and share the datasets that are publicly
available via AWS resources. These datasets can be accessed through AWS
resources but provided and maintained by different government
organizations, researches, businesses, or individuals.
4.Google's Dataset Search Engine(https://toolbox.google.com/datasetsearch)

Google dataset search engine is a search engine launched
by Google on September 5, 2018. This source helps researchers to get
online datasets that are freely available for use.
5. Microsoft Datasets(https://msropendata.com/)
The Microsoft has launched the "Microsoft Research Open data" repository
with the collection of free datasets in various areas such as natural language
processing, computer vision, and domain-specific sciences.
6. Awesome Public Dataset Collection(
https://github.com/awesomedata/awesome-public-datasets)
Awesome public dataset collection provides high-quality datasets that are arranged in a
well-organized manner within a list according to topics such as Agriculture, Biology,
Climate, Complex networks, etc. Most of the datasets are available free, but some may
not, so it is better to check the license before downloading the dataset.
7. Government Datasets:
•Indian Government dataset
•US Government Dataset
•Northern Ireland Public Sector Datasets
•European Union Open Data Portal
8. Computer Vision Datasets(https://www.visualdata.io/)

Visual data provides multiple numbers of the great dataset that are specific to computer
visions such as Image Classification, Video classification, Image Segmentation, etc.
Therefore, if you want to build a project on deep learning or image processing, then you
can refer to this source.
PREPARE THE INPUT DATA
After you have your data, you have to prepare it. You can do this by :
•Putting together all the data you have and randomizing it. This helps make
sure that data is evenly distributed, and the ordering does not affect the
learning process.
•Cleaning the data to remove unwanted data, missing values, rows, and
columns, duplicate values, data type conversion, etc. You might even have to
restructure the dataset and change the rows and columns or index of rows
and columns.
•Visualize the data to understand how it is structured and understand the
relationship between various variables and classes present.
•Splitting the cleaned data into two sets - a training set and a testing set. The
training set is the set your model learns from. A testing set is used to check
the accuracy of your model after training.
ANALYSE THE INPUT DATA
This step is just to ensure that the dataset , based on which you
are going to build the machine learning model meet the desired
quality.
This is the crucial step where you need to ensure the input dataset
could be parsed properly for your chosen computer program.
You also need to ensure that the examples are complete(they are
not missing values).
TRAIN THE ALGORITHM
Training is the most important step in machine learning. In
training, you pass the prepared data to your machine
learning model to find patterns and make predictions. It
results in the model learning from the data so that it can
accomplish the task set. Over time, with training, the model
gets better at predicting.
TEST THE ALGORITHM
As a matter of practice, when you get a dataset, you partition it into 80-20,
where 80% of the examples in the dataset are used to train the model and
20% of the examples in the dataset are used to test the trained model.
When you are training a supervised learning algorithm, once you are
sufficiently confident that the model is well-trained, you put it to test by
feeding it new known inputs and confirming if it produces the desired output
(since desired output is also known from the fed input data).
In unsupervised learning you may have to use various evaluation
parameters, such as number of clusters created and distance between the
cluster objects, to ensure that the model is working as expected.
If the test results are promising you move further with using the model.
However, if the test results are not satisfactory, you need to find the root
cause and based on the root cause you may have to
o Re-train the model
o Make adjustments in the model or data
o Try O a different algorithm
o Collect the dataset from a different source
Testing the algorithm before use is a crucial step to ensuring that your model
does not product false results. Do not skip it.
USE THE ALGORITHM
You spent a lot of time collecting and cleaning the data and then building
and testing the model. Once you are through these steps, you are good to use
the model and sit back and enjoy the success from your hard work. You may
develop an application based out of the model. For example, based on
someone's health parameters, your machine learning model could deduce
health related problems that the person may face in near future.
Based on someone’s credit history, your application may infer the chances
of a new loan getting approved. The applications and usage are plenty as you
learned earlier.
PERIODIC REVIST
As you often revise to ensure that your learning is still effective, you should
periodically review the results that the model is producing and evaluate if
there are opportunities for improving it in light of new data.
You may carry out minor adjustments to the model or may re-train it with
latest data to fine tune it. This step is very similar to you getting a master
health check done for yourself annually to ensure that your body's vital
parameters are doing well.
If any parameter indicates a potential problem, then you either make
lifestyle changes or seek medical advice.
DATA FORMATS IN MACHINE LEARNING
Data
Formats
Structured Data
Semi-structured
Data
Unstructured Data
STRUCTURED DATA
Structured data is a type of data that is expected to have some predefined sort of
structure prior to being stored to storage means. This structure is often referred to
as a schema-on-write.
This type of data can either be human or machine generated. The simplest
example of structured data is a spreadsheet created by an analyst (human
generated). Machine-generated structured data examples include weblogs, or
data entries created upon events (e.g. when a product is purchased by a customer
the a new sale entry is being created containing the price, quantity and
potentially many more fields related to that specific purchase).
Other examples of systems generating structured data include reservation or
inventory control systems.
Structured data usually resides in Relational Database Management
Systems (RDBMS). A relational database is usually consisted of many tables
where each table has a pre-defined schema that every record must satisfy. For
example, each field is associated with an expected data type (e.g. a name field is
expected to be of a string type of a certain length). Then further queries can be
executed over the stored data in order to retrieve records matching the specified
conditions.
SEMI-STRUCTURED DATA
•Now semi-structured data, refers to a specific type of data that are in fact
unstructured but at the same time it also contains some form of metadata that
enable users to determine some partial structure or hierarchy.
•Semi-structured data is information that doesn’t consist of Structured data
(relational database) but still has some structure to it.
•Semi-structured data consist of documents held in JavaScript Object
Notation (JSON) format. It also includes key-value stores
and graph databases.
UNSTRUCTURED DATA
Unstructured data refers to data that cannot be stored in relational databases
since they are missing a pre-defined data model. Such data aren’t processed
till the moment they are actually being used. This concept is also known
as schema-on-read.
Types of data that are considered to be unstructured, include video or audio
files, text, websites, presentation, data collected form various sensors or even
satellite imagery.
Unstructured information is a set of text-heavy but may contain data such as
numbers, dates, and facts as well.
FOR THE DETAIL INFO OF
DATA FORMATS
Visit:
https://k21academy.com/microsoft-azure/dp-900/struct
ured-data-vs-unstructured-data-vs-semi-structured-da
ta/
https://levelup.gitconnected.com/structured-unstructu
red-semistructured-data-259b869725ca
DIKW PYRAMID
DATA
This is the lowest bottom of the DIKW pyramid. This is the raw data
collected from various events, records, and transactions around you. The data
could be generated by machines or by humans.
The data itself does not have very high value until it is enriched with more
attributes that can be used for further analysis.
For example, you could just have a data set that has list of millions of people
with their demographic details and how they died. This data may not give you
anything actionable.
INFORMATION
At the next level, the collected data is enriched with the context to give
information. You start to build perception about the data to give you
hindsight.
The hindsight about the information reflects or acknowledges what is
contained in the data.
For example, you could begin to see that the list of people is actually the list
of cancer patients and their life patterns.
KNOWLEDGE
When you add meaning to the information, you start to gain knowledge. This is
where you precisely start analysing the information at hand and make it more useful
and meaningful. You could gain deep insights about the information and be able to
answer high-level questions.
For example, you can analyse the information on cancer patients and build patterns
around life expectancy after cancer detection with or without chemotherapy.
You could further analyse the effect of various chemotherapy medicines to
understand their dosage and effectiveness level. A company could then invest in
building more effective chemotherapy medicines to improve life expectancy of
cancer patients after diagnosis. So, understand here that the objectives of data
analysis must be clear to derive meaningful knowledge from the information at hand.
WISDOM
The final level of wisdom is achieved when you add understanding to the
derived knowledge. Note here that wisdom is not achieved using a technical
algorithm or formula but is based on the human understanding of the data
analysis that was carried out.
For example, after analysis of data on cancer patients, you could understand
what lifestyle to follow in terms of diet, sleep, and exercise to avoid or delay
occurrence of cancer. This understanding could be shared with the world as
foresight.
CATEGORIES OF DATA
ANALYTICS
TYPES OF DATA IN MACHINE
LEARNING
Types of Data
Qualitative or Categorical Data Quantitative or Numerical Data
Continuous
Nominal Data Ordinal Data Discrete Data Data
QUALITATIVE OR
CATEGORICAL DATA
Qualitative data, also known as the categorical data, describes the data that fits
into the categories. Qualitative data are not numerical. The categorical information
involves categorical variables that describe the features such as a person’s gender,
home town etc. Categorical measures are defined in terms of natural language
specifications, but not in terms of numbers.
For example, if we consider the quality of performance of students in terms of
‘Good’, ‘Average’, and ‘Poor’, it falls under the category of qualitative data. Also,
name or roll number of students are information that cannot be measured using
some scale of measurement.
They are divided into two types: Normal data and Ordinal data.
1. Nominal Data:
 Nominal data is one of the types of qualitative information which helps to
label the variables without providing the numerical value. Nominal data is
also called the nominal scale. It cannot be ordered and measured.
 Examples of nominal data are letters, symbols, words, gender etc.
 The nominal data are examined using the grouping method. In this method,
the data are grouped into categories, and then the frequency or the
percentage of the data can be calculated. These data are visually
represented using the pie charts.
 Examples of nominal data are
1. Blood group: A, B, O, AB, etc.
2. Nationality: Indian, American, British, etc.
3. Gender: Male, Female, Other
2. Ordinal Data:
 Ordinal data/variable is a type of data that follows a natural order. The
order of qualitative information matters more then the difference
between each category. This variable is mostly found in surveys, finance,
economics, questionnaires, and so on.
 The ordinal data is commonly represented using a bar chart. These data
are investigated and interpreted through many visualization tools.
 Ordinal data also assigns named values to attributes but unlike nominal
data, they can be arranged in a sequence of increasing or decreasing
value so that we can say whether a value is better than or greater than
another value. Examples of ordinal data are
1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc.
2. Grades: A, B, C, etc.
3. Hardness of Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc
QUANTITATIVE OR
NUMERICAL DATA
Quantitative data is also known as numerical data which represents the
numerical value (i.e., how much, how often, how many). Numerical data
gives information about the quantities of a specific thing. Some examples of
numerical data are height, length, size, weight, and so on. The quantitative
data can be classified into two different types based on the data sets. The
two different classifications of numerical data are discrete data and
continuous data.
They are divided into two types: Discrete Data and Continuous data.
1. Discrete Data :
 Discrete data can take only discrete values. Discrete information contains
only a finite number of possible values. Those values cannot be
subdivided meaningfully. Here, things can be counted in whole numbers.
 Example: Number of students in the class
2. Continuous data:
 Continuous data is data that can be calculated. It has an infinite number
of probable values that can be selected within a given specific range.
 Example: My brother is 114 inches tall, the elephant weighs 3 tons, the
temperature of sun is 10,000 Fahrenheit.
COMMON DATA QUALITY
ISSUES
Success of machine learning depends largely on the quality of data. A data
which has the right quality helps to achieve better prediction accuracy, in
case of supervised learning. However, it is not realistic to expect that the
data will be flawless.
We have already come across at least two types of problems:
1. Certain data elements without a value or data with a missing value.
2. Data elements having value surprisingly different from the other elements,
which we term as outliers.
There are multiple factors which lead to these data quality issues. Following
are some of them:
1. Data that is fit for use: If u r working on cancer project u need data that
has cancer patients and their health details.
2. Data that meets your analytics requirements : If u r trying to relate
consumption of meat with probability of having cancer , u would need eating
habits of the patients in the dataset.
3. Relevance and timeliness: If u r making a model related to 21st century
then u can’t use the dataset related to 18th century.
4. Completeness, correctness and formatting of data : check whether the
dataset have important field fully populated and there r only few rows
having missing values as well as check that after feature extraction whether
the dataset have enough data to build the model.
5. Data Integrity : Is the data biased or is not collected through right
sampling method , if the biased data is used u may develop incorrect model.
6. Other types of errors:
(a) Spelling mistakes : There may be spelling mistakes in names of countries,
people, things, etc. There could also be abbreviations such as US, United
States of America, America – they all refer to same country.
(b) Date formatting: Asian format is dd-mm-yyyy and America format is mm-
dd-yyyy.
(c) Incorrect labels: Sometimes age could be labelled as year born due to
which there may be chances that the machine may assume the age of 56 as
a person born in the year 1956.
(d) Scaling and Units: Sometimes units can be wrong . Weight of a person
may be written in Kgs or pounds .
REMEDIATING(FIXING) DATA
QUALITY ISSUES
The issues in data quality, as mentioned above, need to be remediated, if the right
amount of efficiency has to be achieved in the learning activity.
Some of the common measures to clean the data are:
1. Delete rows with missing values.
2. Fix any formatting issues.
3. Fix labelling issues.
4. Fix Spelling Mistakes and abbreviations.
5. Insert new columns based on other column.
6. Delete rows with skewed values.
DATA PRE-
PROCESSING(DIMENSIONALIT
Y REDUCTION TECHNIQUES)
Definition: Dimensionality reduction techniques help you to reduce the number of
dimensions to only keep important dimensions of data and discard all other
dimensions.
Example: You find that your dataset could have 100s of features (or dimensions).
Practically, you know that not all dimensions are equally important for analysis or
classification of data. Also, it becomes computationally intensive and visually
difficult to understand which dimensions have the most influence on the dataset if
you have 100s of dimensions.
In most learning algorithms, the complexity depends on the number of input
dimensions, d, as well as on the size of the data sample, N. Ás you increase
the number of dimensions, you would also require collecting increasing
number of samples to support those many dimensions (in order to ensure
that every combination of features is well represented in the dataset). As the
number of dimensions increase, working with it becomes increasingly
harder. This problem is often cited as "the curse of dimensionality".
Let's take a simple example:
This dataset has 4 dimensions (Income, Credit Score, Age, and

Location) based on which loan approval seems to be granted. But,
if you look closely, then you would find that the other dimensions
do not influence the decision as much as Credit Score dimension.
So, the same dataset with reduced dimensions could be as
following.
This dimensional reduction not only makes the algorithms

computationally less intensive but also makes it Simple to understand
and visualize the dataset as well as the results.
Note: The goal of dimensionality reduction is NOT to reduce (or compromise on) the quality of data when
discarding unnecessary dimensions but to only keep the dimensions that matter the most without any significant loss
of quality.
TYPES OF DIMENSIONALITY
REDUCTION TECHNIQUES
1. Feature Selection:
Feature selection is the process of selecting the subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a way
of selecting the optimal features from the input dataset.
Two methods are used for the feature selection:
1. Filters Methods: In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken.
2. Wrappers Methods: The wrapper method has the same goal as the filter method, but it takes a
machine learning model for its evaluation. In this method, some features are fed to the ML model,
and evaluate the performance. The performance decides whether to add those features or remove
to increase the accuracy of the model. This method is more accurate than the filtering method but
complex to work. Some common techniques of wrapper methods are:
(a)Forward Selection
(b)Backward Selection
(a)Forward Selection: In forward selection, you start with no variables and
add them one by one, at each step adding the one that decreases the error
the most, until any further addition does not decrease the error.
(b)Backward Selection: In backward selection, you start with all variables

and remove them one by one, at each step removing the one that decreases
the error the most (or increases it only slightly ), until any further removal
increases the error significantly.
2. Feature Extraction:
Feature extraction is the process of transforming the space containing many
dimensions into space with fewer dimensions. This approach is useful when we
want to keep the whole information but use fewer resources while processing the
information.
Best Known feature extraction techniques are:
(a) Principal Component Analysis
(b) Linear Discriminant Analysis
They both are linear projection methods. PCA is unsupervised method of
dimensionality reduction whereas LDA is a supervised method.
(a) Principal Component Analysis: Principal Component Analysis, or PCA, is a
dimensionality-reduction method that is often used to reduce the dimensionality of
large data sets, by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.
(b) Linear Discriminant Analysis: Linear discriminant analysis (LDA), normal
discriminant analysis (NDA), or discriminant function analysis is a generalization
of Fisher's linear discriminant, a method used in statistics and other fields, to find a
linear combination of features that characterizes or separates two or more classes
of objects or events. The resulting combination may be used as a linear classifier, or,
more commonly, for dimensionality reduction before later classification.

Chapter 2 Preparing To Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 Preparing To Model

Uploaded by

Copyright:

Available Formats

CHAPTER 2 PREPARING

Prepare the input data

Analyse the input data

Train the Algorithm

Test the Algorithm

Use the Algorithm

2. UCI Machine Learning Repository(https://archive.ics.uci.edu/ml/index.php)

4.Google's Dataset Search Engine(https://toolbox.google.com/datasetsearch)

8. Computer Vision Datasets(https://www.visualdata.io/)

Qualitative or Categorical Data Quantitative or Numerical Data

This dataset has 4 dimensions (Income, Credit Score, Age, and

This dimensional reduction not only makes the algorithms

(b)Backward Selection: In backward selection, you start with all variables

You might also like