Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

1. Take a real-life dataset.

Describe the characteristics of


the dataset.
..\Downloads\Compressed\archive\WHO COVID-19 global
table data August 11th 2021 at 10.41.34 AM.csv

There are Following Characteristics in that dataset


Context

Worldwide data on covid-19 cases and the stats related to the pandemic

Content

A 239 * 12 table containing the statistics of all the countries regarding the Newly reported cases
and cumulated cases.

Acknowledgements

https://covid19.who.int/info/

Details By Content
1. Country Names
2. Who Region
3. Cases - cumulative total
4. Cases - cumulative total per 100000 population
5. Cases - newly reported in last 7 days
6. Cases - newly reported in last 7 days per 100000 population
7. Cases - newly reported in last 24 hours
8. Deaths - cumulative total
9. Deaths - cumulative total per 100000 population
10. Deaths - newly reported in last 7 days
11. Deaths - newly reported in last 7 days per 100000 population
12. Deaths - newly reported in last 24 hours

Q.2 Explain different steps of data preprocessing.

Steps Of data preprocessing:

Any database is a collection of data objects. You can also call them data samples, events,
observations, or records. However, each of them is described with the help of different
characteristics. In data science lingo, they are called attributes or features.
Data preprocessing is a necessary step before building a model with these features.

1. Data Cleaning
i. Missing Data
a. Ignore the Tuple
b. The Missing Values by Mean or Median

ii. Noisy Data


a) Binning Method
b) Regression
c) Clustering

2. Data Transformation
i. Normalization
ii. Attribute Selection
iii. Discretization
iv. Concept Hierarchy
v. Generation
3. Data Reduction
i. Data Cube Aggregation
ii. Attribute Subset Selection
iii. Numerosity Reduction
iv. Dimensionality Reduction

It usually happens in stages. Let us have a closer look at each of them.

 Data quality assessment


 Data cleaning
 Data transformation
 Data reduction

Data quality assessment

First of all, you need to have a good look at your database and perform a data quality
assessment. A random collection of data often has irrelevant bits. Here are some examples.

Mismatching in data types

Quite often, you might mix together datasets that use different data formats. Hence, the
mismatching: integer vs. float or UTF8 vs ASCII.

Different dimensions of data arrays


When you aggregate data from different datasets, for example, from five different arrays of
data for voice recognition, three fields that are present in one of them can be missing in four
other arrays.

Mixture of data values


Let’s imagine that you have data, collected from two independent sources. As a result, the
gender field has two different values for women: woman and female.

To clean this dataset, you have to make sure that the same name is used as the descriptor
within the dataset (it can be female in our case).

Outliers in the dataset


Within 200 years of daily temperature observations for New York, there were several days
with very low temperatures in summer.

Outliers are very dangerous. They can strongly influence the output of a machine learning
model. Usually, the researchers evaluate the outliers to identify whether each particular
record is the result of an error in the data collection or a unique phenomenon which should
be taken into consideration for data processing.

Missing data
You may also notice that some important values are missing. These problems arise due to
the human factor, program errors, or other reasons. They will affect the accuracy of the
predictions, so before going any further with your database, you need to do data cleaning.

Why do we need to preprocess data?

By preprocessing data, we:


 Make our database more accurate. We eliminate the incorrect or missing values
that are there as a result of the human factor or bugs.
 Boost consistency. When there are inconsistencies in data or duplicates, it affects
the accuracy of the results.
 Make the database more complete. We can fill in the attributes that are missing if
needed.
 Smooth the data. This way we make it easier to use and interpret.

Data cleaning

The goal of data cleaning is to provide simple, complete, and clear sets of examples for
machine learning.

Missing data

The situation when you have missing data in your dataset is quite common. In this case, you
are looking for additional datasets or collecting more observations.

When you concatenate two or more datasets into one database to get a bigger training set,
some data field mismatches are quite common.

When not all the fields are represented in the joined massive, it is better to delete such
fields in advance before merging.

What to do: if more than 50% of values are missing for any of the database rows or
columns, you have to delete the whole row/columns unless it is possible to fill in the missing
values.

Imagine you make a database of Haskell lovers. The values for the gender column are
missing for several records: Nik, Jane, Julia, and Helen. In this case, the researcher can add
the missing data based on their conclusions. However, this method has flaws, and the model
has to bear the risk of being inaccurate.

Noisy data
A large amount of additional meaningless data is called noise.

This can be:

 duplicates or semi-duplicates of the data records;


 data segments, which have no value for a particular research;
 unnecessary information fields for each of the variables.

An example is when you need to know whether the person speaks English or not. But you
got a whole set of features, including the color of their eyes, shoe size, pulse and blood
pressure, etc.

You can apply one of the following methods to solve this problem:
 Binning. Use binning if you have a pool of sorted data. Divide all the data into
smaller segments of the same size and apply your dataset preparation methods
separately on each segment. For example, you can bin the values for Age into
categories such as 21-35, 36-59, and 60-79.
 Regression. Regression analysis helps to decide what variables do indeed have an
impact. Apply regression analysis to smooth large volumes of data. This will allow
you to only work with the key features instead of trying to analyze an overwhelming
number of variables. In our post about regression, you can learn more about how to
conduct a regression analysis step-by-step.
 Clustering. Finally, you can apply clustering algorithms to group the data. Here you
need to be careful with the outliers.

The outliers are the singular data points dissimilar to the rest of the domain.

It’s important not to substitute the outliers by taking them as noise. For example, we are
building an algorithm that sorts out different sorts of apples. We can encounter two types of
outliers in our dataset:

 The images contain exotic fruits like pineapples and kiwi. They can be found in your
data due to a sampling mistake and represent noise in your dataset.
 There also can be photos of some “weird apples”, for example, that have a strange
shape. When our goal is to teach the machine to recognize the apple sorts, deviation
from groups is important. Such outliers will help to teach the ML model to recognize
special characters and increase the accuracy of the forecast.

When we are not talking about obvious things like apples and pineapples, it is quite
complicated to decide whether the item is important or just noise. Here, the expertise of
the data scientist has a great influence on the success of ML modelling.

Data transformation

In fact, by cleaning and smoothing the data, we have already performed data modification.
However, by data transformation, we understand the methods of turning the data into an
appropriate format for the computer to learn from.

Example: For research about smog around the globe, you have data about wind speeds.
However, the data got mixed, and we have three variants of figures: meters per second,
miles per second, and kilometres per hour. We need to transform these data to the same
scale for ML modelling.

Here are the techniques for data transformation or data scaling:

Aggregation
In the case of data aggregation, the data is pooled together and presented in a unified
format for data analysis.
Working with a large amount of high-quality data allows for getting more reliable results
from the ML model.

If we want to build a neural network algorithm that simulates the style of Vincent Van Gogh,
we need to provide as many paintings by this famous artist as we can to provide enough
material for training. The images need to have the same digital format, and we will use data
transformation techniques to achieve that.

Normalization

Normalization helps you to scale the data within a range to avoid building incorrect ML
models while training and/or executing data analysis. If the data range is very wide, it will be
hard to compare the figures. With various normalization techniques, you can transform the
original data linearly, perform decimal scaling or Z-score normalization.

For example, to compare the population growth of city X (1+ million citizens) to 1 thousand
new citizens in city Y, we need to normalize these figures.

Feature selection
Feature selection is the selection of variables in data that are the best predictors for the
variable we want to predict.

a) Unsupervised

b) Supervised

If there are a lot of features, then the classifier operation time increases. In addition, the
prediction accuracy often decreases. Especially if there are a lot of garbage features in the
data (that are not correlated with the target variable). In the Machine Learning Mastery
blog, you can learn how to perform feature selection for your ML database.

Discretization

During discretization, a programmer transforms the data into sets of small intervals. For
example, putting people in categories “young”, “middle age”, “senior” rather than working
with continuous age values. Discretization helps to improve efficiency.

Concept hierarchy generation

If you use the concept hierarchy generation method, you can generate a hierarchy between
the attributes where it was not specified. For example, if you have the location information
that includes a street, city, province, and country but they have no hierarchical order, this
method can help you transform the data.
Generalization

With the help of generalization, it is possible to convert low-level data features to high-level
data features. For example, house addresses can be generalized to higher-level definitions,
such as town or country.

Data reduction

When you work with large amounts of data, it becomes harder to come up with reliable
solutions. Data reduction can be used to reduce the amount of data and decrease the costs
of analysis.

Researchers really need data reduction when working with verbal speech datasets. Massive
arrays contain individual features of the speakers, for example, interjections and filling
words. In this case, huge databases can be decreased to a representative sampling for the
analysis.

Here are a few techniques for data reduction:

Attribute feature selection

Techniques for data transformation can also be used for data reduction. If you construct a
new feature combining the given features in order to make the data mining process more
efficient, it is called an attribute selection. For example, the
features male/female and student can be constructed into male student/female student.
This can be useful if we conduct research about how many men and/or women are students
but their study field doesn’t interest us.

Dimensionality reduction

Datasets that are used to solve real-life tasks have a huge number of features. Computer
vision, speech generation, translation, and many other tasks cannot sacrifice the speed of
operation for the sake of quality. It’s possible to use dimensionality reduction to cut the
number of features used.

Numerosity reduction

Numerosity reduction is a method of data reduction that replaces the original data by a
smaller form of data representation. There are two types of numerosity reduction methods
– Parametric and Non-Parametric.

Parametric Methods

Parametric methods use models to represent data. Commonly, regression is used to build


such models.
Non-parametric methods

These techniques allow for storing reduced representations of the data


through histograms, data sampling, and data cube aggregation.

You might also like