Data Preparation-Part 1-231018-220411

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 74

Data Preparation

By
Reema Monga
Assistant Professor, USMS
Example 1: Data vs Information
What is Data?
Data is a collection of raw, unorganised facts and details like text, observations, figures, symbols
and descriptions of things etc. In other words, data does not carry any specific purpose and has
no significance by itself. Data is raw, unorganized, unanalyzed, uninterrupted, and unrelated
used in different contexts. For instance, facts and stats gathered by researchers for their
analysis can collectively be called data. Data in essence lacks its informative fervor and relatively
renders itself to be meaningless unless given a purpose or direction to acquire its significance.

What is Information?
Information is processed, organised and structured data. It provides context for data and
enables decision making. For example, a single customer’s sale at a restaurant is data – this
becomes information when the business is able to identify the most popular or least popular dish.
Information is data that have been processed to make them meaningful and useful Data + Meaning
= Information. Another way to add meaning is to process the data. For example, individual exam
marks are raw data, but if you were to process those to say that the average mark for the class
was 53%, or that boys did better than girls, or that 76% of the students in your school got a
grade A or B, then that is information!
Data vs Information
Data is a raw and unorganized fact that required to be processed to make it
meaningful. In other words, Data is always interpreted, by a human or machine, to
derive meaning. So, data is meaningless. Data contains numbers, statements, and
characters in a raw form.

Information is a set of data which is processed in a meaningful way according to the


given requirement. Information is processed, structured, or presented in a given
context to make it meaningful and useful. It is processed data which includes data that
possess context, relevance, and purpose. Information assigns meaning and improves
the reliability of the data. It helps to ensure undesirability and reduces uncertainty. So,
when the data is transformed into information, it never has any useless details.
Example 2: Data vs Information
Data vs Information
Data Information

Data is unorganised and unrefined facts Information comprises processed,


organised data presented in a
meaningful context
Data is an individual unit that contains Information is a group of data that
raw materials which do not carry any collectively carries a logical meaning.
specific meaning.
Data doesn’t depend on information. Information depends on data.

Raw data alone is insufficient for Information is sufficient for decision


decision making making
An example of data is a student’s test The average score of a class is the
score information derived from the given
data.
Types of Data
Qualitative vs Quantitative Data
Data can be of two types:
● Qualitative data: It is non-numerical data. For eg., the texture of the
skin, the colour of the eyes, etc.
● Quantitative data: Quantitative data is given in numbers. Data in the
form of questions such as “how much” and “how many”, gives the
quantitative data.
Nominal data
Nominal data is defined as data that is used for naming or labelling variables,
without any quantitative value. It is sometimes called “named” data – a meaning
coined from the word nominal. Nominal data is the simplest form of data, and is
defined as data that is used for naming or labelling variables. There is usually no
intrinsic ordering to nominal data.

Nominal data are those items which are distinguished by a simple naming system.
Nominal data are also called categorical data. In the nominal scale, the subjects are
only allocated to different categories. The values grouped into these categories
have no meaningful order. There is no hierarchy. For example, gender and
occupation are nominal level values.
Nominal data
For example: If you collected data on hair color, when entering your data into a spreadsheet, you might
use the number 1 to represent blonde hair, the number 2 to represent gray hair, and so on. These numbers
are just labels; they don’t convey any mathematical meaning.

Some examples of nominal data include:

● Eye color (e.g. blue, brown, green)


● Nationality (e.g. German, Cameroonian, Lebanese)
● Personality type (e.g. introvert, extrovert, ambivert)
● Employment status (e.g. unemployed, part-time, retired)
● Political party voted for in the last election (e.g. party X, party Y, party Z)
● Type of smartphone owned (e.g. iPhone, Samsung, Google Pixel)
Nominal data
Nominal data
Ordinal data

Ordinal data is a type of qualitative (non-numeric) data that


groups variables into descriptive categories. A distinguishing
feature of ordinal data is that the categories it uses are ordered
on some kind of hierarchical scale, e.g. high to low. Ordinal data
classifies data while introducing an order, or ranking. For instance,
measuring economic status using the hierarchy: ‘wealthy’, ‘middle
income’ or ‘poor.’ However, there is no clearly defined interval
between these categories.
Ordinal data
Some examples of ordinal data include:
● Academic grades (A, B, C, and so on)
● Happiness on a scale of 1-10 (this is what’s known as a Likert scale)
● Satisfaction (extremely satisfied, quite satisfied, slightly dissatisfied,
extremely dissatisfied)
● Income (high, medium, or low). Note that income is not an ordinal
variable by default; it depends on how you choose to measure it.
● Level of education completed (high school, bachelor’s degree, master’s
degree)
● Seniority level at work (junior, mid-level, senior)
Ordinal data
Ordinal data
Ordinal data
For example, rate on a scale of 1-5, or categorizing your income as high,
medium, or low. As you can see from these examples, there is a natural
hierarchy to the categories—but we don’t know what the quantitative
difference or distance is between each of the categories. We don’t know
how much respondent A earns in the “high income” category compared
to respondent B in the “medium income” category; nor is it possible to
tell how much more painful a rating of 3 is compared to a rating of 1.
Nominal vs Ordinal data
1. Consider the two examples below:
a. How was your customer service experience?
_______

b. How was your customer service experience?

● Good
● Neutral
● Bad
The data to be collected from Example (a) is a nominal data, while that of (b). is an ordinal data.

2. For example, very hot, hot, cold, very cold, warm are all nominal data when considered
individually. But when placed on a scale and arranged in a given order (very hot, hot, warm,
cold, very cold), they are regarded as ordinal data.
Interval and Ratio data
● Interval data classifies and ranks data but also introduces measured
intervals. A great example is temperature scales, in Celsius or Fahrenheit.
However, interval data has no true zero, i.e. a measurement of ‘zero’
can still represent a quantifiable measure (such as zero Celsius, which is
simply another measure on a scale that includes negative values).

● Ratio data is the most complex level of measurement. Like interval data,
it classifies and ranks data, and uses measured intervals. However, unlike
interval data, ratio data also has a true zero. When a variable equals
zero, there is none of this variable. A good example of ratio data is the
measure of height—you cannot have a negative measure of height.
Interval and Ratio data
The interval scale is a numerical scale which labels and orders variables, with a known,
evenly spaced interval between each of the values. An oft-cited example of interval data is
temperature in Fahrenheit, where the difference between 10 and 20 degrees Fahrenheit is
exactly the same as the difference between, say, 50 and 60 degrees Fahrenheit. Interval
scales are numeric scales in which we know both the order and the exact differences between the
values. The classic example of an interval scale is Celsius temperature because the difference
between each value is the same. For example, the difference between 60 and 50 degrees is a
measurable 10 degrees, as is the difference between 80 and 70 degrees.

The ratio scale is exactly the same as the interval scale, with one key difference: The ratio
scale has what’s known as a “true zero.” A good example of ratio data is weight in
kilograms. If something weighs zero kilograms, it truly weighs nothing—compared to
temperature (interval data), where a value of zero degrees doesn’t mean there is “no
temperature,” it simply means it’s extremely cold!
Data Collection
Data Preparation-Meaning
and Steps
Data Preparation
Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis. It is an important step prior to processing and often involves reformatting data, making
corrections to data, and combining datasets to enrich data. It is the phase of transforming raw data
into useful information that will later be used for decision-making. Data Preparation is mainly
the phase that precedes the analysis.
Data preparation helps:

● Fix errors quickly — Data preparation helps catch errors before processing.
● Produce top-quality data — Cleaning and reformatting datasets ensures that
all data used in analysis will be of high quality.
● Make better business decisions — Higher-quality data that can be processed
and analyzed more quickly and efficiently leads to more timely, efficient,
better-quality business decisions.
Steps in Data Preparation
Steps- Data Preparation
Data Preparation process
The data preparation process can vary depending on industry or need, but
typically consists of the following steps:

Acquiring data: The first step in any data preparation process is


acquiring the data that an analyst will use for their analysis.
Determining what data is needed, gathering it, and establishing consistent
access to build powerful, trusted analysis.

Exploring data: Determining the data’s quality, examining its


distribution, and analyzing the relationship between each variable to
better understand how to compose an analysis. In other words, during
this phase, analysts should also evaluate the quality of their dataset. Is the
data complete? Are the patterns what was expected? If not, why?
Data Preparation Process

Cleansing data: Improving data quality and overall productivity to craft


error-proof insights. Cleansing data includes:

● Correcting entry errors


● Removing duplicates or outliers
● Filling in missing values or Eliminating missing data
● Masking sensitive or confidential information like names or addresses
● Conforming data to a standardized pattern

Transforming data: Formatting, orienting, aggregating, and enriching the


datasets used in an analysis to produce more meaningful insights. Data comes in
many shapes, sizes, and structures.
Data Preprocessing-Meaning
and Steps
Data Preprocessing

Data preprocessing, a component of data preparation,


describes any type of processing performed on raw data
to prepare it for another data processing procedure.
Data preprocessing transforms the data into a format that is
more easily and effectively processed in data mining,
machine learning and other data science tasks.
Steps in Data Preprocessing
Steps in Data Preprocessing

● Data cleaning: Real-world data contains irrelevant, duplicate and missing


parts. For this phase, data cleaning is performed. Data cleaning involves
handling of missing data by ignoring the missing tuples and filling the missing
values. For cleaning noisy data different machine learning methods are used
like clustering or regression.
● Data Transformation: Data transformation is used to convert real-world data
into an understandable format. It is the most important process of data
preprocessing.
● Data Reduction: It is used to handle large amounts of data. Working with
large amounts of data, analysis becomes difficult. For this, we use different
data reduction techniques like dimensionality reduction or data cube
aggregation.
Data Processing-Meaning
and Steps
Data processing
Data processing therefore refers to the process of transforming raw data into meaningful output i.e.
information. Data processing occurs when data is collected and translated into usable information.
Usually performed by a data scientist or team of data scientists, it is important for data processing to be
done correctly as not to negatively affect the end product, or data output.

Data processing starts with data in its raw form and converts it into a more readable format (graphs,
documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by
employees throughout an organization.
Data processing applies procedures to raw data to turn it into information. Data sets may be just figures,
text, excerpts, which alone do not tell us anything. These pieces need to be correlated into information
through finding connections in sets. This is the function of data processing. Data processing collects,
stores, cleans, transforms, and presents information from data sets in a valid way.

Without data processing, typically a data set cannot be very useful.


Steps in Data Processing
Steps in Data Processing

1. Data collection: Information from accessible sources, including information


lakes and data warehouses, are collected from different processes.
2. Data preparation: The main reason for this step is to reduce the redundant
data (incomplete data or incorrect data) so that we can create a good quality of
data for different business purposes.
3. Data input: After preparation of data this data is converted into a language
that can be easily understandable and data can be made usable.
4. Processing: Processing of data is done by using machine learning algorithms
for the manipulation of data so that information or pattern is identified.
5. Interpretation of data: At this stage, data is being interpreted for final use.
This stage provides the output of data processing.
6. Data storage: All the processed data is then stored for future use.
Unprocessed Data or Raw Data

Raw data is the data that is collected from a source, but in its
initial state. It has not yet been processed — or cleaned,
organized, and visually presented. In statistics, raw data refers
to data that has been collected directly from a primary source
and has not been processed in any way.
In any type of data analysis project, the first step is gathering
raw data. Once this data has been gathered, it can then be
cleaned, transformed, summarized, and visualized.
Example: Collecting & Using Raw Data
Step 1: Collect Raw Data

Imagine that a basketball


scout collects the following
raw data for 10 players on a
professional basketball
team:

This dataset represents the


raw data because it’s
collected directly by the
scout and it hasn’t been
cleaned or processed in any
way.
Step 2: Clean Raw Data

Before using this data to create


summary tables, charts, or
anything else, it would first
remove any missing values and
clean up any “dirty” data values.

For example, we can spot several


values in the dataset that need to
be transformed or removed.
Step 3: Summarize Data

Once the data has been cleaned,


the it may then summarize each
variable in the dataset. For
example, he could calculate the
following summary statistics for
the “Minutes” variable:

● Mean: 24 minutes
● Median: 22 minutes
● Standard deviation: 9.45
minutes
Step 4: Visualize Data

Visualize the variables in


the dataset to gain a better
understanding of the data
values.

For example, he could


create the following bar
chart to visualize the total
minutes played by each
player:
Step 5: Use Data to Build a
Model

Lastly, once the data has been


cleaned, then decide to fit some
type of predictive model. For
example, he may fit a simple linear
regression model and use minutes
played to predict total points scored
by each player.

The fitted regression equation is:


Points = 8.7012 + 0.2717*(minutes)
Points = 8.7012 + 0.2717*(30) =
16.85

For example, an athlete that plays


30 minutes is predicted to score
16.85 points.
MetaData
MetaData
Metadata is simply data about data. It means it is a description and
context of the data. Metadata means data or data identifying other data.
In technology circles, the prefix “meta” generally means ” an underlying
concept or explanation”. It helps to organize, find and understand data. n
simple terms, metadata is “data/information about data". Metadata helps
us understand the structure, nature, and context of the data.
Metadata is "data that provides information about other data",but not
the content of the data, such as the text of a message or the image
itself. The author, the date created, the date changed, and the file size
are some basic examples.
Some typical metadata elements:
Some examples include:
● Purpose of the data
● Time and date of creation
● Creator or author of the data
● Location on a computer network where the data was created
● File size
● Data quality
● Source of the data
● Title and description,
● Who created and when,
● Who last modified and when,
● Who can access or update.
Metadata Examples
Every time you
take a photo
with today's
cameras a
bunch of
metadata is
gathered and
saved with it:
● date and
time,
● filename,
● camera
settings,
● geolocation
Every word
processing software
collects some
standard metadata
and enables you to
add your own fields
for each document.
Typical fields are:
● title,
● subject,
● author,
● company,
● status,
● creation date
and time,
● last
modification
date and time,
● number of
pages.
Every blog post
has standard
metadata fields
that are usually at
before first
paragraph. This
includes:
● title,
● author,
● published
time,
● category,
● tags.
Spreadsheets
contain a few
metadata fields:
● tab names,
● table names,
● column
names,
● user
comments.

You might also like