Professional Documents
Culture Documents
Data Preparation-Part 1-231018-220411
Data Preparation-Part 1-231018-220411
Data Preparation-Part 1-231018-220411
By
Reema Monga
Assistant Professor, USMS
Example 1: Data vs Information
What is Data?
Data is a collection of raw, unorganised facts and details like text, observations, figures, symbols
and descriptions of things etc. In other words, data does not carry any specific purpose and has
no significance by itself. Data is raw, unorganized, unanalyzed, uninterrupted, and unrelated
used in different contexts. For instance, facts and stats gathered by researchers for their
analysis can collectively be called data. Data in essence lacks its informative fervor and relatively
renders itself to be meaningless unless given a purpose or direction to acquire its significance.
What is Information?
Information is processed, organised and structured data. It provides context for data and
enables decision making. For example, a single customer’s sale at a restaurant is data – this
becomes information when the business is able to identify the most popular or least popular dish.
Information is data that have been processed to make them meaningful and useful Data + Meaning
= Information. Another way to add meaning is to process the data. For example, individual exam
marks are raw data, but if you were to process those to say that the average mark for the class
was 53%, or that boys did better than girls, or that 76% of the students in your school got a
grade A or B, then that is information!
Data vs Information
Data is a raw and unorganized fact that required to be processed to make it
meaningful. In other words, Data is always interpreted, by a human or machine, to
derive meaning. So, data is meaningless. Data contains numbers, statements, and
characters in a raw form.
Nominal data are those items which are distinguished by a simple naming system.
Nominal data are also called categorical data. In the nominal scale, the subjects are
only allocated to different categories. The values grouped into these categories
have no meaningful order. There is no hierarchy. For example, gender and
occupation are nominal level values.
Nominal data
For example: If you collected data on hair color, when entering your data into a spreadsheet, you might
use the number 1 to represent blonde hair, the number 2 to represent gray hair, and so on. These numbers
are just labels; they don’t convey any mathematical meaning.
● Good
● Neutral
● Bad
The data to be collected from Example (a) is a nominal data, while that of (b). is an ordinal data.
2. For example, very hot, hot, cold, very cold, warm are all nominal data when considered
individually. But when placed on a scale and arranged in a given order (very hot, hot, warm,
cold, very cold), they are regarded as ordinal data.
Interval and Ratio data
● Interval data classifies and ranks data but also introduces measured
intervals. A great example is temperature scales, in Celsius or Fahrenheit.
However, interval data has no true zero, i.e. a measurement of ‘zero’
can still represent a quantifiable measure (such as zero Celsius, which is
simply another measure on a scale that includes negative values).
● Ratio data is the most complex level of measurement. Like interval data,
it classifies and ranks data, and uses measured intervals. However, unlike
interval data, ratio data also has a true zero. When a variable equals
zero, there is none of this variable. A good example of ratio data is the
measure of height—you cannot have a negative measure of height.
Interval and Ratio data
The interval scale is a numerical scale which labels and orders variables, with a known,
evenly spaced interval between each of the values. An oft-cited example of interval data is
temperature in Fahrenheit, where the difference between 10 and 20 degrees Fahrenheit is
exactly the same as the difference between, say, 50 and 60 degrees Fahrenheit. Interval
scales are numeric scales in which we know both the order and the exact differences between the
values. The classic example of an interval scale is Celsius temperature because the difference
between each value is the same. For example, the difference between 60 and 50 degrees is a
measurable 10 degrees, as is the difference between 80 and 70 degrees.
The ratio scale is exactly the same as the interval scale, with one key difference: The ratio
scale has what’s known as a “true zero.” A good example of ratio data is weight in
kilograms. If something weighs zero kilograms, it truly weighs nothing—compared to
temperature (interval data), where a value of zero degrees doesn’t mean there is “no
temperature,” it simply means it’s extremely cold!
Data Collection
Data Preparation-Meaning
and Steps
Data Preparation
Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis. It is an important step prior to processing and often involves reformatting data, making
corrections to data, and combining datasets to enrich data. It is the phase of transforming raw data
into useful information that will later be used for decision-making. Data Preparation is mainly
the phase that precedes the analysis.
Data preparation helps:
● Fix errors quickly — Data preparation helps catch errors before processing.
● Produce top-quality data — Cleaning and reformatting datasets ensures that
all data used in analysis will be of high quality.
● Make better business decisions — Higher-quality data that can be processed
and analyzed more quickly and efficiently leads to more timely, efficient,
better-quality business decisions.
Steps in Data Preparation
Steps- Data Preparation
Data Preparation process
The data preparation process can vary depending on industry or need, but
typically consists of the following steps:
Data processing starts with data in its raw form and converts it into a more readable format (graphs,
documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by
employees throughout an organization.
Data processing applies procedures to raw data to turn it into information. Data sets may be just figures,
text, excerpts, which alone do not tell us anything. These pieces need to be correlated into information
through finding connections in sets. This is the function of data processing. Data processing collects,
stores, cleans, transforms, and presents information from data sets in a valid way.
Raw data is the data that is collected from a source, but in its
initial state. It has not yet been processed — or cleaned,
organized, and visually presented. In statistics, raw data refers
to data that has been collected directly from a primary source
and has not been processed in any way.
In any type of data analysis project, the first step is gathering
raw data. Once this data has been gathered, it can then be
cleaned, transformed, summarized, and visualized.
Example: Collecting & Using Raw Data
Step 1: Collect Raw Data
● Mean: 24 minutes
● Median: 22 minutes
● Standard deviation: 9.45
minutes
Step 4: Visualize Data