Session 2 - Excel Fundamentals For Data Exploration

Lab 1
Excel Fundamentals
for Data Exploration
!"#$%&"'(
TABLE OF CONTENTS
01 02 03
DATA PREPARATION ANALYZE AND

EXCEL OVERVIEW
WITH EXCEL VISUALIZE DATA
What is Excel? Handling data source Analysis toolpak
Excel Pros and Cons Handling data types Pivot chart
Handling data issues
Handling data formatting
Data problem checklist
01
EXCEL
OVERVIEW
Excel
is a spreadsheet which features
calculation or computation capabilities,
graphing tools, pivot tables.
EXCEL
OVERVIEW
Popular Functions
Pivot Tables What-if Analysis

Pivot tables allow you to obtain relevant What-if Analysis helps to experiment with
data from a large dataset. different scenarios for values or formula.
EXCEL
OVERVIEW
Popular Functions
Conditional Formatting Charts

The Conditional Formatting feature allows Creating Charts is quite easy and depicts
highlighting cells with a distinct colour, data in various ways which is more useful
depending on the value assigned to it. than a sheet.
EXCEL
OVERVIEW
Popular Functions
Sort & Filter Vlookup & HLookup

Sorting can be done by colour, reverse or Vlookup & Hlookup are very important to
randomize List. Filters are applied to display find a value in a database and fetch other
data which meet certain criteria. values corresponding to it
EXCEL
OVERVIEW
Excel Advantages
Excel is
clear
Excel is the Excel

best support mapping is
tool for powerful
Standard Data
Analysis.
Excel is non-
technical
EXCEL
OVERVIEW
Excel Disadvantages
Missing Data Lack Flexibility Scattered Output Erroneous Output

Unwary user can not Excel has its expectations Output is scattered all Output can be incomplete
realize anything wrong for data arrangement over worksheets or not properly labeled
Repeated Requests Missing data No history record
Repeat requests multiple Unwary user can not Difficult to document or

times for similar action realize anything wrong repeat your analysis
02
DATA PREPARATION
WITH EXCEL
1. Handling data source
2. Handling data types
3. Handling data issues
4. Handling data formatting
5. Data problem checklist
2.1 HANDLING
DATA SOURCE
Types Of Data Source
Company Open
Data Data
Web data Survey data APIs Public records

HANDLING
DATA SOURCE Calculate conversion rates
Web Data
Monitor the popularity of

different pieces of content
Name of the event
URL of the page visited An identifier for the The timestamp An identifier for the user
element that was clicked of the event that performed the action
2.1 HANDLING
DATA SOURCE
Survey Data
Net Promoter Score

2.1 HANDLING
DATA SOURCE
Public Records
02
DATA PREPARATION
WITH EXCEL
2.2 HANDLING
DATA TYPES
Data Types in Statistics
Data
Qualitative Quantitative
described by a characteristic described by a numerical scale
Nominal Ordinal Discreet Continuous

an unordered list of in a scale, or ordered by where the data plugs into where there is an infinite range of
categories magnitude a limited range of values possible values
A person's height: could be any

What languages do How likely are you to The results of value, not just certain fixed heights
you speak? recommend to a friend? rolling 2 dice only
o English o 1 has the values 2, 3,
Interval Ratio
o Vietnamese o 2 4, 5, 6, 7, 8, 9, 10, 11
o German o 3 no true zero value there is a true zero
o 4 value
Temperature
Celcius, SAT Income, weight,
score 200-800 length
2.2 HANDLING
DATA TYPES
Data Types in Statistics
02
DATA PREPARATION
WITH EXCEL
2.3 HANDLING
DATA ISSUES
Data Cleaning
20%
Analysing
80% Bad Data

Cleaning
Data Cleaning is the process to quickly and easily remove data

HOW DATA SCIENTISTS SPEND TIME that may distort your analysis
2.3 HANDLING
DATA ISSUES
Data Cleaning
Data cleaning matters for business intelligence and data analytics:
o To avoid costly errors
WHY?
o Better decision-making process
o Enhance customer acquisition
Data Cleaning o Improve employee productivity
o Accuracy, consistency, completeness, …

2.3 HANDLING
DATA ISSUES
Types of Data Issues
DIRTY
DATA
DATA MISSING
ISSUES DATA
OUTLIERs
2.3 HANDLING
DATA ISSUES
DIRTY
DATA
DATA MISSING
ISSUES DATA
OUTLIERs
2.3 HANDLING
DATA ISSUES
Dirty Data Dirty Data contains errors in them, or in a format that’s unfriendly or unusable
Not Parsed Correctly Extra Characters Unexpected Pattern
Email
Name LName Fname Name Name John.doe@gmail.com
Smith, John Smith John “John Smith” John Smith helen@hotmail
schow@yahoo.com
Misspelled Entries Duplicate Data Records Incorrect data
ID Email Date Amount

125 Main Strrreet ID Email
1 John 2016/07/22 99
1 John
1 John 3000/07/08 330
2 Helen
125 Main Street 2 Helen 2016/05/17 1000
2.3 HANDLING
DATA ISSUES Extra characters can be currency symbols, number signs… We’d need to
Dirty Data - Extra characters remove these before changing between field types
Example 1 Example 2 Example 3
“Maureen” 33829L $346
“Patrick”
“Ben”
2.3 HANDLING
DATA ISSUES Raw data should be store in smallest part since it will
Dirty Data – Not Parsed Correctly be easier to analyze
2.3 HANDLING
DATA ISSUES
DIRTY
DATA
DATA MISSING
ISSUES DATA
OUTLIERs
2.3 HANDLING
DATA ISSUES
Missing Data
Missing data = Gaps in data
Name City Age GPA Name City Age GPA Name City Age GPA
John Paris 12 90 John Paris 12 90 John Paris 12 90
Minh New York 100 Minh New York (null) 100 Minh New York N/A 100
Mei 23 Mei (null) (null) 23 Mei N/A N/A 23
Lucy Tokyo 11 92 Lucy Tokyo 11 92 Lucy Tokyo 11 92
Peter Beijing 12 Peter Beijing 12 (null) Peter Beijing 12 N/A
Blanks/Empty Cells (CSV) Null Values (Database) N/A (Program)
Why care about missing data? BIAS in statistics refers to the

tendency of an analysis to either
• Some statistical • Can add bias to a model over or underestimate the values
algorithms will not work of that specific field or parameter
2.3 HANDLING
DATA ISSUES
Missing Data - Example
ID Age ID Age
1 27 1 27
2 37 2 37
Average age without Average age with missing
3 70 missing values: 3 values:
4 55 44.5 4 55 40
5 23 5 23
6 25 6 25
7 35 7 35 DOWNWARD
8 51 8 51 BIAS
9 65 9
10 67 10 67
REAL DATA COLLECTED DATA

2.3 HANDLING
DATA ISSUES
Missing Data - Solutions
Before addressing the problem of missing data, you

should answer the questions below:
1. How much data is really missing? (>=20%)

2. Is the missing data numeric or categorical? SOLUTIONS
The solutions will be based on the answers.
Deleting Missing Data 1
Imputation 2
2.3 HANDLING
DATA ISSUES
Missing Data – Deleting Missing Data Deleting missing data is often the default method
because it's simplicity.
ID Name City Age GPA
1 John Paris 12 90
2 Minh New York 100
3 Mei 23
However, you should make sure that deleting missing
4 Lucy Tokyo 11 92
data doesn't have adverse effects on your analysis.
5 Peter Beijing 12
For example, if a particular demographic tended to
leave a response blank in a survey, then removing
records with blank entries will mean that a part of the
population is underrepresented.
ID Name City Age GPA
One of the downsides is that eliminating missing
1 John Paris 12 90
data reduces the size of the dataset. (Ex: cost data)
4 Lucy Tokyo 11 92
2.3 HANDLING
DATA ISSUES
Missing Data – Imputation
MODE
In statistics, Imputation is the process of substituting MEDIAN
values in the data where the value are missing
We are creating fake data in order to develop a model

that makes sense and is as close to reality as we can get it
MEAN
2.3 HANDLING
DATA ISSUES MEAN vs MEDIAN
Missing Data – Imputation When is a Median a better summary description of
data as compared to the Mean?
Let’s look at seven employee at a small firm with
the following salaries.
How can we decide when to use mean, median and mode?
Salary
$28,000
Date Rate What’s the typical salary
$33,000
in this group?
1/1/2022 0.936 $33,000
MODE is not a relevant 2/1/2022 0.93 $33,000 Mean: $86,000
descriptive statistics when the Median: $34,000
3/1/2022 0.876 $34,000
data is essentially continuous.
$37,000
4/1/2022 0.875
$40,000
5/1/2022 0.86
$400,000
● Mean-It is preferred if data is numeric and not skewed.
● Median-It is preferred if data is numeric and skewed.

2.3 HANDLING
DATA ISSUES
Missing Data – Solution Comparison
2.3 HANDLING
DATA ISSUES
DIRTY
DATA
DATA MISSING
ISSUES DATA
OUTLIERS
2.3 HANDLING
DATA ISSUES
Outliers
Identifying outliers in the data helps us understand how

vulnerable our model would be to a small set of observations
Incorrect Data
OR
Abnormal but Correct Data

2.3 HANDLING
DATA ISSUES Outliers
Outliers - Identify
Identifying outliers more methodically rather than simply eyeballing them
• Violin Plot (a hybrid of a box plot and a kernel density plot): shows the Upper Fence
volume of the distribution
• Others: z-scores or standard deviations
Lower Fence
Median
3rd Quartile
Interquartile
Range Outliers
1st Quartile
BOX AND WHISKER
PLOT ELEMENTS
2.3 HANDLING
DATA ISSUES Outliers
Outliers – Identify: Exercise
Upper Fence
Identify the outliers of this dataset using

Q3
10 12 11 15 the IQR method below
IQR
Lower Fence = Q1 -1.5*IQR Median
11 14 13 17 Upper Fence = Q3 + 1.5*IQR
Q1
12 22 14 11 Q1 is the median of the data lies between
minimum and the Median.
Q3 is the median of the data lies between

maximum and the Median. Lower Fence
IQR is the difference between Q1 and Q3. Outliers

2.3 HANDLING
DATA ISSUES
Outliers – Identify: Exercise
10 12 11 15 10 11 12 13 14 15 16 17 18 19 20 21 22
11 14 13 17
Sort the data points
12 22 14 11 10 11 11 11 12 12 13 14 14 15 17 22
in ascending order
Lower Fence = Q1 -1.5*IQR = 5.75 Median

Upper Fence = Q3 + 1.5*IQR = 19.75
Q1 =11 Q2=14.5
22 is the only outlier
IQR = Q1 – Q2 = 3.5
2.3 HANDLING
DATA ISSUES
Outliers – Dealing with outliers
RETAIN TRIMMING WINSORIZING

when there legitimately may when you simply don’t when we want to retain the
be high values believe the answers high-value responses but
not take them too literally
2.3 HANDLING
DATA ISSUES
Outliers – Dealing with outliers: EXAMPLE
In the example below, most physicians report

under 100 patients per month, but a few, 4%,
report much higher numbers.
The screener termination criteria already bound

the responses to be at least 5, so we might clip
answers above 100 as shown by the gold line.
2.3 HANDLING
DATA ISSUES
Here the distribution is
Outliers – Dealing with outliers: RETAIN shown on a log scale,
with small bin ranges for
smaller numbers and
larger bin ranges for
larger numbers.
AFTER
Just because numbers are atypical doesn’t mean they are unreasonable.
Many phenomena yield “long-tail” distributions where a few outliers
legitimately exist. For instance, in economics most people have modest
wealth, but a few have very high net worth, and to exclude them from
analysis would be misleading.
“Long-tail” distributions often look normal, or at least more reasonable,

BEFORE when shown on a log scale.
2.3 HANDLING
DATA ISSUES
Outliers – Dealing with outliers: TRIMMING
Another approach is to ignore responses outside the main range.
To do this we can set a filter which includes only responses that
fall within the range [5, 100].
BEFORE AFTER
2.3 HANDLING
DATA ISSUES We can cap those answers to within a defined range by setting the
Outliers – Dealing with outliers: WINSORIZING “ceiling” and “floor” attributes.
The outlying values

are not dropped but
are now counted as if
they were equal to 100
and thus fall in the
range “91 to 100.”.
The range “91 to

100” has increased
from 8% to 12%.
The N size is still 100
The mean is a bit
lower now.
AFTER
Note that the median did not change at all. In all but the most extreme
cases, the median is robust to outliers and unaffected by winsorizing
BEFORE because the extreme values stay on their side of the median .
02
DATA PREPARATION
WITH EXCEL
2.4 HANDLE DATA
FORMATTING
Data Formatting
The 3 Hows:
• How to identify when your data needs to be formatted.

• How to massage data into the correct format.
• How to aggregate it to the form required.
1
Pivot/Unpivot
2
Aggregate Data/Group Data
3
Conditional Formatting
2.4 HANDLING
DATA FORMATTING
Data Formatting – Pivot/Unpivot
Pivot tables are one Excel’s most powerful

features for data analysis. Otherwise known as
cross-tabulation, pivot tables are used to
summarize (or slice) data so that you can focus
on specific aspects that you want to explore in
more depth.
You can quickly build a pivot table by dragging

and dropping fields and rearranging them
without needing to write any formulas. The
flexibility to blend information also makes it
easy to spot trends that might otherwise go
unnoticed.
2.4 HANDLING
DATA FORMATTING
Data Formatting – Aggregate Data
2.4 HANDLING
DATA FORMATTING
Data Formatting – Aggregate Data
2.4 HANDLING
DATA FORMATTING
Data Formatting – Conditional Formatting
Conditional formatting is used to change the appearance

of cells in a range based on your specified conditions.
The conditions are rules based on specified numerical

values or matching text.
Changing the appearance of cells can visually highlight

interesting data points for analysis.
02
DATA PREPARATION
WITH EXCEL
DATA PROBLEM CHECKLIST
“Garbage in, Garbage out”
Creating an analytical dataset Issues 1st Fix-date 2nd Fix-date
Data Source Enough Data
Up to Date
Data types Data types correctly
Data Issues Dirty Data Not Parsed Correctly
Extra Characters
Unexpected Pattern
Incorrect Data
Duplicate Data Records
Misspelled Entries
Missing Data Deleting Missing Data
Imputation
Outliers Case 1: Cross-check & fix
Case 1: Delete
Case 2: Trimming
Data Formatting Pivot/Unpivot
Aggregating Data (Group Data
03
ANALYZE AND
VISUALIZE DATA
ANALYZE AND 5. Covariance 6. Descriptive Statistics
4. Correlation
VISUALIZE DATA To find how much two Gives a report of univariate
To measure relations
Analyse Data - Analysis ToolPak random variables vary stats for your data (e.g. the
between variables.
together. mean and median).
8.F-Test Two-Sample for

7.Exponential Smoothing 9. Fourier Analysis
Variance
A prediction tool for Problem solve linear systems.
A test to compare two
time series data. Analyse periodic data.
population variances.
Analysis Toolpak is a kind of add-in Microsoft Excel
that allows users to use data analysis tools for 11. Moving Average
10. Histogram 12.Random Number Generation
statistical and engineering analysis. The Analysis Predicts future performance
Creates a histogram Generates columns of
Toolpak consists of several functional tools that can based on averages from past
from your data. independent random numbers.
be used to do statistical/engineering analysis. Given data.
below is a table that includes names of all the
functional tools available under Analysis Toolpak: 13. Rank and Percents 14. Regression 15. Sampling
Calculates ordinal and Performs least squares Takes a sample from a
percentage rank. regression on a set of data. “population” that you input.
16. t-Test 17. Z-Test 18. Anova

To compare means To compare means between To test if there’s a difference
between groups groups between groups
ANALYZE AND
VISUALIZE DATA
Visualize Data - Pivot Chart
A pivot chart is the visual

representation of a pivot table in
Excel. Pivot charts and pivot tables
are connected with each other.
ANALYZE AND
VISUALIZE DATA
ANALYZE AND
VISUALIZE DATA

Session 2 - Excel Fundamentals For Data Exploration

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 2 - Excel Fundamentals For Data Exploration

Uploaded by

Copyright:

Available Formats

Lab 1

DATA PREPARATION ANALYZE AND

Pivot Tables What-if Analysis

Conditional Formatting Charts

Sort & Filter Vlookup & HLookup

Excel is the Excel

Missing Data Lack Flexibility Scattered Output Erroneous Output

Repeated Requests Missing data No history record

Repeat requests multiple Unwary user can not Difficult to document or

Web data Survey data APIs Public records

Monitor the popularity of

Name of the event

Net Promoter Score

Nominal Ordinal Discreet Continuous

A person's height: could be any

80% Bad Data

Data Cleaning is the process to quickly and easily remove data

Data cleaning matters for business intelligence and data analytics:

o To avoid costly errors

o Enhance customer acquisition

Data Cleaning o Improve employee productivity

o Accuracy, consistency, completeness, …

Not Parsed Correctly Extra Characters Unexpected Pattern

Misspelled Entries Duplicate Data Records Incorrect data

ID Email Date Amount

Example 1 Example 2 Example 3

“Maureen” 33829L $346

Blanks/Empty Cells (CSV) Null Values (Database) N/A (Program)

Why care about missing data? BIAS in statistics refers to the

REAL DATA COLLECTED DATA

Before addressing the problem of missing data, you

1. How much data is really missing? (>=20%)

Deleting Missing Data 1

We are creating fake data in order to develop a model

● Mean-It is preferred if data is numeric and not skewed.

● Median-It is preferred if data is numeric and skewed.

Identifying outliers in the data helps us understand how

Abnormal but Correct Data

Identifying outliers more methodically rather than simply eyeballing them

Identify the outliers of this dataset using

Q3 is the median of the data lies between

IQR is the difference between Q1 and Q3. Outliers

Lower Fence = Q1 -1.5*IQR = 5.75 Median

RETAIN TRIMMING WINSORIZING

In the example below, most physicians report

The screener termination criteria already bound

“Long-tail” distributions often look normal, or at least more reasonable,

The outlying values

The range “91 to

• How to identify when your data needs to be formatted.

Pivot tables are one Excel’s most powerful

You can quickly build a pivot table by dragging

Conditional formatting is used to change the appearance

The conditions are rules based on specified numerical

Changing the appearance of cells can visually highlight

8.F-Test Two-Sample for

16. t-Test 17. Z-Test 18. Anova

A pivot chart is the visual

You might also like