Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Lab 1

Excel Fundamentals
for Data Exploration
!"#$%&"'(
TABLE OF CONTENTS

01 02 03

DATA PREPARATION ANALYZE AND


EXCEL OVERVIEW
WITH EXCEL VISUALIZE DATA
What is Excel? Handling data source Analysis toolpak
Excel Pros and Cons Handling data types Pivot chart
Handling data issues
Handling data formatting
Data problem checklist
01
EXCEL
OVERVIEW
Excel
is a spreadsheet which features
calculation or computation capabilities,
graphing tools, pivot tables.
EXCEL
OVERVIEW
Popular Functions

Pivot Tables What-if Analysis


Pivot tables allow you to obtain relevant What-if Analysis helps to experiment with
data from a large dataset. different scenarios for values or formula.
EXCEL
OVERVIEW
Popular Functions

Conditional Formatting Charts


The Conditional Formatting feature allows Creating Charts is quite easy and depicts
highlighting cells with a distinct colour, data in various ways which is more useful
depending on the value assigned to it. than a sheet.
EXCEL
OVERVIEW
Popular Functions

Sort & Filter Vlookup & HLookup


Sorting can be done by colour, reverse or Vlookup & Hlookup are very important to
randomize List. Filters are applied to display find a value in a database and fetch other
data which meet certain criteria. values corresponding to it
EXCEL
OVERVIEW
Excel Advantages

Excel is
clear

Excel is the Excel


best support mapping is
tool for powerful
Standard Data
Analysis.
Excel is non-
technical
EXCEL
OVERVIEW
Excel Disadvantages

Missing Data Lack Flexibility Scattered Output Erroneous Output


Unwary user can not Excel has its expectations Output is scattered all Output can be incomplete
realize anything wrong for data arrangement over worksheets or not properly labeled

Repeated Requests Missing data No history record

Repeat requests multiple Unwary user can not Difficult to document or


times for similar action realize anything wrong repeat your analysis
02
DATA PREPARATION
WITH EXCEL
1. Handling data source
2. Handling data types
3. Handling data issues
4. Handling data formatting
5. Data problem checklist
2.1 HANDLING
DATA SOURCE
Types Of Data Source

Company Open
Data Data

Web data Survey data APIs Public records


HANDLING
DATA SOURCE Calculate conversion rates
Web Data

Monitor the popularity of


different pieces of content

Name of the event

URL of the page visited An identifier for the The timestamp An identifier for the user
element that was clicked of the event that performed the action
2.1 HANDLING
DATA SOURCE
Survey Data

Net Promoter Score


2.1 HANDLING
DATA SOURCE
Public Records
02
DATA PREPARATION
WITH EXCEL
1. Handling data source
2. Handling data types
3. Handling data issues
4. Handling data formatting
5. Data problem checklist
2.2 HANDLING
DATA TYPES
Data Types in Statistics

Data
Qualitative Quantitative
described by a characteristic described by a numerical scale

Nominal Ordinal Discreet Continuous


an unordered list of in a scale, or ordered by where the data plugs into where there is an infinite range of
categories magnitude a limited range of values possible values

A person's height: could be any


What languages do How likely are you to The results of value, not just certain fixed heights
you speak? recommend to a friend? rolling 2 dice only
o English o 1 has the values 2, 3,
Interval Ratio
o Vietnamese o 2 4, 5, 6, 7, 8, 9, 10, 11
o German o 3 no true zero value there is a true zero
o 4 value
Temperature
Celcius, SAT Income, weight,
score 200-800 length
2.2 HANDLING
DATA TYPES
Data Types in Statistics
02
DATA PREPARATION
WITH EXCEL
1. Handling data source
2. Handling data types
3. Handling data issues
4. Handling data formatting
5. Data problem checklist
2.3 HANDLING
DATA ISSUES
Data Cleaning

20%
Analysing

80% Bad Data


Cleaning

Data Cleaning is the process to quickly and easily remove data


HOW DATA SCIENTISTS SPEND TIME that may distort your analysis
2.3 HANDLING
DATA ISSUES
Data Cleaning

Data cleaning matters for business intelligence and data analytics:

o To avoid costly errors

WHY?
o Better decision-making process

o Enhance customer acquisition

Data Cleaning o Improve employee productivity

o Accuracy, consistency, completeness, …


2.3 HANDLING
DATA ISSUES
Types of Data Issues

DIRTY
DATA

DATA MISSING
ISSUES DATA

OUTLIERs
2.3 HANDLING
DATA ISSUES
Types of Data Issues

DIRTY
DATA

DATA MISSING
ISSUES DATA

OUTLIERs
2.3 HANDLING
DATA ISSUES
Dirty Data Dirty Data contains errors in them, or in a format that’s unfriendly or unusable

Not Parsed Correctly Extra Characters Unexpected Pattern

Email
Name LName Fname Name Name John.doe@gmail.com
Smith, John Smith John “John Smith” John Smith helen@hotmail
schow@yahoo.com

Misspelled Entries Duplicate Data Records Incorrect data

ID Email Date Amount


125 Main Strrreet ID Email
1 John 2016/07/22 99
1 John
1 John 3000/07/08 330
2 Helen
125 Main Street 2 Helen 2016/05/17 1000
2.3 HANDLING
DATA ISSUES Extra characters can be currency symbols, number signs… We’d need to
Dirty Data - Extra characters remove these before changing between field types

Example 1 Example 2 Example 3

“Maureen” 33829L $346

“Patrick”

“Ben”
2.3 HANDLING
DATA ISSUES Raw data should be store in smallest part since it will
Dirty Data – Not Parsed Correctly be easier to analyze
2.3 HANDLING
DATA ISSUES
Types of Data Issues

DIRTY
DATA

DATA MISSING
ISSUES DATA

OUTLIERs
2.3 HANDLING
DATA ISSUES
Missing Data
Missing data = Gaps in data

Name City Age GPA Name City Age GPA Name City Age GPA
John Paris 12 90 John Paris 12 90 John Paris 12 90
Minh New York 100 Minh New York (null) 100 Minh New York N/A 100
Mei 23 Mei (null) (null) 23 Mei N/A N/A 23
Lucy Tokyo 11 92 Lucy Tokyo 11 92 Lucy Tokyo 11 92
Peter Beijing 12 Peter Beijing 12 (null) Peter Beijing 12 N/A

Blanks/Empty Cells (CSV) Null Values (Database) N/A (Program)

Why care about missing data? BIAS in statistics refers to the


tendency of an analysis to either
• Some statistical • Can add bias to a model over or underestimate the values
algorithms will not work of that specific field or parameter
2.3 HANDLING
DATA ISSUES
Missing Data - Example

ID Age ID Age
1 27 1 27
2 37 2 37
Average age without Average age with missing
3 70 missing values: 3 values:
4 55 44.5 4 55 40
5 23 5 23
6 25 6 25
7 35 7 35 DOWNWARD
8 51 8 51 BIAS
9 65 9
10 67 10 67

REAL DATA COLLECTED DATA


2.3 HANDLING
DATA ISSUES
Missing Data - Solutions

Before addressing the problem of missing data, you


should answer the questions below:

1. How much data is really missing? (>=20%)


2. Is the missing data numeric or categorical? SOLUTIONS
The solutions will be based on the answers.

Deleting Missing Data 1

Imputation 2
2.3 HANDLING
DATA ISSUES
Missing Data – Deleting Missing Data Deleting missing data is often the default method
because it's simplicity.
ID Name City Age GPA
1 John Paris 12 90
2 Minh New York 100
3 Mei 23
However, you should make sure that deleting missing
4 Lucy Tokyo 11 92
data doesn't have adverse effects on your analysis.
5 Peter Beijing 12
For example, if a particular demographic tended to
leave a response blank in a survey, then removing
records with blank entries will mean that a part of the
population is underrepresented.
ID Name City Age GPA
One of the downsides is that eliminating missing
1 John Paris 12 90
data reduces the size of the dataset. (Ex: cost data)
4 Lucy Tokyo 11 92
2.3 HANDLING
DATA ISSUES
Missing Data – Imputation

MODE
In statistics, Imputation is the process of substituting MEDIAN
values in the data where the value are missing

We are creating fake data in order to develop a model


that makes sense and is as close to reality as we can get it

MEAN
2.3 HANDLING
DATA ISSUES MEAN vs MEDIAN
Missing Data – Imputation When is a Median a better summary description of
data as compared to the Mean?
Let’s look at seven employee at a small firm with
the following salaries.
How can we decide when to use mean, median and mode?
Salary
$28,000
Date Rate What’s the typical salary
$33,000
in this group?
1/1/2022 0.936 $33,000
MODE is not a relevant 2/1/2022 0.93 $33,000 Mean: $86,000
descriptive statistics when the Median: $34,000
3/1/2022 0.876 $34,000
data is essentially continuous.
$37,000
4/1/2022 0.875
$40,000
5/1/2022 0.86
$400,000

● Mean-It is preferred if data is numeric and not skewed.

● Median-It is preferred if data is numeric and skewed.


2.3 HANDLING
DATA ISSUES
Missing Data – Solution Comparison
2.3 HANDLING
DATA ISSUES
Types of Data Issues

DIRTY
DATA

DATA MISSING
ISSUES DATA

OUTLIERS
2.3 HANDLING
DATA ISSUES
Outliers

Identifying outliers in the data helps us understand how


vulnerable our model would be to a small set of observations

Incorrect Data

OR

Abnormal but Correct Data


2.3 HANDLING
DATA ISSUES Outliers
Outliers - Identify

Identifying outliers more methodically rather than simply eyeballing them

• Violin Plot (a hybrid of a box plot and a kernel density plot): shows the Upper Fence
volume of the distribution
• Others: z-scores or standard deviations
Lower Fence

Median

3rd Quartile
Interquartile
Range Outliers
1st Quartile
BOX AND WHISKER
PLOT ELEMENTS
2.3 HANDLING
DATA ISSUES Outliers
Outliers – Identify: Exercise
Upper Fence

Identify the outliers of this dataset using


Q3
10 12 11 15 the IQR method below
IQR
Lower Fence = Q1 -1.5*IQR Median
11 14 13 17 Upper Fence = Q3 + 1.5*IQR
Q1
12 22 14 11 Q1 is the median of the data lies between
minimum and the Median.

Q3 is the median of the data lies between


maximum and the Median. Lower Fence

IQR is the difference between Q1 and Q3. Outliers


2.3 HANDLING
DATA ISSUES
Outliers – Identify: Exercise

10 12 11 15 10 11 12 13 14 15 16 17 18 19 20 21 22

11 14 13 17
Sort the data points
12 22 14 11 10 11 11 11 12 12 13 14 14 15 17 22
in ascending order

Lower Fence = Q1 -1.5*IQR = 5.75 Median


Upper Fence = Q3 + 1.5*IQR = 19.75
Q1 =11 Q2=14.5
22 is the only outlier

IQR = Q1 – Q2 = 3.5
2.3 HANDLING
DATA ISSUES
Outliers – Dealing with outliers

RETAIN TRIMMING WINSORIZING


when there legitimately may when you simply don’t when we want to retain the
be high values believe the answers high-value responses but
not take them too literally
2.3 HANDLING
DATA ISSUES
Outliers – Dealing with outliers: EXAMPLE

In the example below, most physicians report


under 100 patients per month, but a few, 4%,
report much higher numbers.

The screener termination criteria already bound


the responses to be at least 5, so we might clip
answers above 100 as shown by the gold line.
2.3 HANDLING
DATA ISSUES
Here the distribution is
Outliers – Dealing with outliers: RETAIN shown on a log scale,
with small bin ranges for
smaller numbers and
larger bin ranges for
larger numbers.

AFTER

Just because numbers are atypical doesn’t mean they are unreasonable.
Many phenomena yield “long-tail” distributions where a few outliers
legitimately exist. For instance, in economics most people have modest
wealth, but a few have very high net worth, and to exclude them from
analysis would be misleading.

“Long-tail” distributions often look normal, or at least more reasonable,


BEFORE when shown on a log scale.
2.3 HANDLING
DATA ISSUES
Outliers – Dealing with outliers: TRIMMING
Another approach is to ignore responses outside the main range.
To do this we can set a filter which includes only responses that
fall within the range [5, 100].

BEFORE AFTER
2.3 HANDLING
DATA ISSUES We can cap those answers to within a defined range by setting the
Outliers – Dealing with outliers: WINSORIZING “ceiling” and “floor” attributes.

The outlying values


are not dropped but
are now counted as if
they were equal to 100
and thus fall in the
range “91 to 100.”.

The range “91 to


100” has increased
from 8% to 12%.
The N size is still 100
The mean is a bit
lower now.
AFTER
Note that the median did not change at all. In all but the most extreme
cases, the median is robust to outliers and unaffected by winsorizing
BEFORE because the extreme values stay on their side of the median .
02
DATA PREPARATION
WITH EXCEL
1. Handling data source
2. Handling data types
3. Handling data issues
4. Handling data formatting
5. Data problem checklist
2.4 HANDLE DATA
FORMATTING
Data Formatting

The 3 Hows:

• How to identify when your data needs to be formatted.


• How to massage data into the correct format.
• How to aggregate it to the form required.

1
Pivot/Unpivot

2
Aggregate Data/Group Data

3
Conditional Formatting
2.4 HANDLING
DATA FORMATTING
Data Formatting – Pivot/Unpivot

Pivot tables are one Excel’s most powerful


features for data analysis. Otherwise known as
cross-tabulation, pivot tables are used to
summarize (or slice) data so that you can focus
on specific aspects that you want to explore in
more depth.

You can quickly build a pivot table by dragging


and dropping fields and rearranging them
without needing to write any formulas. The
flexibility to blend information also makes it
easy to spot trends that might otherwise go
unnoticed.
2.4 HANDLING
DATA FORMATTING
Data Formatting – Aggregate Data
2.4 HANDLING
DATA FORMATTING
Data Formatting – Aggregate Data
2.4 HANDLING
DATA FORMATTING
Data Formatting – Conditional Formatting

Conditional formatting is used to change the appearance


of cells in a range based on your specified conditions.

The conditions are rules based on specified numerical


values or matching text.

Changing the appearance of cells can visually highlight


interesting data points for analysis.
02
DATA PREPARATION
WITH EXCEL
1. Handling data source
2. Handling data types
3. Handling data issues
4. Handling data formatting
5. Data problem checklist
DATA PROBLEM CHECKLIST
“Garbage in, Garbage out”
Creating an analytical dataset Issues 1st Fix-date 2nd Fix-date
Data Source Enough Data
Up to Date
Data types Data types correctly
Data Issues Dirty Data Not Parsed Correctly
Extra Characters
Unexpected Pattern
Incorrect Data
Duplicate Data Records
Misspelled Entries
Missing Data Deleting Missing Data
Imputation
Outliers Case 1: Cross-check & fix
Case 1: Delete
Case 2: Trimming
Data Formatting Pivot/Unpivot
Aggregating Data (Group Data
03
ANALYZE AND
VISUALIZE DATA
ANALYZE AND 5. Covariance 6. Descriptive Statistics
4. Correlation
VISUALIZE DATA To find how much two Gives a report of univariate
To measure relations
Analyse Data - Analysis ToolPak random variables vary stats for your data (e.g. the
between variables.
together. mean and median).

8.F-Test Two-Sample for


7.Exponential Smoothing 9. Fourier Analysis
Variance
A prediction tool for Problem solve linear systems.
A test to compare two
time series data. Analyse periodic data.
population variances.
Analysis Toolpak is a kind of add-in Microsoft Excel
that allows users to use data analysis tools for 11. Moving Average
10. Histogram 12.Random Number Generation
statistical and engineering analysis. The Analysis Predicts future performance
Creates a histogram Generates columns of
Toolpak consists of several functional tools that can based on averages from past
from your data. independent random numbers.
be used to do statistical/engineering analysis. Given data.
below is a table that includes names of all the
functional tools available under Analysis Toolpak: 13. Rank and Percents 14. Regression 15. Sampling
Calculates ordinal and Performs least squares Takes a sample from a
percentage rank. regression on a set of data. “population” that you input.

16. t-Test 17. Z-Test 18. Anova


To compare means To compare means between To test if there’s a difference
between groups groups between groups
ANALYZE AND
VISUALIZE DATA
Visualize Data - Pivot Chart

A pivot chart is the visual


representation of a pivot table in
Excel. Pivot charts and pivot tables
are connected with each other.
ANALYZE AND
VISUALIZE DATA
Visualize Data - Pivot Chart
ANALYZE AND
VISUALIZE DATA
Visualize Data - Pivot Chart

You might also like