MDA - 1.module 1 - BI Introduction - Data Prep

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 131

BUSINESS INTELLIGENCE

Module 1: Business Intelligence Fundamental & Data Prep


Profile
Trinh was former Sales Transformation – Project Operation Manager at Marico SEA where he
managed and developed various innovation projects including Business Intelligence, Data
Analytics & System Development. He is currently Head of Sales Operation & Business
Transformation at Unilever Food Solution
Having intensive background at multinational FMCGs Companies. He specializes in Sales
Analytics & Effectiveness with aiming to transform the way people organize and use data. He has
experiences in successful development end to end BI System including Dashboard & Reporting
using Tableau, ETL (Extract, Transform & Load), Design Datamart & Build Analysis Services for self-
analysis purpose. He was also frequently conducting analytics to review business performance
and suggest appropriate action in execution. Before that, he used to be a Sales Analyst and
Customer Development which giving him a solid foundation of business acumen that can easily
generate insight from data to fund the growth as well as enhance business operation
Proficient in: Tableau, Power BI, MSBI (SSAS, SSIS), SQL Programming
Mastering Data Analytics
Copyright Notice & Disclaimer

The content contained in this module is provided only for educational of Mastering Data Analytics'
training courses. You may not copy, reproduce, distribute, publish, display, perform, modify,
create derivative works, transmit, or in any way exploit any such content, nor may you distribute
any part of this content over any network, including a local area network, sell or offer it for sale, or
use such content to construct any kind of database.

For permission to use the content, please contact via email: training@mastering-da.com
1 Business Intelligence Fundamental

2 Business Statistics

3 Descriptive Analytics

4 Diagnostics Analytics

5 Data Visualization

6 Business Intelligence Capstone


1. BUSSINESS INTELLIGENCE FUNDAMENTAL
& DATA PREPRATION

01 02 03 04
BI Terminology Business Intelligence in Corporates Data Preparation
1. Self-service BI & Analytics 1. Power Query Overview
1. BI vs BA 2. Analytics structure and Coordination Model 2. Get Data
2. Technology in BI 3. BI Success (Organization)
3. PQ – Basic Transform Data
4. BI Success (Individual)
3. Data, Analysis, Analytics 4. Profiling Data
5. The Evolution of Business Intelligence
4. Understanding Data 6. How can data analytics help organizations? 5. Data Issues
7. BI process 4a. Bad Shape + Dirty Data
8. Decision making with BI 4b. Missing Data + Outliers
9. Decision Bias
5. Combine Data from Folder
6. Blending Data
7. Checklist
Business Intelligence Terminology
Business Intelligence & Business Analytics
Business Intelligence Terminology
Business Intelligence & Business Analytics
Business Intelligence Terminology
Technology in Business Intelligence
Business Intelligence Terminology
Technology in Business Intelligence
Business Intelligence Terminology
Data, Analysis, Analytics

Analysis provides you with Analytics The science that


Data, in the information
information & raises analyze crude data to
age, are a large set of
questions extract useful knowledge
bits encoding numbers,
(patterns) from them.
texts, images, sounds,
Analytics give you insights
videos, and so on.
& attempts to answer
questions

DATA ANALYSIS ANALYTICS

Source: A General Introduction to Data Analytics, Wiley & ChartMogul


Business Intelligence Terminology
Data, Analysis, Analytics

DATA ANALYTICS IS A BROADER TERM OF WHICH DATA


ANALYSIS IS A SUBCOMPONENT

Data analytics is an overarching science or


Data analysis refers to the process of
discipline that encompasses the complete
examining, transforming and arranging a
management of data. This not only includes
given data set in specific ways in order to
analysis, but also data collection,
study its individual parts and extract useful
organization, storage, and all the tools and
information
techniques used

Source: www.getsmarter.com
Business Intelligence Terminology
Data, Analysis, Analytics

5 months ago, Bank ABC decreased totally Top 4 reasons due to Attrition in Bank:
10.200 bio. VND of Loan portfolio in Attrition (1) Dissatisfaction about services (50%)
(2) Lower rate in another banks (30%)
(Ending Loan portfolio = Beginning Loan + (3) Change another loan package in the bank(10%)
(4) Death (10%)
New loan – Attrition - Maturity)
Business Intelligence Terminology
Understanding Data – Categories of Data
Business Intelligence Terminology
Understanding Data – Structures of Data

Cross-sectional Data Time Series Data


• Data collected from several • Data collected over several time
elements/entities at the periods (Year, Month, Day, Hour…).
same, or approximately the • Charts of time series data are
same, point in time. common in business and economics.
• Help analysts understand what
Sep 22, 2015 happened in the past, identify trends
GOOG YHOO FB Industry over time, and project future levels for
Market Cap:
Employees:
426.88B
57148
28.62B
12500
261.91B 277.63M
10955 355
the time series.
Qtrly Rev Growth (yoy): 0.11 0.15 0.39 0.15
Revenue (ttm): 69.61B 4.87B 14.64B 132.20M
Gross Margin (ttm): 0.62 0.67 0.83 0.58
EBITDA (ttm): 22.62B 541.75M 6.38B 3.47M
Operating Margin (ttm): 0.26 0.02 0.32 0.01
Net Income (ttm): 14.39B 6.94B 2.72B N/A
EPS (ttm): 21.22 7.2 0.98 0
P/E (ttm): 29.34 4.22 94.47 33.33
PEG (5 yr expected): 1.22 -2.38 1.59 1.07
P/S (ttm): 6.26 6.02 18.39 3.74
Business Intelligence Terminology
Understanding Data – Structures of Data
Business Intelligence Terminology
Understanding Data – Structures of Data

Original Panel Data (Table)


Business Intelligence Terminology
Understanding Data – Structures of Data

Cross Table /Contingency Table / Pivot Table

A cross table is a two-way table (matrix)


consisting of columns and rows.
Also known as a pivot table or a multi-
dimensional table. Regular Table
Business Intelligence Terminology
Understanding Data – Structures of Data
Business Intelligence Terminology
Understanding Data – Structures of Data
Business Intelligence Terminology
Understanding Data – Data face

Level of Measurement

Categorical

Numerical
Business Intelligence Terminology
Understanding Data – Data Sources

Computer files Database Web-based


Business Intelligence Terminology
Understanding Data – Importance of Data Types
Business Intelligence Terminology
Understanding Data – Data Types

String data can be Numeric data are Date/time contains a The Boolean type is Images
declared in a number numbers which can specific date, or a sometimes also called Maps
of different ways be whole numbers, combination of both a logical type and is a Report objects
depending on the such as Integers or date and time conditional flag Sound
character set required numbers with decimal representing either
and the anticipated places true or false
length of the string: Byte
any kind of Integer
characters, Fixed Decimal
alphanumeric, Float
including symbols. Double
Business Intelligence Terminology
Understanding Data – Data Types Exercise

Quiz:
Employee, Address, City,
ZIP Code, Distance,
Telecommuter

Data Types ?
Business Intelligence Terminology
Understanding Data – Data Types vs Data Format

Power Query Power BI Power BI (Format Data Type)


Business Intelligence Terminology
Summary

Business Intelligence vs Business Analytics

4 Type of Analytics

Understanding Data Structure

Data Source

Data Face

Importance of Data Type


1. BUSSINESS INTELLIGENCE FUNDAMENTAL

01 02 03
BI Terminology Business Intelligence in Corporates Data Preparation
1. Self-service BI & Analytics 1. Power Query Overview
1. BI vs BA 2. Analytics structure and Coordination Model 2. Get Data
2. Technology in BI 3. BI Success (Organization)
3. PQ – Basic Transform Data
4. BI Success (Individual)
3. Data, Analysis, Analytics 4. Profiling Data
5. The Evolution of Business Intelligence
4. Understanding Data 6. How can data analytics help organizations? 5. Data Issues
7. BI process 4a. Bad Shape + Dirty Data
8. Decision making with BI 4b. Missing Data + Outliers
9. Decision Bias
5. Combine Data from Folder
6. Blending Data
7. Checklist
Business Intelligence in Corporates
Self-service BI & Analytics

The Hardest Thing In Data Science


Math & IT isn’t the hardest thing in Data Science. Since it’s
so mature, and documented, and well-known, it’s quite
possibly the easiest thing to conquer in the skillset. The hardest
thing about Data Science is asking the right question.
Business Intelligence in Corporates
Self-service BI & Analytics

ORGANIZATIONS EXIST TO
CREATE VALUE
ORGANIZATIONS HAVE TO BE
Understanding

Creating value is taking what you know QUICK AND NIMBLE

Agile BI
and turning it into action in order to
achieve a desired business outcome. Agile BI (speed-to-value)
Now, more than ever, business leaders
❖ Data must be relevant need access to the right information at
❖ Information must be meaningful the right time in order to act before
❖ Insight must be actionable decision windows close.

ANALYTICS IS A JOURNEY TO EFFECTIVE SELF-SERVICE IS A


VALUE ATTAINMENT BALANCING ACT BETWEEN
FREEDOM AND CONTROL

Describing
Defining

Broaden BI usage while reducing the


burden on IT. These companies have
learned that the goal of self-service is not
unfettered liberation from IT, but rather a
partnership that balances freedom and
control, flexibility and standards,
governance and self-service 29
Business Intelligence in Corporates
Self-service BI & Analytics
Business Intelligence in Corporates
Self-service BI & Analytics

2017-2018 2019+ 2020/21+


• Static Reports • Automatic Reports • Predictive Modeling
• Silo Views • Data Warehouse • Prescriptive Analytics
• Excel Reports • Interactive Dashboards
• Lack of Reporting Solution • BI Self-Services Tools • Advanced Analytics
• Lack of Data Integration • Email Alerts
• Lack of Interaction • Report Subscription
• Collaborative Analysis
• Performance • Data Visualization
Measurement
• Performance
Measurement
• Opportunity Identification
• Exception Report
Business Intelligence in Corporates
Analytics structure and Coordination Model

Business Intelligence Discovery

Combine Sources Instant Association

Operational Data Historical Data External Data


E-
POS SCM EDW 1 EDW 2
COMM

MMS ERP WMS ODS 1 ODS 2


Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Analytical Organizations
How centralized or decentralized should these
organizations be?
Functions:
– Reporting
– Ad-hoc Analytics
Shading shows where
– Modeling analytics are
executed
Roles:
– Database Analysts Centralized
Functions
– Data Analysts
– Modelers Collaboration
– Data Scientists
– Etc.
Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Analytical Organizations –
In a Centralized model, a set of analytical
Centralized activities are accomplished through a
central clearinghouse
Example: An enterprise analytics team
serves the needs of marketing, finance,
operations, customer care, etc. with
respect to reporting, ad-hoc analysis, and
statistical modeling
Key Advantages: Key Disadvantages
– Consistency – Responsiveness
– Optimal management of – Lack of context / expertise
bandwidth & focus on can limit effectiveness in more
enterprise priorities complex tasks
– Maximum efficiency in low- – Requires large group and
level tasks consistent overall loading
Business Intelligence in Corporates
2. Analytics structure and Coordination Model

In an Allocated model, analytical activities


Analytical are accomplished by a common team, but
Organizations – specific capacity is reserved for each
Allocated functional area served
Example: In the same enterprise team, at
least one resource is “assigned” to each
functional area, and takes priority from that
group
Key Advantages: Key Disadvantages
– More responsive – Difficult to match allocation
– More context and with enterprise priority
accelerated development of – Requires large central group
domain expertise and consistent overall loading
– Consistency
– Some load balancing possible
across needs
Business Intelligence in Corporates
2. Analytics structure and Coordination Model

In a Coordinated model, analytical


Analytical activities are accomplished by teams in
Organizations – each functional group, but those groups use
centrally defined processes & methods
Coordinated
Example: Analytics teams located in
Finance, Marketing & Operations regularly
convene in a users group and participate in
an enterprise-level Data Governance
program
Key Advantages: Key Disadvantages
– Highly responsive – Coordination difficult
– High degree of context and – Effort & data duplication more
expertise attained likely
– Some degree of consistency – Requires a larger overall
maintained number of resources, and
harder to adapt resource
levels to enterprise needs
Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Analytical In a Distributed model, analytical activities


are accomplished by separate teams in
Organizations – each functional group, with little or no
Distributed coordination
Example: The Business and Consumer
divisions of a large bank each have their
own independently managed analytics
function(s)

Key Disadvantages
Key Advantages:
– Lack of consistency in methods and
– Extremely responsive
sources
– High degree of context and
– Effort & data duplication very likely
expertise attained
– Requires largest overall resources, and
– Efficient localized used of
can be expensive, esp. when
contracting resources
contractors used
Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Centralized Allocated Coordinated Distributed

Smaller SIZE OF COMPANY Larger

Less Different DIFFERENCES IN METHODS More Different

Near PHYSICAL LOCATION Far

Low CONTEXT REQUIRED Very High


Business Intelligence in Corporates
2. Analytics structure and Coordination Model

Board of director

Departments Manufactory Marketing Sale Analytics team Planning R&D IT Finance

Planning &
Marketing Analytics Strategy Financial
Analytics Quality Control Sale analytics R&D analytics IT analytics
analytics Manager analytics analytics
Goal analytics

Expanded Marketing
QC Analyst Business Planning Financial
Team Sales Analyst R&D Specialist IT Specialist
Analyst Analyst specialist Analyst

Advanced
Analytics Result Analytis Analytics Result
Specialist

Data engineer

Core Team

Model of structure and coordination of personnel of analysis team members

Center of Excellence (CoE) model fits into the current context

- Main member (full time)


- Extensive members: Business, Information and Communication
Under the Board of Directors
Department, IT, R&D, Finance and Accounting, Human Resource
Department (concurrently)
Business Intelligence in Corporates
4. BI Success (Individual)
Business Intelligence in Corporates
3. BI Success (Organization)

01. The People Domain 03. The Technology Domain

• Organizational Alignment • The role of technology


• The organizational model • The self-service architecture
• The importance of collaboration • The place for big data
• The right team • The confusing analytics landscape
• The right support

02. The Process Domain 04. The Data Domain

• Setting priorities • Understanding data


• Balancing risk and reward • Preparing data
• Building requirements that work • Data on the edge
• Evaluate and Avolve • Understanding data science
• Sell the vision • Data privacy
• Control the Chaos
Business Intelligence in Corporates
5. The Evolution of Business Intelligence

1st Generation 2nd Generation 3rd Generation

Centralized Decentralized Democratized


Business Intelligence in Corporates
6. How can data analytics help organizations?
Business Intelligence in Corporates
7. BI process

Context/ Big Data


Data Data Cleansing
Picture Dictionary

Data Analytics Dashboard/Ad-


Data Model Visual
Taxonomy hoc/ML Results

Insights Data
Focus
(Declutter) Storytelling
Business Intelligence in Corporates
7. BI process
Business Intelligence in Corporates
8. Decision Making with BI
Business Intelligence in Corporates
9. Decision Making Bias

Data Analytics for Decision making


✓ Impressively large data sets
✓ The best analytics tools
✓ Careful statistical methods
Decision-making traps: People don’t carefully process every piece
of information in every decision.
Instead, we often rely on heuristics—simplified procedures that allow us to
make decisions in the face of uncertainty or when extensive analysis is too
costly or time-consuming.
There are three main cognitive traps that regularly skew decision making:
• The Confirmation Trap
• The Overconfidence Trap
• The Overfitting Trap
Business Intelligence in Corporates
9. Decision Bias

Confirmation bias is our tendency to search for and favor all information that confirms our
beliefs while ignoring or devaluing information that contradicts our beliefs
Business Intelligence in Corporates
9. Decision Making Bias

Over Confidence Bias


Place too much faith in your knowledge. Believing
that your contribution to a decision is more valuable
than it actually is.
• It prevents us from questioning our methods,
motivation, and the way we communicate our
findings to others
• It makes it easy to underinvest in data and
analysis; when we feel too confident in our
understanding, we don’t spend enough time or
money acquiring more information or running
further analyses.
Business Intelligence in Corporates
9. Decision Making Bias

Overfitting occurs when a predictive


model fits exactly against its historical
data (training data). When this happens,
the model unfortunately cannot perform
accurately against unseen data
Business Intelligence in Corporates
Summary

Understand Business Intelligence Self-Services

BI Self-Services in Corporates

Structure of BI Team

What Needed for BI Success (Individual & Corporate)

Business Intelligence Process

Decision Making Bias


1. BUSSINESS INTELLIGENCE FUNDAMENTAL

01 02 03
BI Terminology Business Intelligence in Corporates Data Preparation
1. Self-service BI & Analytics 1. Power Query Overview
1. BI vs BA 2. Analytics structure and Coordination Model 2. Get Data
2. Technology in BI 3. BI Success (Organization)
3. PQ – Basic Transform Data
4. BI Success (Individual)
3. Data, Analysis, Analytics 4. Profiling Data
5. The Evolution of Business Intelligence
4. Understanding Data 6. How can data analytics help organizations? 5. Data Issues
7. BI process 4a. Bad Shape + Dirty Data
8. Decision making with BI 4b. Missing Data + Outliers
9. Decision Bias
5. Combine Data from Folder
6. Blending Data
7. Checklist
DATA PREPARATION
Power Query Overview - Self-Service BI
DATA PREPARATION
Power Query Overview

REPORT DEVELOPMENT PROCESS POWER QUERY BENEFITS


DATA PREPARATION
Power Query Overview
DATA PREPARATION
Power Query Overview

Power Query is an ETL tool. ETL stands for Extract, Transform and Load.

•Extract – Data can be extracted from a variety of sources: Databases, CSV files, Text files, Excel, Website
and even PDF.

•Transform – After the data has been extracted, it can be cleaned up (i.e., remove spaces, split columns,
change date formats, fill blanks, find and replace etc) and reshaped (i.e., unpivot, remove columns
etc). When data is extracted from different sources it is unlikely to be consistent, the transform process is
used to make it ready for use.

•Load – Once the data has been extracted and transformed, it needs to be put somewhere so that you
can use it. From an Excel perspective, it can be pushed into a worksheet, a data model, or another query.

To summarize, Power Query takes data from different sources and turns it into something which can be
used. As a tool, this is pretty useful already. But here is the best part. Once the ETL process has been
created, it can be run over and over again with a single click. Which can save hours of work every week.
DATA PREPARATION
Get Data – Data Sources

If your data is local in any form,


there's also a small but necessary
extra step involved in setting up
report refreshes, and that is installing
a Power BI data gateway. This is the
tool included in the Power BI
subscription, and it's simply a bit of
software that lets Power BI reports
retrieve data from the database or
file location that is on a local server or
machine somewhere.

Now, if your data is not local, then


that means it resides in some type of
cloud service. This means that
typically, it's much easier and simpler
to connect to and refresh as well,
once the report is in the Power BI
service. Since all of the data is
already online, no additional
gateways or setup would be
required. Once you've built the report
and published it to powerbi.com.
DATA PREPARATION
Get Data – Data Gateway
DATA PREPARATION
Get Data
DATA PREPARATION
Get Data
DATA PREPARATION
Get Data
DATA PREPARATION – Get Data
DATA PREPARATION
Get Data
DATA PREPARATION
Get Data
DATA PREPARATION – Get Data
Composite Models
DATA PREPARATION - Basic Transform Data
DATA PREPARATION - Basic Transform Data
DATA PREPARATION - Basic Transform Data
DATA PREPARATION - Basic Transform Data
Add/ Modify Columns
DATA PREPARATION - Basic Transform Data
Insert in Applied Steps
DATA PREPARATION - Basic Transform Data
Move in Applied Steps
DATA PREPARATION - Basic Transform Data
Data Types
DATA PREPARATION - Basic Transform Data
Column Data Type and Formatting

When you load data into Power BI


Desktop, it will attempt to convert the
data type of the source column into a
data type that better supports more
efficient storage, calculations, and
data visualization. For example, if a
column of values you import from Excel
has no fractional values then Power BI
Desktop will convert the entire column
of data to a whole number data type,
which is better suited for storing
integers. This concept is important
because some DAX functions have
special data type requirements. We
have already studied in detail
regarding various data types available
in DAX. While in many cases DAX will
implicitly convert order data for you,
there are some cases where it will not.
For instance, if a DAX function requires
a data type date and the data type for
a column is txt, the DAX function will
not work correctly.
DATA PREPARATION - Basic Transform Data
Data Type vs Data Format
DATA PREPARATION - Basic Transform Data
Detect Data Type

3 BASIC STEPS

1. Change Query Name


2. Change Column Name
3. Change Data Type
Auto Detect Data Type
(Ctrl +A) to selec all Columns)
DATA PREPARATION - Basic Transform Data
Detect Data Type
DATA PREPARATION - Basic Transform Data
Merge Column
DATA PREPARATION - Basic Transform Data
Split Column

In Power Query, you can split a column through different methods. In this case, the column(s)
selected can be split by a delimiter
DATA PREPARATION - Basic Transform Data
Replace
DATA PREPARATION - Basic Transform Data
Trim & Clean

Trim - Remove leading and trailing whitespaces from each cell in the selected columns!
Clean - Remove non-printable characters in the selected columns!
DATA PREPARATION - Basic Transform Data
Filter Table
DATA PREPARATION - Basic Transform Data
Filter
DATA PREPARATION - Basic Transform Data
Advanced Filter
DATA PREPARATION - Basic Transform Data
Advanced Filter
DATA PREPARATION - Basic Transform Data
Add Column from Example
DATA PREPARATION - Basic Transform Data
Add Custom Column
DATA PREPARATION – Basic Transform Data
Writing Power Query Functions
DATA PREPARATION – Basic Transform Data
M-code
DATA PREPARATION – Basic Transform Data
Writing Power Query Functions
DATA PREPARATION – Profile Data
Data profiling
DATA PREPARATION – Profile Data
Data profiling
DATA PREPARATION – Profile Data
Query Dependencies
DATA PREPARATION – Data Issue
Cleaning Data
DATA PREPARATION – Data Issue

Bad
Shape

Dirty Missing
Data Data Data

Outliers
DATA PREPARATION – Data Issue
Data Shape Formatting

• How to identify when your data needs to be formatted.


• How to massage data into the correct format
• How to aggregate it to the form required

1. Transpose Table
2. Cross Tabulation
Pivot + Unpivot
3. Aggregation (Group by)
DATA PREPARATION – Data Issue
Data Shape Formatting - Transpose Table

TRANSPOSE TABLE
DATA PREPARATION – Data Issue
Data Shape Formatting – Cross Tabulation

UNPIVOT & PIVOT


DATA PREPARATION – Data Issue
Data Shape Formatting – Cross Tabulation

Cross Tabulation Cross Tabulation


Pivot Unpivot
DATA PREPARATION – Data Issue
Data Aggregation

Data Aggregation
Aggregate by Month
DATA PREPARATION – Data Issue
Data Aggregation

Data Aggregation
Aggregate by Quarter
DATA PREPARATION – Data Issue
Dirty Data

Dirty Data contains some kind of errors in them, or in a format that’s unfriendly or unusable
DATA PREPARATION – Data Issue
Dirty Data

Extra characters can be currency symbols, number signs… We’d need to remove these before
changing between field types
DATA PREPARATION – Data Issue
Dirty Data
DATA PREPARATION – Data Issue
Dirty Data

No: Yes:

Addresses Address City State Zip


313 173rd Blvd, Kent, WA 981215 313 173rd Blvd Kent WA 981215 Raw Data: Data
316 66th Blvd, Kent, WA 981244 316 66th Blvd Kent WA 981244 stored in its smallest
4358 23rd St, Kent, WA 981225 4358 23rd St Kent WA 981225 size
965 151st St, Kent, WA 981162 965 151st St Kent WA 981162
7900 173rd Lane, Kent, WA 981266 7900 173rd Lane Kent WA 981266
4047 15th Ave, Kent, WA 981228 4047 15th Ave Kent WA 981228
4907 13th Ave, Kent, WA 981232 4907 13th Ave Kent WA 981232
3789 4th Blvd, Seattle, WA 981152 3789 4th Blvd Seattle WA 981152
2977 66th Lane, Seattle, WA 981171 2977 66th Lane Seattle WA 981171
3392 23rd St, Seattle, WA 981131 3392 23rd St Seattle WA 981131
DATA PREPARATION – Data Issue
Missing Data

Missing data: gaps in data

Blank/ Empty cells (CSV) Null value (Database) N/A (program)

BIAS in statistics refers


to the tendency of an
analysis to either over
or under estimate the
values of that specific
field or parameter
DATA PREPARATION – Data Issue
Missing Data

Real Data

Downward BIAS
DATA PREPARATION – Data Issue
Missing Data

SOLUTIONS
1. Deleting Missing Data
2. Imputation
DATA PREPARATION – Data Issue
Missing Data

Deleting Missing Data


Deleting missing data is often the default method
because it's simplicity. No decisions that need to be
made that might confuse the data. You just get rid
of records where there are missing values.

However, you should make sure that deleting


missing data doesn't have adverse effects on your
analysis. For example, if a particular demographic
tended to leave a response blank in a survey, then
removing records with blank entries will mean that a
part of the population is underrepresented.
One of the downsides is that eliminating missing
data reduces the size of the dataset (Ex: cost).
DATA PREPARATION – Data Issue
Missing Data

Imputation
In statistics, Imputation is the process of
substituting values in the data where the
value are missing (we impute values, we
are making them up). We are creating
fake data in order to develop a model
that makes sense and is as close to
reality as we can get it
DATA PREPARATION – Data Issue
Missing Data – Select the method

What methodology might be the best approach


1. How much data is really missing? (>=20%)
2. How the missing data is distributed across the dataset? (2/10 predictor variables missed)
3. Whether those specific variables are actually significant to our analysis and model making
process
4. The missing data is numeric or categorical
DATA PREPARATION – Data Issue
Outliers

Identifying outliers in the data helps us understand how vulnerable our model would be to a small
set of observations.
DATA PREPARATION – Data Issue
Outliers

Identifying outliers more methodically rather than simply eyeballing them


Violin Plot: shows the volume of the distribution
Others: z-scores or standard deviations
DATA PREPARATION – Data Issue
Outliers
DATA PREPARATION – Data Issue
Outliers

If a value is 1.5 times the INTERQUARTILE RANGE of a data set, then it


can be considered an OUTLIER
DATA PREPARATION – Data Issue
Outliers

3. Don’t have obvious errors,


1 & 2/ ERRORS but we aren’t certain whether
the data is accurate or not

1. Try to go back to the original 2. Delete the record from the


source to determine the dataset
correct data

Ex: Age: 299


DATA PREPARATION – Data Issue
Outliers
DATA PREPARATION
Combine Data from Folder
DATA PREPARATION - Data Blending

Data may come from different places,


and as a results, it’ll all need to be
stitched together into one data file
DATA PREPARATION - Data Blending
Unions / Append Queries

Union allows you to take multiple datasets and deal with them as one
DATA PREPARATION - Data Blending
Join / Merge Query
DATA PREPARATION - Data Blending
Union Query
DATA PREPARATION - Data Blending
Merge Query
DATA PREPARATION - Data Blending
Fuzzy Matching

Fuzzy Matching will enable you to join 2 data sets


together where a regular join may fail. The Fuzzy
Match identifies records with similar string values
in specified fields.

Fuzzy Matching uses algorithms to score how


similar 2 words or phrases are.

Fuzzy Matching Algorithms


Jaro: The Jaro algorithm is a measure of MATCHES
characters in common, being no more than half
the length of the longer string in distance, with
consideration for transpositions.
Levenshtein: The Levenshtein algorithm counts the
number of edits (insertions, deletions, or
substitutions) needed to convert one string to the
other.
DATA PREPARATION - Data Blending
Fuzzy Matching - Example

It looks at these words and calculate a closeness of match score


based on the similarity of these words.

The match threshold is the minimum score achieved by the fuzzy matching for
it to be considered to be a match
DATA PREPARATION - Data Blending
Fuzzy Matching - Example
DATA PREPARATION - Data Blending
Spatial Matching

Types of Spatial Data


All of these location data examples are represented by points, lines, or polygons

Points Lines Polygons


A point, also referred to as a A line is a string of latitudes Polygons are made up of a series of
centroid, is in the form of a latitude and longitude locations. longitude and latitude coordinates
and longitude which we use to defining all of the vertices of a region.
pinpoint its exact location.
DATA PREPARATION - Data Blending
Spatial Matching

There aren’t fields that can be Gray area: How many customers fall
used to join them together within a store trade area is to match
them and assign a store number to them
DATA PREPARATION - Data Blending
Spatial Matching - Example

Customer Information

Spatial Data
DATA PREPARATION – Transform Data
Why Combine Queries
DATA PREPARATION – Check List

CREATING AN ANALYTICAL DATASET Issues 1st Fix-date 2nd Fix-date


Data Source Enough Data
Up to Date
Data Types Data Types correctly
Data Issues Dirty Data Not Parsed Correctly
Extra characers
Unexpected Pattern
Incorrect Data
Duplicate Data Records
M isspelled Entries
Missing Data Deleting M issing Data
Imputation
Advanced methods

Outliers Errors: Cross-check & fix


Errors: Delect
No certainty: Remove if Insignificant
Certainty: Truncation

Data Formatting Transposing


Aggregating Data
Cross Tabulation

Data Blending Unions


Joins
Fuzzy M atching
Spatial M atching
Business Intelligence in Corporates
Summary

Know Role of Power Query & How to get data in Power Query?

PQ Basic Transformation. What is Profiling Data?

Data Issue – Bad Shape (Transpose + Unpivot/Pivot + Aggregation)

Data Issue – Missing Data + Outliers

Load from File vs Load from Folder

Data Blending. Why need to merge/append queries?

You might also like