4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

BECE352E

IoT Domain Analyst

BECE352E 1
Syllabus & Course Outcome

Module 2: Data Preprocess and EDA


Data Cleaning, Data Integration, Data Transformation, Data
Reduction, Significance of Exploratory Data Analysis, Making
sense of Data.

Course Outcome: 2
To analyze the data preprocessing techniques and EDA
techniques to convert data into insights.

BECE352E 74
Module 2: Data Preprocess and EDA

➢ Introduction
➢ Significance of Exploratory Data Analysis
➢ Making sense of Data
➢ Data Cleaning
➢ Data Integration
➢ Data Transformation
➢ Data Reduction

BECE352E 75
Introduction
• Data encompasses a diverse collection of objects, numbers, words,
events, facts, measurements, observations or descriptions.

• It is collected & stored by various events or processes across


disciplines like biology, economics, engineering, marketing & many.

• Processing of such data aims to elicit useful information & generate


knowledge.

• Central question is how to derive meaningful and useful information


from this data - Exploratory Data Analysis (EDA) is the answer.

• EDA involves examining datasets to discover patterns, identify


anomalies, test hypotheses & check assumptions using statistical
measures.

• Practical exploration will be done using open-source databases.

BECE352E 76
Introduction
EDA – Stages:
1. Data Requirements:
• Organizations need to identify and understand the types of data
essential for their operations, collected data must be curated and
stored appropriately.
• Example: An application monitoring the sleep patterns of dementia
patients requires data from various sensors, including sleep data,
heart rate, electro-dermal activities and user activity patterns.
• Different data points are necessary to accurately diagnose the
mental state of the person.
• Mandatory requirements for the application include specific types of
sensor data.
• Categorization of data, such as numerical or categorical, is crucial.
• Consideration of the format for storage and dissemination is also
essential.

BECE352E 77
Introduction
EDA – Stages:
2. Data Collection:
• Collected data from various sources needs to be stored in the
appropriate format.
• Transfer of data to the relevant IT personnel within the organization is
crucial.
• Data collection involves gathering information from different objects
and events using various sensors and storage tools.
3. Data Processing:
• Preprocessing is essential before actual analysis, involving tasks to
prepare the dataset.
• Common preprocessing tasks include:
– Correctly exporting the dataset.
– Placing data under the right tables.
– Structuring the data appropriately.
– Exporting the data in the correct format.
BECE352E 78
Introduction
EDA – Stages:
4. Data Cleaning:
• Preprocessed data is not ready for in-depth analysis and requires
transformation.
• Data cleaning involves tasks such as:
– Incompleteness check
– Duplicates check
– Error check
– Missing value check
• Responsibilities in data cleaning include:
– Matching correct records
– Finding inaccuracies in the dataset
– Understanding overall data quality
– Removing duplicate items
– Filling in missing values
• Data cleaning is dependent on the types of data being studied, and
understanding different types of datasets is crucial.
• Example: Using outlier detection methods for quantitative data cleaning.

BECE352E 79
Introduction
EDA – Stages:
5. Modelling & Algorithms:
• Generalized models or mathematical formulas represent
relationships among different variables.
• Models involve one or more variables that depend on others to cause
an event.
• Example: Total price of pens (Total) = price for one pen (UnitPrice) *
number of pens bought (Quantity).
• Dependent Variable: The variable dependent on other variables.
• Independent Variable: Variables that cause an event and are not
dependent on others.
• A model describes the relationship between independent and
dependent variables.
• Inferential Statistics: Deals with quantifying relationships between
specific variables.
– Judd model: Data = Model + Error.
BECE352E 80
Introduction
EDA – Stages:
6. Data Product:
• Definition: Any computer software using data as inputs,
producing outputs, and providing feedback.
• Basis: Generally based on a model developed during data
analysis.
• Example: Recommendation model using user purchase history
to recommend related items.
7. Communication:
• Stage Focus: Deals with disseminating results to end
stakeholders.
• Data Visualization: Techniques include tables, charts, summary
diagrams & bar charts.
BECE352E 81
Module 2: Data Preprocess and EDA

➢ Introduction
➢ Significance of Exploratory Data Analysis
➢ Making sense of Data
➢ Data Cleaning
➢ Data Integration
➢ Data Transformation
➢ Data Reduction

BECE352E 82
Significance of EDA
• Data accumulation in various fields such as science,
economics, engineering, and marketing, primarily stored in
electronic databases.
• Decision-making based on collected data is challenging without
the aid of computer programs.
• Data mining is performed to extract insights and make informed
decisions.
• Exploratory Data Analysis (EDA) is a crucial initial step in data
mining.
– It aims to visualize data, understand its nature, and form hypotheses for
further analysis.
– EDA involves summarizing data, performing statistical analysis, and
utilizing visualization tools.
– Python offers specialized tools for EDA, including pandas for
summarization, scipy for statistical analysis, and matplotlib and plotly for
visualizations.

BECE352E 83
Steps in EDA
Problem definition:
• Essential to define the business problem before extracting
insights from data.
• Serves as the driving force for executing a data analysis plan.
• Tasks include defining the main objective, deliverables, roles,
responsibilities, obtaining data status, defining timetable, and
performing cost/benefit analysis.
• Provides the basis for creating an execution plan.
Data preparation:
• Involves methods to prepare the dataset before actual analysis.
• Tasks include defining data sources, schemas, and tables,
understanding data characteristics, cleaning the dataset,
removing non-relevant data, transforming data, and dividing it
into required chunks for analysis.

BECE352E 84
Steps in EDA
Statistics and analysis of the data:
• Main tasks involve summarizing the data, finding hidden
correlations and relationships, developing predictive models,
evaluating models, and calculating accuracies.
• Techniques include summary tables, graphs, descriptive
statistics, inferential statistics, correlation statistics, searching,
grouping, and mathematical models.
Development and representation of the results:
• Involves presenting the dataset to the target audience using
graphs, summary tables, maps, and diagrams.
• Essential for making results interpretable by business
stakeholders.
• Graphical analysis techniques include scattering plots,
character plots, histograms, box plots, residual plots, mean
plots, and others.
BECE352E 85
Module 2: Data Preprocess and EDA

➢ Introduction
➢ Significance of Exploratory Data Analysis
➢ Making sense of Data
➢ Data Cleaning
➢ Data Integration
➢ Data Transformation
➢ Data Reduction

BECE352E 86
Making Sense of Data
• Different disciplines store diverse types of data for specific
purposes.
• Examples include medical researchers storing patients' data,
universities managing students' and teachers' data, and real
estate industries maintaining datasets about houses and
buildings.
• Dataset consists of multiple observations related to a particular
object or entity.
• In a hospital's patient dataset, each patient is described by
variables such as patient ID, name, address, weight, date of
birth, email and gender.
• Variables are features that characterize an object, and each
observation has specific values for these variables.
• Understanding the types of data and variables is essential for
effective exploratory data analysis.
BECE352E 87
Making Sense of Data
• Dataset comprises four observations (001, 002, 003, 004, 005).
• Each observation is characterized by variables such as
PatientID, name, address, dob, email, gender & weight.
• Dataset can be categorized into: numerical & categorical.
• M

BECE352E 88
Making Sense of Data
Numerical Data:
• Numerical data (Quantitative) involves measurements, such as age,
height, weight, blood pressure, heart rate, temperature, number of
teeth, number of bones, and the number of family members.
• Can be discrete or continuous.
• Discrete Data:
– Countable data with finite and distinct values.
– Example: Number of heads in 200 coin flips.
– Variables representing discrete datasets are discrete variables.
– Examples: Country variable (e.g., Nepal, India, Norway, Japan), Rank
variable in a classroom.
• Continuous Data:
– Variable can have an infinite number of numerical values within a specific
range.
– Examples: Temperature of a city, weight variable in the previous section.
– Variables describing continuous data are continuous variables.

BECE352E 89
Making Sense of Data
Categorical Data:
• Represents characteristics of an object.
• Also known as qualitative datasets.
• Examples: Gender, marital status, type of address, movie genres,
blood type, types of drugs.
• Gender: Male, Female, Other, Unknown
• Marital Status: Annulled, Divorced, Interlocutory, Legally Separated,
Married, Polygamous, Never Married, Domestic Partner, Unmarried,
Widowed, Unknown
• Movie Genres: Action, Adventure, Comedy, Crime, Drama, Fantasy,
Historical, Horror, Mystery, Philosophical, Political, Romance, Saga,
Satire, Science Fiction, Social, Thriller, Urban, Western
• Blood Type: A, B, AB, O
• Types of Drugs: Stimulants, Depressants, Hallucinogens,
Dissociatives, Opioids, Inhalants, Cannabis

BECE352E 90
Making Sense of Data
Categorical Variables:
• Represented by variables describing categorical data.
• Limited number of values.
• Similar to enumerated types or enumerations in computer
science.
Types of Categorical Variables:
• Binary Categorical Variable:
– Takes exactly two values.
– Also known as dichotomous variable.
– Example: Experiment result (success or failure).
• Polytomous Variables:
– Categorical variables with more than two possible values.
– Example: Marital status with values like annulled, divorced, interlocutory,
legally separated, married, polygamous, never married, domestic
partners, unmarried, widowed, domestic partner, unknown.
BECE352E 91
Making Sense of Data
Measurement Scales

Both the order and exact


differences between the
values are significant.
Interval scales are widely
contain order,
used in statistics, for
exact values,
example, in the measure of
and absolute
central
zero,
tendencies—mean, median,
mode, and standard
deviations.
BECE352E 92
Making Sense of Data
Comparing EDA with classical and Bayesian analysis

BECE352E 93
Making Sense of Data
Software Tools
• Python: Open source programming language widely used in
data analysis, data mining & data science - www.python.org
• R programming language: Open source programming
language widely utilized in statistical computation & graphical
data analysis - www.r-project.org
• Weka: Open source data mining package that involves several
EDA tools and algorithms - https://www.cs.waikato.ac.nz/ml/weka/
• KNIME: Open source tool for data analysis and is based on
Eclipse - https://www.knime.com/

BECE352E 94

You might also like