Lesson 7 Data Description and Diagnostics

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

Data Description

and Diagnostics
PREPROCESSING TECHNIQUES
Preprocessing techniques refer to a set of steps and
methods applied to raw data before it undergoes
analysis. The primary goal is to clean, organize, and
enhance the data, making it more suitable for accurate
and efficient analysis. This process is crucial because
real-world data is often messy, containing errors,
missing values, or unnecessary details that can hinder
the effectiveness of analytical tools.
1. DATA CLEANING
Data cleaning is the process of identifying and fixing
errors or inconsistencies in a dataset to ensure that it is
accurate, reliable, and ready for analysis. Imagine you
have a bunch of puzzle pieces, but some are missing,
others are damaged, and a few don't quite fit. Data
cleaning is like sorting through these pieces, fixing or
removing the problematic ones, and making sure
everything fits together smoothly.
Data Cleaning
Handling Missing Data: Dealing with Noisy Data:
Sometimes, data is incomplete, Noisy data includes outliers or
and certain values are missing. errors that don't reflect the
Data cleaning involves actual trends in the dataset.
figuring out how to fill in these Data cleaning identifies and
missing pieces, either by corrects these anomalies,
estimating values or removing ensuring that they don't skew
incomplete records. the analysis.
Real Life Example
Imagine you are managing a customer database for an e-commerce platform. In
this database, there's a column for customer ages, and upon closer inspection,
you notice a few entries that seem off. Some customers have ages recorded as
150 or even 999.

Identifying and Handling Outliers: Handling Missing Data: As you


You recognize these extreme age continue to review the age column,
values as outliers that could distort any you also notice some entries marked
age-related analysis. Through data as "NA" or left blank. In the data
cleaning, you decide to either correct cleaning process, you decide on a
these entries by using a more realistic strategy to fill in these missing values,
age or, if necessary, remove these perhaps by using the average age of
outliers to maintain the accuracy of customers or another suitable method.
your customer age data.
2. DATA INTEGRATION
Data integration is like bringing together pieces of
information from different sources to create a complete
and unified picture, much like combining various
ingredients to make a delicious recipe. It's the process
of merging data from different places to form a cohesive
and comprehensive dataset for better understanding
and analysis.
3. DATA TRANSFORMATION
Data transformation is like giving your data
a makeover to make it more
understandable and ready for analysis. It
involves converting and organizing data in
a way that makes it easier to work with and
reveals valuable insights.
DATA TRANSFORMATION
1. NORMALIZATION. Think of 2. Concept Hierarchy Generation. Imagine
normalization as putting different things you have data about products, and some are
on the same scale. If you have data with categorized differently, like "Electronics" and
varying ranges, like temperatures in "Gadgets." Data transformation creates a
Celsius and Fahrenheit, normalization concept hierarchy, organizing these
transforms them into a common scale, categories into a structured framework. It's
making comparisons and analysis more like creating a clear hierarchy that groups
straightforward. It's like converting related concepts together, making the data
temperatures to either all Celsius or all more organized and meaningful.
Fahrenheit for consistency.
3. Smoothing. Sometimes data can be a 4. Aggregation. Aggregation involves
bit noisy, with lots of ups and downs. combining detailed data into summary
Smoothing is like applying a filter to information. For instance, if you have
remove unnecessary fluctuations, helping daily sales data, you might aggregate it
to identify trends more easily. It's like into monthly totals. It's like summarizing
smoothing out a bumpy road to reveal the the fine details into more manageable and
underlying path. informative chunks.
4. DATA REDUCTION
Data reduction is like simplifying a large
and complex puzzle by focusing on the
most important pieces. It involves
techniques to streamline and condense
data while retaining its essential
information. Let's break it down:
DATA REDUCTION
1. Dimensionality Reduction. 2. Concept Hierarchy
Imagine you have a puzzle with Generation. Imagine you have
many pieces, and each piece data about products, and some are
represents a different aspect of categorized differently, like
your data. Dimensionality "Electronics" and "Gadgets." Data
reduction is like combining or transformation creates a concept
eliminating some pieces, hierarchy, organizing these
reducing the overall complexity. categories into a structured
It's like focusing on the main framework. It's like creating a clear
shapes and colors that capture hierarchy that groups related
the essence of the puzzle concepts together, making the data
without getting lost in more organized and meaningful.
unnecessary details.
Questions?
Group Activity. Pick a company and
research about how big companies applied
the preprocessing techniques.
Instructions:
Company Selection:
Pick a well-known and sizable company for your research. Consider companies from various
industries such as technology, retail, finance, or healthcare.
Dataset Identification:
Research and identify a dataset or data-related challenge the chosen company faced. Look for
information on their website, annual reports, or reputable business articles.
Preprocessing Techniques:
Investigate how the company applied preprocessing techniques to clean, integrate, transform, or
reduce their data. Explore the specific challenges they aimed to address and the impact on
their analytical processes.
Success Stories:
Look for success stories or case studies related to the company's data preprocessing efforts.
Identify any significant improvements in decision-making, efficiency, or overall business
performance attributed to these techniques.
Presentation Preparation:
Prepare a brief presentation summarizing your findings. Include details on the chosen company,
the dataset, preprocessing techniques applied, and the outcomes or benefits achieved.
Presentation Delivery:
Present your research to the class, sharing insights into how the selected company strategically
employed preprocessing techniques to enhance their data quality and analysis.

You might also like