Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

DATA MUNGING

By:-AATHITYAN MARIRAJ(2021115002)
WHAT IS DATA MUNGING?
SIMPLE DEFINITION
 Data munging is like tidying up a messy room so you can find things easily. It's about
taking raw data that's all jumbled and making it neat and organized. This makes it easier to
understand and use for things like studying trends or making predictions.

COMPLEX DEFINITION
 Data munging, sometimes called data wrangling or data cleaning, is converting and
mapping unprocessed data into a different format to improve its suitability and value for
various downstream uses, including analytics. This procedure entails preparing raw data
for analysis by cleaning, organizing, and enriching it in a readable format.
IMPORTANCE OF DATA MUNGING

 Accuracy and Precision: Data munging addresses discrepancies and errors in raw data,
leading to more accurate and precise analyses. Cleaning and organizing data ensure that
the insights derived are trustworthy and dependable.

 Data Suitability: Raw data is often collected from diverse sources in various formats.
Data munging transforms this unprocessed data into a more suitable format for analysis,
making it easier to work with and extract meaningful insights.

 Enhanced Readability: By standardizing formats, resolving inconsistencies, and handling


missing values, data munging improves the overall readability of the dataset. Analysts can
navigate and interpret the data more efficiently.
IMPORTANCE OF DATA MUNGING
 Outlier Management: Data munging addresses outliers by considering their relevance to
the analysis. Deciding whether to modify or remove outliers ensures that anomalous data
points do not unduly influence the results.

 Categorical Data Transformation: Handling categorical data through methods like one-
hot encoding ensures that these variables are appropriately transformed for analysis. This is
vital for including categorical information in models.

 Data Integration: In scenarios where multiple datasets are involved, data integration
ensures compatibility and coherence. Merging datasets harmoniously contributes to a more
comprehensive and holistic analysis.

 Data Integration: In scenarios where multiple datasets are involved, data integration
ensures compatibility and coherence. Merging datasets harmoniously contributes to a more
comprehensive and holistic analysis.
ESSENTIAL STEPS IN DATA MUNGING

1)Data Discovery:
(i)Defining the purpose and goals of data analysis.
(ii) Identifying potential uses and requirements of data.
(iii) Focusing on business requirements rather than technical specifications.
2)Data Structuring:
(i)Structuring raw data to make it machine-readable.
(ii)Organizing data into a well-defined schema with consistent layout
(rows and columns).
(iii)Extracting data from various sources and organizing it into a
formatted repository.
ESSENTIAL STEPS IN DATA MUNGING
3)Data Cleansing:
(i)Addressing data quality issues such as missing values and duplicate datasets.
(ii)Detecting and correcting erroneous data to avoid information gaps.
(iii)Applying transformations (e.g., removing, replacing, finding and replacing)
to eliminate redundant text and null values.

4)Data Enrichment
(i)Appending one or multiple datasets from different sources to generate a holistic view of information.
(ii)Aggregating multiple data sources to make data more useful for reporting and analytics.
Example: Matching an order ID against a different database to obtain further details like account name,
account balance, buying history, etc.
ESSENTIAL STEPS IN DATA MUNGING

5)Data Validation
(i)Validating the accuracy, completeness, and reliability of data.
(ii) Final check to ensure output information is accurate and reliable.
(iii) Rejecting data that don’t comply with pre-defined rules or constraints.
Types of validation checks include consistency check, data-type validation, range and
constraint validation.
IMPLEMENTATION WITH EXAMPLE
In this part we are going to explore these following functionalities:-

(i)Data Exploration
(ii)Data Filtering
(iii)Data Wrangling using Merge operation
(iv)Data wrangling using GROUP BY
(v)Removing Duplicates
(vi)Concatenating two datasets
DATA EXPLORATION
DATA FILTERING
DATA WRANGLING USING MERGE
FUNCTION
DATA WRANGLING USING MERGE FUNCTION

AFTER MERGING
DATA WRANGLING USING GROUP BY
REMOVING DUPLICATES
DATA CONCATENATING
DATA CONCATENATING
BENEFITS OF DATA MUNGING

Data munging, also known as data wrangling, is the process of cleaning, structuring, and
enriching raw data into a usable format for analysis. The benefits include:

1. Improved Data Quality: Munging helps identify and rectify errors, inconsistencies, and
missing values, enhancing the overall quality of the data.

2. Enhanced Analysis: By organizing and restructuring data, analysts can perform more
accurate and insightful analyses, leading to better decision-making.
BENEFITS OF DATA MUNGING

3. Efficient Data Handling: Munging streamlines data processing, making it easier to work
with large datasets efficiently.

4. Compatibility: It ensures data compatibility across different systems and formats,


facilitating integration and interoperability.

5. Increased Productivity: Automating repetitive data cleaning tasks frees up time for
analysts to focus on higher-value activities, increasing productivity.

Overall, data munging is essential for extracting meaningful insights from raw data and
maximizing its potential for decision-making and analysis.
APPLICATIONS OF DATA MUNGING
Data munging, or data wrangling, finds applications across various industries and domains. Some
common applications include:

1. Business Intelligence (BI): Data munging is used to clean and prepare data for BI tools, enabling
organizations to extract actionable insights from large datasets to make informed business decisions.

2. Data Analysis and Reporting: Munging is essential for preparing data for analysis and generating
reports in fields such as finance, marketing, and healthcare, enabling stakeholders to monitor
performance and trends.
BENEFITS OF DATA MUNGING

3. Machine Learning and Data Mining: Data munging is a crucial preprocessing step in machine
learning and data mining tasks, where raw data needs to be cleaned, transformed, and structured before
training models or extracting patterns and trends.

4. Natural Language Processing (NLP): In NLP applications, such as sentiment analysis or text
classification, data munging involves cleaning and preprocessing text data by removing stopwords,
tokenization, and normalization to improve the performance of NLP models.

5. Internet of Things (IoT): Data munging is used to preprocess and clean sensor data collected from IoT
devices before analysis, enabling organizations to derive insights and optimize operations in various
domains, including manufacturing, agriculture, and smart cities.

You might also like