MODULE 5 Merged

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Reading data from XML files

XML stands for Extensible Markup Language. It's needed for keeping track of the tiny to
medium amount of knowledge. It allows programmers to develop their own applications to read
data from other applications. The method of reading the information from an XML file and
further analyzing its logical structure is known as Parsing. Therefore, reading an XML file is
that the same as parsing the XML document.

To read an XML file, firstly, we import the ElementTree class found inside the XML library.
Then, we will pass the filename of the XML file to the ElementTree. parse() method, to start
parsing. Then, we will get the parent tag of the XML file using getroot() .

Microsoft excel files


You all must have worked with Excel at some time in your life and must have felt the
need to automate some repetitive or tedious task. Don’t worry in this tutorial we are going to
learn about how to work with Excel using Python, or automating Excel using Python. We will
be covering this with the help of the Openpyxl module and will also see how to get Python in
Excel.

Getting Started Python Openpyxl

Openpyxl is a Python library that provides various methods to interact with Excel Files using
Python. It allows operations like reading, writing, arithmetic operations, plotting graphs, etc.
This module does not come in-built with Python.

To read an Excel file you have to open the spreadsheet using the load_workbook() method.
After that, you can use the active to select the first sheet available and the cell attribute to select
the cell by passing the row and column parameter. The value attribute prints the value of the
particular cell. See the below example to get a better understanding.

Example:
In this example, a Python program uses the openpyxl module to read an Excel file (“gfg.xlsx”),
opens the workbook, and retrieves the value of the cell in the first row and first column, printing
it to the console.

JSON DATA

Python JSON JavaScript Object Notation is a format for structuring data. It is mainly
used for storing and transferring data between the browser and the server. Python too supports
JSON with a built-in package called JSON. This package provides all the necessary tools for
working with JSON Objects including parsing, serializing, deserializing, and many more.

Let’s see a simple example where we convert the JSON objects to Python objects and
vice versa

Convert from JSON to Python object

Let’s see a simple example where we convert the JSON objects to Python objects. Here,
json.loads( ) method can be used to parse a valid JSON string and convert it into a Python
Dictionary.
OUTPUT:

PICKLE PYTHON OBJECT SERIALIZATION

In Python, we sometimes need to save the object on the disk for later use. This can be
done by using Python pickle. In this article, we will learn about pickles in Python along with a
few examples.

Python Pickle — Python object serialization

Python pickle module is used for serializing and de-serializing a Python object structure.
Any object in Python can be pickled so that it can be saved on disk. What Pickle does is it
“serializes” the object first before writing it to a file. Pickling is a way to convert a Python
object (list, dictionary, etc.) into a character stream. The idea is that this character stream
contains all the information necessary to reconstruct the object in another Python script. It
provides a facility to convert any Python object to a byte stream. This Byte stream contains all
essential information about the object so that it can be reconstructed, or “unpickled” and get
back into its original form in any Python.
Advantages of Using Pickle in Python

1. Recursive objects (objects containing references to themselves): Pickle keeps track of the
objects it has already serialized, so later references to the same object won’t be serialized
again. (The marshal module breaks for this.)
2. Object sharing (references to the same object in different places): This is similar to self-
referencing objects. Pickle stores the object once, and ensures that all other references
point to the master copy. Shared objects remain shared, which can be very important for
mutable objects.
3. User-defined classes and their instances: Marshal does not support these at all, but Pickle
can save and restore class instances transparently. The class definition must be
importable and live in the same module as when the object was stored.

Disadvatages of Using Pickle in Python

1. Python Version Dependency: Data of picle is so sensitive to the version of Python that
produced. Pickled object created with one version of Python that might not be unpickled
with a various versions.
2. Non-Readble: The format of pickle is binary and not easily readable or editable by
humans. The contracts that are in JSON or XML format can be easily modified.
3. Large data inefficiency: Large datasets can slow down the pickling and unpickling.
Serialization might be more appropriate for such use-cases.

DATA MANIPULATION IN PYTHON USING PANDAS LIBRARY

Data manipulation with python is defined as a process in the python programming


language that enables users in data organization in order to make reading or interpreting the
insights from the data more structured and comprises of having better design.
Data manipulation refers to the process of adjusting data to make it organised and easier
to read. Data manipulation language, or DML, is a programming language that adjusts data by
inserting, deleting and modifying data in a database such as to cleanse or map the data.

In simple terms, data manipulation is the moving around and preparing of data before
any analysis takes place. Meanwhile, the three different types include manual, semi automated
and fully automated.

It's a five-step framework to analyze data. The five steps are:

1) Identify business questions

2) Collect and store data

3) Clean and prepare data

4) Analyze data

5) Visualize and communicate data.

Data preparation

Raw data may or may not contain errors and inconsistencies. Hence, drawing actionable
insights is not straightforward. We have to prepare the data to rescue us from the pitfalls of
incomplete, inaccurate, and unstructured data. In this article, we are going to understand data
preparation, the process, and the challenges faced during this process.

What is Data Preparation?

Data preparation is the process of making raw data ready for after processing and
analysis. The key methods are to collect, clean, and label raw data in a format suitable for
machine learning (ML) algorithms, followed by data exploration and visualization. The process
of cleaning and combining raw data before using it for machine learning and business analysis is
known as data preparation, or sometimes “pre-processing.” But it may not be the most attractive
of duties, careful data preparation is essential to the success of data analytics. Clear and
important ideas from raw data require careful validation, cleaning, and an addition. Any
business analysis or model created will only be as strong and validating as the very first
information preparation.

Why Is Data Preparation Important?

Data preparation acts as the foundation for successful machine learning projects as:

1. Improves Data Quality: Raw data often contains inconsistencies, missing values,
errors, and irrelevant information. Data preparation techniques like cleaning, imputation,
and normalization address these issues, resulting in a cleaner and more consistent
dataset. This, in turn, prevents these issues from biasing or hindering the learning
process of your models.
2. Enhances Model Performance: Machine learning algorithms rely heavily on the
quality of the data they are trained on. By preparing your data effectively, you provide
the algorithms with a clear and well-structured foundation for learning patterns and
relationships. This leads to models that are better able to generalize and make accurate
predictions on unseen data.

3. Saves Time and Resources: Investing time upfront in data preparation can significantly
save time and resources down the line. By addressing data quality issues early on, you
avoid encountering problems later in the modeling process that might require re-work or
troubleshooting. This translates to a more efficient and streamlined machine learning
workflow.

4. Facilitates Feature Engineering: Data preparation often involves feature engineering,


which is the process of creating new features from existing ones. These new features can
be more informative and relevant to the task at hand, ultimately improving the model’s
ability to learn and make predictions.

Challenges in Data Preparation

Now, we have already understood that data preparation is a critical stage in the analytics
process, yet it is fraught with numerous challenges like:

1. Lack of or insufficient data profiling:

 Leads to mistakes, errors, and difficulties in data preparation.

 Contributes to poor analytics findings.

 May result in missing or incomplete data.

2. Incomplete data:

 Missing values and other issues that must be addressed from the start.

 Can lead to inaccurate analysis if not handled properly.

3. Invalid values:

 Caused by spelling problems, typos, or incorrect number input.

 Must be identified and corrected early on for analytical accuracy.

4. Lack of standardization in data sets:

 Name and address standardization is essential when combining data sets.

 Different formats and systems may impact how information is received.

5. Inconsistencies between enterprise systems:

 Arise due to differences in terminology, special identifiers, and other factors.


 Make data preparation difficult and may lead to errors in analysis.

6. Data enrichment challenges:

 Determining what additional information to add requires excellent skills and


business analytics knowledge.

7. Setting up, maintaining, and improving data preparation processes:

 Necessary to standardize processes and ensure they can be utilized repeatedly.

 Requires ongoing effort to optimize efficiency and effectiveness.

Data Preparation Process

There are a few important steps in the data preparation process, and each one is essential
to making sure the data is prepared for analysis or other processing. The following are the key
stages related to data preparation:

Step 1: Describe Purpose and Requirements

Identifying the goals and requirements for the data analysis project is the first step in the
data preparation process. Consider the followings:

 What is the goal of the data analysis project and how big is it?

 Which major inquiries or ideas are you planning to investigate or evaluate using the
data?

 Who are the target audience and end-users for the data analysis findings? What positions
and duties do they have?

 Which formats, types, and sources of data do you need to access and analyze?

 What requirements do you have for the data in terms of quality, accuracy, completeness,
timeliness, and relevance?

 What are the limitations and ethical, legal, and regulatory issues that you must take into
account?

With answers to these questions, data analysis project’s goals, parameters, and
requirements simpler as well as highlighting any challenges, risks, or opportunities that can
develop.

Step 2: Data Collection

Collecting information from a variety of sources, including files, databases, websites,


and social media, to conduct a thorough analysis, providing the usage of reliable and high-
quality data. Suitable resources and methods are used to obtain and analyze data from a variety
of sources, including files, databases, APIs, and web scraping.
Step 3: Data Combining and Integrating Data

Data integration requires combining data from multiple sources or dimensions in order to
create a full, logical dataset. Data integration solutions provide a wide range of operations,
including combination, relationship, connection, difference, and join, as well as a variety of data
schemas and types of architecture.

To properly combine and integrate data, it is essential to store and arrange information in
a common standard format, such as CSV, JSON, or XML, for easy access and uniform
comprehension. Organizing data management and storage using solutions such as cloud storage,
data warehouses, or data lakes improves governance, maintains consistency, and speeds up
access to data on a single platform.

Audits, backups, recovery, verification, and encryption are all examples of strong
security procedures that can be used to make sure reliable data management. Privacy protects
data during transmission and storage, whereas authorization and authentication

Step 4: Data Profiling

Data profiling is a systematic method for assessing and analyzing a dataset, making sure
its quality, structure, content, and improving accuracy within an organizational context. Data
profiling identifies data consistency, differences, and null values by analyzing source data,
looking for errors, inconsistencies, and errors, and understanding file structure, content, and
relationships. It helps to evaluate elements including completeness, accuracy, consistency,
validity, and timeliness.

Step 5: Data Exploring

Data exploration is getting familiar with data, identifying patterns, trends, outliers, and
errors in order to better understand it and evaluate the possibilities for analysis. To evaluate data,
identify data types, formats, and structures, and calculate descriptive statistics such as mean,
median, mode, and variance for each numerical variable. Visualizations such as histograms,
boxplots, and scatterplots can provide understanding of data distribution, while complex
techniques such as classification can reveal hidden patterns and show exceptions.

Step 6: Data Transformations and Enrichment

Data enrichment is the process of improving a dataset by adding new features or


columns, enhancing its accuracy and reliability, and verifying it against third-party sources.

 The technique involves combining various data sources like CRM, financial, and
marketing to create a comprehensive dataset, incorporating third-party data like
demographics for enhanced insights.

 The process involves categorizing data into groups like customers or products based on
shared attributes, using standard variables like age and gender to describe these entities.
 Engineer new features or fields by utilizing existing data, such as calculating customer
age based on their birthdate. Estimate missing values from available data, such as absent
sales figures, by referencing historical trends.

 The task involves identifying entities like names and addresses within unstructured text
data, thereby extracting actionable information from text without a fixed structure.

 The process involves assigning specific categories to unstructured text data, such as
product descriptions or customer feedback, to facilitate analysis and gain valuable
insights.

 Utilize various techniques like geocoding, sentiment analysis, entity recognition, and
topic modeling to enrich your data with additional information or context.

 To enable analysis and generate important insights, unstructured text data is classified
into different groups, such as product descriptions or consumer feedback.

Use cleaning procedures to remove or correct flaws or inconsistencies in your data, such
as duplicates, outliers, missing numbers, typos, and formatting difficulties. Validation
techniques like as checksums, rules, limitations, and tests are used to ensure that data is correct
and complete.

Step 7: Data Validation

Data validation is crucial for ensuring data accuracy, completeness, and consistency, as it
checks data against predefined rules and criteria that align with your requirements, standards,
and regulations.

 Analyze the data to better understand its properties, such as data kinds, ranges, and
distributions. Identify any potential issues, such as missing values, exceptions, or errors.

 Choose a representative sample of the dataset for validation. This technique is useful for
larger datasets because it minimizes processing effort.

 Apply planned validation rules to the collected data. Rules may contain format checks,
range validations, or cross-field validations.

 Identify records that do not fulfill the validation standards. Keep track of any flaws or
discrepancies for future analysis.

 Correct identified mistakes by cleaning, converting, or entering data as needed.


Maintaining an audit record of modifications made during this procedure is critical.

 Automate data validation activities as much as feasible to ensure consistent and ongoing
data quality maintenance.

Discretization binning
“Data discretization, also known as quantization or binning, is the process of converting
a continuous variable into a categorical or discrete variable by dividing the entire range of the
variable into a set of intervals or bins.”

Why Learn Data Discretization?

Discretization is important for several reasons:

 Reduces the noise in continuous data and improves the accuracy of the machine
learning model.

 Handles missing values better than continuous data.

 Handles irrelevant or redundant features better than continuous data.

Methods of Data Discretization

There are several methods for discretizing data, including:

1. Equal Width Binning

2. Equal Frequency Binning

3. K-Means Clustering

4. Decision Trees

Example of Data Discretization

Consider a dataset containing the heights of 100 individuals. The heights are continuous data
and can range from 4 feet to 6 feet. To make this data easier to work with, we can discretize it
into the following categories:

1. 4 to 4.5 feet

2. 4.5 to 5 feet

3. 5 to 5.5 feet

4. 5.5 to 6 feet

Each individual’s height can then be assigned to one of the above categories, making the data
easier to work with and improving the accuracy of the machine learning model.
Data aggregation group iteration

Aggregation is used to get the mean, average, variance and standard deviation of all column in a
dataframe or particular column in a data frame.

 sum(): It returns the sum of the data frame

Syntax:

dataframe[‘column].sum()

 mean(): It returns the mean of the particular column in a data frame

Syntax:

dataframe[‘column].mean()

 std(): It returns the standard deviation of that column.

Syntax:

dataframe[‘column].std()

 var(): It returns the variance of that column

dataframe[‘column’].var()

 min(): It returns the minimum value in column

Syntax:

dataframe[‘column’].min()

 max(): It returns maximum value in column

Syntax:

dataframe[‘column’].max()
OUTPUT:

string manipulation

Working with strings String Literals

String values begin and end with a single quote.

But we want to use either double or single quotes within a string then we have a
multiple ways to do it as shown below.

Double Quotes

One benefit of using double quotes is that the string can have a single quote character in
it.
Since the string begins with a double quote, Python knows that the single quote is part
of the string and not marking the end of the string.

Escape Characters

If you need to use both single quotes and double quotes in the string, you’ll need to use
escape characters.

An escape character consists of a backslash (\) followed by the character you want to
add to the string.

Python knows that the single quote in Bob\'s has a backslash, it is not a single quote
meant to end the string value. The escape characters \' and \" allows to put single quotes
and double quotes inside your strings, respectively.

Ex:

The different special escape characters can be used in a program as listed below in a
table.

Raw Strings

You can place an r before the beginning quotation mark of a string to make it a raw
string. A raw string completely ignores all escape characters and prints any backslash
that appears in the string

Multiline Strings with Triple Quotes

A multiline string in Python begins and ends with either three single quotes or three
double quotes.

Any quotes, tabs, or newlines in between the “triple quotes” are considered part of the
string.

Program

Multiline Comments

While the hash character (#) marks the beginning of a comment for the rest of the line.

A multiline string is often used for comments that span multiple lines.
Indexing and Slicing Strings

Strings use indexes and slices the same way lists do. We can think of the string 'Hello
world!' as a list and each character in the string as an item with a corresponding index.

The space and exclamation point are included in the character count, so 'Hello world!' is
12 characters long.

If we specify an index, you’ll get the character at that position in the string.

If we specify a range from one index to another, the starting index is included and the
ending index is not.

The substring we get from spam[0:5] will include everything from spam[0] to spam[4],
leaving out the space at index 5.
Useful String Methods

Several string methods analyze strings or create transformed string values.

The upper(), lower(), isupper(), and islower() String Methods

The upper() and lower() string methods return a new string where all the letters in the
original string have been converted to uppercase or lowercase, respectively.

These methods do not change the string itself but return new string values.

If we want to change the original string, we have to call upper() or lower() on the string
and then assign the new string to the variable where the original was stored.

The upper() and lower() methods are helpful if we need to make a case-insensitive
comparison.

In the following small program, it does not matter whether the user types Great,
GREAT, or grEAT, because the string is first converted to lowercase.

Program Output

The isupper() and islower() methods will return a Boolean True value if the string has
at least one letter and all the letters are uppercase or lowercase, respectively. Otherwise,
the method returns False.
Since the upper() and lower() string methods themselves return strings, you can call
string methods on those returned string values as well. Expressions that do this will
look like a chain of method calls.

The is X String Methods

There are several string methods that have names beginning with the word is. These
methods return a Boolean value that describes the nature of the string.

Here are some common isX string methods:

o isalpha() returns True if the string consists only of letters and is not blank.

o isalnum() returns True if the string consists only of letters and numbers and is
not blank.

o isdecimal() returns True if the string consists only of numeric characters and is
not blank.

o isspace() returns True if the string consists only of spaces, tabs, and newlines
and is not blank.

o istitle() returns True if the string consists only of words that begin with an
uppercase letter followed by only lowercase letters.

You might also like