MOD-3 Dap

3-MODULE
What is Pandas?
Pandas is a Python library used for working with data sets.It has functions for
analyzing, cleaning, exploring, and manipulating data.The name "Pandas" has a
reference to both "Panel Data", and "Python Data Analysis" and was created by
Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical
theories.Pandas can clean messy data sets, and make them readable and
relevant.Relevant data is very important in data science.
Import Pandas
Once Pandas is installed, import it in your applications by adding

the import keyword:
import pandas
Now Pandas is imported and ready to use.
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
Pandas as pd
Pandas is usually imported under the pd alias.
alias: In Python alias are an alternate name for referring to the same thing.
Create an alias with the as keyword while importing:
import pandas as pd
Now the Pandas package can be referred to as pd instead of pandas.

import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
2 work house data structures
1.Series
2.Data frame
1.Series
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
0 1
1 7
2 2
dtype: int64
Labels
If nothing else is specified, the values are labeled with their index number. First
value has index 0, second value has index 1 etc.
This label can be used to access a specified value.
print(myvar[0])
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
x 1
y 7
z 2
dtype: int64
Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)
day1 420
day2 380
day3 390
dtype: int64
DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Series is like a column, a DataFrame is the whole table.
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows
and columns.
Pandas use the loc attribute to return one or more specified row(s)
Return row 0:
#refer to the row index:

print(df.loc[0])
calories 420
duration 50
Name: 0, dtype: int64
Loading from CSV files
What Is a CSV File?
A CSV file (Comma Separated Values file) is a type of plain text file that uses
specific structuring to arrange tabular data. Because it’s a plain text file, it can
contain only actual text data—in other words, printable ASCII or Unicode
characters.
What is a CSV File?

CSV (Comma Separated Values) is a simple file format used to store tabular
data, such as a spreadsheet or database. A CSV file stores tabular data
(numbers and text) in plain text. Each line of the file is a data record. Each
record consists of one or more fields, separated by commas. The use of the
comma as a field separator is the source of the name for this file format. For
working CSV files in Python, there is an inbuilt module called CSV.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Duration Pulse Maxpulse Calories

0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
use to_string() to print the entire DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Duration Pulse Maxpulse Calories

0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
[169 rows x 4 columns]
pandas.read_csv, perform type inference, because the column data types

are not part of the data format. That means you don’t necessarily have to specify
which columns are numeric, integer, boolean, or string. Other data formats, like
HDF5, Feather, and msgpack, have the data types stored in the format. Handling
dates and other custom types can require extra effort.
Comma-separated (CSV) text file:
In [8]: !cat examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
Here used the Unix cat shell command to print the raw contents of the file to the
screen. In Windows,
use type instead of cat to achieve the same effect.
Since this is comma-delimited, we can use read_csv to read it into a DataFrame:
READ_CSV/READ_TABLE FUNCTION ARGUMENTS
Reading Text Files in Pieces :
When processing very large files or figuring out the right set of arguments to
correctly processes a large file, you may only want to read in a small piece of a
file or iterate through smaller chunks of the file. Before we look at a large file,
we make the pandas display settings more compact:
Writing Data to Text Format
Accessing sql database in python
Interacting with Databases In a business setting, most data may not be stored in
text or Excel files. SQLbased relational databases (such as SQL Server,
PostgreSQL, and MySQL) are in wide use, and many alternative databases have
become quite popular. The choice of database is usually dependent on the
performance, data integrity, and scalability needs of an application. Loading
data from SQL into a DataFrame is fairly straightforward, and pandas has some
functions to simplify the process. As an example, create a SQLite database
using Python’s built-in sqlite3 driver:
Accessing an SQL database in Python can be done using various libraries, but
one of the most popular and widely used is sqlite3 for SQLite databases, and
SQLAlchemy for more general SQL database management. Here's a basic
overview of how to use both:
Using sqlite3:
import sqlite3
# Connect to the SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('example.db')
# Create a cursor object to execute SQL queries
cursor = conn.cursor()
# Execute an SQL query
cursor.execute("SELECT * FROM your_table")
# Fetch the results (if any)
results = cursor.fetchall()
# Process the results
for row in results:
print(row)
# Close the cursor and connection
cursor.close()
conn.close()
Cleansing Data with Python: Stripping out extraneous information
clean the data by using the Python modules NumPy and Pandas. What is
Data Cleansing?
Data Cleansing is the process of detecting and changing raw data by
identifying incomplete, wrong, repeated, or irrelevant parts of the data.
For example, when one takes a data set one needs to remove null values,
remove that part of data we need based on application, etc. Besides this, there
are a lot of applications where we need to handle the obtained information.
data cleansing using NumPy and Pandas modules. We can use the below
statements to install the modules.
pip install numpy

pip install pandas
import pandas as pd
data =pd.read_csv('data.csv')
print(data)
Output:
.Finding and Removing Missing Values

The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional
for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a Number) to
represent missing data. We call this a sentinel value that can be easily detected:
We can find the missing values using isnull() function.

Example of finding missing values:
data.isnull()
Output:
Example of removing missing values:

data.dropna()
Output:
2. Replacing Missing Values
We have different options for replacing the missing values. We can use the
replace() function or fillna() function to replace it with a constant value.
Example of replacing missing values using replace():
from numpy import NaN
data.replace({NaN:0.00})
Output:
Example of replacing missing values using fillna():

data.fillna(3)
Output:
Using fillna() function, we can fill forward and fill backward as well.
Example of replacing missing values by filling forward :
data.fillna(method='pad')
Output:
Example of replacing missing values by filling backward:
data.fillna(method='backfill')
Output:
3. Removing Repeated Values

We can remove the repeated values by using the drop_duplicates() method.
Example of removing repeated values:
data.drop_duplicates()
Output:
4. Removing Irrelevant Data

We can remove the irrelevant data by using the del method.
Example of removing irrelevant data:
del data['YOB']
print(data)
Output:
5. Renaming Columns
We have a function rename() to rename the columns.
Example of renaming columns:
print(data.rename(columns={'Name':'FirstName','Surname':'LastName'}))
Output:
isnull()
Generate a Boolean mask indicating missing values
notnull()
Opposite of isnull()
dropna()
Return a filtered version of the data
fillna()
Return a copy of the data with missing values filled or imputed
Normalizing data AND Formatting data.

• Data Normalization: Data Normalization could also be a typical practice
in machine learning which consists of transforming numeric columns to a
standard scale. In machine learning, some feature values differ from others
multiple times. The features with higher values will dominate the learning
process.
Steps Needed
Here, we will apply some techniques to normalize the data and discuss these with
the help of examples. For this, let’s understand the steps needed for data
normalization with Pandas.
1. Import Library (Pandas)
2. Import / Load / Create data.
3. Use the technique to normalize the data.
Examples
Here, we create data by some random values and apply some normalization
techniques to it.
• Python3
# importing packages
import pandas as pd
# create data
df = pd.DataFrame([
[180000, 110, 18.9, 1400],
[360000, 905, 23.4, 1800],
[230000, 230, 14.0, 1300],
[60000, 450, 13.5, 1500]],
columns=['Col A', 'Col B',
'Col C', 'Col D'])
# view data
display(df)
Output:
See the plot of this dataframe:
• Python3
import matplotlib.pyplot as plt
df.plot(kind = 'bar')
Let’s apply normalization techniques one by one.
Using The maximum absolute scaling
The maximum absolute scaling rescales each feature between -1 and 1 by dividing
every observation by its maximum absolute value. We can apply the maximum
absolute scaling in Pandas using the .max() and .abs() methods, as shown below.
• Python3
# copy the data
df_max_scaled = df.copy()
# apply normalization techniques
for column in df_max_scaled.columns:
df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs().max()
# view normalized data
display(df_max_scaled)
Output :
See the plot of this dataframe:
• Python3
df_max_scaled.plot(kind = 'bar')
Output:
Using The min-max feature scaling
The min-max approach (often called normalization) rescales the feature to a hard
and fast range of [0,1] by subtracting the minimum value of the feature then dividing
by the range. We can apply the min-max scaling in Pandas using the .min() and
.max() methods.
• Python3
# copy the data
df_min_max_scaled = df.copy()
for column in df_min_max_scaled.columns:
df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) /

(df_min_max_scaled[column].max() - df_min_max_scaled[column].min())
print(df_min_max_scaled)
Output :
Let’s draw a plot with this dataframe:
• Python3
df_min_max_scaled.plot(kind = 'bar')
Using The z-score method
The z-score method (often called standardization) transforms the info into
distribution with a mean of 0 and a typical deviation of 1. Each standardized value is
computed by subtracting the mean of the corresponding feature then dividing by the
quality deviation.
• Python3
# copy the data
df_z_scaled = df.copy()
for column in df_z_scaled.columns:
df_z_scaled[column] = (df_z_scaled[column] -
df_z_scaled[column].mean()) / df_z_scaled[column].std()
display(df_z_scaled)
Output :
Let’s draw a plot with this dataframe.
• Python3
df_z_scaled.plot(kind='bar')
Summary
Data normalization consists of remodeling numeric columns to a standard scale. In

Python, we will implement data normalization in a very simple way. The Pandas
library contains multiple built-in methods for calculating the foremost common
descriptive statistical functions which make data normalization techniques very easy
to implement.
Formatting Data
Formatting data involves preparing it for presentation or further analysis. Here are some
common formatting tasks in Python:
1. Date and Time Formatting:
o Use libraries like datetime or pandas to format dates and times.
o Example:
Python
import datetime
current_date = datetime.datetime.now()
formatted_date = current_date.strftime("%Y-%m-%d %H:%M:%S")
print(f"Formatted date: {formatted_date}")
1. String Formatting:
o Use string formatting methods to create well-structured strings.
o Example:
Python
name = "Alice"
age = 30
formatted_string = f"Hello, my name is {name} and I am {age} years old."
print(formatted_string)
1. Numeric Formatting:
o Control the precision, decimal places, and alignment of numeric values.
o Example:
Python
pi_value = 3.141592653589793
formatted_pi = f"Value of pi: {pi_value:.2f}"
print(formatted_pi)
Combining and Merging Data Sets
Data contained in pandas objects can be combined together in a number of

built-in ways:
pandas.merge connects rows in DataFrames based on one or more keys. This

will be familiar to users of SQL or other relational databases, as it implements
database join operations.
pandas.concat glues or stacks together objects along an axis.
ombine_first instance method enables splicing together overlapping data to fill

in missing values in one object with values from another
4.6.1 Database-style DataFrame Merges
Merge or join operations combine data sets by linking rows using one or more
keys. These operations are central to relational databases. The merge function in
pandas is the main entry point for using these algorithms on your data.
Example:
>>> df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
>>> df2 = DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
>>> df1
data1 key
00b
11b
22a
33c
44a
55a
66b
>>> df2
Data2 key
00a
11b
22d
Following is an example of a many-to-one merge situation; the data
in df1 has multiple rows labeled a and b, whereas df2 has only one row for
each value in the key column. Calling merge with these objects we obtain:
>>> pd.merge(df1, df2)

Note, in the above example didn’t specify which column to join on. If not
specified, merge uses the overlapping column names as the keys. It’s a good
practice to specify explicitly, though
>>> pd.merge(df1, df2, on='key')

missing from the result. By default merge does an 'inner' join; the keys in the result are the
intersection. Other possible options are 'left', 'right', and 'outer'. The outer join takes the union of
the keys, combining the effect of applying both left and right joins:
>>> pd.merge(df1, df2, how='outer')

Merging on Index
In some cases, the merge key or keys in a DataFrame will be found in its index.
In this case, you can pass left_index=True or right_index=True (or both) to
indicate that the index should be used as the merge key:
>>> left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
....: 'value': range(6)})
>> right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
>>> left1
Concatenating Along an Axis :

Another kind of data combination operation is alternatively referred to as
concatenation, binding, or stacking. NumPy has a concatenate function for
doing this with raw NumPy arrays:
>>> arr = np.arange(12).reshape((3, 4))
>>> arr
By default concat works along axis=0, producing another Series. If you pass
axis=1, the result will instead be a DataFrame (axis=1 is the columns):

>>> pd.concat([s1, s2, s3], axis=1)
. Reshaping and Pivoting:
There are a number of fundamental operations for rearranging tabular data.
These are alternatingly referred to as reshape or pivot operations.
4.7.1 Reshaping with Hierarchical Indexing:
Hierarchical indexing provides a consistent way to rearrange data in a

DataFrame. There are two primary actions:
• stack: this “rotates” or pivots from the columns in the data to the rows
• unstack: this pivots from the rows into the columns
Exmaple:
>>> data = DataFrame(np.arange(6).reshape((2, 3)),
....: index=pd.Index(['Ohio', 'Colorado'], name='state'),
....: columns=pd.Index(['one', 'two', 'three'], name='number'))
>>> data
Using the stack method on this data pivots the columns into the rows,
producing a Series:
>>> result = data.stack()
>>> result
Unstacking might introduce missing data if all of the values in the level aren’t
found in each of the subgroups:
>>> s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
>>> s2 = Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
>>> data2.unstack()
String Manipulation:
Python has long been a popular data munging language in part due to its easeof-
use for string and text processing. Most text operations are made simple with
the string object’s built-in methods. For more complex pattern matching and
text manipulations, regular expressions may be needed.
4.9.1 String Object Methods
In many string munging and scripting applications, built-in string methods are
sufficient.
As an example, a comma-separated string can be broken into pieces with split:
>>> val = 'a,b, guido'
>>> val.split(',')
['a', 'b', ' guido']
split is often combined with strip to trim whitespace (including newlines):
>>> pieces = [x.strip() for x in val.split(',')]
>>> pieces
['a', 'b', 'guido']
These substrings could be concatenated together with a two-colon delimiter
using addition:
>>> first, second, third = pieces
>>> first + '::' + second + '::' + third
'a::b::guido'
But, this isn’t a practical generic method. A faster and more Pythonic way is to
pass a list or tuple to the join method on the string '::':
>>> '::'.join(pieces)
'a::b::guido'
Other methods are concerned with locating substrings. Using Python’s in
keyword is the best way to detect a substring, though index and find can also
be used:
>>> 'guido' in val

True
>>> val.index(',')
>>> val.find(':')
-1
Note the difference between find and index is that index raises an exception if
the string isn’t found (versus returning -1):
>>> val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last) in ()
----> 1 val.index(':')
ValueError: substring not foundin
Relatedly, count returns the number of occurrences of a particular substring
>>> val.count(',')
replace will substitute occurrences of one pattern for another. This is
commonly used to delete patterns, too, by passing an empty string:
>>> val.replace(',', '::')
>>> val.replace(',', '')
'a::b:: guido'
'ab guido'
Regular expressions can also be used with many of these operations like below.
Regular expressions
Regular expressions provide a flexible way to search or match string patterns
in text. A single expression, commonly called a regex, is a string formed
according to the regular expression language. Python’s built-in re module is
responsible for applying regular expressions to strings.
The re module functions fall into three categories:
1. Pattern matching
2. Substitution
3. Splitting
Example: I wanted to split a string with a variable number of whitespace
characters (tabs, spaces, and newlines). The regex describing one or more
whitespace characters is \s+:
>>> import re
>>> text = "foo bar\t baz \tqux"
>>> re.split('\s+', text)
['foo', 'bar', 'baz', 'qux']
When you call re.split('\s+', text), the regular expression is first compiled,
then its split method is called on the passed text. You can compile the regex
yourself with re.compile, forming a reusable regex object: >>> regex =
re.compile('\s+')
>>> regex.split(text)
['foo', 'bar', 'baz', 'qux']
If, instead, you wanted to get a list of all patterns matching the regex, you can
use the findall method: regex.findall(text) Out[227]: [' ', '\t ', ' \t']
Creating a regex object with re.compile is highly recommended if you intend to
apply the same expression to many strings; doing so will save CPU cycles.
match and search are closely related to findall. While findall returns all
matches in a string, search returns only the first match. More rigidly, match
only matches at the beginning of the string.
let’s consider a block of text and a regular expression capable of identifying
most email addresses:

MOD-3 Dap

Uploaded by

Copyright:

Available Formats

You might also like

MOD-3 Dap

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MOD-3 Dap

Uploaded by

Copyright:

Available Formats

3-MODULE

Why Use Pandas?

Once Pandas is installed, import it in your applications by adding

Now Pandas is imported and ready to use.

Pandas is usually imported under the pd alias.

Create an alias with the as keyword while importing:

Now the Pandas package can be referred to as pd instead of pandas.

2 work house data structures

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

This label can be used to access a specified value.

myvar = pd.Series(a, index = ["x", "y", "z"])

calories = {"day1": 420, "day2": 380, "day3": 390}

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

#refer to the row index:

What Is a CSV File?

What is a CSV File?

Duration Pulse Maxpulse Calories

Duration Pulse Maxpulse Calories

[169 rows x 4 columns]

pandas.read_csv, perform type inference, because the column data types

pip install numpy

.Finding and Removing Missing Values

We can find the missing values using isnull() function.

Example of removing missing values:

Example of replacing missing values using fillna():

3. Removing Repeated Values

4. Removing Irrelevant Data

Generate a Boolean mask indicating missing values

Return a filtered version of the data

Return a copy of the data with missing values filled or imputed

Normalizing data AND Formatting data.

[180000, 110, 18.9, 1400],

[360000, 905, 23.4, 1800],

[230000, 230, 14.0, 1300],

[60000, 450, 13.5, 1500]],

columns=['Col A', 'Col B',

'Col C', 'Col D'])

See the plot of this dataframe:

Let’s apply normalization techniques one by one.

Using The maximum absolute scaling

# apply normalization techniques

for column in df_max_scaled.columns:

df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs().max()

# view normalized data

See the plot of this dataframe:

# apply normalization techniques

for column in df_min_max_scaled.columns:

df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) /

# view normalized data

Using The z-score method

# apply normalization techniques

for column in df_z_scaled.columns:

# view normalized data

Let’s draw a plot with this dataframe.

Data normalization consists of remodeling numeric columns to a standard scale. In