MOD-3 Dap

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

3-MODULE

What is Pandas?

Pandas is a Python library used for working with data sets.It has functions for
analyzing, cleaning, exploring, and manipulating data.The name "Pandas" has a
reference to both "Panel Data", and "Python Data Analysis" and was created by
Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical
theories.Pandas can clean messy data sets, and make them readable and
relevant.Relevant data is very important in data science.

Import Pandas

Once Pandas is installed, import it in your applications by adding


the import keyword:

import pandas

Now Pandas is imported and ready to use.

import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)

print(myvar)

Pandas as pd

Pandas is usually imported under the pd alias.

alias: In Python alias are an alternate name for referring to the same thing.

Create an alias with the as keyword while importing:

import pandas as pd

Now the Pandas package can be referred to as pd instead of pandas.


import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)

print(myvar)

2 work house data structures

1.Series

2.Data frame

1.Series

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0 1
1 7
2 2
dtype: int64
Labels

If nothing else is specified, the values are labeled with their index number. First
value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

print(myvar[0])
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x 1
y 7
z 2
dtype: int64
Key/Value Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1 420
day2 380
day3 390
dtype: int64
DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)

print(myvar)
calories duration
0 420 50
1 380 40
2 390 45
Locate Row

As you can see from the result above, the DataFrame is like a table with rows
and columns.

Pandas use the loc attribute to return one or more specified row(s)

Return row 0:

#refer to the row index:


print(df.loc[0])

calories 420
duration 50
Name: 0, dtype: int64
Loading from CSV files

What Is a CSV File?

A CSV file (Comma Separated Values file) is a type of plain text file that uses
specific structuring to arrange tabular data. Because it’s a plain text file, it can
contain only actual text data—in other words, printable ASCII or Unicode
characters.

What is a CSV File?


CSV (Comma Separated Values) is a simple file format used to store tabular
data, such as a spreadsheet or database. A CSV file stores tabular data
(numbers and text) in plain text. Each line of the file is a data record. Each
record consists of one or more fields, separated by commas. The use of the
comma as a field separator is the source of the name for this file format. For
working CSV files in Python, there is an inbuilt module called CSV.

import pandas as pd

df = pd.read_csv('data.csv')
print(df.to_string())

Duration Pulse Maxpulse Calories


0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
use to_string() to print the entire DataFrame.

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

Duration Pulse Maxpulse Calories


0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

[169 rows x 4 columns]

pandas.read_csv, perform type inference, because the column data types


are not part of the data format. That means you don’t necessarily have to specify
which columns are numeric, integer, boolean, or string. Other data formats, like
HDF5, Feather, and msgpack, have the data types stored in the format. Handling
dates and other custom types can require extra effort.
Comma-separated (CSV) text file:
In [8]: !cat examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
Here used the Unix cat shell command to print the raw contents of the file to the
screen. In Windows,
use type instead of cat to achieve the same effect.
Since this is comma-delimited, we can use read_csv to read it into a DataFrame:
READ_CSV/READ_TABLE FUNCTION ARGUMENTS
Reading Text Files in Pieces :
When processing very large files or figuring out the right set of arguments to
correctly processes a large file, you may only want to read in a small piece of a
file or iterate through smaller chunks of the file. Before we look at a large file,
we make the pandas display settings more compact:
Writing Data to Text Format
Accessing sql database in python
Interacting with Databases In a business setting, most data may not be stored in
text or Excel files. SQLbased relational databases (such as SQL Server,
PostgreSQL, and MySQL) are in wide use, and many alternative databases have
become quite popular. The choice of database is usually dependent on the
performance, data integrity, and scalability needs of an application. Loading
data from SQL into a DataFrame is fairly straightforward, and pandas has some
functions to simplify the process. As an example, create a SQLite database
using Python’s built-in sqlite3 driver:
Accessing an SQL database in Python can be done using various libraries, but
one of the most popular and widely used is sqlite3 for SQLite databases, and
SQLAlchemy for more general SQL database management. Here's a basic
overview of how to use both:

Using sqlite3:

import sqlite3
# Connect to the SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('example.db')
# Create a cursor object to execute SQL queries
cursor = conn.cursor()
# Execute an SQL query
cursor.execute("SELECT * FROM your_table")
# Fetch the results (if any)
results = cursor.fetchall()
# Process the results
for row in results:
print(row)
# Close the cursor and connection
cursor.close()
conn.close()
Cleansing Data with Python: Stripping out extraneous information
clean the data by using the Python modules NumPy and Pandas. What is
Data Cleansing?
Data Cleansing is the process of detecting and changing raw data by
identifying incomplete, wrong, repeated, or irrelevant parts of the data.
For example, when one takes a data set one needs to remove null values,
remove that part of data we need based on application, etc. Besides this, there
are a lot of applications where we need to handle the obtained information.
data cleansing using NumPy and Pandas modules. We can use the below
statements to install the modules.

pip install numpy


pip install pandas
import pandas as pd

data =pd.read_csv('data.csv')
print(data)

Output:

.Finding and Removing Missing Values


The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional
for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a Number) to
represent missing data. We call this a sentinel value that can be easily detected:

We can find the missing values using isnull() function.


Example of finding missing values:
data.isnull()

Output:

Example of removing missing values:


data.dropna()

Output:
2. Replacing Missing Values
We have different options for replacing the missing values. We can use the
replace() function or fillna() function to replace it with a constant value.
Example of replacing missing values using replace():
from numpy import NaN

data.replace({NaN:0.00})

Output:

Example of replacing missing values using fillna():


data.fillna(3)

Output:

Using fillna() function, we can fill forward and fill backward as well.
Example of replacing missing values by filling forward :
data.fillna(method='pad')

Output:
Example of replacing missing values by filling backward:
data.fillna(method='backfill')

Output:

3. Removing Repeated Values


We can remove the repeated values by using the drop_duplicates() method.
Example of removing repeated values:
data.drop_duplicates()

Output:

4. Removing Irrelevant Data


We can remove the irrelevant data by using the del method.
Example of removing irrelevant data:
del data['YOB']
print(data)

Output:

5. Renaming Columns
We have a function rename() to rename the columns.
Example of renaming columns:
print(data.rename(columns={'Name':'FirstName','Surname':'LastName'}))

Output:

isnull()

Generate a Boolean mask indicating missing values

notnull()

Opposite of isnull()

dropna()

Return a filtered version of the data

fillna()

Return a copy of the data with missing values filled or imputed

Normalizing data AND Formatting data.


• Data Normalization: Data Normalization could also be a typical practice
in machine learning which consists of transforming numeric columns to a
standard scale. In machine learning, some feature values differ from others
multiple times. The features with higher values will dominate the learning
process.

Steps Needed

Here, we will apply some techniques to normalize the data and discuss these with
the help of examples. For this, let’s understand the steps needed for data
normalization with Pandas.
1. Import Library (Pandas)
2. Import / Load / Create data.
3. Use the technique to normalize the data.
Examples
Here, we create data by some random values and apply some normalization
techniques to it.

• Python3
# importing packages

import pandas as pd

# create data

df = pd.DataFrame([

[180000, 110, 18.9, 1400],

[360000, 905, 23.4, 1800],

[230000, 230, 14.0, 1300],

[60000, 450, 13.5, 1500]],

columns=['Col A', 'Col B',

'Col C', 'Col D'])

# view data

display(df)
Output:

See the plot of this dataframe:

• Python3
import matplotlib.pyplot as plt

df.plot(kind = 'bar')

Let’s apply normalization techniques one by one.

Using The maximum absolute scaling

The maximum absolute scaling rescales each feature between -1 and 1 by dividing
every observation by its maximum absolute value. We can apply the maximum
absolute scaling in Pandas using the .max() and .abs() methods, as shown below.
• Python3
# copy the data

df_max_scaled = df.copy()

# apply normalization techniques

for column in df_max_scaled.columns:

df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs().max()

# view normalized data

display(df_max_scaled)

Output :

See the plot of this dataframe:

• Python3
import matplotlib.pyplot as plt

df_max_scaled.plot(kind = 'bar')

Output:
Using The min-max feature scaling

The min-max approach (often called normalization) rescales the feature to a hard
and fast range of [0,1] by subtracting the minimum value of the feature then dividing
by the range. We can apply the min-max scaling in Pandas using the .min() and
.max() methods.

• Python3
# copy the data

df_min_max_scaled = df.copy()

# apply normalization techniques

for column in df_min_max_scaled.columns:

df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) /


(df_min_max_scaled[column].max() - df_min_max_scaled[column].min())

# view normalized data

print(df_min_max_scaled)

Output :
Let’s draw a plot with this dataframe:

• Python3
import matplotlib.pyplot as plt

df_min_max_scaled.plot(kind = 'bar')

Using The z-score method

The z-score method (often called standardization) transforms the info into
distribution with a mean of 0 and a typical deviation of 1. Each standardized value is
computed by subtracting the mean of the corresponding feature then dividing by the
quality deviation.

• Python3
# copy the data

df_z_scaled = df.copy()

# apply normalization techniques

for column in df_z_scaled.columns:

df_z_scaled[column] = (df_z_scaled[column] -

df_z_scaled[column].mean()) / df_z_scaled[column].std()

# view normalized data

display(df_z_scaled)

Output :

Let’s draw a plot with this dataframe.

• Python3
import matplotlib.pyplot as plt

df_z_scaled.plot(kind='bar')
Summary

Data normalization consists of remodeling numeric columns to a standard scale. In


Python, we will implement data normalization in a very simple way. The Pandas
library contains multiple built-in methods for calculating the foremost common
descriptive statistical functions which make data normalization techniques very easy
to implement.

Formatting Data
Formatting data involves preparing it for presentation or further analysis. Here are some
common formatting tasks in Python:
1. Date and Time Formatting:
o Use libraries like datetime or pandas to format dates and times.
o Example:

Python

import datetime

current_date = datetime.datetime.now()
formatted_date = current_date.strftime("%Y-%m-%d %H:%M:%S")
print(f"Formatted date: {formatted_date}")
1. String Formatting:
o Use string formatting methods to create well-structured strings.
o Example:

Python

name = "Alice"
age = 30
formatted_string = f"Hello, my name is {name} and I am {age} years old."
print(formatted_string)
1. Numeric Formatting:
o Control the precision, decimal places, and alignment of numeric values.
o Example:

Python

pi_value = 3.141592653589793
formatted_pi = f"Value of pi: {pi_value:.2f}"
print(formatted_pi)

Combining and Merging Data Sets

Data contained in pandas objects can be combined together in a number of


built-in ways:

pandas.merge connects rows in DataFrames based on one or more keys. This


will be familiar to users of SQL or other relational databases, as it implements
database join operations.

pandas.concat glues or stacks together objects along an axis.

ombine_first instance method enables splicing together overlapping data to fill


in missing values in one object with values from another

4.6.1 Database-style DataFrame Merges

Merge or join operations combine data sets by linking rows using one or more
keys. These operations are central to relational databases. The merge function in
pandas is the main entry point for using these algorithms on your data.
Example:

>>> df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})

>>> df2 = DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})

>>> df1

data1 key
00b

11b

22a

33c

44a

55a

66b

>>> df2

Data2 key

00a

11b

22d

Following is an example of a many-to-one merge situation; the data

in df1 has multiple rows labeled a and b, whereas df2 has only one row for

each value in the key column. Calling merge with these objects we obtain:

>>> pd.merge(df1, df2)


Note, in the above example didn’t specify which column to join on. If not

specified, merge uses the overlapping column names as the keys. It’s a good

practice to specify explicitly, though

>>> pd.merge(df1, df2, on='key')


missing from the result. By default merge does an 'inner' join; the keys in the result are the
intersection. Other possible options are 'left', 'right', and 'outer'. The outer join takes the union of
the keys, combining the effect of applying both left and right joins:

>>> pd.merge(df1, df2, how='outer')


Merging on Index

In some cases, the merge key or keys in a DataFrame will be found in its index.

In this case, you can pass left_index=True or right_index=True (or both) to

indicate that the index should be used as the merge key:

>>> left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],

....: 'value': range(6)})

>> right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

>>> left1

Concatenating Along an Axis :


Another kind of data combination operation is alternatively referred to as

concatenation, binding, or stacking. NumPy has a concatenate function for

doing this with raw NumPy arrays:

>>> arr = np.arange(12).reshape((3, 4))

>>> arr

By default concat works along axis=0, producing another Series. If you pass

axis=1, the result will instead be a DataFrame (axis=1 is the columns):


>>> pd.concat([s1, s2, s3], axis=1)

. Reshaping and Pivoting:

There are a number of fundamental operations for rearranging tabular data.

These are alternatingly referred to as reshape or pivot operations.

4.7.1 Reshaping with Hierarchical Indexing:

Hierarchical indexing provides a consistent way to rearrange data in a


DataFrame. There are two primary actions:

• stack: this “rotates” or pivots from the columns in the data to the rows

• unstack: this pivots from the rows into the columns

Exmaple:

>>> data = DataFrame(np.arange(6).reshape((2, 3)),

....: index=pd.Index(['Ohio', 'Colorado'], name='state'),

....: columns=pd.Index(['one', 'two', 'three'], name='number'))

>>> data

Using the stack method on this data pivots the columns into the rows,

producing a Series:

>>> result = data.stack()

>>> result
Unstacking might introduce missing data if all of the values in the level aren’t

found in each of the subgroups:

>>> s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

>>> s2 = Series([4, 5, 6], index=['c', 'd', 'e'])

data2 = pd.concat([s1, s2], keys=['one', 'two'])

>>> data2.unstack()
String Manipulation:
Python has long been a popular data munging language in part due to its easeof-
use for string and text processing. Most text operations are made simple with
the string object’s built-in methods. For more complex pattern matching and
text manipulations, regular expressions may be needed.
4.9.1 String Object Methods

In many string munging and scripting applications, built-in string methods are

sufficient.

As an example, a comma-separated string can be broken into pieces with split:

>>> val = 'a,b, guido'

>>> val.split(',')

['a', 'b', ' guido']

split is often combined with strip to trim whitespace (including newlines):

>>> pieces = [x.strip() for x in val.split(',')]

>>> pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter

using addition:

>>> first, second, third = pieces

>>> first + '::' + second + '::' + third

'a::b::guido'

But, this isn’t a practical generic method. A faster and more Pythonic way is to

pass a list or tuple to the join method on the string '::':

>>> '::'.join(pieces)

'a::b::guido'

Other methods are concerned with locating substrings. Using Python’s in

keyword is the best way to detect a substring, though index and find can also

be used:

>>> 'guido' in val


True

>>> val.index(',')

>>> val.find(':')

-1

Note the difference between find and index is that index raises an exception if

the string isn’t found (versus returning -1):

>>> val.index(':')

---------------------------------------------------------------------------

ValueError Traceback (most recent call last) in ()

----> 1 val.index(':')

ValueError: substring not foundin

Relatedly, count returns the number of occurrences of a particular substring

>>> val.count(',')

replace will substitute occurrences of one pattern for another. This is

commonly used to delete patterns, too, by passing an empty string:

>>> val.replace(',', '::')

>>> val.replace(',', '')

'a::b:: guido'

'ab guido'

Regular expressions can also be used with many of these operations like below.
Regular expressions

Regular expressions provide a flexible way to search or match string patterns

in text. A single expression, commonly called a regex, is a string formed

according to the regular expression language. Python’s built-in re module is

responsible for applying regular expressions to strings.

The re module functions fall into three categories:

1. Pattern matching

2. Substitution

3. Splitting

Example: I wanted to split a string with a variable number of whitespace

characters (tabs, spaces, and newlines). The regex describing one or more
whitespace characters is \s+:

>>> import re

>>> text = "foo bar\t baz \tqux"

>>> re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled,

then its split method is called on the passed text. You can compile the regex

yourself with re.compile, forming a reusable regex object: >>> regex =

re.compile('\s+')

>>> regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can

use the findall method: regex.findall(text) Out[227]: [' ', '\t ', ' \t']

Creating a regex object with re.compile is highly recommended if you intend to

apply the same expression to many strings; doing so will save CPU cycles.

match and search are closely related to findall. While findall returns all

matches in a string, search returns only the first match. More rigidly, match

only matches at the beginning of the string.

let’s consider a block of text and a regular expression capable of identifying

most email addresses:

You might also like