Professional Documents
Culture Documents
MOD-3 Dap
MOD-3 Dap
MOD-3 Dap
What is Pandas?
Pandas is a Python library used for working with data sets.It has functions for
analyzing, cleaning, exploring, and manipulating data.The name "Pandas" has a
reference to both "Panel Data", and "Python Data Analysis" and was created by
Wes McKinney in 2008.
Pandas allows us to analyze big data and make conclusions based on statistical
theories.Pandas can clean messy data sets, and make them readable and
relevant.Relevant data is very important in data science.
Import Pandas
import pandas
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
Pandas as pd
alias: In Python alias are an alternate name for referring to the same thing.
import pandas as pd
print(myvar)
1.Series
2.Data frame
1.Series
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
0 1
1 7
2 2
dtype: int64
Labels
If nothing else is specified, the values are labeled with their index number. First
value has index 0, second value has index 1 etc.
print(myvar[0])
import pandas as pd
a = [1, 7, 2]
print(myvar)
x 1
y 7
z 2
dtype: int64
Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
import pandas as pd
myvar = pd.Series(calories)
print(myvar)
day1 420
day2 380
day3 390
dtype: int64
DataFrames
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows
and columns.
Pandas use the loc attribute to return one or more specified row(s)
Return row 0:
calories 420
duration 50
Name: 0, dtype: int64
Loading from CSV files
A CSV file (Comma Separated Values file) is a type of plain text file that uses
specific structuring to arrange tabular data. Because it’s a plain text file, it can
contain only actual text data—in other words, printable ASCII or Unicode
characters.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Using sqlite3:
import sqlite3
# Connect to the SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('example.db')
# Create a cursor object to execute SQL queries
cursor = conn.cursor()
# Execute an SQL query
cursor.execute("SELECT * FROM your_table")
# Fetch the results (if any)
results = cursor.fetchall()
# Process the results
for row in results:
print(row)
# Close the cursor and connection
cursor.close()
conn.close()
Cleansing Data with Python: Stripping out extraneous information
clean the data by using the Python modules NumPy and Pandas. What is
Data Cleansing?
Data Cleansing is the process of detecting and changing raw data by
identifying incomplete, wrong, repeated, or irrelevant parts of the data.
For example, when one takes a data set one needs to remove null values,
remove that part of data we need based on application, etc. Besides this, there
are a lot of applications where we need to handle the obtained information.
data cleansing using NumPy and Pandas modules. We can use the below
statements to install the modules.
data =pd.read_csv('data.csv')
print(data)
Output:
Output:
Output:
2. Replacing Missing Values
We have different options for replacing the missing values. We can use the
replace() function or fillna() function to replace it with a constant value.
Example of replacing missing values using replace():
from numpy import NaN
data.replace({NaN:0.00})
Output:
Output:
Using fillna() function, we can fill forward and fill backward as well.
Example of replacing missing values by filling forward :
data.fillna(method='pad')
Output:
Example of replacing missing values by filling backward:
data.fillna(method='backfill')
Output:
Output:
Output:
5. Renaming Columns
We have a function rename() to rename the columns.
Example of renaming columns:
print(data.rename(columns={'Name':'FirstName','Surname':'LastName'}))
Output:
isnull()
notnull()
Opposite of isnull()
dropna()
fillna()
Steps Needed
Here, we will apply some techniques to normalize the data and discuss these with
the help of examples. For this, let’s understand the steps needed for data
normalization with Pandas.
1. Import Library (Pandas)
2. Import / Load / Create data.
3. Use the technique to normalize the data.
Examples
Here, we create data by some random values and apply some normalization
techniques to it.
• Python3
# importing packages
import pandas as pd
# create data
df = pd.DataFrame([
# view data
display(df)
Output:
• Python3
import matplotlib.pyplot as plt
df.plot(kind = 'bar')
The maximum absolute scaling rescales each feature between -1 and 1 by dividing
every observation by its maximum absolute value. We can apply the maximum
absolute scaling in Pandas using the .max() and .abs() methods, as shown below.
• Python3
# copy the data
df_max_scaled = df.copy()
display(df_max_scaled)
Output :
• Python3
import matplotlib.pyplot as plt
df_max_scaled.plot(kind = 'bar')
Output:
Using The min-max feature scaling
The min-max approach (often called normalization) rescales the feature to a hard
and fast range of [0,1] by subtracting the minimum value of the feature then dividing
by the range. We can apply the min-max scaling in Pandas using the .min() and
.max() methods.
• Python3
# copy the data
df_min_max_scaled = df.copy()
print(df_min_max_scaled)
Output :
Let’s draw a plot with this dataframe:
• Python3
import matplotlib.pyplot as plt
df_min_max_scaled.plot(kind = 'bar')
The z-score method (often called standardization) transforms the info into
distribution with a mean of 0 and a typical deviation of 1. Each standardized value is
computed by subtracting the mean of the corresponding feature then dividing by the
quality deviation.
• Python3
# copy the data
df_z_scaled = df.copy()
df_z_scaled[column] = (df_z_scaled[column] -
df_z_scaled[column].mean()) / df_z_scaled[column].std()
display(df_z_scaled)
Output :
• Python3
import matplotlib.pyplot as plt
df_z_scaled.plot(kind='bar')
Summary
Formatting Data
Formatting data involves preparing it for presentation or further analysis. Here are some
common formatting tasks in Python:
1. Date and Time Formatting:
o Use libraries like datetime or pandas to format dates and times.
o Example:
Python
import datetime
current_date = datetime.datetime.now()
formatted_date = current_date.strftime("%Y-%m-%d %H:%M:%S")
print(f"Formatted date: {formatted_date}")
1. String Formatting:
o Use string formatting methods to create well-structured strings.
o Example:
Python
name = "Alice"
age = 30
formatted_string = f"Hello, my name is {name} and I am {age} years old."
print(formatted_string)
1. Numeric Formatting:
o Control the precision, decimal places, and alignment of numeric values.
o Example:
Python
pi_value = 3.141592653589793
formatted_pi = f"Value of pi: {pi_value:.2f}"
print(formatted_pi)
Merge or join operations combine data sets by linking rows using one or more
keys. These operations are central to relational databases. The merge function in
pandas is the main entry point for using these algorithms on your data.
Example:
>>> df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
>>> df1
data1 key
00b
11b
22a
33c
44a
55a
66b
>>> df2
Data2 key
00a
11b
22d
in df1 has multiple rows labeled a and b, whereas df2 has only one row for
each value in the key column. Calling merge with these objects we obtain:
specified, merge uses the overlapping column names as the keys. It’s a good
In some cases, the merge key or keys in a DataFrame will be found in its index.
>>> left1
>>> arr
By default concat works along axis=0, producing another Series. If you pass
• stack: this “rotates” or pivots from the columns in the data to the rows
Exmaple:
>>> data
Using the stack method on this data pivots the columns into the rows,
producing a Series:
>>> result
Unstacking might introduce missing data if all of the values in the level aren’t
>>> data2.unstack()
String Manipulation:
Python has long been a popular data munging language in part due to its easeof-
use for string and text processing. Most text operations are made simple with
the string object’s built-in methods. For more complex pattern matching and
text manipulations, regular expressions may be needed.
4.9.1 String Object Methods
In many string munging and scripting applications, built-in string methods are
sufficient.
>>> val.split(',')
>>> pieces
using addition:
'a::b::guido'
But, this isn’t a practical generic method. A faster and more Pythonic way is to
>>> '::'.join(pieces)
'a::b::guido'
keyword is the best way to detect a substring, though index and find can also
be used:
>>> val.index(',')
>>> val.find(':')
-1
Note the difference between find and index is that index raises an exception if
>>> val.index(':')
---------------------------------------------------------------------------
----> 1 val.index(':')
>>> val.count(',')
'a::b:: guido'
'ab guido'
Regular expressions can also be used with many of these operations like below.
Regular expressions
1. Pattern matching
2. Substitution
3. Splitting
characters (tabs, spaces, and newlines). The regex describing one or more
whitespace characters is \s+:
>>> import re
When you call re.split('\s+', text), the regular expression is first compiled,
then its split method is called on the passed text. You can compile the regex
re.compile('\s+')
>>> regex.split(text)
If, instead, you wanted to get a list of all patterns matching the regex, you can
use the findall method: regex.findall(text) Out[227]: [' ', '\t ', ' \t']
apply the same expression to many strings; doing so will save CPU cycles.
match and search are closely related to findall. While findall returns all
matches in a string, search returns only the first match. More rigidly, match