All Document Reader 1715619870900

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Introduction to Pandas:

Pandas is a powerful and open-source Python library.


Pandas is a Python library used for working with large data sets.
Working on missing data.
Indexing – slicing – sub-setting the large data sets.
Merge and join two different data sets.
Reshape datasets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes
McKinney in 2008.
The Pandas library introduces two new data structures to Python - Series and DataFrame, both of which are built
on top of NumPy.
Checking Pandas version: The version string is stored under __version__ attribute.
Why to use Pandas?
Some of the reasons why we should use Pandas are as follows:
 Handle Large Data Efficiently: Pandas is designed for handling large datasets. It provides powerful
tools that simplify tasks like data filtering, transforming, and merging. It also provides built-in
functions to work with formats like CSV(Comma Separated Values), JSON(JavaScript Object
Notation), TXT, Excel, and SQL databases.
 Tabular Data Representation: Pandas DataFrames, the primary data structure of Pandas, handle
data in tabular format. This allows easy indexing, selecting, replacing, and slicing of data.
 Data Cleaning and Preprocessing: Data cleaning and preprocessing are essential steps in the data
analysis pipeline, and Pandas provides powerful tools to facilitate these tasks. It has methods for
handling missing values, removing duplicates, handling outliers, data normalization, etc.
 Time Series Functionality: Pandas contains an extensive set of tools for working with dates, times,
and time-indexed data as it was initially developed for financial modelling.
 Free and Open-Source: Pandas follows the same principles as Python, allowing you to use and
distribute Pandas for free, even for commercial use.
 Pandas is well-suited for working with tabular data, such as spreadsheets or SQL tables.
 The Pandas library is an essential tool for data analysts, scientists, and engineers working with
structured data in Python.
Here is a list of things that we can do using Pandas.
o Data set cleaning, merging, and joining.
o Easy handling of missing data in floating point as well as non-floating point data.
o Columns can be inserted and deleted from DataFrame and higher-dimensional objects.
o Powerful group by functionality for performing split-apply-combine operations on data sets.
o Data Visualization.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values.
This is called cleaning the data.
3 Data Structures in Pandas Library
Pandas generally provide two data structures for manipulating data. They are-
 Series – one-dimensional
Syntax: pd.series(data,index)
 DataFrame – two-dimensional (most efficient data structure to the analysis of data sets)
Syntax: DataFrame(data)
 Panel – multidimensional
Syntax: pd.Panel(data, major axis, minor axis, dtype)

1. Pandas Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float,
Python objects, etc.). The axis labels are collectively called indexes.
The Pandas Series is nothing but a column in an Excel sheet. Labels need not be unique but must be of a hashable
type. The object supports both integer and label-based indexing and provides a host of methods for performing
operations involving the index.
Creating a Series:
Empty series
Import pandas as pd
S=pd.Series()
Series(())
Series using array
Import numpy as np
A = np.ndarray([10,20,30,40])
S=pd.Series(a)
Series using Lists
L=[10,11,12,13]
S=pd.Series(L)
Series using Dictionary
S=pd.Series(D)
Pandas Series is created by loading the datasets from existing storage (which can be a SQL database, a CSV file, or
an Excel file).
Pandas Series can be created from lists, dictionaries, scalar values, etc.
Key/Value Objects as Series:
You can also use a key/value object, like a dictionary, when creating a Series. The keys of the dictionary become
the labels.
2. Pandas DataFrame: Pandas DataFrame is a two-dimensional data structure with labelled axes (rows and
columns).
Creating DataFrame: Pandas DataFrame is created by loading the datasets from existing storage (which can be a
SQL database, a CSV file, or an Excel file).
Pandas DataFrame can be created from lists, dictionaries, a list of dictionaries, etc.
A Dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
In dataframe datasets arrange in rows and columns, we can store any number of datasets in a dataframe. We
can perform many operations on these datasets like arithmetic operation, columns/rows selection,
columns/rows addition etc.
Locate Row:
the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)

Pandas DataFrame can be created in multiple ways.


1. Creating an empty dataframe: A basic DataFrame, which can be created is an Empty Dataframe. An
Empty Dataframe is created just by calling a dataframe constructor.
2. Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of narray/list, all the
narray must be of same length. If index is passed then the length index should be equal to the length of
arrays. If no index is passed, then by default, index will be range(n) where n is the array length.
3. Create pandas dataframe from lists using dictionary: Creating pandas data-frame from lists using
dictionary can be achieved in different ways. We can create pandas dataframe from lists using dictionary
using pandas.DataFrame. With this method in Pandas we can transform a dictionary of list to a
dataframe.
4. Create pandas dataframe from lists using zip: one of the way to create Pandas DataFrame is by using
zip() function. You can use the lists to create lists of tuples and create a dictionary from it. Then, this
dictionary can be used to construct a dataframe. zip() function creates the objects and that can be used
to produce single item at a time. This function can create pandas DataFrames by merging two lists.
Suppose there are two lists of student data, first list holds the name of student and second list holds the
age of student.
1. Creating a Pandas dataframe using list of tuples:
Pandas is famous for data manipulation in Python. We can create a DataFrame from a list of simple tuples, and
can even choose the specific elements of the tuples we want to use.
Using pd.DataFrame() function
Here we will create a Pandas Dataframe using a list of tuples with the pd.DataFrame() function.
2. Using from_records(): Here we will create a Pandas Dataframe using a list of tuples using the
from_records() method.
3. Using df.pivot() function: Here we will create a Pandas Dataframe using a list of tuples using the
df.pivot() function method.
4. With a Custom Function: For scenarios that require pre-processing or when working with complex data
structures, a custom function to parse the list of tuples before converting it to a DataFrame can be
applied. This method offers maximum flexibility.
Grouping and Aggregating:
Grouping and aggregating will help to achieve data analysis easily using various functions. These methods will
help us to the group and summarize our data and make complex analysis comparatively easy.
Aggregation in pandas provides various functions that perform a mathematical or logical operation on our
dataset and returns a summary of that function. Aggregation can be used to get a summary of columns in our
dataset like getting sum, minimum, maximum, etc. from a particular column of our dataset. The function used
for aggregation is agg(), the parameter is the function we want to perform.
Some functions used in the aggregation are:
Function Description:

sum() :Compute sum of column values


min() :Compute min of column values
max() :Compute max of column values
mean() :Compute mean of column
size() :Compute column sizes
describe() :Generates descriptive statistics
first() :Compute first of group values
last() :Compute last of group values
count() :Compute count of column values
std() :Standard deviation of column
var() :Compute variance of column
sem() :Standard error of the mean of column
Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-apply-combine strategy.
 Splitting the data into groups based on some criteria.
 Applying a function to each group independently.
 Combining the results into a data structure.
Pandas combining DataFrames
In Pandas there are 4 ways to combine data from different frames:
1. Merging
2. Joining
3. Concatenating
4. Appending
Here merging and joining are basically redundant and concatenating and appending are basically redundant.
Concatenation in Pandas:
The concatenation operation in Pandas appends one DataFrame to another along an axis. It works similar to SQL
UNION ALL operation.
We use the concat() method to concatenate two or more DataFrames in Pandas.
Merging: merging is doing complex column-wise combinations of DataFrames.
 For combining data on common columns
 Most flexible, but also complex of the methods
 Many to one and many to many joins are possible
 Side – by – side merge

Joins:
Inner: joins all shared rows on the joining/key columns. You will lose rows that don’t have a match in the
other DataFrame’s key column.
Outer: joins all rows from both DataFrames. No data will be lost.
Left: joins on all rows from the DataFrame. Any rows from the right DataFrame that do not have a match
in the key column of the left DataFrame are discarded.
Right: the opposite of the left join. Joins on all rows from the right DataFrame. Any rows from the right
DataFrame that do not have a match in the key column of the left.

1. Data visualization: means graphical or pictorial representation of the data using graph, chart etc. the
purpose od plotting is to visualize variation or show relationship between variables.
Visualization of data is effectively used in fields like: health, finance, science, mathematics,
engineering etc. visualize data using matplotlib line, bar, scatter with respect to the various types of
data.

Why we need data visualization:


1. It identifies areas that need improvement and attention.
2. It clarifies the factors.
3. It helps to understand which product to place where.
4. Predict sales volumes.
Plotting:
1. plot(): the plot function is used to draw points in a diagram. By default, the plot() function draws
a line from point. The function takes parameters for specifying points in the diagram.
Plotting data using matplotlib:
Matplotlib: is a low-level graph plotting library in python that serves as a visualization utility. Matplotlib
was created by John D. Hunter. Matplotlib is open source and we can use it freely. Matplotlib is mostly
written in Python, a few segments are written in C and Java-script for platform compatibility.
1. line chart:
A line chart can be created using the matplotlib plot() function. Line charts are used to represent the
relation between two data x and y on different axis.
2. Bar chart: a bar chart shows values as vertical bars, where the position of each bar indicates the values,
it represents. Matplotlib aims to make it as easy as possible to turn data into bar charts. A bar chart in
matplotlib made from python code. The code below creates a bar chart.
3. Histogram: a histogram is basically used to represent data provided in a form of some groups.
For example: age = np.array([22,34,5,6,7,67,44,34,34,67,45,79,78])
4. Scatter chart: a scatter chart is a two – dimensional data visualization method that uses dots to
represent the values obtained for two different variables- one plotted along the y-axis.
Scatter plots are used when you want to show the relationship between two variables. Scatter plots are
sometimes called co-relation plots because they show how variables are correlated. Additionally, the
size, shape or colour of the dot could be represented a third (or even 4th variable).
5. Pie chart: a pie chart shows the size of items in one data series, proportional to the sum of the items.

You might also like